Problem/Motivation

Html::load() + Html::serialize() produces different output across environments.

Steps to reproduce

use Drupal\Component\Utility\Html;

    $source = <<<EOT
<table>
  <tr>
    <td>L</td>
    <td>R</td>
  </tr>
</table>
EOT;
    $document = Html::load($source);
    $out = Html::serialize($document);
    var_export($out);

Run this in different environments.
Sometimes the output is like so:

<table>
  <tr>
    <td>L</td>
    <td>R</td>
  </tr>
</table>

Other times it is like so:

<table><tr><td>L</td>
    <td>R</td>
  </tr></table>

This has no practical effect in most cases, but it causes inconsistency in unit tests.

(Unforutnately I can't tell which differences in the env cause this inconsistency)

Proposed resolution

Pass either LIBXML_HTML_NODEFDTD or LIBXML_NOBLANKS to DOMDocument->loadHTML(), inside Html::load().
One of them consistently removes these line breaks, the other consistently preserves them.

If BC is a concern, we could add an optional parameter to Html::load(), to control this behavior.
And then in FilterHtml we could provide this parameter.

Remaining tasks

User interface changes

API changes

Data model changes

Release notes snippet

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

donquixote created an issue. See original summary.

cilefen’s picture

Interesting! I suppose it is something to do with the system's libxml.

longwave’s picture

Can you describe the two different environments a bit (OS, PHP version, etc), it may give someone enough info to find the root cause.

donquixote’s picture

Interesting! I suppose it is something to do with the system's libxml.

Possibly. But I was not able to put my finger on it.

Can you describe the two different environments a bit (OS, PHP version, etc),

- System A (dockerized Linux system) where some line breaks are removed.
+ System B (local Linux system) where line breaks are preserved.

 phpinfo()
-PHP Version => 7.3.24-3+ubuntu18.04.1+deb.sury.org+1
+PHP Version => 7.3.27-9+ubuntu20.04.1+deb.sury.org+1
[..] 
 libxml
 
 libXML support => active
 libXML Compiled Version => 2.9.10
-libXML Loaded Version => 20904
+libXML Loaded Version => 20910
 libXML streams => enabled
[..]
 xml
 
 XML Support => active
 XML Namespace Support => active
 libxml2 Version => 2.9.10
 
 xmlreader
 
 XMLReader => enabled
 
 xmlwriter
 
 XMLWriter => enabled

So, the PHP extensions are not so different.

I suspect the main difference lies in libxml libraries installed on system level.

Comparing apt-cache policy libxml2

- System A (dockerized Linux system) where some line breaks are removed.
+ System B (local Linux system) where line breaks are preserved.

 libxml2:
-  Installed: 2.9.4+dfsg1-6.1ubuntu1.3
-  Candidate: 2.9.10+dfsg-5+ubuntu18.04.1+deb.sury.org+3
+  Installed: 2.9.10+dfsg-5+ubuntu20.04.1+deb.sury.org+3
+  Candidate: 2.9.10+dfsg-5+ubuntu20.04.1+deb.sury.org+3
donquixote’s picture

Btw I doubt that it has anything to do with php version.

There is no difference between versions here: https://3v4l.org/2jSIa

donquixote’s picture

donquixote’s picture

Title: Html::load() inconsistent line breaks between environments. » Html::load() inconsistent line breaks between environments (old libxml2 version).

So I can confirm that the change is due to the version of the libxml2 library on system level (NOT the php extension).

old: libxml2 2.9.4. Here the line breaks were removed.
new: libxml2 2.9.10. Here the line breaks remain, unless a specific parameter is provided to remove them.

If we want to standardize, we should replicate the behavior from the newer version.
This can be done by passing LIBXML_HTML_NODEFDTD to DOMDocument->loadHTML().

cilefen’s picture

Issue tags: +Novice

That sounds like a plan.

donquixote’s picture

Title: Html::load() inconsistent line breaks between environments (old libxml2 version). » Html::load() inconsistent space removal with old libxml2 versions.

Could be this commit in libxml2:
https://gitlab.gnome.org/GNOME/libxml2/-/commit/f933c898132f20a50ba39ac6...

Keep non-significant blanks node in HTML parser

For https://bugzilla.gnome.org/show_bug.cgi?id=681822

Regardless if the option HTML_PARSE_NOBLANKS is set or not, blank nodes
are removed from a HTML document, for example:
[..]

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

longwave’s picture

I think I am running into this in existing tests when working on #2441811: Upgrade filter system to HTML5 - that issue would fix this one by using a fixed PHP library for HTML parsing and serialization so any differences should be removed.

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.0-alpha1 was released on May 6, 2022, which means new developments and disruptive changes should now be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

longwave’s picture

Status: Active » Needs review
FileSize
484 bytes

It's possible this is now affecting core as I get the same failures locally as in HEAD: https://www.drupal.org/pift-ci-job/2431849

Attached patch fixes the filter module tests locally, so let's see how this goes on DrupalCI.

longwave’s picture

longwave’s picture

If this does fix the tests, as per #7 we should decide whether we want to keep the old behaviour with this flag, or enforce the new behaviour with the other flag and update the tests to match.

andypost’s picture

Looking at commit https://gitlab.gnome.org/GNOME/libxml2/-/commit/f933c898132f20a50ba39ac6... it looks like #15 should be enough to keep consistency within libxml2 versions

The last submitted patch, 15: 3204929-15.patch, failed testing. View results

longwave’s picture

A similar fix is required for Drupal\Tests\layout_discovery\Kernel\LayoutTest.

I can't reproduce either of the FunctionalJavascript failures locally.

catch’s picture

If this does fix the tests, as per #7 we should decide whether we want to keep the old behaviour with this flag

IMO if this fixes the tests, let's commit this and open a follow-up for removing the flag (or not if we eventually decided not to).

Status: Needs review » Needs work

The last submitted patch, 19: 3204929-19.patch, failed testing. View results

Spokje’s picture

Priority: Normal » Critical
Status: Needs work » Reviewed & tested by the community

i think this patch had given all it can give on fixing the errors in HEAD that are related to this issue.

The remaining failures:

- Seem unrelated to the this patch
- Can't be reproduced locally

So I _think_ they are CI-env related.

I think we should commit this and find out (with the DA) what's going on with the remaining 3 failures.

Bumping priority to Critical, since it's related to failing HEADs.

  • catch committed bb5e69b on 10.0.x
    Issue #3204929 by longwave, donquixote, andypost: Html::load()...

  • catch committed 9a07492 on 9.5.x
    Issue #3204929 by longwave, donquixote, andypost: Html::load()...

  • catch committed d8d13be on 10.1.x
    Issue #3204929 by longwave, donquixote, andypost: Html::load()...
  • catch committed e51ecb6 on 9.4.x
    Issue #3204929 by longwave, donquixote, andypost: Html::load()...
catch’s picture

Version: 9.5.x-dev » 9.4.x-dev
Status: Reviewed & tested by the community » Fixed

Committed/pushed to 10.1.x and cherry-picked back through to 9.4.x, thanks!

longwave’s picture

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.