Problem/Motivation

It appears that "\r\n" is saved differently when calling this function with valid HMTL markup.

Steps to reproduce

$html = Html::transformRootRelativeUrlsToAbsolute("Hi there\r\nTesting", "https://test.com");
print $html; // prints Hi there
\nTesting

Proposed resolution

Not entirely sure of the best approach, but at least cover that case.

Remaining tasks

  • ✅ Provide a test.
  • ✅ Provide a fix.
  • ✅ Review.

User interface changes

None.

API changes

None.

Data model changes

None.

Release notes snippet

\Drupal\Component\Utility\Html::load(), which is used by several core and contrib filters as well as helper methods such as
\Drupal\Component\Utility\Html::transformRootRelativeUrlsToAbsolute(), now normalizes all newlines to the line feed character.

Comments

fjgarlin created an issue. See original summary.

fjgarlin’s picture

StatusFileSize
new541 bytes

Test only patch.

fjgarlin’s picture

Title: Html::transformRootRelativeUrlsToAbsolute replaces which are not part of any URL » Html::transformRootRelativeUrlsToAbsolute replaces characters which are not part of any URL
Issue summary: View changes
fjgarlin’s picture

Issue summary: View changes
fjgarlin’s picture

Title: Html::transformRootRelativeUrlsToAbsolute replaces characters which are not part of any URL » Html::transformRootRelativeUrlsToAbsolute replaces "\r" with "
"
Issue summary: View changes
fjgarlin’s picture

Status: Active » Needs review
StatusFileSize
new1.15 KB

Providing fix together with the test.

fjgarlin’s picture

Assigned: fjgarlin » Unassigned
Issue summary: View changes
wim leers’s picture

Status: Needs review » Needs work

Nice edge case!

+++ b/core/tests/Drupal/Tests/Component/Utility/HtmlTest.php
@@ -387,6 +387,8 @@ public function providerTestTransformRootRelativeUrlsToAbsolute() {
+    $data['html without links'] = ["Test without links but with\r\nsome special characters", 'http://example.com', FALSE];

Could you also add a test case that already contains 
 in the original HTML? 🤓🙏

fjgarlin’s picture

Status: Needs work » Needs review
StatusFileSize
new1.84 KB

Good point. I added a test for that case and slightly refactored the fix to cover both cases.

Version: 9.5.x-dev » 10.1.x-dev

Drupal 9.5.0-beta2 and Drupal 10.0.0-beta2 were released on September 29, 2022, which means new developments and disruptive changes should now be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

colorfield’s picture

Thanks @fjgarlin tested manually and works great!

One potential edge case: we are using $placeholder = uniqid();
For a mail, it is not uncommon to generate tokens, it is very unlikely that we could have uniquid matching the substring of a token, but that could theoretically happen, so perhaps we could use a safeguard value (e.g. use a prefix as the first uniquid parameter).
What do you think?

fjgarlin’s picture

StatusFileSize
new1.86 KB

Good point @colorfield. I added a random prefix and more entropy to the placeholder, like this uniqid(mt_rand(), TRUE) (this is used in the same way in other places in core).

Uploaded new patch.

colorfield’s picture

Status: Needs review » Reviewed & tested by the community

Nice! re-tested manually, setting as RTBC

mfb’s picture

Note there are some edge cases where this would have side effects on markup. For example, if the string has just "\r" as whitespace in a tag, that whitespace gets replaced with uniqid, DOMDocument doesn't see any whitespace separating element from attributes, and you end up with attribute values being lost and a weird closing tag added to try to fix the markup. In addition, carriage returns are added back to the output after DOMDocument has cleaned up extra whitespace from tags.

I think calling saveHTML() in Html::serialize() would prevent the 
 from being added - I can't remember why saveXML() is used instead

fjgarlin’s picture

@mfb - thanks for the suggestion, the fix seems much cleaner. I re-tested manually for my case, uploading a patch to see the rest of the tests.

--

Edit: I can see the patch is already failing in some cases, so I'd still go with the one that was RTBC where all tests come fully green.
@mfb - maybe you can provide a test case for the edge case you mention? I'm not entirely sure I know what you mean. The test would be something like this:
$data['html edge case'] = ["Test with\r\n", 'http://example.com', FALSE];. The last parameter is FALSE for same output as input or you can provide the expected output if this should be different from the input.

fjgarlin’s picture

mfb’s picture

Right, we cannot simply use DOMDocument::saveHTML(), but eventually perhaps things can be refactored to make it possible; here are two related issues: #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5 and #3216912: Html::serialize adds unwanted/duplicate xml:* attributes

And here's an additional test case to illustrate potential issue. It's easier to see the output using phpunit --testdox:
Given <a⟵href='/node'>My·link</a>, output is <a⟵href>My·link</a⟵href> rather than <a·href="http://example.com/node">My·link</a>

fjgarlin’s picture

But this function (transform links to absolute) already keeps the "\n" for example, so why should it replace the "\r" and not leave it as it is?

I would expect that the only changes after calling it would be to replace the links, but I wouldn't expect anything else regarding the spacing.

So, for me, the test that would make sense would be:
["<a\rhref='/node'>My link</a>", 'http://example.com', '<a\rhref="http://example.com/node">My link</a>'];

(waiting for test result in any case as I'm not 100% sure about the spacing part)

Status: Reviewed & tested by the community » Needs work
mfb’s picture

DOMDocument does some normalizing of elements, e.g. a newline inside a tag is converted to a space.

The main thing I was trying to point out is how in this edge case (specifically chosen to break this patch), the value of the href attribute is totally lost, and a spurious href attribute is added to the closing tag.

fjgarlin’s picture

It's great that you thought of that edge case and provided the test for it, I really appreciate it, thanks! I'll try to see the options base on all the new tests we have now.

fjgarlin’s picture

Status: Needs work » Needs review
StatusFileSize
new1.64 KB

I found this useful: https://stackoverflow.com/questions/20525741/xml-create-element-new-line

I think the only problem is actually when we have "\r\n", which can be normalized to "\n" before parsing.
"&#13" is the encoding of just "\r", so it's not wrong to encode it when it's by itself, I think.

fjgarlin’s picture

StatusFileSize
new1.64 KB

The last submitted patch, 15: fix_character_replacement-3311595-15.patch, failed testing. View results

fjgarlin’s picture

Title: Html::transformRootRelativeUrlsToAbsolute replaces "\r" with "&#13;" » Html::transformRootRelativeUrlsToAbsolute replaces "\r\n" with "&#13\n;"
Issue summary: View changes

Status: Needs review » Needs work

The last submitted patch, 23: fix_character_replacement-3311595-23.patch, failed testing. View results

fjgarlin’s picture

StatusFileSize
new1.68 KB
fjgarlin’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch, 27: fix_character_replacement-3311595-27.patch, failed testing. View results

mfb’s picture

I would recommend str_replace(["\r\n", "\r"], "\n", $html)

This will replace the common CR-LF sequence with LF, then it will also replace any lone CR with LF.

I think it makes sense to just get rid of all CR as DOMDocument::saveXML() doesn't handle them correctly, for HTML purposes, when they are found in text nodes

mfb’s picture

To further clarify on why get rid of CR, &#13; is valid in XML so that's why saveXML() outputs it, but is considered invalid in HTML - the W3C HTML validation error is "A numeric character reference expanded to carriage return." There are some arcane HTML rules about which numeric character entities are allowed

pooja saraah’s picture

StatusFileSize
new1.68 KB
new614 bytes

Addressed the comment #30
Attached patch against Drupal 10.1.x

pooja saraah’s picture

Status: Needs work » Needs review
fjgarlin’s picture

StatusFileSize
new1.69 KB

Fixed faulty PHP in #32 (it wasn't assigning back to $html after the str_replace) and applied #30 suggestion.

fjgarlin’s picture

mfb’s picture

The comment seems a little vague, howabout something like PHP's \DOMDocument::saveXML() encodes carriage returns as &#13; so normalize all newlines to line feeds.

I'm also adding a duplicate related issue: #2975602: Html::serialize() escapes \r to &#13;

fjgarlin’s picture

StatusFileSize
new1.67 KB

Good suggestion, thanks. Patch updated. Good job finding the duplicate too, I checked when I first created the issue but didn't see that one.

mfb’s picture

Title: Html::transformRootRelativeUrlsToAbsolute replaces "\r\n" with "&#13\n;" » Html::transformRootRelativeUrlsToAbsolute replaces "\r\n" with "&#13;\n"
Issue summary: View changes
Status: Needs review » Reviewed & tested by the community

LGTM.

I updated the release notes snippet to better describe the consequences of this change, which will affect some core and contrib filters as well as the transformRootRelativeUrlsToAbsolute() method.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: +Needs change record

Discussed with @longwave who pointed out that someone somewhere might be relying on the behaviour. It'm not sure that that should prevent use from making the change but it does mean we should have a CR.

fjgarlin’s picture

Status: Needs work » Needs review

I created the change records with some basic information. It's here: https://www.drupal.org/node/3320471
Not sure what else needs adding there as I haven't created a CR for core before.

alexpott’s picture

@fjgarlin I fixed up the CR to have before and after - that always helps on a CR.

longwave’s picture

Status: Needs review » Reviewed & tested by the community
Issue tags: -Needs change record

The change record looks good to me.

xjm’s picture

Title: Html::transformRootRelativeUrlsToAbsolute replaces "\r\n" with "&#13;\n" » Html::transformRootRelativeUrlsToAbsolute() replaces "\r\n" with "&#13;\n"

  • xjm committed 3589a503 on 10.1.x
    Issue #3311595 by fjgarlin, mfb, pooja saraah, colorfield, alexpott, Wim...
xjm’s picture

Status: Reviewed & tested by the community » Fixed

Good find and the fix seems sensible. Committed to 10.1.x and published the CR. Note that I did not backport it on account of #39. Thanks!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.