
Problem/Motivation
It appears that "\r\n" is saved differently when calling this function with valid HMTL markup.
Steps to reproduce
$html = Html::transformRootRelativeUrlsToAbsolute("Hi there\r\nTesting", "https://test.com");
print $html; // prints Hi there \nTesting
Proposed resolution
Not entirely sure of the best approach, but at least cover that case.
Remaining tasks
- ✅ Provide a test.
- ✅ Provide a fix.
- ✅ Review.
User interface changes
None.
API changes
None.
Data model changes
None.
Release notes snippet
\Drupal\Component\Utility\Html::load(), which is used by several core and contrib filters as well as helper methods such as
\Drupal\Component\Utility\Html::transformRootRelativeUrlsToAbsolute(), now normalizes all newlines to the line feed character.
Comment | File | Size | Author |
---|---|---|---|
#37 | fix_character_replacement-3311595-37.patch | 1.67 KB | fjgarlin |
#2 | test-only.patch | 541 bytes | fjgarlin |
Comments
Comment #2
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedTest only patch.
Comment #3
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedComment #4
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedComment #5
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedComment #6
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedProviding fix together with the test.
Comment #7
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedComment #8
wim leersNice edge case!
Could you also add a test case that already contains
in the original HTML? 🤓🙏Comment #9
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedGood point. I added a test for that case and slightly refactored the fix to cover both cases.
Comment #11
colorfieldThanks @fjgarlin tested manually and works great!
One potential edge case: we are using
$placeholder = uniqid();
For a mail, it is not uncommon to generate tokens, it is very unlikely that we could have uniquid matching the substring of a token, but that could theoretically happen, so perhaps we could use a safeguard value (e.g. use a prefix as the first uniquid parameter).
What do you think?
Comment #12
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedGood point @colorfield. I added a random prefix and more entropy to the placeholder, like this
uniqid(mt_rand(), TRUE)
(this is used in the same way in other places in core).Uploaded new patch.
Comment #13
colorfieldNice! re-tested manually, setting as RTBC
Comment #14
mfbNote there are some edge cases where this would have side effects on markup. For example, if the string has just "\r" as whitespace in a tag, that whitespace gets replaced with uniqid, DOMDocument doesn't see any whitespace separating element from attributes, and you end up with attribute values being lost and a weird closing tag added to try to fix the markup. In addition, carriage returns are added back to the output after DOMDocument has cleaned up extra whitespace from tags.
I think calling saveHTML() in Html::serialize() would prevent the
from being added - I can't remember why saveXML() is used insteadComment #15
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commented@mfb - thanks for the suggestion, the fix seems much cleaner. I re-tested manually for my case, uploading a patch to see the rest of the tests.
--
Edit: I can see the patch is already failing in some cases, so I'd still go with the one that was RTBC where all tests come fully green.
@mfb - maybe you can provide a test case for the edge case you mention? I'm not entirely sure I know what you mean. The test would be something like this:
$data['html edge case'] = ["Test with\r\n", 'http://example.com', FALSE];
. The last parameter is FALSE for same output as input or you can provide the expected output if this should be different from the input.Comment #16
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedComment #17
mfbRight, we cannot simply use DOMDocument::saveHTML(), but eventually perhaps things can be refactored to make it possible; here are two related issues: #1333730: [Meta] PHP DOM (libxml2) misinterprets HTML5 and #3216912: Html::serialize adds unwanted/duplicate xml:* attributes
And here's an additional test case to illustrate potential issue. It's easier to see the output using
phpunit --testdox
:Given
<a⟵href='/node'>My·link</a>
, output is<a⟵href>My·link</a⟵href>
rather than<a·href="http://example.com/node">My·link</a>
Comment #18
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedBut this function (transform links to absolute) already keeps the "\n" for example, so why should it replace the "\r" and not leave it as it is?
I would expect that the only changes after calling it would be to replace the links, but I wouldn't expect anything else regarding the spacing.
So, for me, the test that would make sense would be:
["<a\rhref='/node'>My link</a>", 'http://example.com', '<a\rhref="http://example.com/node">My link</a>'];
(waiting for test result in any case as I'm not 100% sure about the spacing part)
Comment #20
mfbDOMDocument does some normalizing of elements, e.g. a newline inside a tag is converted to a space.
The main thing I was trying to point out is how in this edge case (specifically chosen to break this patch), the value of the href attribute is totally lost, and a spurious href attribute is added to the closing tag.
Comment #21
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedIt's great that you thought of that edge case and provided the test for it, I really appreciate it, thanks! I'll try to see the options base on all the new tests we have now.
Comment #22
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedI found this useful: https://stackoverflow.com/questions/20525741/xml-create-element-new-line
I think the only problem is actually when we have "\r\n", which can be normalized to "\n" before parsing.
"
" is the encoding of just "\r", so it's not wrong to encode it when it's by itself, I think.
Comment #23
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedComment #25
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedComment #27
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedComment #28
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedComment #30
mfbI would recommend
str_replace(["\r\n", "\r"], "\n", $html)
This will replace the common CR-LF sequence with LF, then it will also replace any lone CR with LF.
I think it makes sense to just get rid of all CR as DOMDocument::saveXML() doesn't handle them correctly, for HTML purposes, when they are found in text nodes
Comment #31
mfbTo further clarify on why get rid of CR,
is valid in XML so that's why saveXML() outputs it, but is considered invalid in HTML - the W3C HTML validation error is "A numeric character reference expanded to carriage return." There are some arcane HTML rules about which numeric character entities are allowedComment #32
pooja saraah CreditAttribution: pooja saraah at Material for Drupal India Association commentedAddressed the comment #30
Attached patch against Drupal 10.1.x
Comment #33
pooja saraah CreditAttribution: pooja saraah at Material for Drupal India Association commentedComment #34
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedFixed faulty PHP in #32 (it wasn't assigning back to $html after the str_replace) and applied #30 suggestion.
Comment #35
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedComment #36
mfbThe comment seems a little vague, howabout something like
PHP's \DOMDocument::saveXML() encodes carriage returns as so normalize all newlines to line feeds.
I'm also adding a duplicate related issue: #2975602: Html::serialize() escapes \r to
Comment #37
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedGood suggestion, thanks. Patch updated. Good job finding the duplicate too, I checked when I first created the issue but didn't see that one.
Comment #38
mfbLGTM.
I updated the release notes snippet to better describe the consequences of this change, which will affect some core and contrib filters as well as the transformRootRelativeUrlsToAbsolute() method.
Comment #39
alexpottDiscussed with @longwave who pointed out that someone somewhere might be relying on the behaviour. It'm not sure that that should prevent use from making the change but it does mean we should have a CR.
Comment #40
fjgarlin CreditAttribution: fjgarlin as a volunteer and at Drupal Association commentedI created the change records with some basic information. It's here: https://www.drupal.org/node/3320471
Not sure what else needs adding there as I haven't created a CR for core before.
Comment #41
alexpott@fjgarlin I fixed up the CR to have before and after - that always helps on a CR.
Comment #42
longwaveThe change record looks good to me.
Comment #43
xjmComment #45
xjmGood find and the fix seems sensible. Committed to 10.1.x and published the CR. Note that I did not backport it on account of #39. Thanks!