Download & Extend

Refactor Drupal HTML corrector (PHP5)

Project:Drupal core
Version:7.x-dev
Component:filter.module
Category:task
Priority:normal
Assigned:Unassigned
Status:closed (fixed)

Issue Summary

Because Drupal 7 requires PHP 5.2, we can make use of PHP XML parsing facilities... ;)

Comments

#1

Status:active» needs review

And here it is...

filter.module |   69 ++--------------------------------------------------------
1 file changed, 3 insertions(+), 66 deletions(-)
AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch3.02 KBIdleFailed: 9561 passes, 8 fails, 0 exceptionsView details

#2

Status:needs review» needs work

The last submitted patch failed testing.

#3

Also you need to add XML then to Drupal hook_requirements. There are idiot distributions outta there that disable core PHP extensions...

Edit: not disable, just package separately.

#4

Status:needs work» needs review

I'm not sure it's even possible to package the DOM extension separately (seems to be part of PHP core these days), but I added a requirement test anyway.

Also created a proper test case for this.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch7.1 KBIdleFailed: 9651 passes, 67 fails, 0 exceptionsView details

#5

Status:needs review» needs work

The last submitted patch failed testing.

#6

Status:needs work» needs review

There was an issue with the regexp: we need "." to match the newline character (that's the role of the "s" modifier). Extended the test case to check for this.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch7.26 KBIdleFailed: 9735 passes, 3 fails, 0 exceptionsView details

#7

Status:needs review» needs work

The last submitted patch failed testing.

#8

Status:needs work» needs review

Reroll, resubmit, goto 1.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch7.26 KBIdleFailed: Failed to apply patch.View details

#9

Status:needs review» needs work

This would a great way of streamlining and removing a lot of code.

Some issues after review though:

  1. I don't think DOMDocument properly handles encoding in this case. I attempted to run some Georgian text through the filter and it came out mangled (not so without this patch applied). More info: http://www.onphp5.com/article/57.
  2. I'm wondering if we can use saveXML(DOMNode) instead of the regex to get to the contents of the body tag:
    something like (untested and currently returns the body element too, not just contents)
    $bodyNode = $htmlDom->getDocumentElement()->getElementsByTagName('body')->item(0);
    return $htmlDom->saveXML($bodyNode);

    instead of
    +  // The result of DOMDocument->saveXML() is a full (X)HTML document. Only
    +  // extract the body of it. The "s" modifier makes "." match newlines too.
    +  if (preg_match("|<body[^>]*>(.*)</body>|s", $htmlDom->saveXML(), $matches)) {
    +    return $matches[1];
       }

#10

1. I don't think DOMDocument properly handles encoding in this case. I attempted to run some Georgian text through the filter and it came out mangled (not so without this patch applied). More info: http://www.onphp5.com/article/57.

Ok, apparently we need to add a XML declaration to force the DOM to load as UTF-8 (I guess it still defaults to latin1, silly PHP).

2. I'm wondering if we can use saveXML(DOMNode) instead of the regex to get to the contents of the body tag:
something like (untested and currently returns the body element too, not just contents)

*So* great ;) I wasn't aware that saveXML() can be passed a DOMNode. I was more looking for a saveXML() method on DOMNode itself (which would make a lot more sense).

#11

Status:needs work» needs review

Let's try that.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch7.43 KBIdleFailed: Failed to apply patch.View details

#12

Apparently, DOM only read HTML meta tags to detect encoding. Fixed, and added a test for that.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch7.56 KBIdleFailed: Failed to apply patch.View details

#13

Strange, I was testing the patch in #11 and the encoding problems I experienced with the patch from #8 were gone...

I guess we should add a test to test.filter that ensures non-Latin characters make it through the filters?

#14

I guess we should add a test to test.filter that ensures non-Latin characters make it through the filters?

Already done.

<?php
     
// Encoding is correctly kept.
     
'<p>دروبال' => '<p>دروبال</p>'
?>

... which is apparently the literal transcription of Drupal.

#15

Some minor comments after review:

  1. If I apply the patch, the test string is converted from "دروبال" to "دروبال". I'm not sure if this is my configuration (windows/eclipse), or if this has something to do with the patch or the file's encoding. Ignore this if there is no problem applying this against HEAD
  2. Tested some extra single-tag elements like area, hr, ... (to ensure they're properly closed): works fine.
  3. Should we add a test to verify that non-nesting tags are auto-closed (I personally think this differs enough from "Properly close unclosed tags, and remove useless closing tags." to warant its own test):
  4. <?php
    // Auto-close improperly nested tags.
    '<p>test<p>test</p>\n' => '<p>test</p><p>test</p>\n',
    ?>

    In any case, above test passes succesfully with patch applied.
  5. Going through the old code, I think all 'replaced' logic is properly addressed via the tests, and by passing the tests the new logic proves to be sound.

I think this is good to go...

#16

1. If I apply the patch, the test string is converted from "دروبال" to "دروبال". I'm not sure if this is my configuration (windows/eclipse), or if this has something to do with the patch or the file's encoding. Ignore this if there is no problem applying this against HEAD

I believe this is your editor misbehaving. The patch is clean UTF-8.

2. Tested some extra single-tag elements like area, hr, ... (to ensure they're properly closed): works fine.

Extended the tests to test for


too. I don't believe it's necessary to test all of them (the list is readily available).

3. Should we add a test to verify that non-nesting tags are auto-closed (I personally think this differs enough from "Properly close unclosed tags, and remove useless closing tags." to warant its own test):

<?php
     
// Auto-close improperly nested tags.
     
'<p>test<p>test</p>\n' => '<p>test</p><p>test</p>\n',
     
?>

In any case, above test passes succesfully with patch applied.

Added. Pass fine.

5. Going through the old code, I think all 'replaced' logic is properly addressed via the tests, and by passing the tests the new logic proves to be sound.

There is no way that the old code pass all those tests. The old HTML corrector is broken in so many ways, and I really don't want to fix it.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch7.71 KBIdleFailed: Failed to apply patch.View details

#17

I believe this is your editor misbehaving. The patch is clean UTF-8.

If I open the patch from d.o. in my browser, it's exhibiting the same behavior. I have to manually switch my browser to UTF-8 encoding to see the correct characters because d.o doesn't send encoding info with the patch. This doesn't have anything to do with the patch itself though, just something to take into account when saving/downloading the patch for review...

There is no way that the old code pass all those tests. The old HTML corrector is broken in so many ways, and I really don't want to fix it.

I understand. I just meant to say that whatever the old corrector claims to do, your new version does correctly (and has the test results to prove it). I haven't ran the new tests against the old corrector.

IMHO, with a change like this, there's bound to be some exotic edge cases where the old and new corrector produce (slightly) different results, but given the test results and the fact that core PHP functionality (DOMDocument) is used, I'm confident that this patch is an improvement over the current situation.

#18

I have to manually switch my browser to UTF-8 encoding to see the correct characters because d.o doesn't send encoding info with the patch.

Yes, this is a known issue.

IMHO, with a change like this, there's bound to be some exotic edge cases where the old and new corrector produce (slightly) different results, but given the test results and the fact that core PHP functionality (DOMDocument) is used, I'm confident that this patch is an improvement over the current situation.

Of course. Using PHP DOM is way better because the HTML parser of the DOM will not only automatically close dangling tags, but make proper valid XHTML. For example, it will automatically close tags that can only contain inline elements when there is a block level element in them:

<p>my test case<div>a block</div>the rest of the test case</p>

... will become:

<p>my test case</p><div>a block</div>the rest of the test case

This is a good thing at two levels: (1) we are guaranteed to produce valid (X)HTML, (2) we will have to be sure that other filters behave correctly, because the HTML corrector became unforgiving.

In this new version:

- I added several new DTD correctness tests.
- I moved from ->assertIdentical() to a fuzzy ->assertHtmlIdentical(), that is more fair to the old HTML corrector in accepting "<br/>" and "<br />" as identical

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch8.54 KBIdleFailed: Failed to apply patch.View details

#19

...and this change would fix the problem outlined in #372454: Filtered HTML filter modifies class attribute...

(verified by using the test '<pre class="brush: bash;"></pre>' => '<pre class="brush: bash;"/>',)

Crossing my fingers that this patch gets applied to HEAD asap...

#20

@mr.baileys: you tested this carefully, reviewed the code and suggested modifications. Why not RTBC-ing it now?

#21

Status:needs review» reviewed & tested by the community

I thought issue etiquette was to wait for 2 positive reviews for a RTBC? Let's try and see what happens :)

#22

- Do we know if this might have any performance implications? We all know how XML parsers can be. On certain pages (e.g. forum topics with lots of comments), this filter can be called a lot ...

- Code looks good. Good job on the code size reduction.

- If this improves the quality of the HTML filter, this might be worthy a CHANGELOG.txt item. I'll leave that up to Damien to decide.

#23

Interesting... I ran 4 quick performance tests (100000 iterations):
A. With patch (new code)
B. Without patch (old code)

1. just one div as input
2. the source for Dries's comment as input

and here are the results:
A1: 5.5 seconds
A2: 17.12 seconds

B1: 2.83 seconds (old code beats new)
B2: 30.72 seconds (new code is almost twice as fast as old code)

So it looks like the new code is faster, except when it comes to really small pieces of content. I guess the overhead of initialising the DOMDocument is to blame for this...

Anyway, probably no match for some real performance testing, but gives an indication...

#24

Status:reviewed & tested by the community» needs work

The last submitted patch failed testing.

#25

Added an entry in CHANGELOG.txt.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch9.25 KBIdleFailed: 11879 passes, 2 fails, 0 exceptionsView details

#26

Very nice. I want.

+  // DOM can load HTML soup. But, HTML soup can throw warnings, suppress them.

I understand that.. But we really want less emotions in comments. ;)

+  // The "s" modifier makes "." match newlines too.

Let's not document PCRE basics.

+  @$htmlDom = DOMDocument::loadHTML('<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /></head><body>' . $text . '</body></html>');
...
+  if (preg_match("|^<body[^>]*>(.*)</body>$|s", $htmlDom->saveXML($bodyNode), $matches)) {

I don't see why we need a preg_match() here? We hard-code <body></body>, but we can't rely on it?

+    return "";

Single quotes.

+      'description' => t('Filter each filter individually: Convert line breaks, Correct broken HTML.'),

Lowercase "correct" after comma.

+      // Do not touch HTML comments.
+      '<!-- this is a comment -->' => '<!-- this is a comment -->',
+      'test <!-- comment -->' => 'test <!-- comment -->',

Can we add a test for "improper" comments as well? This:
<!--this is a comment-->

#27

Damien explained me why we need preg_match() here - the content could potentially contain a BODY tag.

We should test that as well though :)

#28

Status:needs work» needs review

Just refreshing the same patch.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch9.25 KBIdleFailed: 11879 passes, 2 fails, 0 exceptionsView details

#29

Status:needs review» needs work

The last submitted patch failed testing.

#30

@damZ: why did you remove the createFormat() and deleteFormat() functions in the previous patches? If these are not used anywhere, they should be removed, but maybe in a different patch.

I've adapted the foreach listing of the tests to the current, more verbose syntax used in core. I do think the foreach approach is more readable and less verbose. any reason why it is not used in filter.test?

I've also moved the silent switch @ to the DOMDocument::loadHTML function.

It still does not passes all tests though.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector_9.patch10.1 KBIdleFailed: 11499 passes, 16 fails, 0 exceptionsView details

#31

Status:needs work» needs review
AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch11.06 KBIdleFailed: 11501 passes, 14 fails, 0 exceptionsView details

#32

Status:needs review» needs work

The last submitted patch failed testing.

#33

Status:needs work» needs review

Changed the tests that assumed a faulty behavior of the HTML corrector.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector.patch12.03 KBIdleFailed: Failed to apply patch.View details

#34

Seems like sun's review was ignored

#35

Drupal strives to produce XHTML code, let's not break the XHTML compliant tests just for this patch to pass the tests. Instead, this patch adds the missing XHTML whitespace for empty elements in the HTML corrector filter and fix the non XHTML tests.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector_12.patch10.74 KBIdleFailed: 11502 passes, 12 fails, 0 exceptionsView details

#36

Status:needs review» needs work

The last submitted patch failed testing.

#37

Status:needs work» needs review

update the tests for the text_summary() test accordingly to comply with XHTML and the output of the HTML corrector.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector_13.patch11.57 KBIdleFailed: Failed to apply patch.View details

#38

This patch addresses sun's comments in #26.

in #27:

Damien explained me why we need preg_match() here - the content could potentially contain a BODY tag.

We should test that as well though :)

If I try with <div><body>test</body></div>, the body tags are removed and <div>test</div> is returned. does a test make sense here?

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector_14.patch12.23 KBIdleFailed: Failed to apply patch.View details

#39

Works, too, but the more natural case would probably be that $text contains <html><head><title>Foo bar</title></head><body>Some content</body></html> already.

#40

@Dries:

- Do we know if this might have any performance implications? We all know how XML parsers can be. On certain pages (e.g. forum topics with lots of comments), this filter can be called a lot ...

The filtered content is cached, so the impact is minimized and would only be visible if the cache has just been cleared.

The whitespace added via preg_replace('|<([^>]*)/>|i', '<$1 />', $body_content) is to support the compatibility with the old HTML 4 browsers, see http://www.w3.org/TR/xhtml1/#C_2. The code would still be valid XHTML without it but I'm unsure how the older browsers would react. Since Drupal and the old HTML Corrector have been supporting it, we would need to make sure it does not break anywhere before removing it.

#41

The (old) HTML Corrector converts the valid XHTML <span class="test" /> to non valid <span class="test" /></span>. The new implementation in this patch fixes that. Adding a test for it.

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector_15.patch11.81 KBIdleFailed: Failed to apply patch.View details

#42

Status:needs review» needs work

The last submitted patch failed testing.

#43

Status:needs work» needs review

rerolling the patch: Remove t() from getInfo

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector_16.patch11.81 KBIdlePassed: 11817 passes, 0 fails, 0 exceptionsView details

#44

talked with tic2000 which pointed out a couple of related issues on the HTML corrector breaking comments, so I added more tests with complex comments and also removed one which was redundant. Also normalized the spelling of PHP 5 (with space).

AttachmentSizeStatusTest resultOperations
374441-refactor-html-corrector_17.patch12.32 KBIdlePassed: 11819 passes, 0 fails, 0 exceptionsView details

#45

Status:needs review» reviewed & tested by the community

The patch is working properly.
What I did notice while reviewing and testing this:
Using Filtered HTML input will remove the comments if there is a clean comment or even worse if you have something like <!-- comment <p>comment</p> --> which will output comment -->
Using Full HTML will not strip the comment, but because of the line brake filter it will output in source view

<p><!-- comment --><br /><!-- comment
<p>comment</p>
<p> --></p>

The second problem is fixed by the patch in #222926: HTML Corrector filter escapes HTML comments the part that changes the _filter_autop function.

But both of those problems are worst without the patch, not in the scope of this patch and should be treated in a follow-up issue so this one is RTBC.

#46

I assume this doesn't fubar the <!--break--> tag that we are using for node teasers? At first glance, there don't seem to be any tests for the break-tag elsewhere in the code so this might require some manual testing after applying the patch.

#47

@Dries
The html corrector in this patch doesn't touch comments which is a huge improvements over what we have now.
What removes comments is Filtered HTML input, and it does that before the patch too. <!-- is not an allowed tag.
What brakes comments with html tags inside is the xss filter and that also is not touched by this patched and is present in the current state of things.
To fix those problems we need a separate issue to address them. As I said in #45 the second part is already addressed in another issue and it will be an easy commit I think (if this one gets committed #222926: HTML Corrector filter escapes HTML comments becomes a one line patch).

#48

Status:reviewed & tested by the community» fixed

Alright, thanks for explaining that (again). Committed to CVS HEAD. Thanks!

#49

Status:fixed» closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.