Function needs tests - it's in mail.inc :)

CommentFileSizeAuthor
#164 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-164.patch32.32 KBpillarsdotnet
#247 drupal_html_to_text-improvements-299138-247.patch55.87 KBpillarsdotnet
#232 drupal_html_to_text-improvements-299138-232.patch50.12 KBpillarsdotnet
#246 drupal_html_to_text-improvements-299138-246.patch56.36 KBpillarsdotnet
#243 drupal_html_to_text-improvements-299138-243.patch55.86 KBpillarsdotnet
#239 drupal_html_to_text-improvements-299138-239.patch50.36 KBpillarsdotnet
#238 drupal_html_to_text-improvements-299138-238.patch50.18 KBpillarsdotnet
#237 drupal_html_to_text-improvements-299138-237.patch50.12 KBpillarsdotnet
#236 drupal_html_to_text-improvements-299138-236.patch50.12 KBpillarsdotnet
#234 drupal_html_to_text-improvements-299138-234.patch50.1 KBpillarsdotnet
#231 drupal_html_to_text-improvements-299138-231.patch50.17 KBpillarsdotnet
#230 drupal_html_to_text-improvements-299138-230.patch49.74 KBpillarsdotnet
#228 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-228.patch49.18 KBpillarsdotnet
#227 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-226.patch46.94 KBpillarsdotnet
#225 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-225.patch43.67 KBpillarsdotnet
#218 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-218.patch42.53 KBpillarsdotnet
#216 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-216.patch42.55 KBpillarsdotnet
#208 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-208.patch40.64 KBpillarsdotnet
#204 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-204.patch40.75 KBpillarsdotnet
#205 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-205.patch40.64 KBpillarsdotnet
#200 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-200.patch39.51 KBpillarsdotnet
#198 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-198.patch39.33 KBpillarsdotnet
#190 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-190.patch37.23 KBpillarsdotnet
#157 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-157.patch32.61 KBpillarsdotnet
#167 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-167.patch32.78 KBpillarsdotnet
#192 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-192.patch38.19 KBpillarsdotnet
#197 drupal_html_to_text-fix_broken_formatting_and_forget_about_the_tests-299138-197.patch10.11 KBpillarsdotnet
#196 drupal_html_to_text-fix_broken_formatting_and_forget_about_the_tests-299138-195.patch39.64 KBpillarsdotnet
#193 drupal_html_to_text-fix_broken_formatting_and_forget_about_the_tests-299138-193.patch9.93 KBpillarsdotnet
#184 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-184.patch37.19 KBpillarsdotnet
#185 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-185.patch37.2 KBpillarsdotnet
#182 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-182.patch35.79 KBpillarsdotnet
#177 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-177.patch34.52 KBpillarsdotnet
#174 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-174.patch34.52 KBpillarsdotnet
#171 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-171.patch34.51 KBpillarsdotnet
#165 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-164.patch32.09 KBstephandale
#166 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-166.patch32.78 KBpillarsdotnet
#155 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-155.patch32.63 KBpillarsdotnet
#154 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-154.patch32.47 KBpillarsdotnet
#153 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-153.patch32.86 KBpillarsdotnet
#152 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-152.patch33.27 KBpillarsdotnet
#148 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-148.patch31.07 KBpillarsdotnet
#147 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-147.patch31.06 KBpillarsdotnet
#145 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-145.patch30.9 KBpillarsdotnet
#143 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-143.patch30.89 KBpillarsdotnet
#141 drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-141.patch29.18 KBpillarsdotnet
#140 drupal_html_to_text-tests+fix-299138-136.patch26.05 KBstephandale
#135 drupal_html_to_text-test_only-299138-135.patch8.35 KBpillarsdotnet
#135 drupal_html_to_text-tests+fix-299138-135.patch22.17 KBpillarsdotnet
#133 drupal_html_to_text-test_only-299138-133.patch8.32 KBpillarsdotnet
#133 drupal_html_to_text-tests+fix-299138-133.patch22.28 KBpillarsdotnet
#134 drupal_html_to_text-test_only-299138-134.patch8.32 KBpillarsdotnet
#134 drupal_html_to_text-tests+fix-299138-134.patch21.62 KBpillarsdotnet
#130 drupal_html_to_text-test_only-299138-130.patch7.9 KBpillarsdotnet
#130 drupal_html_to_text-tests+fix-299138-130.patch19.76 KBpillarsdotnet
#121 drupal_html_to_text-tests+fix-299138-121.patch19.4 KBpillarsdotnet
#120 drupal_html_to_text-test_only-299138-120.patch7.93 KBpillarsdotnet
#120 drupal_html_to_text-tests+fix-299138-120.patch19.4 KBpillarsdotnet
#119 drupal_html_to_text-test_only-299138-119.patch7.99 KBpillarsdotnet
#119 drupal_html_to_text-tests+fix-299138-119.patch19.41 KBpillarsdotnet
#108 drupal_html_to_text-tests_only-299138-107.patch7.53 KBpillarsdotnet
#108 drupal_html_to_text-tests+fix-299138-107.patch16.97 KBpillarsdotnet
#106 drupal_html_to_text-tests_only-299138-106.patch7.53 KBpillarsdotnet
#106 drupal_html_to_text-tests+fix-299138-106.patch16.98 KBpillarsdotnet
#105 drupal_html_to_text-tests+fix-299138-105.patch16.95 KBpillarsdotnet
#101 drupal_html_to_text-tests_only-299138-101.patch7.51 KBpillarsdotnet
#101 drupal_html_to_text-tests+fix-299138-101.patch16.93 KBpillarsdotnet
#99 drupal_html_to_text-tests-299138-99.patch6.99 KBpillarsdotnet
#99 drupal_html_to_text-tests+fix-299138-99.patch14.9 KBpillarsdotnet
#75 drupal_html_to_text-299138-75.patch1.06 KBpillarsdotnet
#94 drupal_html_to_text-tests+fix-299138-94.patch14.73 KBpillarsdotnet
#92 drupal_html_to_text-tests+fix-299138-92.patch14.73 KBpillarsdotnet
#89 drupal_html_to_text-tests_only-299138-89.patch5.8 KBpillarsdotnet
#89 drupal_html_to_text-tests+fix-299138-89.patch10.02 KBpillarsdotnet
#84 drupal_html_to_text-299138-84.patch9.54 KBpillarsdotnet
#82 drupal_html_to_text-299138-82.patch15.71 KBpillarsdotnet
#81 drupal_html_to_text-299138-81.patch14.55 KBpillarsdotnet
#77 drupal_html_to_text-299138-77.patch14.57 KBpillarsdotnet
#78 drupal_html_to_text-299138-78.patch14.62 KBpillarsdotnet
#69 mail.inc_.patch585 bytespillarsdotnet
#66 mail.inc_.patch548 bytespillarsdotnet
#68 mail.inc_.patch545 bytespillarsdotnet
#57 drupal.html2text.57.patch12.23 KBsun
#52 issue_299138_mail.test.patch13.28 KBTiburón
#37 mail.test-299138-37.patch13.52 KBarjenk
#33 mail.test-299138-33.patch14.14 KBarjenk
#32 mail.test-299138-32.patch13.41 KBarjenk
#29 mail.test-299138-29.patch13.42 KBarjenk
#25 mail.test-299138-25.patch13.21 KBarjenk
#22 299138-drupal_html_to_text-test_1.patch13.02 KBjrglasgow
#21 299138-drupal_html_to_text-test.patch12.58 KBjrglasgow
#17 mail.test-299138-17.patch11.98 KBarjenk
#13 mail.test-299138-13.patch11.22 KBarjenk
#11 mail.test-299138-11.patch2.89 KBarjenk
#9 mail.test-299138-9.patch3.8 KBarjenk
#8 mail.inc-299138-8.patch1.51 KBKevin Hankens
#5 mail.test1.48 KBKevin Hankens
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Kevin Hankens’s picture

One quick question - sorry if this is in the wrong forum. If the $allowed_tags argument is empty, would you consider using the allowed tags list from the Filtered HTML input format settings? This would be more aligned with the bulk of the site content.

I use a lot of div tags in my content and wouldn't want to pass the entire list of allowed tags plus 'div' in an array each time I call drupal_html_to_text.

Kevin Hankens’s picture

Oh, I forgot to add that for me a div tag would signify a separate section, so I also wanted to see a case for it in the switch stanza that worked like the p tag.

Thanks for letting me share :)

Kevin Hankens’s picture

I haven't submitted any simpletests before, but I started working on one yesterday for this function. If you are interested, I would be happy to finish it up and send it this way for review. If so, here are a couple of quick questions.

  1. Would the testing be as simple as: a) checking that there are no HTML tags; and b) testing a piece of HTML code against an expected result - or can you foresee something more complex?
  2. There aren't any core tests under Mail, would this require a new Group, or grouped with another?

Thanks!

Kevin

catch’s picture

1. That'd probably be fine. Anything else can always be added to later, but the idea here is provide input, pass it through the function, verify expected output, not much more than that.
2. You'd need a new group and a new mail.test - you could copy modules/simpletest/tests/form.test (or some other .inc that already as tests) to get started quickly.

Also feel free to post patches that are still in progress if you think there's something worth looking at, even if it's not quite ready.

Kevin Hankens’s picture

FileSize
1.48 KB

Here's a proposed mail.test file. It's a little rough, but the tests worked, and I was able to make them fail by manipulating the output of drupal_html_to_text(). Here are some considerations:

  1. The line breaks in the expected output are ugly, but I wasn't able to replicate it using new line codes. Perhaps we could just use a simpler piece of HTML for testing.
  2. The regex expression for the HTML matching is something that I've had saved for quite some time, but I don't know the identity of the original author. If this is a problem, we'll need a Drupal coder to replace it with something of ours.
  3. I tried using assertNoPattern() instead of using preg_match with a boolean because it was cleaner, but I couldn't get it to work without it throwing a warning. It appeared to work, but I got the yellow warning bar saying "delimiters cannot be alpha numeric or a back slash"

Otherwise, take a look and see if it's something you can use!

webchick’s picture

Status: Active » Needs review

Marking needs review.

Kevin Hankens’s picture

On a related note, here are two things that I noticed in mail.inc that might warrant revision - hope it's ok to compile both into one patch!

  1. The first change would also allow single quotes in <a> tags to generate footnote URLs.
  2. The second creates a case for <div> tags which act like a <p> tag, since divs are usually a separate piece of info

Thanks!

[edit]Just found an error - will post revision soon[/edit]

Kevin Hankens’s picture

FileSize
1.51 KB

Apologies, I forgot to include div in the list of allowed tags.

One other note, would it make sense to include a blank line before paragraphs, divs, etc.?

arjenk’s picture

FileSize
3.8 KB

Well, i started reviewing the test patch, but ended writing it. Sorry about that. Some notes:

  • I solved the the not-so-nice-linebreaks with the use of a define LF
  • the tests are all defined in an array for readability;
  • Kevin, i'm not sure what to do with the div patch, since the issue is about the making of tests, i would say move it to another issue with the div patch alone?;
  • next step should probably be tests for the parameter $allowed_tags, or more for exceptional use, or combined use of tags, or can we accept the scope of the patch as is, considering the size of it?
drewish’s picture

Status: Needs review » Needs work

A few things:

  • Why are we using the setUp() function if there's only one test?
  • Comments need proper capitalization.
  • I don't think LF is an improvement, what's wrong with "\n"?
  • It's not clear to me what the second half of the test is doing... it seems like it's testing a separate aspect and would be better served as a separate test with a proper description.
arjenk’s picture

FileSize
2.89 KB

cleanup

drewish’s picture

Status: Needs work » Needs review

visually it looks a lot better. haven't tried running them.

arjenk’s picture

FileSize
11.22 KB

Thanks,

After searching Drupal for references to drupal_html_to_text i stumbled upon the initial patch for this function (#154218: Add HTML to text convertor for email (and wrap mails properly)) by Steven. Here he showed a sample conversion, which gave me the idea to add an more functional unit test to the patch. Pro: it tests the function in a more reallife (tm) use, Cons: i adds a lot of lines to the mail.test file. What do you think?

test results so far:

There is some strange behaviour with utf8 characters, and some trivial bugs in drupal_html_to():

  • A single <li>foo</li> gives a PHP:notice;
  • There is always an extra whiteline at the end;
  • There are trailing whitespaces in the output.

I will make a separate issue for these findings (if not too trivial), after this patch is committed ;-) ...

Please comment.

Status: Needs review » Needs work

The last submitted patch failed testing.

lilou’s picture

Status: Needs work » Needs review
catch’s picture

Category: bug » task
Status: Needs review » Needs work

There are redundant empty lines in the phpdoc. Also I think this could use some extra inline comments about what's being tested and how.

arjenk’s picture

Status: Needs work » Needs review
FileSize
11.98 KB

Tried to improve comments and generated output.

catch’s picture

Component: tests » base system
Issue tags: +Needs tests
cwgordon7’s picture

It probably doesn't matter, but since this is a test case and should try to be as near the actual case as possible, shouldn't valid HTML be used? For example, the line <ul><li>Drupal<li>Drupal</ul> should probably be <ul><li>Drupal</li><li>Drupal</li></ul>, unless there is a reason for the invalid htm — perhaps you were trying to test the function's handling of such a case?

Status: Needs review » Needs work

The last submitted patch failed testing.

jrglasgow’s picture

Status: Needs work » Needs review
FileSize
12.58 KB

I added the tests suggested by cwgordon7 in #19 in additional to the tests already there.

jrglasgow’s picture

I fixed some alignment problem based on the recommendation of cwgordon7

cwgordon7’s picture

Status: Needs review » Reviewed & tested by the community

Code style is good, and all tests passed, so rtbc. :)

sun’s picture

Status: Reviewed & tested by the community » Needs work
+      'description'  => t('Test mail.inc function drupal_html_to_text().'),

Spot the double spacing here.

+  function testDrupalHtmlToText() {
+
+    $html_to_text_cases = array(
...
+  function testDrupalHtmlToTextArgs() {
+

Please remove the blank line.

+      '<blockquote>Drupal</blockquote>'                 => '>Drupal' . "\n",

Shouldn't it actually be '> Drupal' ?

+      '<ol><li>Drupal<li>Drupal</ol>'                   => ' 1) Drupal' . "\n" . ' 2) Drupal' . "\n\n",
+      '<ol><li>Drupal</li><li>Drupal</li></ol>'         => ' 1) Drupal' . "\n" . ' 2) Drupal' . "\n\n",

Could we ammend the first line with a comment that states that we're explicitly testing malformed HTML?
Also, there is no test for an UL in an OL and vice-versa.

+      '<h2>Drupal</h2>'                                 => '-------- DRUPAL ' . str_repeat("-", 62). "\n\n",

Concatentation mistake here.

arjenk’s picture

Status: Needs work » Needs review
FileSize
13.21 KB

More code cleanup, coder gives no errors anymore. Also i changed the test to DrupalUnitTestCase for speedup of the test (see #464714: Speed up the tests).

'<blockquote>Drupal</blockquote>'                 => '>Drupal' . "\n",

True, '> Drupal' makes more sense, lets fix this after the unittest gets in. I also added a test with a nested <ul> and <ol>, and visa-versa.

stella’s picture

Status: Needs review » Needs work

Minor fixes required:
There are a number of trailing spaces in the patch. Also should 'Drupal<br>Drupal' not really be 'Drupal<br />Drupal' inline with Drupal standards, or is that <br> on purpose as part of the test?

Also not sure about the assertion message text below, it could use some simplification:

$this->assertEqual($result, $expected_result, t('Test output of drupal_html_to_text(Long html mail with all sorts of html tags.)'));

Ditto with the assertion messages for the assertions in testDrupalHtmlToTextArgs(). I'd rather English sentences explaining the functionality that passed/failed, rather than a piece of code. And how does the 3rd assertion here differ in functionality from the 1st? They both appear to be testing that the array() overrides the default html tags allowed.

Other than those few things, patch is looking good, great work!

sun’s picture

Also should 'Drupal
Drupal' not really be 'Drupal
Drupal' inline with Drupal standards, or is that
on purpose as part of the test?

Good catch! The answer is: both. We need to test both, since the handling of <br> and <br /> may break. Whereas the former should go into the "malformed" array section.

The other remarks from stella I agree with.

This is a wonderful & great patch! Let's get it done! :)

sun’s picture

Ah well. And that was a nice test for HTML Filter...

arjenk’s picture

FileSize
13.42 KB

Thanks for the reviews!

  • ah, those trailing spaces; believe it on not, they are on purpose. Those are generated by drupal_html_to_text(). Removing them would cause this test to fail. This is added as a comment in the
    code.
  • Added separate tests for the <br> and <br /> tags.
  • Tried to improve the test messages to more English sentences; t('Test of drupal_html_to_text() with a long html mail.'). Could use a suggestion here, if this is not what you were looking for.
  • The idea of the third test of testDrupalHtmlToTextArgs() was to test if passing a second parameter would not break the existing funcionality. I combined this now with passing two tags, so it makes somewhat more sense (...)
arjenk’s picture

Status: Needs work » Needs review
sun’s picture

Status: Needs review » Needs work

Now that I look at it again:

+class MailIncTestCase extends DrupalUnitTestCase {

I think we should call this DrupalHtmlToTextTestCase.

+      'description' => t('Test mail.inc function drupal_html_to_text().'),

Likewise, mail.inc function can be removed here.

+      $this->assertEqual($result,  $expected_text, t('drupal_html_to_text(@func_call ...)', array('@func_call' => drupal_substr($html_string, 0, 15))));

Duplicate space before $expected_text.

+   * Tests html_to_text() with a full mail example.

s/Tests/Test/ - PHPDoc summaries never use the third form.

+    // This is the html code to test.

HTML we want to write uppercase here.

+    /*
+     * The expected result after html_to_text($html).
+     * Do not remove the trailing whitespaces, as these are generated, and
+     * will otherwise fail this test.
+     */

That's a wrong notation for inline comments. We only use multiline comments for PHPDoc, just use // here.
Also, I believe the function name is drupal_html_to_text(), no?
Can we simplify that to Trailing white-space is required to match the output of drupal_html_to_text(). ?

+   * Test $allowed_tags argument of html_to_text().

One more instance of html_to_text() here.

arjenk’s picture

Status: Needs work » Needs review
FileSize
13.41 KB

re-roll with sun's comments.

arjenk’s picture

FileSize
14.14 KB

Some small updates according to the latest test guidelines [http://drupal.org/node/325974]:

The getInfo is now a public static, and the descriptions use a more active tense now. Also changed mail to e-mail, and spotted a missed lowercase html. Last: added an mail.test entry to simpletest.info, otherwise it wont
show up in the testlist.

kscheirer’s picture

will review this when I get a chance

kscheirer’s picture

Status: Needs review » Needs work

All the tests pass, this is a great addition!

just a few quibbles:

  • Need a newline at the end of modules/simpletest/tests/mail.test to follow coding standards
  • You have two tests that report with the message "Tests drupal_html_to_text() basic working of the parameter $allowed_tags." - perhaps the second one could explain how it differs from the first?
  • When reporting the function and parameters used, why truncate at 15 characters? Most of the time the test text will not be seen by anyone, but when one fails it will be useful to see the complete test performed.
lilou’s picture

Now, t() is removed in getinfo() : #500866: [META] remove t() from assert message

arjenk’s picture

Status: Needs work » Needs review
FileSize
13.52 KB

# Need a newline at the end of modules/simpletest/tests/mail.test to follow coding standards

changed.

# You have two tests that report with the message "Tests drupal_html_to_text() basic working of the parameter $allowed_tags." - perhaps the second one could explain how it differs from the first?

how about: "Tests drupal_html_to_text() with tags which are not given by the parameter $allowed_tags.". Better suggestions are welcome!

# When reporting the function and parameters used, why truncate at 15 characters? Most of the time the test text will not be seen by anyone, but when one fails it will be useful to see the complete test performed.

ok, looks better indeed. Also removed the t() there as t('drupal_html_to_text()') makes no sense.

#Now, t() is removed in getinfo()

done.

Status: Needs review » Needs work

The last submitted patch failed testing.

sun.core’s picture

Priority: Critical » Normal

Tests don't qualify as critical.

sreynen’s picture

These tests seem to be failing due to a bug in the actual function. That is, I think the tests are working as intended, and catching actual errors. Specifically, line breaks aren't be added as expected in drupal_html_to_text(). It's not clear to me whether it's really necessary to have line breaks at the end of many of the examples, but there definitely needs to be a line break in place of an actual page break (e.g. "Drupal<br />Drupal", and there isn't currently.

So I think the function needs to be updated to pass the tests, but I'm not very familiar with the function. Maybe someone more familiar with the function can confirm this?

gowriabhaya’s picture

Title: TestingParty08: drupal_html_to_text() » Tests for drupal_html_to_text()
Issue tags: +TestingPartySF

Code sprint tag

MichaelCole’s picture

Working this today. Will fix test and code, and resubmit patch.

MichaelCole’s picture

Giving up. Catch was unavailable on IRC.

Problem is how newlines are tested in the tests.

Eric_A’s picture

Eric_A’s picture

The testbot is happy in #508738: drupal_html_to_text negative padding error. Please review! (This issue depends on that one.)

Eric_A’s picture

Status: Needs work » Needs review
Issue tags: -Needs tests, -TestingPartySF

#37: mail.test-299138-37.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +Needs tests, +TestingPartySF

The last submitted patch, mail.test-299138-37.patch, failed testing.

arjenk’s picture

Patch 1.26 of mail.inc (http://drupalcode.org/viewvc/drupal/drupal/includes/mail.inc?r1=1.25&r2=...) broke the test by doing:

@@ -466,7 +496,7 @@
         $chunk = $casing($chunk);
       }
       // Format it and apply the current indentation.
-      $output .= drupal_wrap_mail($chunk, implode('', $indent)) . "\n";
+      $output .= drupal_wrap_mail($chunk, implode('', $indent));
       // Remove non-quotation markers from indentation.
       $indent = array_map('_drupal_html_to_text_clean', $indent);
     }

The commit message was "- Patch #356074 by chx, Damien Tournoud: provide a sequences API." Which does not make much sense to me for this change... Perhaps the idea was to replace \n by MAIL_LINE_ENDINGS ?

sun’s picture

Just remove "\n" from all the assertions. As you can see, the input text of those assertions does not contain a newline or BR, so the outlined change in #48 was a fix to not insert newlines, which do not actually exist; therefore making newlines match between input and output.

kscheirer’s picture

Title: Tests for drupal_html_to_text() » Fix and test drupal_html_to_text()
Category: task » bug

Man, this took a while to track down. First of all, Dries mixed up his commit messages! Arjenk, the issue for that patch is really #407452: Remove use of drupal_html_to_text() from system mails. Which was basically sun's patch.

sun - what wasn't obvious (because we had no tests for this function!) was that many of the tags were relying on that extra \n to format themselves properly. Stuff like h1, h2, dl, br, p, ol, ul, li, dl, dt, dd, etc. I think your patch was trying to make this function more general - but it doesn't seem to fulfill its original purpose now. Also _drupal_html_to_text_pad() is doing a substr, relying on the fact that there was an extra \n to throw away.

So right now drupal_html_to_text('Drupal<br />Drupal') returns DrupalDrupal, which is probably not what we want. Some of these can be fixed by changing
$chunk = ''; // Ensure blank new-line. to $chunk = "\n"; // Ensure blank new-line. .

But that doesn't take care of all the cases - like '<ul><li>Drupal</li><li><ol><li>Drupal</li><li>Drupal</li></ol></li></ul>' - what text should that result in? Reading RFC 3676 did not make it clear either.

Arjenk - you wrote many of the test cases way back, based on #154218: Add HTML to text convertor for email (and wrap mails properly) - which included some examples that aren't available online anymore. It now looks like we have broken tests for a broken function. This may be a case where we write the test cases first, and then have the function match them - anyone up for it?

Tiburón’s picture

Assigned: Unassigned » Tiburón

Assigning to me.

(DrupalCon CPH 2010 Core Dev Summit)

Tiburón’s picture

FileSize
13.28 KB

The attached patch (based on #37) fixes the expected result of five tests of which four currently parses locally.

The tests for em, i, strong, b and br incorrectly expected a newline at the end of the result with the given input.

The test for br still fails as br is broken in drupal_html_to_text() (allowed but not handled).

In includes/mail.inc the functions drupal_html_to_text() and _drupal_html_to_text_pad() both needs to be fixed.

arjenk’s picture

Wait, i'm puzzled here

First, kscheirer gave a excellent summary. And i agree with his
conclusion that the function is broken, but i'm not sure we can fix it
in the test.

The function is called drupal_html_to_text(), it is (was) a function
that was capable of formatting html to readable, formatted (hear the
newlines here?) text. It did need to add newlines in order to make
something out of the orginal html. To not allow the function to add
newlines, is to break the use of the function. So that is the status
of the current function: broken.

to give an example to illustrate:

[user:name],

<p>Thank you for registering at <a href="[site:url]">[site:name]</a>. Your application for an account is currently pending approval. </br>Once it has been approved, you will receive another e-mail containing information about 

<ul>
<li>how to log in,
<li>set your password, 
<li>and other details.
</p>

--  [site:name] team

(this is not the real text from core) gave the following text email:

test,

Thank you for registering at localhost [1]. Your application for an account
is currently pending approval. Once it has been approved, you will receive
another e-mail containing information about

  * how to log in,
  * set your password,
  * and other details.


    --  localhost team


[1] http://localhost/drupal7/

but now:

test,

Thank you for registering at localhost [1]. Your application for an account
is currently pending approval. Once it has been approved, you will receive
another e-mail containing information about * how to log in, * set your  
password, * and other details.


    --  localhost team
[1] http://localhost/drupal7/

The options are, to my opinion:

  • If it really is a problem to change the number of newlines, remove
    the function from core, and change it with strip_tags, or
    likewise. Which effectively removes all tags in mails.
  • Make a limited drupal_html_to_text() that only accepts (em, i,
    strong, b) and removes the others. Perhaps also change the name as
    well.
  • Agree that the fix in #407452: Remove use of drupal_html_to_text() from system mails broke the function, and go back to
    there and try to fix it in another way. Perhaps make a test first if
    we are dealing with a html text, and then call the function, and
    else do not...

(EDIT: fixed the correct issue number)

sun’s picture

Status: Needs work » Needs review

Last option 3) in #53 is not possible. I tried to track and keep up with this issue, but didn't manage to do so. I'm not sure whether I understand #50. I like the tests added in #53, but didn't verify in detail whether they are valid. Overall, drupal_html_to_text() implements quite complex logic, so I'm pretty sure that it is possible to achieve any newline behavior we can think of. We should start by making sure that our test contains all expectations. Afterwards fix the logic.

Status: Needs review » Needs work

The last submitted patch, issue_299138_mail.test.patch, failed testing.

kscheirer’s picture

We should start by making sure that our test contains all expectations.

Agreed, we're back to square one.

  • Define what exactly this function is supposed to do
  • Write tests for it
  • Fix the function to pass tests

Need to track down when this function came into Drupal and see what its original purpose was?

@sun: The short version of #50 is - your patch in #407452: Remove use of drupal_html_to_text() from system mails broke this function (or at least, it doesn't do what it did before that patch). There were no tests for this function in core to catch that. The old function may not have been correct, but the current version definitely is not.

sun’s picture

Status: Needs work » Needs review
FileSize
12.23 KB

I've slightly fixed the tests (which also means that expectations might be slightly changed).

The tests added here should definitely not try to be "compatible" with the current output of drupal_html_to_text(), but rather assert our expectations.

Our expectations are pretty clear: A proper conversion from HTML to plain text. If you happen to get e-mail notifications from drupal.org, then you may know most details of the expected format already.

Test results are clear, the current function is horribly broken. We should have another round to make sure that the expectations are right, then start to actually fix the bugs.

sun’s picture

Title: Fix and test drupal_html_to_text() » drupal_html_to_text() is broken
Component: base system » mail system
Assigned: Tiburón » sun

Feedback welcome.

Status: Needs review » Needs work

The last submitted patch, drupal.html2text.57.patch, failed testing.

arjenk’s picture

It would also help to add a test for the issue which led to the troublesome patch. Perhaps it had something to do with wordwrap and adding a newline on long lines? So how about adding the test:

str_repeat('Drupal ', 15) => str_repeat('Drupal ', 11) . "\n" . str_repeat('Drupal ', 4)

arjenk’s picture

rfc3676 :

A generating agent SHOULD:

   o  Ensure all lines (fixed and flowed) are 78 characters or fewer in
      length, counting any trailing space as well as a space added as
      stuffing, but not counting the CRLF, unless a word by itself
      exceeds 78 characters.

   o  Trim spaces before user-inserted hard line breaks.

among others, we should also add some tests to test if this works properly.

sun’s picture

Title: drupal_html_to_text() is broken » drupal_html_to_text() formatting is broken and does not have tests
Priority: Normal » Major

ok, but what about the existing expectations/assertions in the patch? I need a confirmation that they are correct, before I'll continue with this patch.

sun.core’s picture

Status: Needs work » Postponed (maintainer needs more info)

Either we have serious problems here or we don't. Feedback on the added tests would be helpful to move on here (and fix the actual bugs).

kscheirer’s picture

Well considering we don't even have enough information to write tests, I think we have a serious problem here. No one posting in this issue queue seems to know what the tests should look like. Someone has to know what this function is supposed to do, can we find the person that committed it?

sun’s picture

Status: Postponed (maintainer needs more info) » Needs work

We already have tests in the latest patch, the current task is to confirm (or object to) the expectations in the tests.

Only with precisely tested expectations, we can fix drupal_html_to_text().

pillarsdotnet’s picture

Status: Needs review » Needs work
FileSize
548 bytes

Patch to mail.inc restoring functionality in converting lists.

pillarsdotnet’s picture

Status: Needs work » Needs review
pillarsdotnet’s picture

FileSize
545 bytes

That failed. Let's try just reversing the commit that caused the current brokenness.

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
585 bytes

I wish the testbot would make up its mind as to whether it prefers "-p0" or "-p1" in its patches...

EDIT: Now I see -- earlier patch was hand-edited and I deleted a couple lines i shouldn't have.

pillarsdotnet’s picture

Note: all tests passed.

Pancho’s picture

Status: Needs review » Needs work

This bug renders the Simplenews module unusable and blocks further progress towards a D7 release there.
Please commit what you have there. #69 fixes the most urgent problem with drupal_html_to_text(). Further tests are a good idea but shouldn't block fixing a major bug. Thanks.

pillarsdotnet’s picture

Title: Fix the broken formatting in drupal_html_to_text() and also provide tests » drupal_html_to_text() formatting is broken and does not have tests
Version: 8.x-dev » 7.x-dev
Assigned: Unassigned » sun
Status: Needs review » Needs work
Issue tags: -Needs backport to D6, -Needs backport to D7 +Needs tests, +TestingPartySF
Pancho’s picture

Status: Needs work » Needs review

@pillarsdotnet: Please reconsider your attitude. (You're not talking to a script kiddie, and even then you don't need to assume it was bullshit.)

Actually, I'm using Simplenews on D7 just as well to deliver our newsletter every two or three days.
So "unusable and blocking further progress" might in fact be a bit overstated. Correct would have been: "out of the box it is unusable, and because the bug is to be solved upstream, it paralyzes further development, plus: because admins don't get it work correctly, they refrain from further testing".

I thought to be specific enough, because it is obvious enough what patch #69 does. However to be even more detailed:
After applying #69 from this queue, drupal_html_to_text() stopped stripping off newlines from my simplemail newsletters (see #1099188: simplenews_html_to_text() strips off newlines).

Now, how come you're obviously not affected by the bug?
You might be using a different backend, such as Mime Mail, bypassing drupal_html_to_text().

Now the title of this major (!) issue here states "drupal_html_to_text() formatting is broken", so it doesn't mainly refer to adding tests. So what I'd like you to consider is fixing "drupal_html_to_text()" asap based on the tests we have now, even if the tests are not 100% complete. Thank you.

Simon Georges’s picture

Closing #1099188: simplenews_html_to_text() strips off newlines as a duplicate of this one, and subscribing.
If I may be of any help to help solve this issue, I'm clearly willing to help, to stop people raising this in Simplenews issue queue ;-).

pillarsdotnet’s picture

Assigned: Unassigned » sun
Category: feature » bug
FileSize
1.06 KB

Bumping, tagging, and re-rolling according to standard policy.

@sun: I still haven't mastered simpletest, but a minimum test requirement should be that:

  • Chunks of text located within separate block-level HTML tags must be separated by (at least) a newline in the plaintext version.
  • Chunks of text separated by a <br /> or <hr /> tag must be separated by (at least) a newline in the plaintext version.
sun’s picture

Version: 7.x-dev » 8.x-dev
Status: Needs review » Needs work
Issue tags: -TestingPartySF +Needs backport to D7

It would tremendously help if people would re-roll the proper patch. See #57.

pillarsdotnet’s picture

@sun -- Obviously we want to do more, but at least the single-line change in #75 should be committed ASAP, because it fixes an obvious regression.

Added patch from #57, part of which had already been applied.

Also added my suggested test from #75 above.

pillarsdotnet’s picture

Status: Needs review » Needs work
FileSize
14.62 KB

Sorry; forgot the second argument to assertFalse().

pillarsdotnet’s picture

Status: Needs work » Needs review

The last submitted patch, drupal_html_to_text-299138-78.patch, failed testing.

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
14.55 KB

Trying again...

pillarsdotnet’s picture

As long as I'm doing major surgery to drupal_html_to_text(), I might as well merge in #345931: Support value attribute in ordered lists.

Status: Needs review » Needs work

The last submitted patch, drupal_html_to_text-299138-82.patch, failed testing.

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
9.54 KB

Removed the "Full HTML Document" test because:

  1. I couldn't get it to pass no matter what I did.
  2. I think it's going about things the wrong way.
pillarsdotnet’s picture

Assigned: sun » pillarsdotnet
sun’s picture

Status: Needs review » Needs work
  1. Removing a test that has been added for a good reason is a bad idea. The stated reasons are, in fact, critical reasons to keep it.
  2. Merging in other feature requests will do nothing else than to slow down progress on this issue even more.
pillarsdotnet’s picture

Category: feature » bug
Status: Needs review » Needs work

@sun -- You fix it, then. This whole issue is a "feature request". The only part that was a bugfix was the patch in #75 which you so offhandedly rejected.

EDIT: Opened #1130198: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText()

pillarsdotnet’s picture

Assigned: pillarsdotnet » Unassigned
Category: bug » feature
pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
10.02 KB
5.8 KB

To justify a possible RTBC by someone other than @sun:

pillarsdotnet’s picture

Category: bug » feature
Status: Needs work » Needs review

Tests-only: failed (Good!)

Tests+fix: passed (Good!)

BeatnikDude’s picture

subscribe

pillarsdotnet’s picture

Replaced all "\n" with MAIL_LINE_ENDINGS for consistency. May possibly fail tests now, as tests assume: (MAIL_LINE_ENDINGS === "\n")

This begs the question: Should we write separate tests for each possible value of MAIL_LINE_ENDINGS?

Status: Needs review » Needs work

The last submitted patch, drupal_html_to_text-tests+fix-299138-92.patch, failed testing.

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
14.73 KB

Fixed typo.

Simon Georges’s picture

Is there still something missing here ?
I'm really looking forward to having it in the 7.1, since we can't release Simplenews without it.

pillarsdotnet’s picture

You can help by posting a review of the patch in #94. An acceptable review would include the following:

  • Confirm that all code and comments comply with Drupal coding standards.

  • Confirm that the tests included with the patch are necessary and sufficient.

  • Confirm that the patch actually solves the problem. Describe a reproducible test where the current code fails and the patched code succeeds.

  • Change the issue status from "Needs Review" to "Reviewed and Tested By the Community".

If, after thorough testing, you cannot in good conscience affirm all of the above, please explain what is lacking in the current patch and I will do my best to address your concerns.

Lars Toomre’s picture

@pillarsdotnet:

I spent some time reading through the patch file tonight and the new tests. As a whole, the new patch and tests look good. However, I would suggest that you expand the tests to include some additional cases:

1) Multiple good tags override
$text = '<ul><li>This is a <em>complex</em> text string.</li></ul>';
$result = drupal_html_to_text($text, array('em', 'ul'));

2) Purposefully mal-formed tags in string test (from user input)
$text = '<ul><li>This is a <em>complex</em> text string.</ul></ul>';
$result = drupal_html_to_text($text, array('li','em'));
should be => ' * This is a *complex* text string.'

3) Ignorant user example of highlighting
$text = "This is <B>highlighted</B> text.';
$result = drupal_html_to_text($text, array('B'));
should be => 'This is *highlighted* text.'

Cases 2) and 3) I have experienced in the last few months from users hand-entered node bodies and/or comments. These are examples of the edge cases that I think this particular function needs to test for. Users are notorious for what they might enter without trying to be malicious.

I am unsure whether this function is intended to be used with utf-8 strings. If it is, I would want to add some text strings that include characters outside of the usual ASCII range like the European special vowels and non-Roman language sets.

Finally, I can easily see this function used as part of a larger webpage data extraction function. That typically involves using something like cURL to retrieve a webpage as $text and then start stripping out the tags to get to whatever information is desired. Hence, I think that we also want a test that takes a mocked up web page and demonstrate that this function handles that "complex case" as well.

Edit: Two further thoughts/questions:

1) What happens if 'html' is passed in as the second parameter?
2) Looking at #732992: Allow filter_xss_admin() to accept HTML5 tags, I am wondering if the permissible html tags acceptable as the second variable should be constrained to the acceptable terms used in filter_xss()? If so, how are we going to deal with what is acceptable for HTML5 (being dealt with in that issue)?

pillarsdotnet’s picture

Assigned: sun » pillarsdotnet
Category: bug » feature
Status: Needs review » Needs work

@Lars

I don't understand the significance of testing 'This is <B>highlighted</B> text.'.

Did you mean that the <b> tag should be supported?
It already is, and there is a test to prove it.

 

Did you mean that uppercase tags should be supported as well as lowercase tags?
The tag matching is already case-insensitive, but I suppose we should add a test to prove it.

 

pillarsdotnet’s picture

Assigned: pillarsdotnet » Unassigned
Status: Needs work » Needs review
FileSize
14.9 KB
6.99 KB

Okay, I've added all the tests that Lars asked for, except for the utf-8 ones. I really don't know what to do about those and would welcome any suggestions.

@further thoughts/questions:

1) What happens if 'html' is passed in as the second parameter?

Nothing. I added a test to prove it. Unsupported tags are stripped and ignored.

2) Looking at #732992: Change filter_xss_admin() to take a variable by default instead of a hardcoded list., I am wondering if the permissible html tags acceptable as the second variable should be constrained to the acceptable terms used in filter_xss()? If so, how are we going to deal with what is acceptable for HTML5 (being dealt with in that issue)?

Same as the previous. Passing unsupported tags into the second parameter will not affect the output, unless someone rewrites the drupal_html_to_text function.

If I get *real* ambitious, I might rewrite the entire function to make it easier to add new tags in the future, but right now I'm just looking for short-term results.

Lars Toomre’s picture

Sorry if my point was not clear enough... I was trying to highlight the <B> case as in capitalized. I know that well formatted XHTML wants all tags in all lower-case, but users have the funny tendency to want to use both <B> and <b>. I thought a test to prove that we could accept either was appropriate.

OK... I have now read through all of the function changes and they make sense to me. I have not installed this patch though so I cannot say that it installs or runs as expected.

The added comments in the code look good, with full sentences ending in periods. I did not open up the patch in a text editor so I am unable to confirm that the comments do not extend beyond line 80. I also was unable to check for extra white spaces at the end of lines. There is one phrase 'see #345931: Support value attribute in ordered lists.' which I am unsure about in a comment. I am pretty sure we are not suppose to reference d.o. issues in comments, except in special cases. Is this the format that a special case should be in?

From reading the code and the tests, this patch appears to solve the problem and passes appropriate tests. I am not prepared to RTBC this however since I cannot confirm that the patch actually applies. Other steps to complete the review of this patch include:

1) A couple of people to also review this patch and make sure that it applies appropriately.
2) Add a couple of utf-8 strings with HTML tags embedded therein to make sure that the regex expressions do not degrade for instance an o with omlats to an ascii o.

Perhaps someone who does not speak English as a primary language could make some suggestions for item #2??

pillarsdotnet’s picture

I did not open up the patch in a text editor so I am unable to confirm that the comments do not extend beyond line 80.

The comments don't. Some of the code does, a little.

I was trying to highlight the <B> case as in capitalized

See the very last test. It compares drupal_strtoupper(drupal_html_to_text($input)) with drupal_html_to_text(drupal_strtoupper($input)) where $input is a fairly long mix of tags and text.

There is one phrase 'see #345931: Support value attribute in ordered lists.' which I am unsure about in a comment.

You are correct. I have removed the references.

Now running coder review... done. Fixed problems identified by coder.

pillarsdotnet’s picture

Issue tags: +i18n

Responding to #97 and also to sun's comment in #86:

I think that we also want a test that takes a mocked up web page and demonstrate that this function handles that "complex case" as well.

First of all, this function is not supposed to do as good a job as a dedicated text-mode browser such as Lynx, links, or w3m. If you want a general data extractor, write a module. This is for sending plain-text email messages. That's why it's located in includes/mail.inc instead of modules/filter.

Secondly, producing such a test as you describe basically means taking a big chunk of html, running it through the function, and copying the output to your $expected variable. What does that prove? Only that any future improvements to the function must also be accompanied by a change this huge test. And very likely, the future coders will likewise run the html through the function and retroactively "expect" whatever comes out the other end. I think this is a poor testing methodology. That's what I meant when I said in #84 above, "I think it's going about things the wrong way."

Thirdly, I *have* run web-pages through the original function (as it existed in d6 or in #1130198) and I can testify that the resultant output is acceptably readable. I have about twenty people on my Simplenews mailing list who must receive text-only messages, and they're doing just fine.

Lars Toomre’s picture

OK... Glad to know that you had run a larger more complex text string (webpage) through this function. Skip the thought since you already demonstrated that 'html' tag cannot be 'slipped in' as an exception tag. My only other concern was what would happen if it encountered 'mal formed' HTMl code.

Glad that you ran the code through coder and corrected anything that came from that. Now if we can get the two items I raised previously in #100, we should be good to go!!

P.S. - I have code in contrib module that mimics this function and another that just eliminates one HTML tag. I am curious to test out this function as a possible replacement/enhancement there.

pillarsdotnet’s picture

My only other concern was what would happen if it encountered 'mal formed' HTMl code.

I already added the "malformed HTML" tests that you suggested. If you think those tests are insufficient, please suggest additional ones.

From the patch:

      // Tests malformed HTML tags.
      '<br>Drupal<br>Drupal' => "Drupal\nDrupal",
      '<hr>Drupal<hr>Drupal' => str_repeat('-', 78) . "\nDrupal\n" . str_repeat('-', 78) . "\nDrupal",
      '<ol><li>Drupal<li>Drupal</ol>' => " 1) Drupal\n 2) Drupal\n",
      '<ul><li>Drupal <em>Drupal</em> Drupal</ul></ul>' => " * Drupal /Drupal/ Drupal\n",
      '<ul><li>Drupal<li>Drupal</ol>' => " * Drupal\n * Drupal\n",
      '<ul><li>Drupal<li>Drupal</ul>' => " * Drupal\n * Drupal\n",
      '<ul>Drupal</ul>' => "Drupal\n",
      'Drupal</ul></ol></dl>' => "Drupal",
      // Tests some unsupported HTML tags.
      '<html>Drupal</html>' => "Drupal",
      '<script type="text/javascript">Drupal</script>' => "Drupal",
pillarsdotnet’s picture

Found a bug and fixed it. Dunno how to test for it, but calling drupal_html_to_text('<li>Drupal</li>') would produce a PHP warning.

pillarsdotnet’s picture

Status: Needs review » Needs work

The last submitted patch, drupal_html_to_text-tests+fix-299138-106.patch, failed testing.

pillarsdotnet’s picture

Must have brushed against the middle mouse button while editing.

pillarsdotnet’s picture

Status: Needs work » Needs review
pillarsdotnet’s picture

Title: Fix the broken formatting in drupal_html_to_text() and also provide tests » drupal_html_to_text() formatting is broken and does not have tests
Category: bug » feature
Issue tags: -Needs backport to D6 +Needs tests
kscheirer’s picture

don't give up! You're trying to get drupal_html_to_text-tests+fix-299138-107.patch in right? I'll take another look as soon as I can, and thanks for picking this up.

pillarsdotnet’s picture

Included a patched copy of drupal_html_to_text() in the Mail System 6.x-2.3 release to resolve the bug reported here:

#1135978: Mail System format() method uses drupal_html_to_text() which strips linefeeds from input.

It appears that this bugfix is also applicable to D6. Tagging accordingly.

Also removing the "Needs tests" tag because I honestly believe the tests in this patch are sufficient. If anyone disagrees, please do so constructively. That is, don't just complain -- provide a suggestion for the test you think is missing.

pillarsdotnet’s picture

Title: drupal_html_to_text() formatting is broken and does not have tests » Fix the broken formatting in drupal_html_to_text() and also provide tests

More constructive issue title.

Heine’s picture

D6 behaviour is by design and correct. Only pass real HTML to the function. Additional whitespace will be collapsed, see #298708-12: drupal_html_to_text() removes line endings.

pillarsdotnet’s picture

@Heine -- Are you saying that the DefaultMailSystem::format() method is broken because it blindly applies drupal_html_to_text() to all input?

If so, have you submitted a bug report?

Or perhaps you are claiming that the d6 and d7 versions of drupal_html_to_text() are somehow completely different despite sharing identical doxygen headers?

If so, have you submitted a bug report?

Heine’s picture

I was under the impression that this is the bug report dealing with drupal_html_to_text(). The following is what I understand is wrong about the current D7 drupal_html_to_text implementation.

Suppose you pass the following HTML through drupal_html_to_text.

<p>foo</p><p>bar</p><p>baz</p>

Expected output ( aka the 'acceptable plaintext representation') should keep the paragraph divisons of the text with:

foo

bar

baz

Output in D6 is according to expectation, in D7 we get:

foobarbaz
pillarsdotnet’s picture

The only functions that call drupal_html_to-text() are:

None of the above functions check whether the argument to drupal_html_to_text() is HTML or plaintext.

Heine asserts that drupal_html_to_text() should collapse whitespace, by design. Yet this would break the functionality of the only cases where the function is used.

So I ask Heine, for the record, which of the following is true?

  • Your assertion was in error, and we should expect drupal_html_to_text() to preserve plaintext formatting.

  • All the documented use cases are in error, and drupal_html_to_text() should be deleted from drupal core.

Damien Tournoud’s picture

Category: feature » bug
-  $text = substr($text, 0, -1);
+  $text = preg_replace('/\n$/s', '', $text);

Shouldn't that be MAIL_LINE_ENDINGS too?

Other then that, this looks RTBC to me. Also, this is purely a bug report. drupal_html_to_text() was accidentally broken on D7, and this patch fixes it.

@Heine: I think that pillarsdotnet was talking about backporting some additional bug fixes for drupal_html_to_text() that are also in that patch. The behavior around line breaks between the tags should obviously be the same on D6 and D7.

@pillarsdotnet: Heine is not saying that D7 behavior is correct.

pillarsdotnet’s picture

@Damien -- Corrected.

Also corrected doxygen to reflect the current list of supported tags.

Added a test confirming that internal (not leading or trailing) whitespace is preserved; this revealed a bug which I subsequently fixed.

Double-checked that everything passes coder review.

Hopefully this is ready to commit.

pillarsdotnet’s picture

Couldn't resist one more tweak -- it bothered me that some of the "expected" results had trailing whitespace. Fixed.

pillarsdotnet’s picture

Even better...

Heine’s picture

Why should drupal_html_to_text operate on non-HTML?

Now it incorrectly converts HTML suchs as

<p>Foo

bar

baz
</p>

Which renders as

Foo bar baz

The acceptable plaintext representation should match this rendition and display

Foo bar baz

Not what this patch displays.

Shoehorning two incompatible formats into a conversion function indicates that the callers are in the wrong.

Heine’s picture

Well, that wasn't exactly fair; the incorrect whitespace treatment is not related to the patch.
I'll open a new issue.

pillarsdotnet’s picture

@Heine

See the comment starting at line 497:

      // Convert inline HTML text to plain text; not removing line-breaks or
      // white-space, since that breaks newlines when sanitizing plain-text.

I find it interesting that the relevant commit is the same one which provoked this issue in the first place. Obviously there was more intended than the "provide a sequence API" comment would suggest.

I restate my conclusion in #117: either whitespace should be preserved, as required by the above comment and documented use-cases, or the function should be removed entirely. As I also said in #102, this function is not intended as a general-purpose text-mode browser. It is a convenience function to ensure that text emails remain readable, regardless of the input format.

Perhaps the simplest way to satisfy you would be to rename this function something else, like "drupal_sanitize_plaintext", or "drupal_html_or_text_to_plaintext".

Heine’s picture

Not really, sending mail is fundamentally broken; the API is misdesigned.

Plaintext should be passed through nl2br or perhaps the linebreak filter before being passed to drupal_html_to_text.

pillarsdotnet’s picture

Title: Improve drupal_html_to_text() » Fix the broken formatting in drupal_html_to_text() and also provide tests
Category: feature » bug
Heine’s picture

If you can't bear reading reviews / musings aimed at the betterment of Drupal, stop posting patches, pillarsdotnet.

/me out

Damien Tournoud’s picture

Status: Needs review » Needs work

Not really, sending mail is fundamentally broken; the API is misdesigned.

Nope, not at all. Mails are handled 100% in HTML in Drupal 7, and are only converted into plaintext (via drupal_html_to_text()) at the very end of the chain (when sending out the mail), because core by default only support sending email serialized as plain-text. You can see a summary of the design on #407452-35: Remove use of drupal_html_to_text() from system mails.

Converting plain-text to HTML before calling drupal_mail() is the responsibility of the caller.

You are completely right in saying that drupal_html_to_text() should collapse whitespace. pillarsdotnet is wrong in expecting that plain-text input should be handled.

Needs work again: we need to fix drupal_html_to_text() so that it properly collapse whitespace. The initial patch from #407452: Remove use of drupal_html_to_text() from system mails was clearly misguided in that particular part.

Damien Tournoud’s picture

Also see http://drupal.org/update/modules/6/7#email-html for cristal clear statements that every mail is HTML and that we don't handle plain-text.

pillarsdotnet’s picture

I give up; you win. Re-rolled as requested. Also removed the comment.

If and when this ever trickles down to D6 it will either need to preserve whitespace or be accompanied by a patch to every place in core that calls drupal_mail().

(looking...)

Hmm... Not such a big list as I feared... Will probably tackle this myself when the time comes.

pillarsdotnet’s picture

Does anybody care about this? Fixing it would be ... complicated, and I don't have a use-case, so I'm tempted to leave it broken-as-designed:

drupal_html_to_text("<pre>Drupal\n     Drupal\nDrupal</pre>") == "Drupal Drupal Drupal"

The old code doesn't support the <pre> tag at all. I only added it as a synonym for <p> because it's listed among the block-level tags and I wanted a test that every block-level tag forces a line-break.

pillarsdotnet’s picture

Found another bug:

drupal_html_to_text('Drupal<br /><br />Drupal') == "Drupal\nDrupal"

Trying to fix it, but I'm having problems.

pillarsdotnet’s picture

Finally got it fixed. <pre> tags are now fully supported and tested, as are multiple consecutive <br /> tags.

The only gotcha is that spaces within <pre> tags turn into unicode non-breaking spaces.

There's probably a cleaner way to do this, but for now it passes all tests so I'm gonna go to bed and await further review.

pillarsdotnet’s picture

Fixed the gotcha and simplified the code.

pillarsdotnet’s picture

Nested <pre> tags are now tested and properly handled.

pillarsdotnet’s picture

pillarsdotnet’s picture

Status: Needs review » Needs work

The last submitted patch, drupal_html_to_text-tests+fix-299138-135.patch, failed testing.

pillarsdotnet’s picture

Status: Needs work » Needs review
Issue tags: +Needs backport to D6, +Needs backport to D7
stephandale’s picture

I've made some modifications so that consecutive block level elements are separated by a blank line (i.e. two new line characters rather than one) and nested lists are handled better. As part of these changes block level elements output a new line before and afterwards.

The most significant changes are a) the tags now output chunks as they need this control when nested, b) the indent is applied by the following value rather than by the current tag and c) $item is used as the current element of $split so $value can store values only (as they may be needed by closing tags when determining their chunk).

More tests have been added, to test these modifications and test short tags that don't contain a space (which is the more correct use - the space was only recommended as a guideline to support old user agents - see http://www.w3.org/TR/xhtml1/#guidelines)

I also replaced the 3 instances of \r\n in drupal_html_to_text() with MAIL_LINE_ENDINGS as I assumed that they were forgotten earlier. Hopefully there wasn't a reason they hadn't been changed!

I wasn't sure what to name the patch so I kept the same format as the last one and incremented the last integer in its name. Hope this is ok.

P.S. I couldn't find the Coder module for 8.x. Is there any way to automatically test coding guidelines on 8?

pillarsdotnet’s picture

New approach using a DOM tree instead of a custom parser. Looks cleaner to me and passes all tests locally.

Status: Needs review » Needs work
pillarsdotnet’s picture

Okay; let's try that again...

pillarsdotnet’s picture

Status: Needs work » Needs review
pillarsdotnet’s picture

Coder griped about using substr instead of drupal_substr and found an indentation error.

Corrected patch.

pillarsdotnet’s picture

@stephandale --

pillarsdotnet’s picture

Found and fixed a subtle error in character counting.

pillarsdotnet’s picture

Another coder gripe.

Note that the net difference in size between #140 and this patch is to add 31 comment lines and to take away 15 code lines.

pillarsdotnet’s picture

Code in #148 has been added to Mail System module:

And is being considered for incorporation into the Simplenews module:

salvis’s picture

Paste this

$html = "<p>line 1<br />\nline 2<br />line 3\n<br />line 4</p><p>paragraph</p>";
dpm($html);
return drupal_html_to_text($html);

into Devel's Execute PHP textarea (devel/php) and you'll get

  • <p>line 1<br />
    line 2<br />line 3
    <br />line 4</p><p>paragraph</p>
  • line 1
    line 2
    line 3
    line 4
     
    paragraph

I think line 2 should not be indented.

salvis’s picture

Status: Needs review » Needs work

Here's another test case — I've put the following into a node body:

line 1
line 2

line 4
line 5

0X

where the X is a space, not followed by any newline.

This shows up just fine as

line 1
line 2

line 4
line 5

0

and it results in the markup in the following snippet:

$html = "<p>line 1<br /> line 2</p> <p>line 4<br /> line 5</p> <p>0</p>";
dpm($html);
return drupal_html_to_text($html);

Running the snippet in Devel's Execute PHP function gives:

  • <p>line 1<br /> line 2</p> <p>line 4<br /> line 5</p> <p>0</p>
  • line 1
    line 2
     
    line 4
    line 5

Note the indenting again, and especially the missing 0!

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
33.27 KB

Fixed, and tests added; thanks.

pillarsdotnet’s picture

Turns out I didn't need that block-tag function after all. No functional change from previous patch; just deletion of cruft.

pillarsdotnet’s picture

Slight refactoring; no functional changes.

pillarsdotnet’s picture

Fixed a couple of gripes identified by coder. Again, no functional changes.

pillarsdotnet’s picture

Assigned: Unassigned » pillarsdotnet
pillarsdotnet’s picture

Issue tags: -Needs committer feedback
FileSize
32.61 KB

Minor nitpicks; no code changes.

Applied to Mail System
6.x-2.17
7.x-2.17
8.x-2.17

Proposed patch for:

pillarsdotnet’s picture

Status: Needs review » Reviewed & tested by the community

Note that the code in this issue has been incorporated into Mail System since 6.x-2.6/7.x-2.6/8.x-2.6. According to the usage statistics page, that's 1,348 users. Does this qualify as "Reviewed and Tested by the Community" ?

pillarsdotnet’s picture

webchick’s picture

Status: Reviewed & tested by the community » Needs review

No. You never mark your own patch RTBC. All patches require peer review.

pillarsdotnet’s picture

Issue tags: +Needs committer feedback

Well, it's had plenty of review along the lines of "Looks good but..." and all the noted shortcomings have been fixed and tests added to ensure that they stay fixed.

Marking "Needs committer feedback" in hopes that someone will be willing to stick their neck out.

salvis’s picture

Status: Needs review » Needs work
  • here's a <a href='/node/123'>node</a> link
  • here's a node [1] link
    [1] /node/123

The link must be fully qualified, like http://example.com/node/123

Also, there should be an empty line between the text and the link list.

salvis’s picture

I'm going through all the tags. You've done an amazing job with this. Here are a few more comments:

  1. I like your <hX> implementations, but I'd suggest adding 3/2/1 additional newlines above the H1/H2/H3 heading to separate them from the text above.
  2. <p>a</p><p><hr /></p><p>b</p>
    currently produces two empty lines above and one below the separator line; there should be just one empty line above and below.
  3. <pre> lines must NOT be line-wrapped.
  4. <ins> and <del> are inline elements, not block elements. Currently you're treating the tags as line breaks, which is not correct. They are typically rendered as underline and strikeout. The former is usually _underline_. I'm not aware of any standard for strikeout, but I'd suggest -strikeout-.
  5. <u> and <strike> are missing.
  6. For ordered lists, more than 99 items are probably rare, but more than 9 are not unreasonable. Could you indent #1 though #9 by an additional space so that they'll line up with #10?
  7. This is debatable, but I'd suggest inserting 2 spaces between table cells, and not wrapping long table rows. This would preserve a little bit of structure.

What is your strategy regarding soft line breaks (http://www.ietf.org/rfc/rfc2646.txt), space stuffing, etc.?

The difficulty in this issue is that it's so complex, and once a patch is committed, it'll be almost impossible to get new issues fixed that might pop up in the future. Putting this into production is certainly a good way to build confidence, but the feedback generated that way is very slow.

I'll incorporate your code into Subscriptions to generate some more feedback (hopefully), but the switch to HTML will take some major surgery before I can actually do this (right now I'm using the D6 mail.inc).

pillarsdotnet’s picture

Title: Improve drupal_html_to_text() » Fix the broken formatting in drupal_html_to_text() and also provide tests
Assigned: Unassigned » pillarsdotnet
Category: feature » bug
Status: Needs work » Needs review
Issue tags: +Needs committer feedback
FileSize
32.32 KB

Re #163:

  1. I like your <hX> implementations, but I'd suggest adding 3/2/1 additional newlines above the H1/H2/H3 heading to separate them from the text above.

    I don't like them, and they aren't mine. I suggest emulating Markdown.

  2. <p>a</p><p><hr /></p><p>b</p> currently produces two empty lines above and one below the separator line; there should be just one empty line above and below.

    Two out of three text-mode browsers disagree with you. Both w3m and links render two empty lines above and below; only the older and arguably less-sophisticated lynx renders one empty line above and below. None of them, by the way, render headers as you suggest.

  3. <pre> lines must NOT be line-wrapped.

    RFC 3676 disagrees.

  4. <ins> and <del> are inline elements, not block elements.

    The w3 org says:

    These two elements are unusual for HTML in that they may serve as either block-level or inline elements (but not both). They may contain one or more words within a paragraph or contain one or more block-level elements such as paragraphs, lists and tables.

    The HTML-5 draft says you have to check all the child tags to decide whether to render them as phrasing content or flow content.
    The Web Design Group says that they are inline if e.g., within another inline element or a P.
    On reflection, I'm dropping support for both, as neither one has any accurate plain-text representation.

  5. <u> and <strike> are missing.

    And they were missing before, too. This is a bug report, not a feature request. See my comment in #102 above.

  6. For ordered lists, more than 99 items are probably rare, but more than 9 are not unreasonable. Could you indent #1 though #9 by an additional space so that they'll line up with #10?

    Since you ask so nicely, and I happen to agree, done. (2 files changed, 8 insertions, 8 deletions)

  7. I'd suggest inserting 2 spaces between table cells, and not wrapping long table rows.

    Wrapping is mandatory, according to RFC 3676, and will often be performed by the email-reading software even if we don't.

    I considered coding full support for tables, but I thought about the colspan and rowspan attributes, and decided that it wasn't worth the trouble.

stephandale’s picture

Status: Needs review » Needs work
FileSize
32.09 KB

Edit 01/06/2011 16:03: Patch in comment#167 does the trick.
Edit 01/06/2011 15:49: Bad timing - pillarsdotnet posted at same time. I'll make the mods to his latest patch and repost.

I've found that the use of the PCRE_MULTILINE modifier m in preg_replace() resulted in different behaviour on different servers, possibly due to the difference in the PCRE version though I can't be sure at this stage.

Server 1 (tests pass):

php > $s = "one\ntwo\n";
php > echo preg_replace('/^/m','FOO',$s);
FOOone
FOOtwo

Server 2 (tests fail):

php > $s = "one\ntwo\n";
php > echo preg_replace('/^/m','FOO',$s);
FOOone
FOOtwo
FOO

Note how on server 2 (which has an older PCRE library), the replacement is also made on the trailing new line.

I've added a three line workaround to the drupal_wrap_mail() function, so tests are now green on both servers. The change was made to the last three lines before the return statement and simply removes the $clean_indent if it's present at the end of the text.

  // Apply indentation. We only include non-'>' indentation on the first line.
  $indented_text = preg_replace('/^/m', $clean_indent, $text);
  $indented_text = preg_replace('/' . $clean_indent . '$/', '', $indented_text); // Workaround for differences in multiline behaviour.
  $text = $indent . drupal_substr($indented_text, drupal_strlen($indent));
  return $text;

I updated to the latest Drupal 8.x, applied the patch from comment#157 (http://drupal.org/node/299138#comment-4535020) and made my modifications. The attached patch should include those changes and my modifications. Note that it differs to the patch in comment#157 by more than the three lines I edited, which I can only assume is due to other changes in core since comment#157.

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
32.78 KB

Re #162:

I agree; done.

Re #165:

Try this slightly simpler workaround and see if it passes on your server 2.

pillarsdotnet’s picture

Assigned: Unassigned » pillarsdotnet
Status: Needs work » Needs review
Issue tags: +Needs committer feedback
FileSize
32.78 KB
stephandale’s picture

Status: Needs work » Needs review

Re #166: It does, yes. Thanks.

Note: I tried patch from #167.

salvis’s picture

Re #164:

Two out of three text-mode browsers disagree with you. Both w3m and links render two empty lines above and below; only the older and arguably less-sophisticated lynx renders one empty line above and below.

<hr> is a separator. It separates two equal parts. There's no way to tell whether it should belong to the part above or below. Putting it in the middle is the right thing to do. Apparently, all three text-mode browsers agree with me that the space above and below should be equal. That's my point. I'd say one empty line is enough, but if you prefer two, that's perfectly fine with me, as long as the line is in the middle.

None of them, by the way, render headers as you suggest.

Let's take a step back. Headings are not 'footings' and not separators. They announce what comes after them, and as such they semantically belong to what follows. They have very little connection to what's above them. Thus the distance above should be larger than the distance below.

I don't know what the design goals and priorities of the text-mode browsers have been — probably to pack as much information on a 25-line CRT as possible, to minimize the required navigation (= transmission over a slow connection). That's completely irrelevant in a text file (or email message).

The goal here, IMHO, should be to render HTML as text in a way that helps the human reader to understand its structure and content. You're forced to drop a lot of information to convert to plain text, and any hint that you can reasonably provide will be appreciated by the human reader.

<pre> lines must NOT be line-wrapped.

RFC 3676 disagrees.

Huh? Our interpretations of RFC3676 seem to differ. Let me cite:

Text/Plain is usually displayed as preformatted text, often in a fixed font. That is, the characters start at the left margin of the display window, and advance to the right until a CRLF sequence is seen, at which point a new line is started, again at the left margin. When a line length exceeds the display window, some clients will wrap the line, while others invoke a horizontal scroll bar.

Text which meets this description is defined by this memo as "fixed".

RFC3676 goes on to say that there can be problems with this type of text, and that in general text should be flowed. BUT (my interpretation now) if the author explicitly says that he formatted the text exactly as he wants it to appear, then the renderer should comply, if technically possible.

The <pre> tag explicitly asks for the fixed format (preformatted text). Preformatted means that indents should be preserved, spaces should be preserved, line breaks should be preserved, etc. — essentially as much formatting should be preserved as possible. If you don't have a horizontal scrollbar (e.g. on a cell phone), then you obviously have to wrap the lines to avoid clipping them, but if you can avoid it, then you should preserve formatting.
Some email clients may still munge it, but others will do the right thing. If you munge it even before it leaves the site, then you ensure that it'll be equally incomprehensible for both camps. Is that the goal?

See also HTML5. It doesn't say this explicitly, but the intention of <pre> clearly is that line breaks should neither be added nor removed inside <pre>. That's what <pre> is all about.

See my comment in #102 above. --> This is for sending plain-text email messages.

Yes, and it's called drupal_html_to_text(). Thus, if I want to send a node body through email, then this function is the obvious choice. I thought the idea of this exercise was to make the D7 version better than the D6 one, was it not?

_Underlined_ is firmly established. I don't know whether it was supported in the D6 version or not, and it doesn't really matter. Call its absence a pre-existing bug if you want, but that doesn't justify carrying its absence over into your new implementation.

IMO, the goal should be to try to preserve as much information as possible. If a node author decides that underline or strikeout are needed to convey his meaning, then it would be pity to lose those markups when transmitting the node body through text-mode email, given that there is a reasonable way to express them.

Wrapping is mandatory, according to RFC 3676, and will often be performed by the email-reading software even if we don't.

Here's where we differ. Where in RFC 3676 does it say "MUST wrap"? In my reading it says "SHOULD wrap", which means there can be overriding considerations. The author's clearly expressed intention via <pre> or <table> are such an overriding consideration, IMO.

RFC 3676 even says explicitly that Hand-aligned text, such as ASCII tables or art, source code, etc., SHOULD be sent as fixed, not flowed lines.

salvis’s picture

Status: Needs review » Needs work

The patch in #167 now produces

  • here's a <a href='/node/123'>node</a> link
  • here's a node [1] link
     
    [1] http://example.com//node/123

Note the extraneous slash. Apparently, there's no test for this yet.

It seems like you have to remove the leading slash before you call url().

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
34.51 KB

Point by point:

  • Headings ... have very little connection to what's above them. Thus the distance above should be larger than the distance below.

    Please revise the following test code to meet your expectations:

      /**
       * Test that headers are properly separated from surrounding text.
       */
      function testHeaderSeparation() {
        // Text before and after.
        $html = 'Drupal<h1>Drupal</h1>Drupal';
        $text = "Drupal\n\n======== DRUPAL " . str_repeat('=', 61) . "\nDrupal";
        $this->_testHtmlToText($html, $text);
        // Paragraph before; text after.
        $html = '<p>Drupal</p><h1>Drupal</h1>Drupal';
        $text = "Drupal\n\n\n======== DRUPAL " . str_repeat('=', 61) . "\nDrupal";
        $this->_testHtmlToText($html, $text);
        // Text before; paragraph after.
        $html = 'Drupal<h1>Drupal</h1><p>Drupal</p>';
        $text = "Drupal\n\n======== DRUPAL " . str_repeat('=', 61) . "\n\nDrupal";
        $this->_testHtmlToText($html, $text);
        //  Paragraph before and after.
        $html = '<p>Drupal</p><h1>Drupal</h1><p>Drupal</p>';
        $text = "Drupal\n\n\n======== DRUPAL " . str_repeat('=', 61) . "\n\nDrupal";
        $this->_testHtmlToText($html, $text);
      }
    
  • Where in RFC 3676 does it say "MUST wrap"?

    From RFC 3676:

    The Text/Plain media type is the lowest common denominator of Internet email, with lines of no more than 998 characters

    From RFC 2046:

    NOTE: Some protocols defines a maximum line length. E.g. SMTP [RFC-821] allows a maximum of 998 octets before the next CRLF sequence. To be transported by such protocols, data which includes too long segments without CRLF sequences must be encoded with a suitable content-transfer-encoding.

    From RFC 821:

    text line
    The maximum total length of a text line including the <CRLF> is 1000 characters (but not counting the leading dot duplicated for transparency).

    So line wrapping MUST occur before 1000 characters (998 plus CRLF). Outside of <pre> tags, it SHOULD occur at a convenient width for text-mode readers, which historically has been 80 characters (78 plus CRLF). If there is a problem, it lies in the fact that we use the same code to meet both constraints.

  • I thought the idea of this exercise was to make the D7 version better than the D6 one, was it not?

    The idea of this exercise is to "fix the broken formatting in drupal_html_to_text() and also provide tests". Please understand that "broken" means something entirely different from "sub-optimal". I now regret that I overzealously added support for tags that were not originally supported, as I may have unnecessarily delayed adoption of the more important fix. Also note the "Needs backport to D6" tag.

  • _Underlined_ is firmly established. ... Call its absence a pre-existing bug

    Fine; added support for the <u> tag, even though its actual use is quite rare. Stylesheets would be the far more common use case, but supporting them is way out of scope.

  • It seems like you have to remove the leading slash before you call url().

    Thanks. Fixed, and test added.

pillarsdotnet’s picture

<p>a</p><p><hr /></p><p>b</p> currently produces two empty lines above and one below the separator line; there should be just one empty line above and below.

I wasted four hours on this one. Try running the following program:

$dom = new DOMDocument;
$dom->loadHTML('<html><body><p>a</p><p><hr /></p><p>b</p></body></html>');
echo $dom->saveXML();

You will discover that <p><hr /></p> gets corrected to <p/><hr/> because you can't nest <hr /> inside a <p> container.

sreynen’s picture

Status: Needs review » Reviewed & tested by the community

Looking over this thread, the scope has expanded far beyond the initial bug report, which was addressed a long time ago.

The latest patch solves the original bug (lack of tests) without introducing any new bugs. If there are additional feature requests, or even other bugs that aren't yet fixed by this patch, I suggest we take them to separate issues and close this thread before it reaches 3 years old.

pillarsdotnet’s picture

Re-ran #171 through coder review and found one strlen() that could be replaced by drupal_strlen().

Changing to avoid rejection by pedants, even though it makes absolutely no functional difference. Leaving status at RTBC.

Lars Toomre’s picture

I agree with @sreynen #173 that this patch fixes the original bug report and goes far beyond. Hence, I strongly urge that this patch be applied with the clear understanding that 'better' may not be 'perfect'. However, 'better' is also *much* better than the current state of core Drupal.

This patch adds the initial requested tests and passes tests also included for all of the extra code that was added to better handle the complex challenge of converting HTML text to its plain text equivalent. Edge cases, additional bugs and/or feature requests can be started in separate issues. It is time for this RTBC patch to be applied. It makes HTML-text to plain text conversion so much better!!

bdragon’s picture

Skimmed through the patch and noticed a couple of minor nits in the doxygen docs:

* From somewhere whereabouts line 300ish in includes/mail.inc:

@@ -303,40 +305,47 @@ interface MailSystemInterface {
  * We use delsp=yes wrapping, but only break non-spaced languages when
  * absolutely necessary to avoid compatibility issues.
  *
- * We deliberately use LF rather than CRLF, see drupal_mail().
+ * We deliberately use variable_get('mail_line_endings), MAIL_LINE_ENDINGS)
+ * rather than "\r\n".

Typo, need a ' instead of ) after mail_line_endings.

* From the comment block above function _drupal_html_to_text():

+/**
+ * Helper function for drupal_html_to_text
+ *
+ * Recursively converts $node to text, wrapping and indenting as necessary.

Should have () after drupal_html_to_text there.

pillarsdotnet’s picture

sun’s picture

Status: Reviewed & tested by the community » Needs work

Needs work for @bdragon's review.

Please also make sure that all the RFC research is documented/summarized in inline comments (including RFC numbers). It would be silly to go through all of this research once again in the future.

Furthermore:

+++ b/includes/mail.inc
@@ -12,6 +12,8 @@
+define('NBSP', html_entity_decode('&nbsp;'));

Should be MAIL_NBSP. Although "NBSP" is not really descriptive. I don't understand why this is a constant, and why we're decoding &nbsp; to get the character instead of directly specifying the character. This needs to be documented in the (missing) phpDoc.

+++ b/includes/mail.inc
@@ -303,40 +305,47 @@ interface MailSystemInterface {
+ * We deliberately use variable_get('mail_line_endings), MAIL_LINE_ENDINGS)
+ * rather than "\r\n".
...
+  // Convert LF or CRLF into platform-specific line-endings.

I don't understand why we're adding a variable for this. Especially since the actual function code talks about platform-specific line-endings -- so if something is platform-specific, why is it a variable? A short explanation for why it's a variable would be helpful.

+++ b/includes/mail.inc
@@ -303,40 +305,47 @@ interface MailSystemInterface {
+  $text = preg_replace('/\r?\n/', $eol, $text);
...
+    $text = preg_replace('/ +\r?\n/m', $eol, $text);

The first preg_replace() replaces \r\n with platform-specific or variable-defined line endings. The second preg_replace() still tries to match original line endings even though they got already replaced with $eol.

+++ b/includes/mail.inc
@@ -347,177 +356,328 @@ function drupal_wrap_mail($text, $indent = '') {
+ * @param $notes
+ *   The list of footnotes, an associative array of (url => reference number) items.

Exceeds 80 chars.

+++ b/includes/mail.inc
@@ -347,177 +356,328 @@ function drupal_wrap_mail($text, $indent = '') {
+function _drupal_html_to_text(DOMNode $node, array $allowed_tags, array &$notes, $parents = array(), &$count = NULL) {

$parents should use array type-hinting, too.

+++ b/includes/mail.inc
@@ -347,177 +356,328 @@ function drupal_wrap_mail($text, $indent = '') {
+  if (is_null($count)) {

Always use !isset() instead of is_null().

+++ b/includes/mail.inc
@@ -347,177 +356,328 @@ function drupal_wrap_mail($text, $indent = '') {
+        if ( !empty($node->attributes)
...
+          && valid_url($url) ) {

Stray leading and trailing space in if condition.

+++ b/includes/mail.inc
@@ -347,177 +356,328 @@ function drupal_wrap_mail($text, $indent = '') {
+      // Dictionary list.
+      case 'dl':
+        // Start on a newline except inside other lists.
+        if (!in_array('li', $parents)) {
+          $text = $eol;
+        }
+
+      // Dictionary term.
+      case 'dt':

Missing break; or missing inline comment that explains the fall-through to 'dt'.

+++ b/includes/mail.inc
@@ -347,177 +356,328 @@ function drupal_wrap_mail($text, $indent = '') {
+          if ( !empty($node->attributes)
+            && ($value = $node->attributes->getNamedItem('value'))) {
...
+        if ( !empty($node->attributes)
+          && ($value = $node->attributes->getNamedItem('start')) ) {

1) Stray leading and trailing space in if condition.

2) Extra parenthesis around $value can be removed.

3) Shouldn't wrap.

+++ b/includes/mail.inc
@@ -347,177 +356,328 @@ function drupal_wrap_mail($text, $indent = '') {
+      default:
+        break;

The empty default case needs an inline comment for why it's needed (if it is needed).

+++ b/modules/simpletest/tests/mail.test
@@ -43,7 +44,7 @@ class MailTestCase extends DrupalWebTestCase implements MailSystemInterface {
-   * @see DefaultMailSystem
+   * @see DefaultMailSystem()

Bogus change: DefaultMailSystem is a class, not a function.

+++ b/modules/simpletest/tests/mail.test
@@ -63,3 +64,285 @@ class MailTestCase extends DrupalWebTestCase implements MailSystemInterface {
+  function _string_to_html($text) {
+    return '"' .
+      str_replace(
+        array("\n", ' '),
+        array('\n', '&nbsp;'),
+        check_plain($text)
+      ) . '"';

1) Please remove the leading underscore and rename to stringToHtml().

2) Missing phpDoc.

+++ b/modules/simpletest/tests/mail.test
@@ -63,3 +64,285 @@ class MailTestCase extends DrupalWebTestCase implements MailSystemInterface {
+  function _testHtmlToText($html, $text, $args = NULL, $reason = '') {
+    $result = drupal_html_to_text($html, $args);
+    $this->assertEqual(
+      $result,
+      $text,
+      (drupal_strlen($reason) ? $reason . '<br />' : '')
+      . 'html = ' . $this->_string_to_html($html) . '<br />'
+      . 'result = ' . $this->_string_to_html($result) . '<br />'
+      . 'expected = ' . $this->_string_to_html($text)
+    );
+  }

1) Rename to assertHtmlToText().

2) Rename $args to $allowed_tags.

3) $reason should be $message.

4) Use a separate $this->verbose() after assertEqual() to output the debugging info as a page instead of squeezing it into the assertion message.

5) Missing phpDoc.

+++ b/modules/simpletest/tests/mail.test
@@ -63,3 +64,285 @@ class MailTestCase extends DrupalWebTestCase implements MailSystemInterface {
+      'Allowed &lt;b&gt; tag found.'
...
+      'Disallowed &lt;h1&gt; tag not found.'
...
+      'Disallowed &lt;p&gt;, &lt;em&gt;, and &lt;b&gt; tags not found.'
...
+      'Unsupported &lt;html&gt; and &lt;body&gt; tags not found.'

Might make sense to unconditionally run check_plain() on assertHtmlToText()'s $message, so we don't need those &lt;...&gt;.

+++ b/modules/simpletest/tests/mail.test
@@ -63,3 +64,285 @@ class MailTestCase extends DrupalWebTestCase implements MailSystemInterface {
+    $tests = array(
+      array(
+        'html' => "<p>line 1<br />\nline 2<br />line 3\n<br />line 4</p><p>paragraph</p>",

Let's make these a bit cleaner:

$tests = array();
$tests[] = array(
  'html' => '...',
  'text' => '...',
);

Powered by Dreditor.

webchick’s picture

Status: Needs work » Needs review

Guys, if this gets marked RTBC one more time before it's ready, it's going to the very bottom of my stack.

webchick’s picture

Status: Needs review » Needs work
Issue tags: -Needs committer feedback
pillarsdotnet’s picture

  • Needs work for @bdragon's review.

    Already fixed.

  • Please also make sure that all the RFC research is documented/summarized in inline comments (including RFC numbers). It would be silly to go through all of this research once again in the future.

    If the objective is absolute RFC-compliant pedantry, I'm unqualified for the job. Someone else will have to do it.

  • Should be MAIL_NBSP. Although "NBSP" is not really descriptive. I don't understand why this is a constant, and why we're decoding &nbsp; to get the character instead of directly specifying the character. This needs to be documented in the (missing) phpDoc.

    Changed globally to chr(160), which is what it evaluates to in my copy of PHP. Added inline comments to explain the magic number.

  • I don't understand why we're adding a variable for this. Especially since the actual function code talks about platform-specific line-endings -- so if something is platform-specific, why is it a variable? A short explanation for why it's a variable would be helpful.

    The explanation is at the top of the file, lines 8-13 inclusive:

    /**
     * Auto-detect appropriate line endings for e-mails.
     *
     * $conf['mail_line_endings'] will override this setting.
     */
    define('MAIL_LINE_ENDINGS', isset($_SERVER['WINDIR']) || strpos($_SERVER['SERVER_SOFTWARE'], 'Win32') !== FALSE ? "\r\n" : "\n");
    

    I'm using a variable $eol instead of repeating the code variable_get('mail_line_endings', MAIL_LINE_ENDINGS) but if you prefer the repetition, I will comply.

  • The first preg_replace() replaces \r\n with platform-specific or variable-defined line endings. The second preg_replace() still tries to match original line endings even though they got already replaced with $eol.

    Okay, replaced '/ +\r?\n/' with "/ +$eol/" but I don't think it makes any difference.

  • Exceeds 80 chars.

    Fixed.

  • $parents should use array type-hinting, too.

    Changed. Drupal in general should use type-hinting. It doesn't.

  • Always use !isset() instead of is_null().

    Changed. (PHP is such a screwball language! If we can't use the function, why does it exist?)

  • Stray leading and trailing space in if condition.

    Re-wrote long if conditions using a temporary variable.

  • Missing break; or missing inline comment that explains the fall-through to 'dt'.

    Good catch. Fixed.

  • The empty default case needs an inline comment for why it's needed (if it is needed).

    Added an inline comment.

  • Bogus change: DefaultMailSystem is a class, not a function.

    Fixed, but please file an issue against Coder Review, which claims the parentheses are needed, and against API, which fails to create a link without them.

  • Please remove the leading underscore and rename to stringToHtml().

    Done.

  • Missing phpDoc.

    Added, but is phpDoc really necessary for internal-use-only methods within a test class that don't show up on api.drupal.org anyway?

  • 1) Rename to assertHtmlToText().

    2) Rename $args to $allowed_tags.

    3) $reason should be $message.

    4) Use a separate $this->verbose() after assertEqual() to output the debugging info as a page instead of squeezing it into the assertion message.

    5) Missing phpDoc.

    All fixed. I didn't know about the verbose() function.

  • Might make sense to unconditionally run check_plain() on assertHtmlToText()'s $message, so we don't need those &amp;lt;...&amp;gt;.

    Done.

  • Let's make these a bit cleaner:

    Done.

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
35.79 KB
pillarsdotnet’s picture

Title: drupal_html_to_text() formatting is broken and does not have tests » Fix the broken formatting in drupal_html_to_text() and also provide tests<li>
Assigned: Unassigned » pillarsdotnet
Category: feature » bug
Issue tags: -Needs tests +Needs backport to D6
pillarsdotnet’s picture

Title: Fix the broken formatting in drupal_html_to_text() and also provide tests » Fix the broken formatting in drupal_html_to_text() and also provide tests<li>
FileSize
37.19 KB

Fixed typos, and added a test for wrapping very long text lines. The test fails, however. Trying to figure out why.

pillarsdotnet’s picture

salvis’s picture

Title: Fix the broken formatting in drupal_html_to_text() and also provide tests<li> » Fix the broken formatting in drupal_html_to_text() and also provide tests
Status: Needs review » Needs work
Issue tags: +Needs committer feedback

The patch produces PHP syntax errors, so it's NW in any case. I think item 2 below (wrapping <pre> and <table> below 80 characters) also makes this NW.

Re #171:

  1. Headings: This was/is a suggestion for discussion — I'd simply add three newlines above the level 1 heading, two for level 2, and 1 for level 3, independently of the context, in order to further emphasize the importance of the heading.

    That would give us

      /**
       * Test that headers are properly separated from surrounding text.
       */
      function testHeaderSeparation() {
        // Text before and after.
        $html = 'Drupal<h1>Drupal</h1>Drupal';
        $text = "Drupal\n\n\n\n\n======== DRUPAL " . str_repeat('=', 61) . "\nDrupal";
        $this->_testHtmlToText($html, $text);
        // Paragraph before; text after.
        $html = '<p>Drupal</p><h1>Drupal</h1>Drupal';
        $text = "Drupal\n\n\n\n\n\n======== DRUPAL " . str_repeat('=', 61) . "\nDrupal";
        $this->_testHtmlToText($html, $text);
        // Text before; paragraph after.
        $html = 'Drupal<h1>Drupal</h1><p>Drupal</p>';
        $text = "Drupal\n\n\n\n\n======== DRUPAL " . str_repeat('=', 61) . "\n\nDrupal";
        $this->_testHtmlToText($html, $text);
        //  Paragraph before and after.
        $html = '<p>Drupal</p><h1>Drupal</h1><p>Drupal</p>';
        $text = "Drupal\n\n\n\n\n\n======== DRUPAL " . str_repeat('=', 61) . "\n\nDrupal";
        $this->_testHtmlToText($html, $text);
      }
  2. Wrapping <pre> lines and table rows: I'm sorry, I wasn't communicating well enough. Yes, I agree that lines MUST wrap shortly before 1000. However, right now you're wrapping them before 80, and I thought that was what you were defending. Can we agree that newlines SHOULD NOT be added or removed inside <pre> and <tr> elements as long as lines are not longer than 998 characters?

    Try this:

    $html = "<pre>aaaaaaaaaaaaaaaaaaaaaaa\nbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbb\n ccccccccccccccc</pre>";
    dpm($html);
    return drupal_html_to_text($html);
    

    It gives

    • <pre>aaaaaaaaaaaaaaaaaaaaaaa
      bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbb
      ccccccccccccccc</pre>
    • aaaaaaaaaaaaaaaaaaaaaaa
      bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
      bbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb
      bbbbbbbbbbbbbbbbbbbbbbbbb
      ccccccccccccccc

    There are two newlines among the b's which shouldn't be there.

    The one space ahead of the c's turns into two spaces. It's a nit, but if you're trying to convert something that depends on exact indenting, then it's maddening.

  3. The thought of doing more to separate table cells briefly occurred to me, too, but just like vertical separators, horizontal separators should be symmetrical IMO, and thus it would take at least " | " (three characters), and it still wouldn't work so well if the cells weren't aligned. Tabs rarely help with alignment either.

    However, not getting any clue whatsoever about where a cell ends and the next one starts is not so great. Hence my suggestion to put two spaces between cells, rather than just one. This would usually allow the reader to count the cells in a row and give him a fighting chance to analyze and reconstruct the table structure, if he really wanted to. With only one space, it's completely impossible unless the cell contents are just single words.

  4. Thank you for adding the <u> tag! We're generally not seeing a lot of markup in nodes here on d.o, probably because most people don't particularly like to write HTML by hand. However, we have a list of "Allowed HTML tags" at the bottom of this textarea (<h1> <h2> <h3> <h4> <h5> <h6> <em> <strong> <code> <del> <blockquote> <q> <cite> <sup> <sub> <p> <br> <ul> <ol> <li> <dl> <dt> <dd> <a> <b> <u> <i> <table> <tr> <td> <th> <tbody> <thead> <tfoot> <colgroup> <caption>). This is a reasonable list for a technically oriented site, and it's a good assumption that any of these will be used at some time or other. If they can be rendered sensibly in plain text, and drupal_html_to_text() can support them with reasonable effort, then I'm sure that support will be appreciated in the right context.

    I was going to argue that <strike> was on this list, and I think it was yesterday, and I think I've seen it used by people to show where they have made minor edits... It's still on g.d.o, for example here. It would be a pity to lose the <strike> information when such a node is transmitted through email. So, please, would you reconsider adding <strike>?

  5. There's a smaller list in the default 'Filtered HTML' text format: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>. I think these should all be supported in some way for the sake of consistency. Only two are missing in your implementation: <cite> could be approximated reasonably well as <i>; &ltcode> is more difficult, because it can be either inline or block. As a block, I think it should behave like <pre> — inline is tough though... Maybe `backticks` (Markdown), or { braces and spaces }?
salvis’s picture

Status: Needs work » Needs review
Issue tags: -Needs committer feedback

Oh, sorry, our messages crossed. My item 2 referred to your patch #182. Apparently, you've noticed the problem now.

salvis’s picture

Status: Needs review » Needs work

No, the <pre> line is still getting wrapped below 80.

pillarsdotnet’s picture

Title: Fix the broken formatting in drupal_html_to_text() and also provide tests<li> » Fix the broken formatting in drupal_html_to_text() and also provide tests
Assigned: pillarsdotnet » Unassigned
Status: Needs review » Needs work
pillarsdotnet’s picture

Assigned: pillarsdotnet » Unassigned
Status: Needs review » Needs work
FileSize
37.23 KB
salvis’s picture

Assigned: pillarsdotnet » Unassigned

I'm sorry for being a source of frustration for you. You've taken on a big task and done a great job. No one knows your code as well as you do — please don't give up! Take a break, let some feedback accumulate, and then deal with it calmly, rather than frantically shooting patches.

Re #190:

$html = "<pre>aaaaaaaaaaaaaaaaaaaaaaa\nbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbb\n ccccccccccccccc</pre>";
dpm($html);
return drupal_html_to_text($html);

on Devel's Execute PHP page now gives

  • <pre>aaaaaaaaaaaaaaaaaaaaaaa
    bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbb
    ccccccccccccccc</pre>
  • aaaaaaaaaaaaaaaaaaaaaaa
    bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
    bbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb
    bbbbbbbbbbbbbbbbbbbbbbbbb
    ccccccccccccccc

The one space in front of the c's now turns into no space, it should be preserved.

Two newlines are inserted between the b's inside the <pre> element, that shouldn't happen, as long as the resulting line is shorter than 998 characters.

The trailing spaces are gone, which is a clear improvement.

pillarsdotnet’s picture

Status: Needs review » Needs work
FileSize
38.19 KB

Here's a patch with a failing test. I think that making it pass will require a rewrite/replacement of (at least) drupal_wrap_mail() and friends, which would prevent the possibility of a backport.

EDIT: Had to change the test after learning some more about space-stuffing:

  /**
   * Ensure that content within <pre> tags is not changed.
   */
  function testNoWrapWithinPre() {
    // This awkward construct comes from includes/mail.inc lines 8-13.
    $eol = variable_get('mail_line_endings', MAIL_LINE_ENDINGS);
    $text = str_repeat('a', 80)
      . ' ' . str_repeat('b', 80) // Single space.
      . '  ' . str_repeat('c', 80) // Two spaces.
      . "$eol" . str_repeat('d', 80) // Single linefeed.
      . "$eol$eol" . str_repeat('e', 80) // Double linefeed.
      . "$eol " . str_repeat('f', 80) // Linefeed and space.
      . " $eol" . str_repeat('g', 80); // Space and linefeed.
    // The total length of $text is under 1000 octets and should not wrap.
    $html = "<pre>$text</pre>";
    // Space-stuffing will add an extra space to any line starting with a space.
    $text = str_replace("$eol ", "$eol  ", $text);
    $this->assertHtmlToText($html, $text, 'Text within <pre> is unchanged.');
  }
pillarsdotnet’s picture

Here's an entirely different approach. ;->

pillarsdotnet’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work
pillarsdotnet’s picture

Fixed.

pillarsdotnet’s picture

pillarsdotnet’s picture

pillarsdotnet’s picture

pillarsdotnet’s picture

Status: Needs review » Needs work
FileSize
39.51 KB

Weird. Looks like we somehow broke core.

salvis’s picture

Please wait and give us some time to think...

In #192 you've pointed to drupal_wrap_mail(). Why would this stop working after having worked for D6?

The goal is to do format=flowed, delsp=yes wrapping, and this is NOT what I thought we should do!

pillarsdotnet’s picture

Actually, I think the most interesting solution is in #197.

salvis’s picture

After thinking about this some more I believe that RFC 3676 calls for soft wrapping <pre> tags.

I think the proper test for <pre> is the following:

/**
   * Ensure that content within <pre> tags is not changed.
   */
  function testNoWrapWithinPre() {
    // This awkward construct comes from includes/mail.inc lines 8-13.
    $eol = variable_get('mail_line_endings', MAIL_LINE_ENDINGS);
    $html = '<pre>'
      . str_repeat('a', 30) . ' ' . str_repeat('a', 30) . ' ' // Single space.
      . str_repeat('b', 30) . ' ' . str_repeat('b', 30) . '  ' // Two spaces.
      . str_repeat('c', 30) . ' ' . str_repeat('c', 30) . "$eol" // Single newline.
      . str_repeat('d', 30) . ' ' . str_repeat('d', 30) . "$eol$eol" // Double newline.
      . str_repeat('e', 30) . ' ' . str_repeat('e', 30) . "$eol " // Newline and space.
      . str_repeat('f', 30) . ' ' . str_repeat('f', 30) . "$eol  " // Newline and two spaces.
      . str_repeat('g', 30) . ' ' . str_repeat('g', 30) . " $eol" // Space and newline.
      . str_repeat('h', 30) . ' ' . str_repeat('h', 30) . "  $eol" // Two spaces and newline.
      . str_repeat('i', 30) . ' ' . str_repeat('i', 30) . '</pre>'; 
    $text =
      . str_repeat('a', 30) . ' ' . str_repeat('a', 30) . "  $eol" // Two spaces and newline.
      . str_repeat('b', 30) . ' ' . str_repeat('b', 30) . "   $eol" // Three spaces and newline.
      . str_repeat('c', 30) . ' ' . str_repeat('c', 30) . "$eol" // Single newline.
      . str_repeat('d', 30) . ' ' . str_repeat('d', 30) . "$eol$eol" // Double newline.
      . str_repeat('e', 30) . ' ' . str_repeat('e', 30) . "$eol  " // Newline and two spaces.
      . str_repeat('f', 30) . ' ' . str_repeat('f', 30) . "$eol   " // Newline and three spaces.
      . str_repeat('g', 30) . ' ' . str_repeat('g', 30) . "$eol" // Newline only.
      . str_repeat('h', 30) . ' ' . str_repeat('h', 30) . "$eol" // Newline only.
      . str_repeat('i', 30) . ' ' . str_repeat('i', 30) . '</pre>'; 
    // Soft wrapping and space stuffing for lines between about 78 and 998 characters long,
    // according to http ://www.ietf.org/rfc/rfc3676.txt.
    $this->assertHtmlToText($html, $text, 'Text within <pre> treated according to RFC 3676.');
  }
pillarsdotnet’s picture

After thinking about this some more I believe that RFC 3676 calls for soft wrapping

 tags.
You can only soft-wrap content that starts at the left margin. You can't soft-wrap anything that is a descendant of a <blockquote>, <li>, <ol>, or <ul> tag. Because of the way we are rendering them, you also can't soft-wrap anything that is a descendant of a <b>, <cite>, <em>, <i>, <strong>, or <u> tag. The more I consider this, the more I think the approach in #197 is the way to go. There is no right way to do things here.
pillarsdotnet’s picture

Note that this will get rejected the first time it gets reviewed, as the comments extend beyond column 80.

EDIT: nevermind; code fails with a syntax error anyway.

salvis’s picture

Actually, I think the most interesting solution is in #197.

You're not serious, are you? I'm pretty sure that spawning a child process is way too slow, and the text-mode browsers are designed to produce output for 25x80 terminals, which is not the same thing as a text email message.

Ah, this is a test, right? I'll readily admit that I'm not able to look at each of your patches, at the rate that you produce them...

pillarsdotnet’s picture

I'm pretty sure that spawning a child process is way too slow,

All tests complete in less than one second on my six-year-old Lenovo T-60.

the text-mode browsers are designed to produce output for 25x80 terminals, which is not the same thing as a text email message.

All of the ones I referenced have command-line options to set the line width. None of them care about the screen height.

I'll readily admit that I'm not able to look at each of your patches, at the rate that you produce them...

You can tell which ones I'm serious about because I commit them to Mail System, which is the only thing that is going to benefit from this issue before D9 ships. And also, interdiff is handy when comparing large patchsets.

pillarsdotnet’s picture

#205 has a syntax error and a test error which are now both corrected.

The final test, cut-and-pasted from #203, fails because it now requires behavior which you condemned in #186/#191 and which I fixed in #196.

The comments still extend beyond column 80.

pillarsdotnet’s picture

Status: Needs work » Closed (works as designed)
salvis’s picture

You can only soft-wrap content that starts at the left margin.

I think this is wrong. Space stuffing allows you to have leading spaces:

A generating agent MUST:

   o  Space-stuff lines which start with a space, "From ", or ">".

I believe this has been working in D6, but I'm not sure.

Because of the way we are rendering them, you also can't soft-wrap anything that is a descendant of a <b>, <cite>, <em>, <i>, <strong>, or <u> tag.

Why? Is this a limitation of your algorithm (can it be improved?) or do you see something in RFC 3676 that causes difficulties?

The final test, cut-and-pasted from #203, fails because it now requires behavior which you condemned in #191 and which I fixed in #196.

Yes, I got wiser between #191 and #203. I'm sorry for leading you down a path which I now think is a wrong one.

My hat's off to you for your drive, but maybe you're taking "talk is silver, code is gold" a bit too far. The basic problem in this issue is that we not only didn't have any tests, but we didn't even know how things are supposed work and why. In this situation it would be wise to discuss first and make sure we really understand what we're talking about (ideally getting consent from several people), before you start coding. This may seem slower, but ultimately it's probably faster, and certainly less frustrating.

Why would this stop working after having worked for D6?

You're absolutely right.

I was referring to drupal_wrap_mail(), which apparently hadn't been touched since D6. This is what caused me to question whether my test case was correct.

Obviously, the D7 drupal_html_to_text() is badly broken, but I still think that the D6 version was mostly working (except that it made it difficult to send HTML mail). It was broken by inadvertently committing #407452-64: Remove use of drupal_html_to_text() from system mails. The error was noticed a year ago in #299138-48: Improve \Drupal\Core\Utility\Mail::htmlToText(). Your patch in #299138-69: Improve \Drupal\Core\Utility\Mail::htmlToText() and in #1130198: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText() may have been all that was needed to get things working again, but without tests we can't really know.

There's a tremendous amount of work, know-how, and tuning in the D6 version that we (you, mostly) have had to redo and rebuild. Once you've pulled this off, D7 will be in much better shape, of course, but it takes a lot of learning and hard work. It's not just coding...

salvis’s picture

I'm pretty sure that spawning a child process is way too slow,

All tests complete in less than one second on my six-year-old Lenovo T-60.

Guessing performance is very difficult. A more relevant test would be how many mails can be pumped through drupal_html_to_text() and sent out in one cron run. This may include spawning sendmail (or something similar), which could explain why that number is notoriously low. Spawning a text-mode browser may be a comparable effort, which might mean that you'd be cutting the number in half.

Not an attractive proposition...

pillarsdotnet’s picture

Once you've pulled this off, D7 will be in much better shape,

Once someone has pulled this off, D9 will be in much better shape.

There; fixed that for you.

salvis’s picture

Oh, you're using <del>!

;-)

pillarsdotnet’s picture

Status: Needs work » Needs review

Testbots are idle; let's put them to work.

Status: Needs review » Needs work
pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
42.55 KB
pillarsdotnet’s picture

Status: Needs review » Needs work
pillarsdotnet’s picture

pillarsdotnet’s picture

Title: Improve drupal_html_to_text() » Fix the broken formatting in drupal_html_to_text() and also provide tests
Category: feature » bug

Spawning a text-mode browser may be a comparable effort...

My point is that we're rapidly approaching the complexity of (e.g.) html2text. If the goal is to duplicate the entire functionality, we might as well spawn the compiled binary, which is bound to be more efficient than its interpreted-language equivalent.

#1130198 may have been all that was needed to get things working again, but without tests we can't really know.

Re-opened #1130198. Please add relevant tests to that issue.

Copied tests from this issue, then adjusted the "expected" values until they passed. Discovered that the current version of drupal_wrap_mail() will happily generate lines of over 2000 characters in length, so removed the RFC-3676 test replaced the testVeryLongLineWrap() assertion with a pass().

pillarsdotnet’s picture

Status: Needs work » Needs review

Looks like the testbots broke down around 10pm last night.

pillarsdotnet’s picture

pillarsdotnet’s picture

Status: Needs review » Needs work
Issue tags: +Needs backport to D6, +Needs backport to D7
pillarsdotnet’s picture

Yay! Got rid of all magic constants except for 160 (the non-breaking space character), 80 (the default line length for soft-wrapping) and 1000 (the default line length for hard-wrapping).

Working on adding support for tables; stay tuned.

Damien Tournoud’s picture

Title: Fix the broken formatting in drupal_html_to_text() and also provide tests » Improve drupal_html_to_text()
Category: bug » feature

This issue transformed into a massive feature request. Let's keep the other #1130198: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText() to fix the regression.

pillarsdotnet’s picture

Title: Improve drupal_html_to_text() » Fix the broken formatting in drupal_html_to_text() and also provide tests
Category: feature » bug
FileSize
46.94 KB

Full table support added.

pillarsdotnet’s picture

Added inline comments to explain the strategy for generating tables, improved efficiency slightly, and cleaned up inline comments in test code.

pillarsdotnet’s picture

Title: Fix the broken formatting in drupal_html_to_text() and also provide tests » Improve drupal_html_to_text()
Category: bug » feature
Status: Needs work » Needs review
pillarsdotnet’s picture

pillarsdotnet’s picture

Minor changes from actual use-testing.

pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
50.12 KB

Last one; I promise... :-) EDIT: must have had my fingers crossed.

Applied to Mail System
6.x-2.23 release
7.x-2.23 release
8.x-2.23 release

Status: Needs review » Needs work

The last submitted patch, drupal_html_to_text-improvements-299138-232.patch, failed testing.

pillarsdotnet’s picture

Status: Needs review » Needs work
FileSize
50.1 KB

Fixed errors in tests.

pillarsdotnet’s picture

Status: Needs work » Needs review
pillarsdotnet’s picture

Status: Needs work » Needs review
FileSize
50.12 KB

Corrected a one-off error: Need to subtract the length of the line-break character from the maximum line length before wrapping or adding padding.

pillarsdotnet’s picture

Removed url() calls from test code and also switched to DrupalWebTestCase.

pillarsdotnet’s picture

Corrected the initial calculation of table cell wrapping width, and eliminated a duplicate assignment.

pillarsdotnet’s picture

#237 fails, which means that #238 will probably fail too, but #1179448: verbose() results are missing.

Moving verbose() text to assertion message so that I can figure out why the testbots are failing where my local tests are succeeding.

sun’s picture

@pillarsdotnet: In light of #239, it would be helpful if you would run tests and debug test failures on your local box, before submitting patches to drupal.org. The common exception are patches that affect all of Drupal core, which take hours to test locally and only ~30 minutes for the testbot. But especially if it's only a single test-case that needs to be run, it's not really time-consuming to do that on your local box. Probably 90% of all follow-ups on this issue are pure patch/testing noise, which could have been avoided, in turn leading to more solid and focused follow-ups, reviews, and faster progress. Thanks!

pillarsdotnet’s picture

Title: Fix the broken formatting in drupal_html_to_text() and also provide tests » Improve drupal_html_to_text()
Category: bug » feature
Status: Closed (works as designed) » Needs review

I am testing on my local box. Tests pass here but fail on drupal.org. Dunno why.

EDIT: Okay, I need to run local tests with clean urls turned off and drupal running from a subdirectory, else the results aren't valid.

Probably 90% of all follow-ups on this issue are pure patch/testing noise

Really? 220 comments worth? (counting non-noise comments... 148 out of 243)
So.... 40% noise, including my (now removed) snarkiness.

Status: Needs review » Needs work

The last submitted patch, drupal_html_to_text-improvements-299138-239.patch, failed testing.

pillarsdotnet’s picture

Title: drupal_html_to_text() formatting is broken and does not have tests » Improve drupal_html_to_text()
Version: 7.x-dev » 8.x-dev
Assigned: sun » Unassigned
Category: bug » feature
Status: Needs work » Needs review
Issue tags: -Needs tests, -TestingPartySF +Needs backport to D6
FileSize
55.86 KB

Improved table generation accuracy and performance.
Added test for correctly formed deeply-nested tables.
Test also asserts that table generation takes less than one second.

pillarsdotnet’s picture

Status: Needs work » Needs review
pillarsdotnet’s picture

Status: Needs review » Needs work
pillarsdotnet’s picture

Status: Needs review » Needs work
Issue tags: +Needs backport to D7
FileSize
56.36 KB

Found and fixed a typo -- doxygen comment didn't match parameter name.

Applied to Mail system
6.x-2.25
7.x-2.25
8.x-2.25
pillarsdotnet’s picture

Fixed an edge case reported by Seph. Haven't added a test because I haven't yet figured out how to reproduce the problem. CNR for testbot only.

Applied to Mail System
6.x-2.26
7.x-2.26
8.x-2.26
pillarsdotnet’s picture

Changing back to "needs work" because #247 still lacks the following features:

  • A test for the edge case reported by Seph.
  • Support for tags such as <ins>, <del>, and <strike>.
  • Tag attributes, notably align, rowspan, and colspan.
  • A distinction between <th> and <td>.
  • CSS / stylesheet support.

Also, it doesn't end world hunger, save the whales, or promote world peace. Definitely needs work.

Meanwhile, please support #1130198-46: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText()

salvis’s picture

Title: Fix the broken formatting in drupal_html_to_text() and also provide tests » Improve drupal_html_to_text()
Assigned: pillarsdotnet » Unassigned
Category: bug » feature
Status: Needs review » Needs work
Issue tags: -Needs committer feedback

@pillarsdotnet: Please stop rewriting history (like #164 that you posted on June 1 and edited just now).

attiks’s picture

subscribing

Wim Leers’s picture

Version: 8.x-dev » 9.x-dev
catch’s picture

Version: 9.x-dev » 8.1.x-dev
Issue summary: View changes
Issue tags: +API addition, +Needs issue summary update

Could be done in 8.1.x as an API addition.

ianthomas_uk’s picture

Title: Improve drupal_html_to_text() » Improve \Drupal\Core\Utility\Mail::htmlToText()

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.0-beta1 was released on March 2, 2016, which means new developments and disruptive changes should now be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.0-alpha1 will be released the week of July 31, 2017, which means new developments and disruptive changes should now be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.0-alpha1 will be released the week of January 17, 2018, which means new developments and disruptive changes should now be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.7.x-dev

Drupal 8.6.0-alpha1 will be released the week of July 16, 2018, which means new developments and disruptive changes should now be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.7.x-dev » 8.8.x-dev

Drupal 8.7.0-alpha1 will be released the week of March 11, 2019, which means new developments and disruptive changes should now be targeted against the 8.8.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.0-alpha1 will be released the week of October 14th, 2019, which means new developments and disruptive changes should now be targeted against the 8.9.x-dev branch. (Any changes to 8.9.x will also be committed to 9.0.x in preparation for Drupal 9’s release, but some changes like significant feature additions will be deferred to 9.1.x.). For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.1.x-dev

Drupal 8.9.0-beta1 was released on March 20, 2020. 8.9.x is the final, long-term support (LTS) minor release of Drupal 8, which means new developments and disruptive changes should now be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

jungle’s picture

Outdated? htmlToText() is in \Drupal\Core\Mail\MailFormatHelper ATM. \Drupal\Core\Utility\Mail::htmlToText() does not exist in 9.1.x.

jungle’s picture

Status: Needs work » Needs review

Setting to NR for closing this if applicable.

kscheirer’s picture

Category: Feature request » Task
Status: Needs review » Needs work
Issue tags: +testing coverage

Well the original issue (12 years ago?? sheesh) was to add tests for htmlToText() since it's part of Drupal core.

Looking at core/tests/Drupal/Tests/Core/Mail/MailFormatHelperTest.php::MailFormatHelperTest() we still are not testing the htmlToText() function, so I think this issue still applies. Along the way this issue also wanted to improve the function. I haven't read through all the comments, but perhaps it would be easier to commit tests as-is, and then a new issue for improvements. Or close if there is no desire for the improvements anymore.

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

quietone’s picture

Status: Needs work » Closed (outdated)

I spent some time reading the patch and I agree with jungle in #263 that this is outdated.

The tests are being converted to Unit tests in #2035133: Convert system module's Mail unit test to phpunit. And it there appears to be agreement to change the mail system, see #1803948: [META] Adopt the symfony mailer component.