Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
After being applied to the 7.x branch, it should be considered for backport to the 6.x branch. Note: This tag should generally remain even after the backport has been written, approved, and committed.
Needs backport to D7
After being applied to the 8.x branch, it should be considered for backport to the 7.x branch. Note: This tag should generally remain even after the backport has been written, approved, and committed.
API addition
Enhances an existing API or introduces a new subsystem. Depending on the size and impact, possibly backportable to earlier major versions.
Kevin HankensCreditAttribution: Kevin Hankens commented
One quick question - sorry if this is in the wrong forum. If the $allowed_tags argument is empty, would you consider using the allowed tags list from the Filtered HTML input format settings? This would be more aligned with the bulk of the site content.
I use a lot of div tags in my content and wouldn't want to pass the entire list of allowed tags plus 'div' in an array each time I call drupal_html_to_text.
Kevin HankensCreditAttribution: Kevin Hankens commented
Oh, I forgot to add that for me a div tag would signify a separate section, so I also wanted to see a case for it in the switch stanza that worked like the p tag.
Kevin HankensCreditAttribution: Kevin Hankens commented
I haven't submitted any simpletests before, but I started working on one yesterday for this function. If you are interested, I would be happy to finish it up and send it this way for review. If so, here are a couple of quick questions.
Would the testing be as simple as: a) checking that there are no HTML tags; and b) testing a piece of HTML code against an expected result - or can you foresee something more complex?
There aren't any core tests under Mail, would this require a new Group, or grouped with another?
1. That'd probably be fine. Anything else can always be added to later, but the idea here is provide input, pass it through the function, verify expected output, not much more than that.
2. You'd need a new group and a new mail.test - you could copy modules/simpletest/tests/form.test (or some other .inc that already as tests) to get started quickly.
Also feel free to post patches that are still in progress if you think there's something worth looking at, even if it's not quite ready.
Here's a proposed mail.test file. It's a little rough, but the tests worked, and I was able to make them fail by manipulating the output of drupal_html_to_text(). Here are some considerations:
The line breaks in the expected output are ugly, but I wasn't able to replicate it using new line codes. Perhaps we could just use a simpler piece of HTML for testing.
The regex expression for the HTML matching is something that I've had saved for quite some time, but I don't know the identity of the original author. If this is a problem, we'll need a Drupal coder to replace it with something of ours.
I tried using assertNoPattern() instead of using preg_match with a boolean because it was cleaner, but I couldn't get it to work without it throwing a warning. It appeared to work, but I got the yellow warning bar saying "delimiters cannot be alpha numeric or a back slash"
Otherwise, take a look and see if it's something you can use!
Well, i started reviewing the test patch, but ended writing it. Sorry about that. Some notes:
I solved the the not-so-nice-linebreaks with the use of a define LF
the tests are all defined in an array for readability;
Kevin, i'm not sure what to do with the div patch, since the issue is about the making of tests, i would say move it to another issue with the div patch alone?;
next step should probably be tests for the parameter $allowed_tags, or more for exceptional use, or combined use of tags, or can we accept the scope of the patch as is, considering the size of it?
Why are we using the setUp() function if there's only one test?
Comments need proper capitalization.
I don't think LF is an improvement, what's wrong with "\n"?
It's not clear to me what the second half of the test is doing... it seems like it's testing a separate aspect and would be better served as a separate test with a proper description.
After searching Drupal for references to drupal_html_to_text i stumbled upon the initial patch for this function (#154218: Add HTML to text convertor for email (and wrap mails properly)) by Steven. Here he showed a sample conversion, which gave me the idea to add an more functional unit test to the patch. Pro: it tests the function in a more reallife (tm) use, Cons: i adds a lot of lines to the mail.test file. What do you think?
test results so far:
There is some strange behaviour with utf8 characters, and some trivial bugs in drupal_html_to():
A single <li>foo</li> gives a PHP:notice;
There is always an extra whiteline at the end;
There are trailing whitespaces in the output.
I will make a separate issue for these findings (if not too trivial), after this patch is committed ;-) ...
It probably doesn't matter, but since this is a test case and should try to be as near the actual case as possible, shouldn't valid HTML be used? For example, the line <ul><li>Drupal<li>Drupal</ul> should probably be <ul><li>Drupal</li><li>Drupal</li></ul>, unless there is a reason for the invalid htm — perhaps you were trying to test the function's handling of such a case?
Could we ammend the first line with a comment that states that we're explicitly testing malformed HTML?
Also, there is no test for an UL in an OL and vice-versa.
More code cleanup, coder gives no errors anymore. Also i changed the test to DrupalUnitTestCase for speedup of the test (see #464714: Speed up the tests).
Minor fixes required:
There are a number of trailing spaces in the patch. Also should 'Drupal<br>Drupal' not really be 'Drupal<br />Drupal' inline with Drupal standards, or is that <br> on purpose as part of the test?
Also not sure about the assertion message text below, it could use some simplification:
$this->assertEqual($result, $expected_result, t('Test output of drupal_html_to_text(Long html mail with all sorts of html tags.)'));
Ditto with the assertion messages for the assertions in testDrupalHtmlToTextArgs(). I'd rather English sentences explaining the functionality that passed/failed, rather than a piece of code. And how does the 3rd assertion here differ in functionality from the 1st? They both appear to be testing that the array() overrides the default html tags allowed.
Other than those few things, patch is looking good, great work!
Also should 'Drupal Drupal' not really be 'Drupal Drupal' inline with Drupal standards, or is that on purpose as part of the test?
Good catch! The answer is: both. We need to test both, since the handling of <br> and <br /> may break. Whereas the former should go into the "malformed" array section.
The other remarks from stella I agree with.
This is a wonderful & great patch! Let's get it done! :)
ah, those trailing spaces; believe it on not, they are on purpose. Those are generated by drupal_html_to_text(). Removing them would cause this test to fail. This is added as a comment in the
code.
Added separate tests for the <br> and <br /> tags.
Tried to improve the test messages to more English sentences; t('Test of drupal_html_to_text() with a long html mail.'). Could use a suggestion here, if this is not what you were looking for.
The idea of the third test of testDrupalHtmlToTextArgs() was to test if passing a second parameter would not break the existing funcionality. I combined this now with passing two tags, so it makes somewhat more sense (...)
+ * Tests html_to_text() with a full mail example.
s/Tests/Test/ - PHPDoc summaries never use the third form.
+ // This is the html code to test.
HTML we want to write uppercase here.
+ /*
+ * The expected result after html_to_text($html).
+ * Do not remove the trailing whitespaces, as these are generated, and
+ * will otherwise fail this test.
+ */
That's a wrong notation for inline comments. We only use multiline comments for PHPDoc, just use // here.
Also, I believe the function name is drupal_html_to_text(), no?
Can we simplify that to Trailing white-space is required to match the output of drupal_html_to_text(). ?
+ * Test $allowed_tags argument of html_to_text().
The getInfo is now a public static, and the descriptions use a more active tense now. Also changed mail to e-mail, and spotted a missed lowercase html. Last: added an mail.test entry to simpletest.info, otherwise it wont
show up in the testlist.
Need a newline at the end of modules/simpletest/tests/mail.test to follow coding standards
You have two tests that report with the message "Tests drupal_html_to_text() basic working of the parameter $allowed_tags." - perhaps the second one could explain how it differs from the first?
When reporting the function and parameters used, why truncate at 15 characters? Most of the time the test text will not be seen by anyone, but when one fails it will be useful to see the complete test performed.
# Need a newline at the end of modules/simpletest/tests/mail.test to follow coding standards
changed.
# You have two tests that report with the message "Tests drupal_html_to_text() basic working of the parameter $allowed_tags." - perhaps the second one could explain how it differs from the first?
how about: "Tests drupal_html_to_text() with tags which are not given by the parameter $allowed_tags.". Better suggestions are welcome!
# When reporting the function and parameters used, why truncate at 15 characters? Most of the time the test text will not be seen by anyone, but when one fails it will be useful to see the complete test performed.
ok, looks better indeed. Also removed the t() there as t('drupal_html_to_text()') makes no sense.
These tests seem to be failing due to a bug in the actual function. That is, I think the tests are working as intended, and catching actual errors. Specifically, line breaks aren't be added as expected in drupal_html_to_text(). It's not clear to me whether it's really necessary to have line breaks at the end of many of the examples, but there definitely needs to be a line break in place of an actual page break (e.g. "Drupal<br />Drupal", and there isn't currently.
So I think the function needs to be updated to pass the tests, but I'm not very familiar with the function. Maybe someone more familiar with the function can confirm this?
@@ -466,7 +496,7 @@
$chunk = $casing($chunk);
}
// Format it and apply the current indentation.
- $output .= drupal_wrap_mail($chunk, implode('', $indent)) . "\n";
+ $output .= drupal_wrap_mail($chunk, implode('', $indent));
// Remove non-quotation markers from indentation.
$indent = array_map('_drupal_html_to_text_clean', $indent);
}
The commit message was "- Patch #356074 by chx, Damien Tournoud: provide a sequences API." Which does not make much sense to me for this change... Perhaps the idea was to replace \n by MAIL_LINE_ENDINGS ?
Just remove "\n" from all the assertions. As you can see, the input text of those assertions does not contain a newline or BR, so the outlined change in #48 was a fix to not insert newlines, which do not actually exist; therefore making newlines match between input and output.
sun - what wasn't obvious (because we had no tests for this function!) was that many of the tags were relying on that extra \n to format themselves properly. Stuff like h1, h2, dl, br, p, ol, ul, li, dl, dt, dd, etc. I think your patch was trying to make this function more general - but it doesn't seem to fulfill its original purpose now. Also _drupal_html_to_text_pad() is doing a substr, relying on the fact that there was an extra \n to throw away.
So right now drupal_html_to_text('Drupal<br />Drupal') returns DrupalDrupal, which is probably not what we want. Some of these can be fixed by changing $chunk = ''; // Ensure blank new-line. to $chunk = "\n"; // Ensure blank new-line. .
But that doesn't take care of all the cases - like '<ul><li>Drupal</li><li><ol><li>Drupal</li><li>Drupal</li></ol></li></ul>' - what text should that result in? Reading RFC 3676 did not make it clear either.
Arjenk - you wrote many of the test cases way back, based on #154218: Add HTML to text convertor for email (and wrap mails properly) - which included some examples that aren't available online anymore. It now looks like we have broken tests for a broken function. This may be a case where we write the test cases first, and then have the function match them - anyone up for it?
First, kscheirer gave a excellent summary. And i agree with his
conclusion that the function is broken, but i'm not sure we can fix it
in the test.
The function is called drupal_html_to_text(), it is (was) a function
that was capable of formatting html to readable, formatted (hear the
newlines here?) text. It did need to add newlines in order to make
something out of the orginal html. To not allow the function to add
newlines, is to break the use of the function. So that is the status
of the current function: broken.
to give an example to illustrate:
[user:name],
<p>Thank you for registering at <a href="[site:url]">[site:name]</a>. Your application for an account is currently pending approval. </br>Once it has been approved, you will receive another e-mail containing information about
<ul>
<li>how to log in,
<li>set your password,
<li>and other details.
</p>
-- [site:name] team
(this is not the real text from core) gave the following text email:
test,
Thank you for registering at localhost [1]. Your application for an account
is currently pending approval. Once it has been approved, you will receive
another e-mail containing information about
* how to log in,
* set your password,
* and other details.
-- localhost team
[1] http://localhost/drupal7/
but now:
test,
Thank you for registering at localhost [1]. Your application for an account
is currently pending approval. Once it has been approved, you will receive
another e-mail containing information about * how to log in, * set your
password, * and other details.
-- localhost team
[1] http://localhost/drupal7/
The options are, to my opinion:
If it really is a problem to change the number of newlines, remove
the function from core, and change it with strip_tags, or
likewise. Which effectively removes all tags in mails.
Make a limited drupal_html_to_text() that only accepts (em, i,
strong, b) and removes the others. Perhaps also change the name as
well.
Agree that the fix in #407452: Remove use of drupal_html_to_text() from system mails broke the function, and go back to
there and try to fix it in another way. Perhaps make a test first if
we are dealing with a html text, and then call the function, and
else do not...
Last option 3) in #53 is not possible. I tried to track and keep up with this issue, but didn't manage to do so. I'm not sure whether I understand #50. I like the tests added in #53, but didn't verify in detail whether they are valid. Overall, drupal_html_to_text() implements quite complex logic, so I'm pretty sure that it is possible to achieve any newline behavior we can think of. We should start by making sure that our test contains all expectations. Afterwards fix the logic.
We should start by making sure that our test contains all expectations.
Agreed, we're back to square one.
Define what exactly this function is supposed to do
Write tests for it
Fix the function to pass tests
Need to track down when this function came into Drupal and see what its original purpose was?
@sun: The short version of #50 is - your patch in #407452: Remove use of drupal_html_to_text() from system mails broke this function (or at least, it doesn't do what it did before that patch). There were no tests for this function in core to catch that. The old function may not have been correct, but the current version definitely is not.
I've slightly fixed the tests (which also means that expectations might be slightly changed).
The tests added here should definitely not try to be "compatible" with the current output of drupal_html_to_text(), but rather assert our expectations.
Our expectations are pretty clear: A proper conversion from HTML to plain text. If you happen to get e-mail notifications from drupal.org, then you may know most details of the expected format already.
Test results are clear, the current function is horribly broken. We should have another round to make sure that the expectations are right, then start to actually fix the bugs.
It would also help to add a test for the issue which led to the troublesome patch. Perhaps it had something to do with wordwrap and adding a newline on long lines? So how about adding the test:
A generating agent SHOULD:
o Ensure all lines (fixed and flowed) are 78 characters or fewer in
length, counting any trailing space as well as a space added as
stuffing, but not counting the CRLF, unless a word by itself
exceeds 78 characters.
o Trim spaces before user-inserted hard line breaks.
among others, we should also add some tests to test if this works properly.
» drupal_html_to_text() formatting is broken and does not have tests
Priority:
Normal
» Major
ok, but what about the existing expectations/assertions in the patch? I need a confirmation that they are correct, before I'll continue with this patch.
Well considering we don't even have enough information to write tests, I think we have a serious problem here. No one posting in this issue queue seems to know what the tests should look like. Someone has to know what this function is supposed to do, can we find the person that committed it?
This bug renders the Simplenews module unusable and blocks further progress towards a D7 release there.
Please commit what you have there. #69 fixes the most urgent problem with drupal_html_to_text(). Further tests are a good idea but shouldn't block fixing a major bug. Thanks.
@pillarsdotnet: Please reconsider your attitude. (You're not talking to a script kiddie, and even then you don't need to assume it was bullshit.)
Actually, I'm using Simplenews on D7 just as well to deliver our newsletter every two or three days.
So "unusable and blocking further progress" might in fact be a bit overstated. Correct would have been: "out of the box it is unusable, and because the bug is to be solved upstream, it paralyzes further development, plus: because admins don't get it work correctly, they refrain from further testing".
I thought to be specific enough, because it is obvious enough what patch #69 does. However to be even more detailed:
After applying #69 from this queue, drupal_html_to_text() stopped stripping off newlines from my simplemail newsletters (see #1099188: simplenews_html_to_text() strips off newlines).
Now, how come you're obviously not affected by the bug?
You might be using a different backend, such as Mime Mail, bypassing drupal_html_to_text().
Now the title of this major (!) issue here states "drupal_html_to_text() formatting is broken", so it doesn't mainly refer to adding tests. So what I'd like you to consider is fixing "drupal_html_to_text()" asap based on the tests we have now, even if the tests are not 100% complete. Thank you.
Simon GeorgesCreditAttribution: Simon Georges commented
Closing #1099188: simplenews_html_to_text() strips off newlines as a duplicate of this one, and subscribing.
If I may be of any help to help solve this issue, I'm clearly willing to help, to stop people raising this in Simplenews issue queue ;-).
@sun -- You fix it, then. This whole issue is a "feature request". The only part that was a bugfix was the patch in #75 which you so offhandedly rejected.
Confirm that the tests included with the patch are necessary and sufficient.
Confirm that the patch actually solves the problem. Describe a reproducible test where the current code fails and the patched code succeeds.
Change the issue status from "Needs Review" to "Reviewed and Tested By the Community".
If, after thorough testing, you cannot in good conscience affirm all of the above, please explain what is lacking in the current patch and I will do my best to address your concerns.
Lars ToomreCreditAttribution: Lars Toomre commented
@pillarsdotnet:
I spent some time reading through the patch file tonight and the new tests. As a whole, the new patch and tests look good. However, I would suggest that you expand the tests to include some additional cases:
1) Multiple good tags override
$text = '<ul><li>This is a <em>complex</em> text string.</li></ul>';
$result = drupal_html_to_text($text, array('em', 'ul'));
2) Purposefully mal-formed tags in string test (from user input)
$text = '<ul><li>This is a <em>complex</em> text string.</ul></ul>';
$result = drupal_html_to_text($text, array('li','em'));
should be => ' * This is a *complex* text string.'
3) Ignorant user example of highlighting
$text = "This is <B>highlighted</B> text.';
$result = drupal_html_to_text($text, array('B'));
should be => 'This is *highlighted* text.'
Cases 2) and 3) I have experienced in the last few months from users hand-entered node bodies and/or comments. These are examples of the edge cases that I think this particular function needs to test for. Users are notorious for what they might enter without trying to be malicious.
I am unsure whether this function is intended to be used with utf-8 strings. If it is, I would want to add some text strings that include characters outside of the usual ASCII range like the European special vowels and non-Roman language sets.
Finally, I can easily see this function used as part of a larger webpage data extraction function. That typically involves using something like cURL to retrieve a webpage as $text and then start stripping out the tags to get to whatever information is desired. Hence, I think that we also want a test that takes a mocked up web page and demonstrate that this function handles that "complex case" as well.
Edit: Two further thoughts/questions:
1) What happens if 'html' is passed in as the second parameter?
2) Looking at #732992: Allow filter_xss_admin() to accept HTML5 tags, I am wondering if the permissible html tags acceptable as the second variable should be constrained to the acceptable terms used in filter_xss()? If so, how are we going to deal with what is acceptable for HTML5 (being dealt with in that issue)?
Okay, I've added all the tests that Lars asked for, except for the utf-8 ones. I really don't know what to do about those and would welcome any suggestions.
@further thoughts/questions:
1) What happens if 'html' is passed in as the second parameter?
Nothing. I added a test to prove it. Unsupported tags are stripped and ignored.
2) Looking at #732992: Change filter_xss_admin() to take a variable by default instead of a hardcoded list., I am wondering if the permissible html tags acceptable as the second variable should be constrained to the acceptable terms used in filter_xss()? If so, how are we going to deal with what is acceptable for HTML5 (being dealt with in that issue)?
Same as the previous. Passing unsupported tags into the second parameter will not affect the output, unless someone rewrites the drupal_html_to_text function.
If I get *real* ambitious, I might rewrite the entire function to make it easier to add new tags in the future, but right now I'm just looking for short-term results.
Lars ToomreCreditAttribution: Lars Toomre commented
Sorry if my point was not clear enough... I was trying to highlight the <B> case as in capitalized. I know that well formatted XHTML wants all tags in all lower-case, but users have the funny tendency to want to use both <B> and <b>. I thought a test to prove that we could accept either was appropriate.
OK... I have now read through all of the function changes and they make sense to me. I have not installed this patch though so I cannot say that it installs or runs as expected.
The added comments in the code look good, with full sentences ending in periods. I did not open up the patch in a text editor so I am unable to confirm that the comments do not extend beyond line 80. I also was unable to check for extra white spaces at the end of lines. There is one phrase 'see #345931: Support value attribute in ordered lists.' which I am unsure about in a comment. I am pretty sure we are not suppose to reference d.o. issues in comments, except in special cases. Is this the format that a special case should be in?
From reading the code and the tests, this patch appears to solve the problem and passes appropriate tests. I am not prepared to RTBC this however since I cannot confirm that the patch actually applies. Other steps to complete the review of this patch include:
1) A couple of people to also review this patch and make sure that it applies appropriately.
2) Add a couple of utf-8 strings with HTML tags embedded therein to make sure that the regex expressions do not degrade for instance an o with omlats to an ascii o.
Perhaps someone who does not speak English as a primary language could make some suggestions for item #2??
I did not open up the patch in a text editor so I am unable to confirm that the comments do not extend beyond line 80.
The comments don't. Some of the code does, a little.
I was trying to highlight the <B> case as in capitalized
See the very last test. It compares drupal_strtoupper(drupal_html_to_text($input)) with drupal_html_to_text(drupal_strtoupper($input)) where $input is a fairly long mix of tags and text.
There is one phrase 'see #345931: Support value attribute in ordered lists.' which I am unsure about in a comment.
You are correct. I have removed the references.
Now running coder review... done. Fixed problems identified by coder.
Responding to #97 and also to sun's comment in #86:
I think that we also want a test that takes a mocked up web page and demonstrate that this function handles that "complex case" as well.
First of all, this function is not supposed to do as good a job as a dedicated text-mode browser such as Lynx, links, or w3m. If you want a general data extractor, write a module. This is for sending plain-text email messages. That's why it's located in includes/mail.inc instead of modules/filter.
Secondly, producing such a test as you describe basically means taking a big chunk of html, running it through the function, and copying the output to your $expected variable. What does that prove? Only that any future improvements to the function must also be accompanied by a change this huge test. And very likely, the future coders will likewise run the html through the function and retroactively "expect" whatever comes out the other end. I think this is a poor testing methodology. That's what I meant when I said in #84 above, "I think it's going about things the wrong way."
Thirdly, I *have* run web-pages through the original function (as it existed in d6 or in #1130198) and I can testify that the resultant output is acceptably readable. I have about twenty people on my Simplenews mailing list who must receive text-only messages, and they're doing just fine.
Lars ToomreCreditAttribution: Lars Toomre commented
OK... Glad to know that you had run a larger more complex text string (webpage) through this function. Skip the thought since you already demonstrated that 'html' tag cannot be 'slipped in' as an exception tag. My only other concern was what would happen if it encountered 'mal formed' HTMl code.
Glad that you ran the code through coder and corrected anything that came from that. Now if we can get the two items I raised previously in #100, we should be good to go!!
P.S. - I have code in contrib module that mimics this function and another that just eliminates one HTML tag. I am curious to test out this function as a possible replacement/enhancement there.
don't give up! You're trying to get drupal_html_to_text-tests+fix-299138-107.patch in right? I'll take another look as soon as I can, and thanks for picking this up.
It appears that this bugfix is also applicable to D6. Tagging accordingly.
Also removing the "Needs tests" tag because I honestly believe the tests in this patch are sufficient. If anyone disagrees, please do so constructively. That is, don't just complain -- provide a suggestion for the test you think is missing.
Or perhaps you are claiming that the d6 and d7 versions of drupal_html_to_text() are somehow completely different despite sharing identical doxygen headers?
I was under the impression that this is the bug report dealing with drupal_html_to_text(). The following is what I understand is wrong about the current D7 drupal_html_to_text implementation.
Suppose you pass the following HTML through drupal_html_to_text.
<p>foo</p><p>bar</p><p>baz</p>
Expected output ( aka the 'acceptable plaintext representation') should keep the paragraph divisons of the text with:
foo
bar
baz
Output in D6 is according to expectation, in D7 we get:
None of the above functions check whether the argument to drupal_html_to_text() is HTML or plaintext.
Heine asserts that drupal_html_to_text() should collapse whitespace, by design. Yet this would break the functionality of the only cases where the function is used.
So I ask Heine, for the record, which of the following is true?
Your assertion was in error, and we should expect drupal_html_to_text() to preserve plaintext formatting.
All the documented use cases are in error, and drupal_html_to_text() should be deleted from drupal core.
Other then that, this looks RTBC to me. Also, this is purely a bug report. drupal_html_to_text() was accidentally broken on D7, and this patch fixes it.
@Heine: I think that pillarsdotnet was talking about backporting some additional bug fixes for drupal_html_to_text() that are also in that patch. The behavior around line breaks between the tags should obviously be the same on D6 and D7.
@pillarsdotnet: Heine is not saying that D7 behavior is correct.
// Convert inline HTML text to plain text; not removing line-breaks or
// white-space, since that breaks newlines when sanitizing plain-text.
I find it interesting that the relevant commit is the same one which provoked this issue in the first place. Obviously there was more intended than the "provide a sequence API" comment would suggest.
I restate my conclusion in #117: either whitespace should be preserved, as required by the above comment and documented use-cases, or the function should be removed entirely. As I also said in #102, this function is not intended as a general-purpose text-mode browser. It is a convenience function to ensure that text emails remain readable, regardless of the input format.
Perhaps the simplest way to satisfy you would be to rename this function something else, like "drupal_sanitize_plaintext", or "drupal_html_or_text_to_plaintext".
Not really, sending mail is fundamentally broken; the API is misdesigned.
Nope, not at all. Mails are handled 100% in HTML in Drupal 7, and are only converted into plaintext (via drupal_html_to_text()) at the very end of the chain (when sending out the mail), because core by default only support sending email serialized as plain-text. You can see a summary of the design on #407452-35: Remove use of drupal_html_to_text() from system mails.
Converting plain-text to HTML before calling drupal_mail() is the responsibility of the caller.
You are completely right in saying that drupal_html_to_text() should collapse whitespace. pillarsdotnet is wrong in expecting that plain-text input should be handled.
I give up; you win. Re-rolled as requested. Also removed the comment.
If and when this ever trickles down to D6 it will either need to preserve whitespace or be accompanied by a patch to every place in core that calls drupal_mail().
(looking...)
Hmm... Not such a big list as I feared... Will probably tackle this myself when the time comes.
The old code doesn't support the <pre> tag at all. I only added it as a synonym for <p> because it's listed among the block-level tags and I wanted a test that every block-level tag forces a line-break.
I've made some modifications so that consecutive block level elements are separated by a blank line (i.e. two new line characters rather than one) and nested lists are handled better. As part of these changes block level elements output a new line before and afterwards.
The most significant changes are a) the tags now output chunks as they need this control when nested, b) the indent is applied by the following value rather than by the current tag and c) $item is used as the current element of $split so $value can store values only (as they may be needed by closing tags when determining their chunk).
More tests have been added, to test these modifications and test short tags that don't contain a space (which is the more correct use - the space was only recommended as a guideline to support old user agents - see http://www.w3.org/TR/xhtml1/#guidelines)
I also replaced the 3 instances of \r\n in drupal_html_to_text() with MAIL_LINE_ENDINGS as I assumed that they were forgotten earlier. Hopefully there wasn't a reason they hadn't been changed!
I wasn't sure what to name the patch so I kept the same format as the last one and incremented the last integer in its name. Hope this is ok.
P.S. I couldn't find the Coder module for 8.x. Is there any way to automatically test coding guidelines on 8?
Note that the code in this issue has been incorporated into Mail System since 6.x-2.6/7.x-2.6/8.x-2.6. According to the usage statistics page, that's 1,348 users. Does this qualify as "Reviewed and Tested by the Community" ?
Well, it's had plenty of review along the lines of "Looks good but..." and all the noted shortcomings have been fixed and tests added to ensure that they stay fixed.
Marking "Needs committer feedback" in hopes that someone will be willing to stick their neck out.
I'm going through all the tags. You've done an amazing job with this. Here are a few more comments:
I like your <hX> implementations, but I'd suggest adding 3/2/1 additional newlines above the H1/H2/H3 heading to separate them from the text above.
<p>a</p><p><hr /></p><p>b</p>
currently produces two empty lines above and one below the separator line; there should be just one empty line above and below.
<pre> lines must NOT be line-wrapped.
<ins> and <del> are inline elements, not block elements. Currently you're treating the tags as line breaks, which is not correct. They are typically rendered as underline and strikeout. The former is usually _underline_. I'm not aware of any standard for strikeout, but I'd suggest -strikeout-.
<u> and <strike> are missing.
For ordered lists, more than 99 items are probably rare, but more than 9 are not unreasonable. Could you indent #1 though #9 by an additional space so that they'll line up with #10?
This is debatable, but I'd suggest inserting 2 spaces between table cells, and not wrapping long table rows. This would preserve a little bit of structure.
The difficulty in this issue is that it's so complex, and once a patch is committed, it'll be almost impossible to get new issues fixed that might pop up in the future. Putting this into production is certainly a good way to build confidence, but the feedback generated that way is very slow.
I'll incorporate your code into Subscriptions to generate some more feedback (hopefully), but the switch to HTML will take some major surgery before I can actually do this (right now I'm using the D6 mail.inc).
I like your <hX> implementations, but I'd suggest adding 3/2/1 additional newlines above the H1/H2/H3 heading to separate them from the text above.
I don't like them, and they aren't mine. I suggest emulating Markdown.
<p>a</p><p><hr /></p><p>b</p> currently produces two empty lines above and one below the separator line; there should be just one empty line above and below.
Two out of three text-mode browsers disagree with you. Both w3m and links render two empty lines above and below; only the older and arguably less-sophisticated lynx renders one empty line above and below. None of them, by the way, render headers as you suggest.
These two elements are unusual for HTML in that they may serve as either block-level or inline elements (but not both). They may contain one or more words within a paragraph or contain one or more block-level elements such as paragraphs, lists and tables.
The HTML-5 draft says you have to check all the child tags to decide whether to render them as phrasing content or flow content.
The Web Design Group says that they are inline if e.g., within another inline element or a P.
On reflection, I'm dropping support for both, as neither one has any accurate plain-text representation.
<u> and <strike> are missing.
And they were missing before, too. This is a bug report, not a feature request. See my comment in #102 above.
For ordered lists, more than 99 items are probably rare, but more than 9 are not unreasonable. Could you indent #1 though #9 by an additional space so that they'll line up with #10?
Since you ask so nicely, and I happen to agree, done. (2 files changed, 8 insertions, 8 deletions)
I'd suggest inserting 2 spaces between table cells, and not wrapping long table rows.
Wrapping is mandatory, according to RFC 3676, and will often be performed by the email-reading software even if we don't.
I considered coding full support for tables, but I thought about the colspan and rowspan attributes, and decided that it wasn't worth the trouble.
Edit 01/06/2011 16:03: Patch in comment#167 does the trick.
Edit 01/06/2011 15:49: Bad timing - pillarsdotnet posted at same time. I'll make the mods to his latest patch and repost.
I've found that the use of the PCRE_MULTILINE modifier m in preg_replace() resulted in different behaviour on different servers, possibly due to the difference in the PCRE version though I can't be sure at this stage.
Note how on server 2 (which has an older PCRE library), the replacement is also made on the trailing new line.
I've added a three line workaround to the drupal_wrap_mail() function, so tests are now green on both servers. The change was made to the last three lines before the return statement and simply removes the $clean_indent if it's present at the end of the text.
// Apply indentation. We only include non-'>' indentation on the first line.
$indented_text = preg_replace('/^/m', $clean_indent, $text);
$indented_text = preg_replace('/' . $clean_indent . '$/', '', $indented_text); // Workaround for differences in multiline behaviour.
$text = $indent . drupal_substr($indented_text, drupal_strlen($indent));
return $text;
I updated to the latest Drupal 8.x, applied the patch from comment#157 (http://drupal.org/node/299138#comment-4535020) and made my modifications. The attached patch should include those changes and my modifications. Note that it differs to the patch in comment#157 by more than the three lines I edited, which I can only assume is due to other changes in core since comment#157.
Two out of three text-mode browsers disagree with you. Both w3m and links render two empty lines above and below; only the older and arguably less-sophisticated lynx renders one empty line above and below.
<hr> is a separator. It separates two equal parts. There's no way to tell whether it should belong to the part above or below. Putting it in the middle is the right thing to do. Apparently, all three text-mode browsers agree with me that the space above and below should be equal. That's my point. I'd say one empty line is enough, but if you prefer two, that's perfectly fine with me, as long as the line is in the middle.
None of them, by the way, render headers as you suggest.
Let's take a step back. Headings are not 'footings' and not separators. They announce what comes after them, and as such they semantically belong to what follows. They have very little connection to what's above them. Thus the distance above should be larger than the distance below.
I don't know what the design goals and priorities of the text-mode browsers have been — probably to pack as much information on a 25-line CRT as possible, to minimize the required navigation (= transmission over a slow connection). That's completely irrelevant in a text file (or email message).
The goal here, IMHO, should be to render HTML as text in a way that helps the human reader to understand its structure and content. You're forced to drop a lot of information to convert to plain text, and any hint that you can reasonably provide will be appreciated by the human reader.
<pre> lines must NOT be line-wrapped.
RFC 3676 disagrees.
Huh? Our interpretations of RFC3676 seem to differ. Let me cite:
Text/Plain is usually displayed as preformatted text, often in a fixed font. That is, the characters start at the left margin of the display window, and advance to the right until a CRLF sequence is seen, at which point a new line is started, again at the left margin. When a line length exceeds the display window, some clients will wrap the line, while others invoke a horizontal scroll bar.
Text which meets this description is defined by this memo as "fixed".
RFC3676 goes on to say that there can be problems with this type of text, and that in general text should be flowed. BUT (my interpretation now) if the author explicitly says that he formatted the text exactly as he wants it to appear, then the renderer should comply, if technically possible.
The <pre> tag explicitly asks for the fixed format (preformatted text). Preformatted means that indents should be preserved, spaces should be preserved, line breaks should be preserved, etc. — essentially as much formatting should be preserved as possible. If you don't have a horizontal scrollbar (e.g. on a cell phone), then you obviously have to wrap the lines to avoid clipping them, but if you can avoid it, then you should preserve formatting.
Some email clients may still munge it, but others will do the right thing. If you munge it even before it leaves the site, then you ensure that it'll be equally incomprehensible for both camps. Is that the goal?
See also HTML5. It doesn't say this explicitly, but the intention of <pre> clearly is that line breaks should neither be added nor removed inside <pre>. That's what <pre> is all about.
See my comment in #102 above. --> This is for sending plain-text email messages.
Yes, and it's called drupal_html_to_text(). Thus, if I want to send a node body through email, then this function is the obvious choice. I thought the idea of this exercise was to make the D7 version better than the D6 one, was it not?
_Underlined_ is firmly established. I don't know whether it was supported in the D6 version or not, and it doesn't really matter. Call its absence a pre-existing bug if you want, but that doesn't justify carrying its absence over into your new implementation.
IMO, the goal should be to try to preserve as much information as possible. If a node author decides that underline or strikeout are needed to convey his meaning, then it would be pity to lose those markups when transmitting the node body through text-mode email, given that there is a reasonable way to express them.
Wrapping is mandatory, according to RFC 3676, and will often be performed by the email-reading software even if we don't.
Here's where we differ. Where in RFC 3676 does it say "MUST wrap"? In my reading it says "SHOULD wrap", which means there can be overriding considerations. The author's clearly expressed intention via <pre> or <table> are such an overriding consideration, IMO.
RFC 3676 even says explicitly that Hand-aligned text, such as ASCII tables or art, source code, etc., SHOULD be sent as fixed, not flowed lines.
NOTE: Some protocols defines a maximum line length. E.g. SMTP [RFC-821] allows a maximum of 998 octets before the next CRLF sequence. To be transported by such protocols, data which includes too long segments without CRLF sequences must be encoded with a suitable content-transfer-encoding.
The maximum total length of a text line including the <CRLF> is 1000 characters (but not counting the leading dot duplicated for transparency).
So line wrapping MUST occur before 1000 characters (998 plus CRLF). Outside of <pre> tags, it SHOULD occur at a convenient width for text-mode readers, which historically has been 80 characters (78 plus CRLF). If there is a problem, it lies in the fact that we use the same code to meet both constraints.
I thought the idea of this exercise was to make the D7 version better than the D6 one, was it not?
The idea of this exercise is to "fix the broken formatting in drupal_html_to_text() and also provide tests". Please understand that "broken" means something entirely different from "sub-optimal". I now regret that I overzealously added support for tags that were not originally supported, as I may have unnecessarily delayed adoption of the more important fix. Also note the "Needs backport to D6" tag.
_Underlined_ is firmly established. ... Call its absence a pre-existing bug
Fine; added support for the <u> tag, even though its actual use is quite rare. Stylesheets would be the far more common use case, but supporting them is way out of scope.
It seems like you have to remove the leading slash before you call url().
<p>a</p><p><hr /></p><p>b</p> currently produces two empty lines above and one below the separator line; there should be just one empty line above and below.
I wasted four hours on this one. Try running the following program:
$dom = new DOMDocument;
$dom->loadHTML('<html><body><p>a</p><p><hr /></p><p>b</p></body></html>');
echo $dom->saveXML();
You will discover that <p><hr /></p> gets corrected to <p/><hr/> because you can't nest <hr /> inside a <p> container.
Looking over this thread, the scope has expanded far beyond the initial bug report, which was addressed a long time ago.
The latest patch solves the original bug (lack of tests) without introducing any new bugs. If there are additional feature requests, or even other bugs that aren't yet fixed by this patch, I suggest we take them to separate issues and close this thread before it reaches 3 years old.
Lars ToomreCreditAttribution: Lars Toomre commented
I agree with @sreynen #173 that this patch fixes the original bug report and goes far beyond. Hence, I strongly urge that this patch be applied with the clear understanding that 'better' may not be 'perfect'. However, 'better' is also *much* better than the current state of core Drupal.
This patch adds the initial requested tests and passes tests also included for all of the extra code that was added to better handle the complex challenge of converting HTML text to its plain text equivalent. Edge cases, additional bugs and/or feature requests can be started in separate issues. It is time for this RTBC patch to be applied. It makes HTML-text to plain text conversion so much better!!
Skimmed through the patch and noticed a couple of minor nits in the doxygen docs:
* From somewhere whereabouts line 300ish in includes/mail.inc:
@@ -303,40 +305,47 @@ interface MailSystemInterface {
* We use delsp=yes wrapping, but only break non-spaced languages when
* absolutely necessary to avoid compatibility issues.
*
- * We deliberately use LF rather than CRLF, see drupal_mail().
+ * We deliberately use variable_get('mail_line_endings), MAIL_LINE_ENDINGS)
+ * rather than "\r\n".
Typo, need a ' instead of ) after mail_line_endings.
* From the comment block above function _drupal_html_to_text():
+/**
+ * Helper function for drupal_html_to_text
+ *
+ * Recursively converts $node to text, wrapping and indenting as necessary.
Please also make sure that all the RFC research is documented/summarized in inline comments (including RFC numbers). It would be silly to go through all of this research once again in the future.
Should be MAIL_NBSP. Although "NBSP" is not really descriptive. I don't understand why this is a constant, and why we're decoding to get the character instead of directly specifying the character. This needs to be documented in the (missing) phpDoc.
+++ b/includes/mail.inc
@@ -303,40 +305,47 @@ interface MailSystemInterface {
+ * We deliberately use variable_get('mail_line_endings), MAIL_LINE_ENDINGS)
+ * rather than "\r\n".
...
+ // Convert LF or CRLF into platform-specific line-endings.
I don't understand why we're adding a variable for this. Especially since the actual function code talks about platform-specific line-endings -- so if something is platform-specific, why is it a variable? A short explanation for why it's a variable would be helpful.
The first preg_replace() replaces \r\n with platform-specific or variable-defined line endings. The second preg_replace() still tries to match original line endings even though they got already replaced with $eol.
+++ b/includes/mail.inc
@@ -347,177 +356,328 @@ function drupal_wrap_mail($text, $indent = '') {
+ * @param $notes
+ * The list of footnotes, an associative array of (url => reference number) items.
Please also make sure that all the RFC research is documented/summarized in inline comments (including RFC numbers). It would be silly to go through all of this research once again in the future.
If the objective is absolute RFC-compliant pedantry, I'm unqualified for the job. Someone else will have to do it.
Should be MAIL_NBSP. Although "NBSP" is not really descriptive. I don't understand why this is a constant, and why we're decoding to get the character instead of directly specifying the character. This needs to be documented in the (missing) phpDoc.
Changed globally to chr(160), which is what it evaluates to in my copy of PHP. Added inline comments to explain the magic number.
I don't understand why we're adding a variable for this. Especially since the actual function code talks about platform-specific line-endings -- so if something is platform-specific, why is it a variable? A short explanation for why it's a variable would be helpful.
The explanation is at the top of the file, lines 8-13 inclusive:
/**
* Auto-detect appropriate line endings for e-mails.
*
* $conf['mail_line_endings'] will override this setting.
*/
define('MAIL_LINE_ENDINGS', isset($_SERVER['WINDIR']) || strpos($_SERVER['SERVER_SOFTWARE'], 'Win32') !== FALSE ? "\r\n" : "\n");
I'm using a variable $eol instead of repeating the code variable_get('mail_line_endings', MAIL_LINE_ENDINGS) but if you prefer the repetition, I will comply.
The first preg_replace() replaces \r\n with platform-specific or variable-defined line endings. The second preg_replace() still tries to match original line endings even though they got already replaced with $eol.
Okay, replaced '/ +\r?\n/' with "/ +$eol/" but I don't think it makes any difference.
Exceeds 80 chars.
Fixed.
$parents should use array type-hinting, too.
Changed. Drupal in general should use type-hinting. It doesn't.
Always use !isset() instead of is_null().
Changed. (PHP is such a screwball language! If we can't use the function, why does it exist?)
Stray leading and trailing space in if condition.
Re-wrote long if conditions using a temporary variable.
Missing break; or missing inline comment that explains the fall-through to 'dt'.
Good catch. Fixed.
The empty default case needs an inline comment for why it's needed (if it is needed).
Added an inline comment.
Bogus change: DefaultMailSystem is a class, not a function.
Fixed, but please file an issue against Coder Review, which claims the parentheses are needed, and against API, which fails to create a link without them.
Please remove the leading underscore and rename to stringToHtml().
Done.
Missing phpDoc.
Added, but is phpDoc really necessary for internal-use-only methods within a test class that don't show up on api.drupal.org anyway?
1) Rename to assertHtmlToText().
2) Rename $args to $allowed_tags.
3) $reason should be $message.
4) Use a separate $this->verbose() after assertEqual() to output the debugging info as a page instead of squeezing it into the assertion message.
5) Missing phpDoc.
All fixed. I didn't know about the verbose() function.
Might make sense to unconditionally run check_plain() on assertHtmlToText()'s $message, so we don't need those &lt;...&gt;.
Fix the broken formatting in drupal_html_to_text() and also provide tests<li>
» Fix the broken formatting in drupal_html_to_text() and also provide tests
Status:
Needs review
» Needs work
Issue tags:
+Needs committer feedback
The patch produces PHP syntax errors, so it's NW in any case. I think item 2 below (wrapping <pre> and <table> below 80 characters) also makes this NW.
Re #171:
Headings: This was/is a suggestion for discussion — I'd simply add three newlines above the level 1 heading, two for level 2, and 1 for level 3, independently of the context, in order to further emphasize the importance of the heading.
That would give us
/**
* Test that headers are properly separated from surrounding text.
*/
function testHeaderSeparation() {
// Text before and after.
$html = 'Drupal<h1>Drupal</h1>Drupal';
$text = "Drupal\n\n\n\n\n======== DRUPAL " . str_repeat('=', 61) . "\nDrupal";
$this->_testHtmlToText($html, $text);
// Paragraph before; text after.
$html = '<p>Drupal</p><h1>Drupal</h1>Drupal';
$text = "Drupal\n\n\n\n\n\n======== DRUPAL " . str_repeat('=', 61) . "\nDrupal";
$this->_testHtmlToText($html, $text);
// Text before; paragraph after.
$html = 'Drupal<h1>Drupal</h1><p>Drupal</p>';
$text = "Drupal\n\n\n\n\n======== DRUPAL " . str_repeat('=', 61) . "\n\nDrupal";
$this->_testHtmlToText($html, $text);
// Paragraph before and after.
$html = '<p>Drupal</p><h1>Drupal</h1><p>Drupal</p>';
$text = "Drupal\n\n\n\n\n\n======== DRUPAL " . str_repeat('=', 61) . "\n\nDrupal";
$this->_testHtmlToText($html, $text);
}
Wrapping <pre> lines and table rows: I'm sorry, I wasn't communicating well enough. Yes, I agree that lines MUST wrap shortly before 1000. However, right now you're wrapping them before 80, and I thought that was what you were defending. Can we agree that newlines SHOULD NOT be added or removed inside <pre> and <tr> elements as long as lines are not longer than 998 characters?
There are two newlines among the b's which shouldn't be there.
The one space ahead of the c's turns into two spaces. It's a nit, but if you're trying to convert something that depends on exact indenting, then it's maddening.
The thought of doing more to separate table cells briefly occurred to me, too, but just like vertical separators, horizontal separators should be symmetrical IMO, and thus it would take at least " | " (three characters), and it still wouldn't work so well if the cells weren't aligned. Tabs rarely help with alignment either.
However, not getting any clue whatsoever about where a cell ends and the next one starts is not so great. Hence my suggestion to put two spaces between cells, rather than just one. This would usually allow the reader to count the cells in a row and give him a fighting chance to analyze and reconstruct the table structure, if he really wanted to. With only one space, it's completely impossible unless the cell contents are just single words.
Thank you for adding the <u> tag! We're generally not seeing a lot of markup in nodes here on d.o, probably because most people don't particularly like to write HTML by hand. However, we have a list of "Allowed HTML tags" at the bottom of this textarea (<h1> <h2> <h3> <h4> <h5> <h6> <em> <strong> <code> <del> <blockquote> <q> <cite> <sup> <sub> <p> <br> <ul> <ol> <li> <dl> <dt> <dd> <a> <b> <u> <i> <table> <tr> <td> <th> <tbody> <thead> <tfoot> <colgroup> <caption>). This is a reasonable list for a technically oriented site, and it's a good assumption that any of these will be used at some time or other. If they can be rendered sensibly in plain text, and drupal_html_to_text() can support them with reasonable effort, then I'm sure that support will be appreciated in the right context.
I was going to argue that <strike> was on this list, and I think it was yesterday, and I think I've seen it used by people to show where they have made minor edits... It's still on g.d.o, for example here. It would be a pity to lose the <strike> information when such a node is transmitted through email. So, please, would you reconsider adding <strike>?
There's a smaller list in the default 'Filtered HTML' text format: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>. I think these should all be supported in some way for the sake of consistency. Only two are missing in your implementation: <cite> could be approximated reasonably well as <i>; <code> is more difficult, because it can be either inline or block. As a block, I think it should behave like <pre> — inline is tough though... Maybe `backticks` (Markdown), or { braces and spaces }?
I'm sorry for being a source of frustration for you. You've taken on a big task and done a great job. No one knows your code as well as you do — please don't give up! Take a break, let some feedback accumulate, and then deal with it calmly, rather than frantically shooting patches.
The one space in front of the c's now turns into no space, it should be preserved.
Two newlines are inserted between the b's inside the <pre> element, that shouldn't happen, as long as the resulting line is shorter than 998 characters.
The trailing spaces are gone, which is a clear improvement.
Here's a patch with a failing test. I think that making it pass will require a rewrite/replacement of (at least) drupal_wrap_mail() and friends, which would prevent the possibility of a backport.
EDIT: Had to change the test after learning some more about space-stuffing:
/**
* Ensure that content within <pre> tags is not changed.
*/
function testNoWrapWithinPre() {
// This awkward construct comes from includes/mail.inc lines 8-13.
$eol = variable_get('mail_line_endings', MAIL_LINE_ENDINGS);
$text = str_repeat('a', 80)
. ' ' . str_repeat('b', 80) // Single space.
. ' ' . str_repeat('c', 80) // Two spaces.
. "$eol" . str_repeat('d', 80) // Single linefeed.
. "$eol$eol" . str_repeat('e', 80) // Double linefeed.
. "$eol " . str_repeat('f', 80) // Linefeed and space.
. " $eol" . str_repeat('g', 80); // Space and linefeed.
// The total length of $text is under 1000 octets and should not wrap.
$html = "<pre>$text</pre>";
// Space-stuffing will add an extra space to any line starting with a space.
$text = str_replace("$eol ", "$eol ", $text);
$this->assertHtmlToText($html, $text, 'Text within <pre> is unchanged.');
}
After thinking about this some more I believe that RFC 3676 calls for soft wrapping
tags.
You can only soft-wrap content that starts at the left margin. You can't soft-wrap anything that is a descendant of a <blockquote>, <li>, <ol>, or <ul> tag.
Because of the way we are rendering them, you also can't soft-wrap anything that is a descendant of a <b>, <cite>, <em>, <i>, <strong>, or <u> tag.
The more I consider this, the more I think the approach in #197 is the way to go. There is no right way to do things here.
Actually, I think the most interesting solution is in #197.
You're not serious, are you? I'm pretty sure that spawning a child process is way too slow, and the text-mode browsers are designed to produce output for 25x80 terminals, which is not the same thing as a text email message.
Ah, this is a test, right? I'll readily admit that I'm not able to look at each of your patches, at the rate that you produce them...
I'm pretty sure that spawning a child process is way too slow,
All tests complete in less than one second on my six-year-old Lenovo T-60.
the text-mode browsers are designed to produce output for 25x80 terminals, which is not the same thing as a text email message.
All of the ones I referenced have command-line options to set the line width. None of them care about the screen height.
I'll readily admit that I'm not able to look at each of your patches, at the rate that you produce them...
You can tell which ones I'm serious about because I commit them to Mail System, which is the only thing that is going to benefit from this issue before D9 ships. And also, interdiff is handy when comparing large patchsets.
You can only soft-wrap content that starts at the left margin.
I think this is wrong. Space stuffing allows you to have leading spaces:
A generating agent MUST:
o Space-stuff lines which start with a space, "From ", or ">".
I believe this has been working in D6, but I'm not sure.
Because of the way we are rendering them, you also can't soft-wrap anything that is a descendant of a <b>, <cite>, <em>, <i>, <strong>, or <u> tag.
Why? Is this a limitation of your algorithm (can it be improved?) or do you see something in RFC 3676 that causes difficulties?
The final test, cut-and-pasted from #203, fails because it now requires behavior which you condemned in #191 and which I fixed in #196.
Yes, I got wiser between #191 and #203. I'm sorry for leading you down a path which I now think is a wrong one.
My hat's off to you for your drive, but maybe you're taking "talk is silver, code is gold" a bit too far. The basic problem in this issue is that we not only didn't have any tests, but we didn't even know how things are supposed work and why. In this situation it would be wise to discuss first and make sure we really understand what we're talking about (ideally getting consent from several people), before you start coding. This may seem slower, but ultimately it's probably faster, and certainly less frustrating.
Why would this stop working after having worked for D6?
You're absolutely right.
I was referring to drupal_wrap_mail(), which apparently hadn't been touched since D6. This is what caused me to question whether my test case was correct.
There's a tremendous amount of work, know-how, and tuning in the D6 version that we (you, mostly) have had to redo and rebuild. Once you've pulled this off, D7 will be in much better shape, of course, but it takes a lot of learning and hard work. It's not just coding...
I'm pretty sure that spawning a child process is way too slow,
All tests complete in less than one second on my six-year-old Lenovo T-60.
Guessing performance is very difficult. A more relevant test would be how many mails can be pumped through drupal_html_to_text() and sent out in one cron run. This may include spawning sendmail (or something similar), which could explain why that number is notoriously low. Spawning a text-mode browser may be a comparable effort, which might mean that you'd be cutting the number in half.
» Fix the broken formatting in drupal_html_to_text() and also provide tests
Category:
feature
» bug
Spawning a text-mode browser may be a comparable effort...
My point is that we're rapidly approaching the complexity of (e.g.) html2text. If the goal is to duplicate the entire functionality, we might as well spawn the compiled binary, which is bound to be more efficient than its interpreted-language equivalent.
#1130198 may have been all that was needed to get things working again, but without tests we can't really know.
Re-opened #1130198. Please add relevant tests to that issue.
Copied tests from this issue, then adjusted the "expected" values until they passed. Discovered that the current version of drupal_wrap_mail() will happily generate lines of over 2000 characters in length, so removed the RFC-3676 test replaced the testVeryLongLineWrap() assertion with a pass().
Yay! Got rid of all magic constants except for 160 (the non-breaking space character), 80 (the default line length for soft-wrapping) and 1000 (the default line length for hard-wrapping).
@pillarsdotnet: In light of #239, it would be helpful if you would run tests and debug test failures on your local box, before submitting patches to drupal.org. The common exception are patches that affect all of Drupal core, which take hours to test locally and only ~30 minutes for the testbot. But especially if it's only a single test-case that needs to be run, it's not really time-consuming to do that on your local box. Probably 90% of all follow-ups on this issue are pure patch/testing noise, which could have been avoided, in turn leading to more solid and focused follow-ups, reviews, and faster progress. Thanks!
Improved table generation accuracy and performance.
Added test for correctly formed deeply-nested tables.
Test also asserts that table generation takes less than one second.
Drupal 8.8.0-alpha1 will be released the week of October 14th, 2019, which means new developments and disruptive changes should now be targeted against the 8.9.x-dev branch. (Any changes to 8.9.x will also be committed to 9.0.x in preparation for Drupal 9’s release, but some changes like significant feature additions will be deferred to 9.1.x.). For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.
Well the original issue (12 years ago?? sheesh) was to add tests for htmlToText() since it's part of Drupal core.
Looking at core/tests/Drupal/Tests/Core/Mail/MailFormatHelperTest.php::MailFormatHelperTest() we still are not testing the htmlToText() function, so I think this issue still applies. Along the way this issue also wanted to improve the function. I haven't read through all the comments, but perhaps it would be easier to commit tests as-is, and then a new issue for improvements. Or close if there is no desire for the improvements anymore.
Comments
Comment #1
Kevin Hankens CreditAttribution: Kevin Hankens commentedOne quick question - sorry if this is in the wrong forum. If the $allowed_tags argument is empty, would you consider using the allowed tags list from the Filtered HTML input format settings? This would be more aligned with the bulk of the site content.
I use a lot of div tags in my content and wouldn't want to pass the entire list of allowed tags plus 'div' in an array each time I call drupal_html_to_text.
Comment #2
Kevin Hankens CreditAttribution: Kevin Hankens commentedOh, I forgot to add that for me a div tag would signify a separate section, so I also wanted to see a case for it in the switch stanza that worked like the p tag.
Thanks for letting me share :)
Comment #3
Kevin Hankens CreditAttribution: Kevin Hankens commentedI haven't submitted any simpletests before, but I started working on one yesterday for this function. If you are interested, I would be happy to finish it up and send it this way for review. If so, here are a couple of quick questions.
Thanks!
Kevin
Comment #4
catch1. That'd probably be fine. Anything else can always be added to later, but the idea here is provide input, pass it through the function, verify expected output, not much more than that.
2. You'd need a new group and a new mail.test - you could copy modules/simpletest/tests/form.test (or some other .inc that already as tests) to get started quickly.
Also feel free to post patches that are still in progress if you think there's something worth looking at, even if it's not quite ready.
Comment #5
Kevin Hankens CreditAttribution: Kevin Hankens commentedHere's a proposed mail.test file. It's a little rough, but the tests worked, and I was able to make them fail by manipulating the output of drupal_html_to_text(). Here are some considerations:
Otherwise, take a look and see if it's something you can use!
Comment #6
webchickMarking needs review.
Comment #7
Kevin Hankens CreditAttribution: Kevin Hankens commentedOn a related note, here are two things that I noticed in mail.inc that might warrant revision - hope it's ok to compile both into one patch!
<a>
tags to generate footnote URLs.<div>
tags which act like a<p>
tag, since divs are usually a separate piece of infoThanks!
[edit]Just found an error - will post revision soon[/edit]
Comment #8
Kevin Hankens CreditAttribution: Kevin Hankens commentedApologies, I forgot to include div in the list of allowed tags.
One other note, would it make sense to include a blank line before paragraphs, divs, etc.?
Comment #9
arjenk CreditAttribution: arjenk commentedWell, i started reviewing the test patch, but ended writing it. Sorry about that. Some notes:
Comment #10
drewish CreditAttribution: drewish commentedA few things:
Comment #11
arjenk CreditAttribution: arjenk commentedcleanup
Comment #12
drewish CreditAttribution: drewish commentedvisually it looks a lot better. haven't tried running them.
Comment #13
arjenk CreditAttribution: arjenk commentedThanks,
After searching Drupal for references to drupal_html_to_text i stumbled upon the initial patch for this function (#154218: Add HTML to text convertor for email (and wrap mails properly)) by Steven. Here he showed a sample conversion, which gave me the idea to add an more functional unit test to the patch. Pro: it tests the function in a more reallife (tm) use, Cons: i adds a lot of lines to the mail.test file. What do you think?
test results so far:
There is some strange behaviour with utf8 characters, and some trivial bugs in drupal_html_to():
<li>foo</li>
gives a PHP:notice;I will make a separate issue for these findings (if not too trivial), after this patch is committed ;-) ...
Please comment.
Comment #15
lilou CreditAttribution: lilou commentedSee: #335122: Test clean HEAD after every commit and http://pastebin.ca/1258476
Comment #16
catchThere are redundant empty lines in the phpdoc. Also I think this could use some extra inline comments about what's being tested and how.
Comment #17
arjenk CreditAttribution: arjenk commentedTried to improve comments and generated output.
Comment #18
catchComment #19
cwgordon7 CreditAttribution: cwgordon7 commentedIt probably doesn't matter, but since this is a test case and should try to be as near the actual case as possible, shouldn't valid HTML be used? For example, the line
<ul><li>Drupal<li>Drupal</ul>
should probably be<ul><li>Drupal</li><li>Drupal</li></ul>
, unless there is a reason for the invalid htm — perhaps you were trying to test the function's handling of such a case?Comment #21
jrglasgow CreditAttribution: jrglasgow commentedI added the tests suggested by cwgordon7 in #19 in additional to the tests already there.
Comment #22
jrglasgow CreditAttribution: jrglasgow commentedI fixed some alignment problem based on the recommendation of cwgordon7
Comment #23
cwgordon7 CreditAttribution: cwgordon7 commentedCode style is good, and all tests passed, so rtbc. :)
Comment #24
sunSpot the double spacing here.
Please remove the blank line.
Shouldn't it actually be
'> Drupal'
?Could we ammend the first line with a comment that states that we're explicitly testing malformed HTML?
Also, there is no test for an UL in an OL and vice-versa.
Concatentation mistake here.
Comment #25
arjenk CreditAttribution: arjenk commentedMore code cleanup, coder gives no errors anymore. Also i changed the test to DrupalUnitTestCase for speedup of the test (see #464714: Speed up the tests).
True,
'> Drupal'
makes more sense, lets fix this after the unittest gets in. I also added a test with a nested<ul> and <ol>
, and visa-versa.Comment #26
stella CreditAttribution: stella commentedMinor fixes required:
There are a number of trailing spaces in the patch. Also should
'Drupal<br>Drupal'
not really be'Drupal<br />Drupal'
inline with Drupal standards, or is that<br>
on purpose as part of the test?Also not sure about the assertion message text below, it could use some simplification:
Ditto with the assertion messages for the assertions in
testDrupalHtmlToTextArgs()
. I'd rather English sentences explaining the functionality that passed/failed, rather than a piece of code. And how does the 3rd assertion here differ in functionality from the 1st? They both appear to be testing that the array() overrides the default html tags allowed.Other than those few things, patch is looking good, great work!
Comment #27
sunGood catch! The answer is: both. We need to test both, since the handling of
<br>
and<br />
may break. Whereas the former should go into the "malformed" array section.The other remarks from stella I agree with.
This is a wonderful & great patch! Let's get it done! :)
Comment #28
sunAh well. And that was a nice test for HTML Filter...
Comment #29
arjenk CreditAttribution: arjenk commentedThanks for the reviews!
code.
<br> and <br />
tags.Comment #30
arjenk CreditAttribution: arjenk commentedComment #31
sunNow that I look at it again:
I think we should call this
DrupalHtmlToTextTestCase
.Likewise,
can be removed here.Duplicate space before
$expected_text
.s/Tests/Test/ - PHPDoc summaries never use the third form.
That's a wrong notation for inline comments. We only use multiline comments for PHPDoc, just use
//
here.Also, I believe the function name is
drupal_html_to_text()
, no?Can we simplify that to ?
One more instance of html_to_text() here.
Comment #32
arjenk CreditAttribution: arjenk commentedre-roll with sun's comments.
Comment #33
arjenk CreditAttribution: arjenk commentedSome small updates according to the latest test guidelines [http://drupal.org/node/325974]:
The getInfo is now a public static, and the descriptions use a more active tense now. Also changed mail to e-mail, and spotted a missed lowercase html. Last: added an mail.test entry to simpletest.info, otherwise it wont
show up in the testlist.
Comment #34
kscheirerwill review this when I get a chance
Comment #35
kscheirerAll the tests pass, this is a great addition!
just a few quibbles:
Comment #36
lilou CreditAttribution: lilou commentedNow,
t()
is removed ingetinfo()
: #500866: [META] remove t() from assert messageComment #37
arjenk CreditAttribution: arjenk commented# Need a newline at the end of modules/simpletest/tests/mail.test to follow coding standards
changed.
# You have two tests that report with the message "Tests drupal_html_to_text() basic working of the parameter $allowed_tags." - perhaps the second one could explain how it differs from the first?
how about: "Tests drupal_html_to_text() with tags which are not given by the parameter $allowed_tags.". Better suggestions are welcome!
# When reporting the function and parameters used, why truncate at 15 characters? Most of the time the test text will not be seen by anyone, but when one fails it will be useful to see the complete test performed.
ok, looks better indeed. Also removed the t() there as t('drupal_html_to_text()') makes no sense.
#Now, t() is removed in getinfo()
done.
Comment #39
sun.core CreditAttribution: sun.core commentedTests don't qualify as critical.
Comment #40
sreynen CreditAttribution: sreynen commentedThese tests seem to be failing due to a bug in the actual function. That is, I think the tests are working as intended, and catching actual errors. Specifically, line breaks aren't be added as expected in drupal_html_to_text(). It's not clear to me whether it's really necessary to have line breaks at the end of many of the examples, but there definitely needs to be a line break in place of an actual page break (e.g. "Drupal<br />Drupal", and there isn't currently.
So I think the function needs to be updated to pass the tests, but I'm not very familiar with the function. Maybe someone more familiar with the function can confirm this?
Comment #41
gowriabhaya CreditAttribution: gowriabhaya commentedCode sprint tag
Comment #42
MichaelCole CreditAttribution: MichaelCole commentedWorking this today. Will fix test and code, and resubmit patch.
Comment #43
MichaelCole CreditAttribution: MichaelCole commentedGiving up. Catch was unavailable on IRC.
Problem is how newlines are tested in the tests.
Comment #44
Eric_A CreditAttribution: Eric_A commented#508738: drupal_html_to_text negative padding error
Comment #45
Eric_A CreditAttribution: Eric_A commentedThe testbot is happy in #508738: drupal_html_to_text negative padding error. Please review! (This issue depends on that one.)
Comment #46
Eric_A CreditAttribution: Eric_A commented#37: mail.test-299138-37.patch queued for re-testing.
Comment #48
arjenk CreditAttribution: arjenk commentedPatch 1.26 of mail.inc (http://drupalcode.org/viewvc/drupal/drupal/includes/mail.inc?r1=1.25&r2=...) broke the test by doing:
The commit message was "- Patch #356074 by chx, Damien Tournoud: provide a sequences API." Which does not make much sense to me for this change... Perhaps the idea was to replace \n by MAIL_LINE_ENDINGS ?
Comment #49
sunJust remove "\n" from all the assertions. As you can see, the input text of those assertions does not contain a newline or BR, so the outlined change in #48 was a fix to not insert newlines, which do not actually exist; therefore making newlines match between input and output.
Comment #50
kscheirerMan, this took a while to track down. First of all, Dries mixed up his commit messages! Arjenk, the issue for that patch is really #407452: Remove use of drupal_html_to_text() from system mails. Which was basically sun's patch.
sun - what wasn't obvious (because we had no tests for this function!) was that many of the tags were relying on that extra
\n
to format themselves properly. Stuff like h1, h2, dl, br, p, ol, ul, li, dl, dt, dd, etc. I think your patch was trying to make this function more general - but it doesn't seem to fulfill its original purpose now. Also_drupal_html_to_text_pad()
is doing asubstr
, relying on the fact that there was an extra\n
to throw away.So right now
drupal_html_to_text('Drupal<br />Drupal')
returnsDrupalDrupal
, which is probably not what we want. Some of these can be fixed by changing$chunk = ''; // Ensure blank new-line.
to$chunk = "\n"; // Ensure blank new-line.
.But that doesn't take care of all the cases - like
'<ul><li>Drupal</li><li><ol><li>Drupal</li><li>Drupal</li></ol></li></ul>'
- what text should that result in? Reading RFC 3676 did not make it clear either.Arjenk - you wrote many of the test cases way back, based on #154218: Add HTML to text convertor for email (and wrap mails properly) - which included some examples that aren't available online anymore. It now looks like we have broken tests for a broken function. This may be a case where we write the test cases first, and then have the function match them - anyone up for it?
Comment #51
Tiburón CreditAttribution: Tiburón commentedAssigning to me.
(DrupalCon CPH 2010 Core Dev Summit)
Comment #52
Tiburón CreditAttribution: Tiburón commentedThe attached patch (based on #37) fixes the expected result of five tests of which four currently parses locally.
The tests for em, i, strong, b and br incorrectly expected a newline at the end of the result with the given input.
The test for br still fails as br is broken in drupal_html_to_text() (allowed but not handled).
In includes/mail.inc the functions drupal_html_to_text() and _drupal_html_to_text_pad() both needs to be fixed.
Comment #53
arjenk CreditAttribution: arjenk commentedWait, i'm puzzled here
First, kscheirer gave a excellent summary. And i agree with his
conclusion that the function is broken, but i'm not sure we can fix it
in the test.
The function is called drupal_html_to_text(), it is (was) a function
that was capable of formatting html to readable, formatted (hear the
newlines here?) text. It did need to add newlines in order to make
something out of the orginal html. To not allow the function to add
newlines, is to break the use of the function. So that is the status
of the current function: broken.
to give an example to illustrate:
(this is not the real text from core) gave the following text email:
but now:
The options are, to my opinion:
the function from core, and change it with strip_tags, or
likewise. Which effectively removes all tags in mails.
strong, b) and removes the others. Perhaps also change the name as
well.
there and try to fix it in another way. Perhaps make a test first if
we are dealing with a html text, and then call the function, and
else do not...
(EDIT: fixed the correct issue number)
Comment #54
sunLast option 3) in #53 is not possible. I tried to track and keep up with this issue, but didn't manage to do so. I'm not sure whether I understand #50. I like the tests added in #53, but didn't verify in detail whether they are valid. Overall, drupal_html_to_text() implements quite complex logic, so I'm pretty sure that it is possible to achieve any newline behavior we can think of. We should start by making sure that our test contains all expectations. Afterwards fix the logic.
Comment #56
kscheirerAgreed, we're back to square one.
Need to track down when this function came into Drupal and see what its original purpose was?
@sun: The short version of #50 is - your patch in #407452: Remove use of drupal_html_to_text() from system mails broke this function (or at least, it doesn't do what it did before that patch). There were no tests for this function in core to catch that. The old function may not have been correct, but the current version definitely is not.
Comment #57
sunI've slightly fixed the tests (which also means that expectations might be slightly changed).
The tests added here should definitely not try to be "compatible" with the current output of drupal_html_to_text(), but rather assert our expectations.
Our expectations are pretty clear: A proper conversion from HTML to plain text. If you happen to get e-mail notifications from drupal.org, then you may know most details of the expected format already.
Test results are clear, the current function is horribly broken. We should have another round to make sure that the expectations are right, then start to actually fix the bugs.
Comment #58
sunFeedback welcome.
Comment #60
arjenk CreditAttribution: arjenk commentedIt would also help to add a test for the issue which led to the troublesome patch. Perhaps it had something to do with wordwrap and adding a newline on long lines? So how about adding the test:
str_repeat('Drupal ', 15) => str_repeat('Drupal ', 11) . "\n" . str_repeat('Drupal ', 4)
Comment #61
arjenk CreditAttribution: arjenk commentedrfc3676 :
among others, we should also add some tests to test if this works properly.
Comment #62
sunok, but what about the existing expectations/assertions in the patch? I need a confirmation that they are correct, before I'll continue with this patch.
Comment #63
sun.core CreditAttribution: sun.core commentedEither we have serious problems here or we don't. Feedback on the added tests would be helpful to move on here (and fix the actual bugs).
Comment #64
kscheirerWell considering we don't even have enough information to write tests, I think we have a serious problem here. No one posting in this issue queue seems to know what the tests should look like. Someone has to know what this function is supposed to do, can we find the person that committed it?
Comment #65
sunWe already have tests in the latest patch, the current task is to confirm (or object to) the expectations in the tests.
Only with precisely tested expectations, we can fix drupal_html_to_text().
Comment #66
pillarsdotnet CreditAttribution: pillarsdotnet commentedPatch to mail.inc restoring functionality in converting lists.
Comment #67
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #68
pillarsdotnet CreditAttribution: pillarsdotnet commentedThat failed. Let's try just reversing the commit that caused the current brokenness.
Comment #69
pillarsdotnet CreditAttribution: pillarsdotnet commentedI wish the testbot would make up its mind as to whether it prefers "-p0" or "-p1" in its patches...
EDIT: Now I see -- earlier patch was hand-edited and I deleted a couple lines i shouldn't have.
Comment #70
pillarsdotnet CreditAttribution: pillarsdotnet commentedNote: all tests passed.
Comment #71
PanchoThis bug renders the Simplenews module unusable and blocks further progress towards a D7 release there.
Please commit what you have there. #69 fixes the most urgent problem with drupal_html_to_text(). Further tests are a good idea but shouldn't block fixing a major bug. Thanks.
Comment #72
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #73
Pancho@pillarsdotnet: Please reconsider your attitude. (You're not talking to a script kiddie, and even then you don't need to assume it was bullshit.)
Actually, I'm using Simplenews on D7 just as well to deliver our newsletter every two or three days.
So "unusable and blocking further progress" might in fact be a bit overstated. Correct would have been: "out of the box it is unusable, and because the bug is to be solved upstream, it paralyzes further development, plus: because admins don't get it work correctly, they refrain from further testing".
I thought to be specific enough, because it is obvious enough what patch #69 does. However to be even more detailed:
After applying #69 from this queue, drupal_html_to_text() stopped stripping off newlines from my simplemail newsletters (see #1099188: simplenews_html_to_text() strips off newlines).
Now, how come you're obviously not affected by the bug?
You might be using a different backend, such as Mime Mail, bypassing drupal_html_to_text().
Now the title of this major (!) issue here states "drupal_html_to_text() formatting is broken", so it doesn't mainly refer to adding tests. So what I'd like you to consider is fixing "drupal_html_to_text()" asap based on the tests we have now, even if the tests are not 100% complete. Thank you.
Comment #74
Simon Georges CreditAttribution: Simon Georges commentedClosing #1099188: simplenews_html_to_text() strips off newlines as a duplicate of this one, and subscribing.
If I may be of any help to help solve this issue, I'm clearly willing to help, to stop people raising this in Simplenews issue queue ;-).
Comment #75
pillarsdotnet CreditAttribution: pillarsdotnet commentedBumping, tagging, and re-rolling according to standard policy.
@sun: I still haven't mastered simpletest, but a minimum test requirement should be that:
<br />
or<hr />
tag must be separated by (at least) a newline in the plaintext version.Comment #76
sunIt would tremendously help if people would re-roll the proper patch. See #57.
Comment #77
pillarsdotnet CreditAttribution: pillarsdotnet commented@sun -- Obviously we want to do more, but at least the single-line change in #75 should be committed ASAP, because it fixes an obvious regression.
Added patch from #57, part of which had already been applied.
Also added my suggested test from #75 above.
Comment #78
pillarsdotnet CreditAttribution: pillarsdotnet commentedSorry; forgot the second argument to assertFalse().
Comment #79
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #81
pillarsdotnet CreditAttribution: pillarsdotnet commentedTrying again...
Comment #82
pillarsdotnet CreditAttribution: pillarsdotnet commentedAs long as I'm doing major surgery to drupal_html_to_text(), I might as well merge in #345931: Support value attribute in ordered lists.
Comment #84
pillarsdotnet CreditAttribution: pillarsdotnet commentedRemoved the "Full HTML Document" test because:
Comment #85
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #86
sunComment #87
pillarsdotnet CreditAttribution: pillarsdotnet commented@sun -- You fix it, then. This whole issue is a "feature request". The only part that was a bugfix was the patch in #75 which you so offhandedly rejected.
EDIT: Opened #1130198: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText()
Comment #88
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #89
pillarsdotnet CreditAttribution: pillarsdotnet commentedTo justify a possible RTBC by someone other than @sun:
Comment #90
pillarsdotnet CreditAttribution: pillarsdotnet commentedTests-only: failed (Good!)
Tests+fix: passed (Good!)
Comment #91
BeatnikDude CreditAttribution: BeatnikDude commentedsubscribe
Comment #92
pillarsdotnet CreditAttribution: pillarsdotnet commentedReplaced all
"\n"
withMAIL_LINE_ENDINGS
for consistency. May possibly fail tests now, as tests assume:(MAIL_LINE_ENDINGS === "\n")
This begs the question: Should we write separate tests for each possible value of
MAIL_LINE_ENDINGS
?Comment #94
pillarsdotnet CreditAttribution: pillarsdotnet commentedFixed typo.
Comment #95
Simon Georges CreditAttribution: Simon Georges commentedIs there still something missing here ?
I'm really looking forward to having it in the 7.1, since we can't release Simplenews without it.
Comment #96
pillarsdotnet CreditAttribution: pillarsdotnet commentedYou can help by posting a review of the patch in #94. An acceptable review would include the following:
Confirm that all code and comments comply with Drupal coding standards.
Confirm that the tests included with the patch are necessary and sufficient.
Confirm that the patch actually solves the problem. Describe a reproducible test where the current code fails and the patched code succeeds.
Change the issue status from "Needs Review" to "Reviewed and Tested By the Community".
If, after thorough testing, you cannot in good conscience affirm all of the above, please explain what is lacking in the current patch and I will do my best to address your concerns.
Comment #97
Lars Toomre CreditAttribution: Lars Toomre commented@pillarsdotnet:
I spent some time reading through the patch file tonight and the new tests. As a whole, the new patch and tests look good. However, I would suggest that you expand the tests to include some additional cases:
1) Multiple good tags override
$text = '<ul><li>This is a <em>complex</em> text string.</li></ul>';
$result = drupal_html_to_text($text, array('em', 'ul'));
2) Purposefully mal-formed tags in string test (from user input)
$text = '<ul><li>This is a <em>complex</em> text string.</ul></ul>';
$result = drupal_html_to_text($text, array('li','em'));
should be => ' * This is a *complex* text string.'
3) Ignorant user example of highlighting
$text = "This is <B>highlighted</B> text.';
$result = drupal_html_to_text($text, array('B'));
should be => 'This is *highlighted* text.'
Cases 2) and 3) I have experienced in the last few months from users hand-entered node bodies and/or comments. These are examples of the edge cases that I think this particular function needs to test for. Users are notorious for what they might enter without trying to be malicious.
I am unsure whether this function is intended to be used with utf-8 strings. If it is, I would want to add some text strings that include characters outside of the usual ASCII range like the European special vowels and non-Roman language sets.
Finally, I can easily see this function used as part of a larger webpage data extraction function. That typically involves using something like cURL to retrieve a webpage as $text and then start stripping out the tags to get to whatever information is desired. Hence, I think that we also want a test that takes a mocked up web page and demonstrate that this function handles that "complex case" as well.
Edit: Two further thoughts/questions:
1) What happens if 'html' is passed in as the second parameter?
2) Looking at #732992: Allow filter_xss_admin() to accept HTML5 tags, I am wondering if the permissible html tags acceptable as the second variable should be constrained to the acceptable terms used in filter_xss()? If so, how are we going to deal with what is acceptable for HTML5 (being dealt with in that issue)?
Comment #98
pillarsdotnet CreditAttribution: pillarsdotnet commented@Lars
I don't understand the significance of testing
'This is <B>highlighted</B> text.'
.<b>
tag should be supported?Comment #99
pillarsdotnet CreditAttribution: pillarsdotnet commentedOkay, I've added all the tests that Lars asked for, except for the utf-8 ones. I really don't know what to do about those and would welcome any suggestions.
@further thoughts/questions:
Nothing. I added a test to prove it. Unsupported tags are stripped and ignored.
Same as the previous. Passing unsupported tags into the second parameter will not affect the output, unless someone rewrites the drupal_html_to_text function.
If I get *real* ambitious, I might rewrite the entire function to make it easier to add new tags in the future, but right now I'm just looking for short-term results.
Comment #100
Lars Toomre CreditAttribution: Lars Toomre commentedSorry if my point was not clear enough... I was trying to highlight the <B> case as in capitalized. I know that well formatted XHTML wants all tags in all lower-case, but users have the funny tendency to want to use both <B> and <b>. I thought a test to prove that we could accept either was appropriate.
OK... I have now read through all of the function changes and they make sense to me. I have not installed this patch though so I cannot say that it installs or runs as expected.
The added comments in the code look good, with full sentences ending in periods. I did not open up the patch in a text editor so I am unable to confirm that the comments do not extend beyond line 80. I also was unable to check for extra white spaces at the end of lines. There is one phrase 'see #345931: Support value attribute in ordered lists.' which I am unsure about in a comment. I am pretty sure we are not suppose to reference d.o. issues in comments, except in special cases. Is this the format that a special case should be in?
From reading the code and the tests, this patch appears to solve the problem and passes appropriate tests. I am not prepared to RTBC this however since I cannot confirm that the patch actually applies. Other steps to complete the review of this patch include:
1) A couple of people to also review this patch and make sure that it applies appropriately.
2) Add a couple of utf-8 strings with HTML tags embedded therein to make sure that the regex expressions do not degrade for instance an o with omlats to an ascii o.
Perhaps someone who does not speak English as a primary language could make some suggestions for item #2??
Comment #101
pillarsdotnet CreditAttribution: pillarsdotnet commentedThe comments don't. Some of the code does, a little.
See the very last test. It compares
drupal_strtoupper(drupal_html_to_text($input))
withdrupal_html_to_text(drupal_strtoupper($input))
where$input
is a fairly long mix of tags and text.You are correct. I have removed the references.
Now running coder review... done. Fixed problems identified by coder.
Comment #102
pillarsdotnet CreditAttribution: pillarsdotnet commentedResponding to #97 and also to sun's comment in #86:
First of all, this function is not supposed to do as good a job as a dedicated text-mode browser such as Lynx, links, or w3m. If you want a general data extractor, write a module. This is for sending plain-text email messages. That's why it's located in
includes/mail.inc
instead ofmodules/filter
.Secondly, producing such a test as you describe basically means taking a big chunk of html, running it through the function, and copying the output to your
$expected
variable. What does that prove? Only that any future improvements to the function must also be accompanied by a change this huge test. And very likely, the future coders will likewise run the html through the function and retroactively "expect" whatever comes out the other end. I think this is a poor testing methodology. That's what I meant when I said in #84 above, "I think it's going about things the wrong way."Thirdly, I *have* run web-pages through the original function (as it existed in d6 or in #1130198) and I can testify that the resultant output is acceptably readable. I have about twenty people on my Simplenews mailing list who must receive text-only messages, and they're doing just fine.
Comment #103
Lars Toomre CreditAttribution: Lars Toomre commentedOK... Glad to know that you had run a larger more complex text string (webpage) through this function. Skip the thought since you already demonstrated that 'html' tag cannot be 'slipped in' as an exception tag. My only other concern was what would happen if it encountered 'mal formed' HTMl code.
Glad that you ran the code through coder and corrected anything that came from that. Now if we can get the two items I raised previously in #100, we should be good to go!!
P.S. - I have code in contrib module that mimics this function and another that just eliminates one HTML tag. I am curious to test out this function as a possible replacement/enhancement there.
Comment #104
pillarsdotnet CreditAttribution: pillarsdotnet commentedI already added the "malformed HTML" tests that you suggested. If you think those tests are insufficient, please suggest additional ones.
From the patch:
Comment #105
pillarsdotnet CreditAttribution: pillarsdotnet commentedFound a bug and fixed it. Dunno how to test for it, but calling
drupal_html_to_text('<li>Drupal</li>')
would produce a PHP warning.Comment #106
pillarsdotnet CreditAttribution: pillarsdotnet commentedFigured it out.
Comment #108
pillarsdotnet CreditAttribution: pillarsdotnet commentedMust have brushed against the middle mouse button while editing.
Comment #109
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #110
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #111
kscheirerdon't give up! You're trying to get drupal_html_to_text-tests+fix-299138-107.patch in right? I'll take another look as soon as I can, and thanks for picking this up.
Comment #112
pillarsdotnet CreditAttribution: pillarsdotnet commentedIncluded a patched copy of
drupal_html_to_text()
in the Mail System 6.x-2.3 release to resolve the bug reported here:#1135978: Mail System format() method uses drupal_html_to_text() which strips linefeeds from input.
It appears that this bugfix is also applicable to D6. Tagging accordingly.
Also removing the "Needs tests" tag because I honestly believe the tests in this patch are sufficient. If anyone disagrees, please do so constructively. That is, don't just complain -- provide a suggestion for the test you think is missing.
Comment #113
pillarsdotnet CreditAttribution: pillarsdotnet commentedMore constructive issue title.
Comment #114
Heine CreditAttribution: Heine commentedD6 behaviour is by design and correct. Only pass real HTML to the function. Additional whitespace will be collapsed, see #298708-12: drupal_html_to_text() removes line endings.
Comment #115
pillarsdotnet CreditAttribution: pillarsdotnet commented@Heine -- Are you saying that the
DefaultMailSystem::format()
method is broken because it blindly appliesdrupal_html_to_text()
to all input?If so, have you submitted a bug report?
Or perhaps you are claiming that the d6 and d7 versions of
drupal_html_to_text()
are somehow completely different despite sharing identical doxygen headers?If so, have you submitted a bug report?
Comment #116
Heine CreditAttribution: Heine commentedI was under the impression that this is the bug report dealing with drupal_html_to_text(). The following is what I understand is wrong about the current D7 drupal_html_to_text implementation.
Suppose you pass the following HTML through drupal_html_to_text.
Expected output ( aka the 'acceptable plaintext representation') should keep the paragraph divisons of the text with:
Output in D6 is according to expectation, in D7 we get:
Comment #117
pillarsdotnet CreditAttribution: pillarsdotnet commentedThe only functions that call
drupal_html_to-text()
are:system_mail()
(d6)hook_mail()
(d6 and d7)DefaultMailSystem::format()
(d7)None of the above functions check whether the argument to
drupal_html_to_text()
is HTML or plaintext.Heine asserts that
drupal_html_to_text()
should collapse whitespace, by design. Yet this would break the functionality of the only cases where the function is used.So I ask Heine, for the record, which of the following is true?
Your assertion was in error, and we should expect
drupal_html_to_text()
to preserve plaintext formatting.All the documented use cases are in error, and
drupal_html_to_text()
should be deleted from drupal core.Comment #118
Damien Tournoud CreditAttribution: Damien Tournoud commentedShouldn't that be MAIL_LINE_ENDINGS too?
Other then that, this looks RTBC to me. Also, this is purely a bug report. drupal_html_to_text() was accidentally broken on D7, and this patch fixes it.
@Heine: I think that pillarsdotnet was talking about backporting some additional bug fixes for drupal_html_to_text() that are also in that patch. The behavior around line breaks between the tags should obviously be the same on D6 and D7.
@pillarsdotnet: Heine is not saying that D7 behavior is correct.
Comment #119
pillarsdotnet CreditAttribution: pillarsdotnet commented@Damien -- Corrected.
Also corrected doxygen to reflect the current list of supported tags.
Added a test confirming that internal (not leading or trailing) whitespace is preserved; this revealed a bug which I subsequently fixed.
Double-checked that everything passes coder review.
Hopefully this is ready to commit.
Comment #120
pillarsdotnet CreditAttribution: pillarsdotnet commentedCouldn't resist one more tweak -- it bothered me that some of the "expected" results had trailing whitespace. Fixed.
Comment #121
pillarsdotnet CreditAttribution: pillarsdotnet commentedEven better...
Comment #122
Heine CreditAttribution: Heine commentedWhy should drupal_html_to_text operate on non-HTML?
Now it incorrectly converts HTML suchs as
Which renders as
The acceptable plaintext representation should match this rendition and display
Not what this patch displays.
Shoehorning two incompatible formats into a conversion function indicates that the callers are in the wrong.
Comment #123
Heine CreditAttribution: Heine commentedWell, that wasn't exactly fair; the incorrect whitespace treatment is not related to the patch.
I'll open a new issue.
Comment #124
pillarsdotnet CreditAttribution: pillarsdotnet commented@Heine
See the comment starting at line 497:
I find it interesting that the relevant commit is the same one which provoked this issue in the first place. Obviously there was more intended than the "provide a sequence API" comment would suggest.
I restate my conclusion in #117: either whitespace should be preserved, as required by the above comment and documented use-cases, or the function should be removed entirely. As I also said in #102, this function is not intended as a general-purpose text-mode browser. It is a convenience function to ensure that text emails remain readable, regardless of the input format.
Perhaps the simplest way to satisfy you would be to rename this function something else, like "drupal_sanitize_plaintext", or "drupal_html_or_text_to_plaintext".
Comment #125
Heine CreditAttribution: Heine commentedNot really, sending mail is fundamentally broken; the API is misdesigned.
Plaintext should be passed through nl2br or perhaps the linebreak filter before being passed to drupal_html_to_text.
Comment #126
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #127
Heine CreditAttribution: Heine commentedIf you can't bear reading reviews / musings aimed at the betterment of Drupal, stop posting patches, pillarsdotnet.
/me out
Comment #128
Damien Tournoud CreditAttribution: Damien Tournoud commentedNope, not at all. Mails are handled 100% in HTML in Drupal 7, and are only converted into plaintext (via drupal_html_to_text()) at the very end of the chain (when sending out the mail), because core by default only support sending email serialized as plain-text. You can see a summary of the design on #407452-35: Remove use of drupal_html_to_text() from system mails.
Converting plain-text to HTML before calling
drupal_mail()
is the responsibility of the caller.You are completely right in saying that drupal_html_to_text() should collapse whitespace. pillarsdotnet is wrong in expecting that plain-text input should be handled.
Needs work again: we need to fix drupal_html_to_text() so that it properly collapse whitespace. The initial patch from #407452: Remove use of drupal_html_to_text() from system mails was clearly misguided in that particular part.
Comment #129
Damien Tournoud CreditAttribution: Damien Tournoud commentedAlso see http://drupal.org/update/modules/6/7#email-html for cristal clear statements that every mail is HTML and that we don't handle plain-text.
Comment #130
pillarsdotnet CreditAttribution: pillarsdotnet commentedI give up; you win. Re-rolled as requested. Also removed the comment.
If and when this ever trickles down to D6 it will either need to preserve whitespace or be accompanied by a patch to every place in core that calls
drupal_mail()
.(looking...)
Hmm... Not such a big list as I feared... Will probably tackle this myself when the time comes.
Comment #131
pillarsdotnet CreditAttribution: pillarsdotnet commentedDoes anybody care about this? Fixing it would be ... complicated, and I don't have a use-case, so I'm tempted to leave it broken-as-designed:
The old code doesn't support the
<pre>
tag at all. I only added it as a synonym for<p>
because it's listed among the block-level tags and I wanted a test that every block-level tag forces a line-break.Comment #132
pillarsdotnet CreditAttribution: pillarsdotnet commentedFound another bug:
Trying to fix it, but I'm having problems.
Comment #133
pillarsdotnet CreditAttribution: pillarsdotnet commentedFinally got it fixed.
<pre>
tags are now fully supported and tested, as are multiple consecutive<br />
tags.The only gotcha is that spaces within
<pre>
tags turn into unicode non-breaking spaces.There's probably a cleaner way to do this, but for now it passes all tests so I'm gonna go to bed and await further review.
Comment #134
pillarsdotnet CreditAttribution: pillarsdotnet commentedFixed the gotcha and simplified the code.
Comment #135
pillarsdotnet CreditAttribution: pillarsdotnet commentedNested
<pre>
tags are now tested and properly handled.Comment #136
pillarsdotnet CreditAttribution: pillarsdotnet commented#135: drupal_html_to_text-tests+fix-299138-135.patch queued for re-testing.
Comment #137
pillarsdotnet CreditAttribution: pillarsdotnet commented#135: drupal_html_to_text-tests+fix-299138-135.patch queued for re-testing.
Comment #139
pillarsdotnet CreditAttribution: pillarsdotnet commented#135: drupal_html_to_text-tests+fix-299138-135.patch queued for re-testing.
Comment #140
stephandale CreditAttribution: stephandale commentedI've made some modifications so that consecutive block level elements are separated by a blank line (i.e. two new line characters rather than one) and nested lists are handled better. As part of these changes block level elements output a new line before and afterwards.
The most significant changes are a) the tags now output chunks as they need this control when nested, b) the indent is applied by the following value rather than by the current tag and c) $item is used as the current element of $split so $value can store values only (as they may be needed by closing tags when determining their chunk).
More tests have been added, to test these modifications and test short tags that don't contain a space (which is the more correct use - the space was only recommended as a guideline to support old user agents - see http://www.w3.org/TR/xhtml1/#guidelines)
I also replaced the 3 instances of \r\n in drupal_html_to_text() with MAIL_LINE_ENDINGS as I assumed that they were forgotten earlier. Hopefully there wasn't a reason they hadn't been changed!
I wasn't sure what to name the patch so I kept the same format as the last one and incremented the last integer in its name. Hope this is ok.
P.S. I couldn't find the Coder module for 8.x. Is there any way to automatically test coding guidelines on 8?
Comment #141
pillarsdotnet CreditAttribution: pillarsdotnet commentedNew approach using a DOM tree instead of a custom parser. Looks cleaner to me and passes all tests locally.
Comment #143
pillarsdotnet CreditAttribution: pillarsdotnet commentedOkay; let's try that again...
Comment #144
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #145
pillarsdotnet CreditAttribution: pillarsdotnet commentedCoder griped about using
substr
instead ofdrupal_substr
and found an indentation error.Corrected patch.
Comment #146
pillarsdotnet CreditAttribution: pillarsdotnet commented@stephandale --
Comment #147
pillarsdotnet CreditAttribution: pillarsdotnet commentedFound and fixed a subtle error in character counting.
Comment #148
pillarsdotnet CreditAttribution: pillarsdotnet commentedAnother coder gripe.
Note that the net difference in size between #140 and this patch is to add 31 comment lines and to take away 15 code lines.
Comment #149
pillarsdotnet CreditAttribution: pillarsdotnet commentedCode in #148 has been added to Mail System module:
And is being considered for incorporation into the Simplenews module:
Comment #150
salvisPaste this
into Devel's Execute PHP textarea (devel/php) and you'll get
I think line 2 should not be indented.
Comment #151
salvisHere's another test case — I've put the following into a node body:
where the X is a space, not followed by any newline.
This shows up just fine as
and it results in the markup in the following snippet:
Running the snippet in Devel's Execute PHP function gives:
Note the indenting again, and especially the missing 0!
Comment #152
pillarsdotnet CreditAttribution: pillarsdotnet commentedFixed, and tests added; thanks.
Comment #153
pillarsdotnet CreditAttribution: pillarsdotnet commentedTurns out I didn't need that block-tag function after all. No functional change from previous patch; just deletion of cruft.
Comment #154
pillarsdotnet CreditAttribution: pillarsdotnet commentedSlight refactoring; no functional changes.
Comment #155
pillarsdotnet CreditAttribution: pillarsdotnet commentedFixed a couple of gripes identified by coder. Again, no functional changes.
Comment #156
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #157
pillarsdotnet CreditAttribution: pillarsdotnet commentedMinor nitpicks; no code changes.
Proposed patch for:
Comment #158
pillarsdotnet CreditAttribution: pillarsdotnet commentedNote that the code in this issue has been incorporated into Mail System since 6.x-2.6/7.x-2.6/8.x-2.6. According to the usage statistics page, that's 1,348 users. Does this qualify as "Reviewed and Tested by the Community" ?
Comment #159
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #160
webchickNo. You never mark your own patch RTBC. All patches require peer review.
Comment #161
pillarsdotnet CreditAttribution: pillarsdotnet commentedWell, it's had plenty of review along the lines of "Looks good but..." and all the noted shortcomings have been fixed and tests added to ensure that they stay fixed.
Marking "Needs committer feedback" in hopes that someone will be willing to stick their neck out.
Comment #162
salvishere's a <a href='/node/123'>node</a> link
here's a node [1] link
[1] /node/123
The link must be fully qualified, like http://example.com/node/123
Also, there should be an empty line between the text and the link list.
Comment #163
salvisI'm going through all the tags. You've done an amazing job with this. Here are a few more comments:
<p>a</p><p><hr /></p><p>b</p>
currently produces two empty lines above and one below the separator line; there should be just one empty line above and below.
What is your strategy regarding soft line breaks (http://www.ietf.org/rfc/rfc2646.txt), space stuffing, etc.?
The difficulty in this issue is that it's so complex, and once a patch is committed, it'll be almost impossible to get new issues fixed that might pop up in the future. Putting this into production is certainly a good way to build confidence, but the feedback generated that way is very slow.
I'll incorporate your code into Subscriptions to generate some more feedback (hopefully), but the switch to HTML will take some major surgery before I can actually do this (right now I'm using the D6 mail.inc).
Comment #164
pillarsdotnet CreditAttribution: pillarsdotnet commentedRe #163:
I don't like them, and they aren't mine. I suggest emulating Markdown.
Two out of three text-mode browsers disagree with you. Both w3m and links render two empty lines above and below; only the older and arguably less-sophisticated lynx renders one empty line above and below. None of them, by the way, render headers as you suggest.
RFC 3676 disagrees.
The w3 org says:
The HTML-5 draft says you have to check all the child tags to decide whether to render them as phrasing content or flow content.
The Web Design Group says that they are inline if e.g., within another inline element or a P.
On reflection, I'm dropping support for both, as neither one has any accurate plain-text representation.
And they were missing before, too. This is a bug report, not a feature request. See my comment in #102 above.
Since you ask so nicely, and I happen to agree, done. (2 files changed, 8 insertions, 8 deletions)
Wrapping is mandatory, according to RFC 3676, and will often be performed by the email-reading software even if we don't.
I considered coding full support for tables, but I thought about the
colspan
androwspan
attributes, and decided that it wasn't worth the trouble.Comment #165
stephandale CreditAttribution: stephandale commentedEdit 01/06/2011 16:03: Patch in comment#167 does the trick.
Edit 01/06/2011 15:49: Bad timing - pillarsdotnet posted at same time. I'll make the mods to his latest patch and repost.
I've found that the use of the PCRE_MULTILINE modifier m in preg_replace() resulted in different behaviour on different servers, possibly due to the difference in the PCRE version though I can't be sure at this stage.
Server 1 (tests pass):
Server 2 (tests fail):
Note how on server 2 (which has an older PCRE library), the replacement is also made on the trailing new line.
I've added a three line workaround to the drupal_wrap_mail() function, so tests are now green on both servers. The change was made to the last three lines before the return statement and simply removes the $clean_indent if it's present at the end of the text.
I updated to the latest Drupal 8.x, applied the patch from comment#157 (http://drupal.org/node/299138#comment-4535020) and made my modifications. The attached patch should include those changes and my modifications. Note that it differs to the patch in comment#157 by more than the three lines I edited, which I can only assume is due to other changes in core since comment#157.
Comment #166
pillarsdotnet CreditAttribution: pillarsdotnet commentedRe #162:
I agree; done.
Re #165:
Try this slightly simpler workaround and see if it passes on your server 2.
Comment #167
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #168
stephandale CreditAttribution: stephandale commentedRe #166: It does, yes. Thanks.
Note: I tried patch from #167.
Comment #169
salvisRe #164:
<hr> is a separator. It separates two equal parts. There's no way to tell whether it should belong to the part above or below. Putting it in the middle is the right thing to do. Apparently, all three text-mode browsers agree with me that the space above and below should be equal. That's my point. I'd say one empty line is enough, but if you prefer two, that's perfectly fine with me, as long as the line is in the middle.
Let's take a step back. Headings are not 'footings' and not separators. They announce what comes after them, and as such they semantically belong to what follows. They have very little connection to what's above them. Thus the distance above should be larger than the distance below.
I don't know what the design goals and priorities of the text-mode browsers have been — probably to pack as much information on a 25-line CRT as possible, to minimize the required navigation (= transmission over a slow connection). That's completely irrelevant in a text file (or email message).
The goal here, IMHO, should be to render HTML as text in a way that helps the human reader to understand its structure and content. You're forced to drop a lot of information to convert to plain text, and any hint that you can reasonably provide will be appreciated by the human reader.
Huh? Our interpretations of RFC3676 seem to differ. Let me cite:
RFC3676 goes on to say that there can be problems with this type of text, and that in general text should be flowed. BUT (my interpretation now) if the author explicitly says that he formatted the text exactly as he wants it to appear, then the renderer should comply, if technically possible.
The <pre> tag explicitly asks for the fixed format (preformatted text). Preformatted means that indents should be preserved, spaces should be preserved, line breaks should be preserved, etc. — essentially as much formatting should be preserved as possible. If you don't have a horizontal scrollbar (e.g. on a cell phone), then you obviously have to wrap the lines to avoid clipping them, but if you can avoid it, then you should preserve formatting.
Some email clients may still munge it, but others will do the right thing. If you munge it even before it leaves the site, then you ensure that it'll be equally incomprehensible for both camps. Is that the goal?
See also HTML5. It doesn't say this explicitly, but the intention of <pre> clearly is that line breaks should neither be added nor removed inside <pre>. That's what <pre> is all about.
Yes, and it's called drupal_html_to_text(). Thus, if I want to send a node body through email, then this function is the obvious choice. I thought the idea of this exercise was to make the D7 version better than the D6 one, was it not?
_Underlined_ is firmly established. I don't know whether it was supported in the D6 version or not, and it doesn't really matter. Call its absence a pre-existing bug if you want, but that doesn't justify carrying its absence over into your new implementation.
IMO, the goal should be to try to preserve as much information as possible. If a node author decides that underline or strikeout are needed to convey his meaning, then it would be pity to lose those markups when transmitting the node body through text-mode email, given that there is a reasonable way to express them.
Here's where we differ. Where in RFC 3676 does it say "MUST wrap"? In my reading it says "SHOULD wrap", which means there can be overriding considerations. The author's clearly expressed intention via <pre> or <table> are such an overriding consideration, IMO.
RFC 3676 even says explicitly that Hand-aligned text, such as ASCII tables or art, source code, etc., SHOULD be sent as fixed, not flowed lines.
Comment #170
salvisThe patch in #167 now produces
here's a <a href='/node/123'>node</a> link
here's a node [1] link
[1] http://example.com//node/123
Note the extraneous slash. Apparently, there's no test for this yet.
It seems like you have to remove the leading slash before you call url().
Comment #171
pillarsdotnet CreditAttribution: pillarsdotnet commentedPoint by point:
Please revise the following test code to meet your expectations:
From RFC 3676:
From RFC 2046:
From RFC 821:
So line wrapping MUST occur before 1000 characters (998 plus
CRLF
). Outside of<pre>
tags, it SHOULD occur at a convenient width for text-mode readers, which historically has been 80 characters (78 plusCRLF
). If there is a problem, it lies in the fact that we use the same code to meet both constraints.The idea of this exercise is to "fix the broken formatting in drupal_html_to_text() and also provide tests". Please understand that "broken" means something entirely different from "sub-optimal". I now regret that I overzealously added support for tags that were not originally supported, as I may have unnecessarily delayed adoption of the more important fix. Also note the "Needs backport to D6" tag.
Fine; added support for the
<u>
tag, even though its actual use is quite rare. Stylesheets would be the far more common use case, but supporting them is way out of scope.Thanks. Fixed, and test added.
Comment #172
pillarsdotnet CreditAttribution: pillarsdotnet commentedI wasted four hours on this one. Try running the following program:
You will discover that
<p><hr /></p>
gets corrected to<p/><hr/>
because you can't nest<hr />
inside a<p>
container.Comment #173
sreynen CreditAttribution: sreynen commentedLooking over this thread, the scope has expanded far beyond the initial bug report, which was addressed a long time ago.
The latest patch solves the original bug (lack of tests) without introducing any new bugs. If there are additional feature requests, or even other bugs that aren't yet fixed by this patch, I suggest we take them to separate issues and close this thread before it reaches 3 years old.
Comment #174
pillarsdotnet CreditAttribution: pillarsdotnet commentedRe-ran #171 through coder review and found one
strlen()
that could be replaced bydrupal_strlen()
.Changing to avoid rejection by pedants, even though it makes absolutely no functional difference. Leaving status at RTBC.
Comment #175
Lars Toomre CreditAttribution: Lars Toomre commentedI agree with @sreynen #173 that this patch fixes the original bug report and goes far beyond. Hence, I strongly urge that this patch be applied with the clear understanding that 'better' may not be 'perfect'. However, 'better' is also *much* better than the current state of core Drupal.
This patch adds the initial requested tests and passes tests also included for all of the extra code that was added to better handle the complex challenge of converting HTML text to its plain text equivalent. Edge cases, additional bugs and/or feature requests can be started in separate issues. It is time for this RTBC patch to be applied. It makes HTML-text to plain text conversion so much better!!
Comment #176
bdragon CreditAttribution: bdragon commentedSkimmed through the patch and noticed a couple of minor nits in the doxygen docs:
* From somewhere whereabouts line 300ish in includes/mail.inc:
Typo, need a ' instead of ) after mail_line_endings.
* From the comment block above function _drupal_html_to_text():
Should have () after drupal_html_to_text there.
Comment #177
pillarsdotnet CreditAttribution: pillarsdotnet commentedUpdated.
Comment #178
sunNeeds work for @bdragon's review.
Please also make sure that all the RFC research is documented/summarized in inline comments (including RFC numbers). It would be silly to go through all of this research once again in the future.
Furthermore:
Should be MAIL_NBSP. Although "NBSP" is not really descriptive. I don't understand why this is a constant, and why we're decoding
to get the character instead of directly specifying the character. This needs to be documented in the (missing) phpDoc.I don't understand why we're adding a variable for this. Especially since the actual function code talks about platform-specific line-endings -- so if something is platform-specific, why is it a variable? A short explanation for why it's a variable would be helpful.
The first preg_replace() replaces \r\n with platform-specific or variable-defined line endings. The second preg_replace() still tries to match original line endings even though they got already replaced with $eol.
Exceeds 80 chars.
$parents should use array type-hinting, too.
Always use !isset() instead of is_null().
Stray leading and trailing space in if condition.
Missing
break;
or missing inline comment that explains the fall-through to 'dt'.1) Stray leading and trailing space in if condition.
2) Extra parenthesis around $value can be removed.
3) Shouldn't wrap.
The empty default case needs an inline comment for why it's needed (if it is needed).
Bogus change: DefaultMailSystem is a class, not a function.
1) Please remove the leading underscore and rename to stringToHtml().
2) Missing phpDoc.
1) Rename to assertHtmlToText().
2) Rename $args to $allowed_tags.
3) $reason should be $message.
4) Use a separate $this->verbose() after assertEqual() to output the debugging info as a page instead of squeezing it into the assertion message.
5) Missing phpDoc.
Might make sense to unconditionally run check_plain() on assertHtmlToText()'s $message, so we don't need those
<...>
.Let's make these a bit cleaner:
Powered by Dreditor.
Comment #179
webchickGuys, if this gets marked RTBC one more time before it's ready, it's going to the very bottom of my stack.
Comment #180
webchickComment #181
pillarsdotnet CreditAttribution: pillarsdotnet commentedAlready fixed.
If the objective is absolute RFC-compliant pedantry, I'm unqualified for the job. Someone else will have to do it.
Changed globally to
chr(160)
, which is what it evaluates to in my copy of PHP. Added inline comments to explain the magic number.The explanation is at the top of the file, lines 8-13 inclusive:
I'm using a variable
$eol
instead of repeating the codevariable_get('mail_line_endings', MAIL_LINE_ENDINGS)
but if you prefer the repetition, I will comply.Okay, replaced
'/ +\r?\n/'
with"/ +$eol/"
but I don't think it makes any difference.Fixed.
Changed. Drupal in general should use type-hinting. It doesn't.
Changed. (PHP is such a screwball language! If we can't use the function, why does it exist?)
Re-wrote long
if
conditions using a temporary variable.Good catch. Fixed.
Added an inline comment.
Fixed, but please file an issue against Coder Review, which claims the parentheses are needed, and against API, which fails to create a link without them.
Done.
Added, but is phpDoc really necessary for internal-use-only methods within a test class that don't show up on api.drupal.org anyway?
All fixed. I didn't know about the
verbose()
function.Done.
Done.
Comment #182
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #183
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #184
pillarsdotnet CreditAttribution: pillarsdotnet commentedFixed typos, and added a test for wrapping very long text lines. The test fails, however. Trying to figure out why.
Comment #185
pillarsdotnet CreditAttribution: pillarsdotnet commentedFigured it out.
Comment #186
salvisThe patch produces PHP syntax errors, so it's NW in any case. I think item 2 below (wrapping <pre> and <table> below 80 characters) also makes this NW.
Re #171:
That would give us
Try this:
It gives
<pre>aaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbb
ccccccccccccccc</pre>
aaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbb
ccccccccccccccc
There are two newlines among the b's which shouldn't be there.
The one space ahead of the c's turns into two spaces. It's a nit, but if you're trying to convert something that depends on exact indenting, then it's maddening.
However, not getting any clue whatsoever about where a cell ends and the next one starts is not so great. Hence my suggestion to put two spaces between cells, rather than just one. This would usually allow the reader to count the cells in a row and give him a fighting chance to analyze and reconstruct the table structure, if he really wanted to. With only one space, it's completely impossible unless the cell contents are just single words.
I was going to argue that <strike> was on this list, and I think it was yesterday, and I think I've seen it used by people to show where they have made minor edits... It's still on g.d.o, for example here. It would be a pity to lose the <strike> information when such a node is transmitted through email. So, please, would you reconsider adding <strike>?
Comment #187
salvisOh, sorry, our messages crossed. My item 2 referred to your patch #182. Apparently, you've noticed the problem now.
Comment #188
salvisNo, the <pre> line is still getting wrapped below 80.
Comment #189
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #190
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #191
salvisI'm sorry for being a source of frustration for you. You've taken on a big task and done a great job. No one knows your code as well as you do — please don't give up! Take a break, let some feedback accumulate, and then deal with it calmly, rather than frantically shooting patches.
Re #190:
on Devel's Execute PHP page now gives
<pre>aaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbb
ccccccccccccccc</pre>
aaaaaaaaaaaaaaaaaaaaaaa
bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb
bbbbbbbbbbbbbbbbbbbbbbbbb
ccccccccccccccc
The one space in front of the c's now turns into no space, it should be preserved.
Two newlines are inserted between the b's inside the <pre> element, that shouldn't happen, as long as the resulting line is shorter than 998 characters.
The trailing spaces are gone, which is a clear improvement.
Comment #192
pillarsdotnet CreditAttribution: pillarsdotnet commentedHere's a patch with a failing test. I think that making it pass will require a rewrite/replacement of (at least)
drupal_wrap_mail()
and friends, which would prevent the possibility of a backport.EDIT: Had to change the test after learning some more about space-stuffing:
Comment #193
pillarsdotnet CreditAttribution: pillarsdotnet commentedHere's an entirely different approach. ;->
Comment #194
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #196
pillarsdotnet CreditAttribution: pillarsdotnet commentedFixed.
Comment #197
pillarsdotnet CreditAttribution: pillarsdotnet commentedCorrected approach from #193.
Comment #198
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #199
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #200
pillarsdotnet CreditAttribution: pillarsdotnet commentedWeird. Looks like we somehow broke core.
Comment #201
salvisPlease wait and give us some time to think...
In #192 you've pointed to drupal_wrap_mail(). Why would this stop working after having worked for D6?
The goal is to do format=flowed, delsp=yes wrapping, and this is NOT what I thought we should do!
Comment #202
pillarsdotnet CreditAttribution: pillarsdotnet commentedActually, I think the most interesting solution is in #197.
Comment #203
salvisAfter thinking about this some more I believe that RFC 3676 calls for soft wrapping <pre> tags.
I think the proper test for <pre> is the following:
Comment #204
pillarsdotnet CreditAttribution: pillarsdotnet commented<blockquote>
,<li>
,<ol>
, or<ul>
tag. Because of the way we are rendering them, you also can't soft-wrap anything that is a descendant of a<b>
,<cite>
,<em>
,<i>
,<strong>
, or<u>
tag. The more I consider this, the more I think the approach in #197 is the way to go. There is no right way to do things here.Comment #205
pillarsdotnet CreditAttribution: pillarsdotnet commentedNote that this will get rejected the first time it gets reviewed, as the comments extend beyond column 80.
EDIT: nevermind; code fails with a syntax error anyway.
Comment #206
salvisYou're not serious, are you? I'm pretty sure that spawning a child process is way too slow, and the text-mode browsers are designed to produce output for 25x80 terminals, which is not the same thing as a text email message.
Ah, this is a test, right? I'll readily admit that I'm not able to look at each of your patches, at the rate that you produce them...
Comment #207
pillarsdotnet CreditAttribution: pillarsdotnet commentedAll tests complete in less than one second on my six-year-old Lenovo T-60.
All of the ones I referenced have command-line options to set the line width. None of them care about the screen height.
You can tell which ones I'm serious about because I commit them to Mail System, which is the only thing that is going to benefit from this issue before D9 ships. And also, interdiff is handy when comparing large patchsets.
Comment #208
pillarsdotnet CreditAttribution: pillarsdotnet commented#205 has a syntax error and a test error which are now both corrected.
The final test, cut-and-pasted from #203, fails because it now requires behavior which you condemned in #186/#191 and which I fixed in #196.
The comments still extend beyond column 80.
Comment #209
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #210
salvisI think this is wrong. Space stuffing allows you to have leading spaces:
I believe this has been working in D6, but I'm not sure.
Why? Is this a limitation of your algorithm (can it be improved?) or do you see something in RFC 3676 that causes difficulties?
Yes, I got wiser between #191 and #203. I'm sorry for leading you down a path which I now think is a wrong one.
My hat's off to you for your drive, but maybe you're taking "talk is silver, code is gold" a bit too far. The basic problem in this issue is that we not only didn't have any tests, but we didn't even know how things are supposed work and why. In this situation it would be wise to discuss first and make sure we really understand what we're talking about (ideally getting consent from several people), before you start coding. This may seem slower, but ultimately it's probably faster, and certainly less frustrating.
I was referring to drupal_wrap_mail(), which apparently hadn't been touched since D6. This is what caused me to question whether my test case was correct.
Obviously, the D7 drupal_html_to_text() is badly broken, but I still think that the D6 version was mostly working (except that it made it difficult to send HTML mail). It was broken by inadvertently committing #407452-64: Remove use of drupal_html_to_text() from system mails. The error was noticed a year ago in #299138-48: Improve \Drupal\Core\Utility\Mail::htmlToText(). Your patch in #299138-69: Improve \Drupal\Core\Utility\Mail::htmlToText() and in #1130198: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText() may have been all that was needed to get things working again, but without tests we can't really know.
There's a tremendous amount of work, know-how, and tuning in the D6 version that we (you, mostly) have had to redo and rebuild. Once you've pulled this off, D7 will be in much better shape, of course, but it takes a lot of learning and hard work. It's not just coding...
Comment #211
salvisGuessing performance is very difficult. A more relevant test would be how many mails can be pumped through drupal_html_to_text() and sent out in one cron run. This may include spawning sendmail (or something similar), which could explain why that number is notoriously low. Spawning a text-mode browser may be a comparable effort, which might mean that you'd be cutting the number in half.
Not an attractive proposition...
Comment #212
pillarsdotnet CreditAttribution: pillarsdotnet commentedOnce someone has pulled this off, D9 will be in much better shape.
There; fixed that for you.
Comment #213
salvisOh, you're using <del>!
;-)
Comment #214
pillarsdotnet CreditAttribution: pillarsdotnet commentedTestbots are idle; let's put them to work.
Comment #216
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #217
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #218
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #219
pillarsdotnet CreditAttribution: pillarsdotnet commentedMy point is that we're rapidly approaching the complexity of (e.g.) html2text. If the goal is to duplicate the entire functionality, we might as well spawn the compiled binary, which is bound to be more efficient than its interpreted-language equivalent.
Re-opened #1130198.
Please add relevant tests to that issue.Copied tests from this issue, then adjusted the "expected" values until they passed. Discovered that the current version of
drupal_wrap_mail()
will happily generate lines of over 2000 characters in length, soremoved the RFC-3676 testreplaced thetestVeryLongLineWrap()
assertion with apass()
.Comment #221
pillarsdotnet CreditAttribution: pillarsdotnet commentedLooks like the testbots broke down around 10pm last night.
Comment #222
pillarsdotnet CreditAttribution: pillarsdotnet commented#218: drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-218.patch queued for re-testing.
Comment #223
pillarsdotnet CreditAttribution: pillarsdotnet commented#200: drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-200.patch queued for re-testing.
Comment #225
pillarsdotnet CreditAttribution: pillarsdotnet commentedYay! Got rid of all magic constants except for 160 (the non-breaking space character), 80 (the default line length for soft-wrapping) and 1000 (the default line length for hard-wrapping).
Working on adding support for tables; stay tuned.
Comment #226
Damien Tournoud CreditAttribution: Damien Tournoud commentedThis issue transformed into a massive feature request. Let's keep the other #1130198: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText() to fix the regression.
Comment #227
pillarsdotnet CreditAttribution: pillarsdotnet commentedFull table support added.
Comment #228
pillarsdotnet CreditAttribution: pillarsdotnet commentedAdded inline comments to explain the strategy for generating tables, improved efficiency slightly, and cleaned up inline comments in test code.
Comment #229
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #230
pillarsdotnet CreditAttribution: pillarsdotnet commentedCoder review and cleanups.
Comment #231
pillarsdotnet CreditAttribution: pillarsdotnet commentedMinor changes from actual use-testing.
Comment #232
pillarsdotnet CreditAttribution: pillarsdotnet commentedLast one;
I promise... :-)EDIT: must have had my fingers crossed.Comment #234
pillarsdotnet CreditAttribution: pillarsdotnet commentedFixed errors in tests.
Comment #235
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #236
pillarsdotnet CreditAttribution: pillarsdotnet commentedCorrected a one-off error: Need to subtract the length of the line-break character from the maximum line length before wrapping or adding padding.
Comment #237
pillarsdotnet CreditAttribution: pillarsdotnet commentedRemoved
url()
calls from test code and also switched toDrupalWebTestCase
.Comment #238
pillarsdotnet CreditAttribution: pillarsdotnet commentedCorrected the initial calculation of table cell wrapping width, and eliminated a duplicate assignment.
Comment #239
pillarsdotnet CreditAttribution: pillarsdotnet commented#237 fails, which means that #238 will probably fail too, but #1179448: verbose() results are missing.
Moving
verbose()
text to assertion message so that I can figure out why the testbots are failing where my local tests are succeeding.Comment #240
sun@pillarsdotnet: In light of #239, it would be helpful if you would run tests and debug test failures on your local box, before submitting patches to drupal.org. The common exception are patches that affect all of Drupal core, which take hours to test locally and only ~30 minutes for the testbot. But especially if it's only a single test-case that needs to be run, it's not really time-consuming to do that on your local box. Probably 90% of all follow-ups on this issue are pure patch/testing noise, which could have been avoided, in turn leading to more solid and focused follow-ups, reviews, and faster progress. Thanks!
Comment #241
pillarsdotnet CreditAttribution: pillarsdotnet commentedI am testing on my local box. Tests pass here but fail on drupal.org.
Dunno why.EDIT: Okay, I need to run local tests with clean urls turned off and drupal running from a subdirectory, else the results aren't valid.
Really? 220 comments worth? (counting non-noise comments... 148 out of 243)
So.... 40% noise, including my (now removed) snarkiness.
Comment #243
pillarsdotnet CreditAttribution: pillarsdotnet commentedImproved table generation accuracy and performance.
Added test for correctly formed deeply-nested tables.
Test also asserts that table generation takes less than one second.
Comment #244
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #245
pillarsdotnet CreditAttribution: pillarsdotnet commentedComment #246
pillarsdotnet CreditAttribution: pillarsdotnet commentedFound and fixed a typo -- doxygen comment didn't match parameter name.
Comment #247
pillarsdotnet CreditAttribution: pillarsdotnet commentedFixed an edge case reported by Seph. Haven't added a test because I haven't yet figured out how to reproduce the problem. CNR for testbot only.
Comment #248
pillarsdotnet CreditAttribution: pillarsdotnet commentedChanging back to "needs work" because #247 still lacks the following features:
<ins>
,<del>
, and<strike>
.align
,rowspan
, andcolspan
.<th>
and<td>
.Also, it doesn't end world hunger, save the whales, or promote world peace. Definitely needs work.
Meanwhile, please support #1130198-46: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText()
Comment #249
salvis@pillarsdotnet: Please stop rewriting history (like #164 that you posted on June 1 and edited just now).
Comment #250
attiks CreditAttribution: attiks commentedsubscribing
Comment #251
Wim LeersComment #252
catchCould be done in 8.1.x as an API addition.
Comment #253
ianthomas_ukComment #263
jungleOutdated?
htmlToText()
is in\Drupal\Core\Mail\MailFormatHelper
ATM.\Drupal\Core\Utility\Mail::htmlToText()
does not exist in 9.1.x.Comment #264
jungleSetting to NR for closing this if applicable.
Comment #265
kscheirerWell the original issue (12 years ago?? sheesh) was to add tests for htmlToText() since it's part of Drupal core.
Looking at
core/tests/Drupal/Tests/Core/Mail/MailFormatHelperTest.php::MailFormatHelperTest()
we still are not testing thehtmlToText()
function, so I think this issue still applies. Along the way this issue also wanted to improve the function. I haven't read through all the comments, but perhaps it would be easier to commit tests as-is, and then a new issue for improvements. Or close if there is no desire for the improvements anymore.Comment #267
quietone CreditAttribution: quietone as a volunteer commentedI spent some time reading the patch and I agree with jungle in #263 that this is outdated.
The tests are being converted to Unit tests in #2035133: Convert system module's Mail unit test to phpunit. And it there appears to be agreement to change the mail system, see #1803948: [META] Adopt the symfony mailer component.