Closed (outdated)
Project:
Drupal core
Version:
9.2.x-dev
Component:
mail system
Priority:
Major
Category:
Task
Assigned:
Unassigned
Issue tags:
Reporter:
Created:
23 Aug 2008 at 20:46 UTC
Updated:
29 Apr 2021 at 09:46 UTC
Jump to comment: Most recent, Most recent file
Comments
Comment #1
Kevin Hankens commentedOne quick question - sorry if this is in the wrong forum. If the $allowed_tags argument is empty, would you consider using the allowed tags list from the Filtered HTML input format settings? This would be more aligned with the bulk of the site content.
I use a lot of div tags in my content and wouldn't want to pass the entire list of allowed tags plus 'div' in an array each time I call drupal_html_to_text.
Comment #2
Kevin Hankens commentedOh, I forgot to add that for me a div tag would signify a separate section, so I also wanted to see a case for it in the switch stanza that worked like the p tag.
Thanks for letting me share :)
Comment #3
Kevin Hankens commentedI haven't submitted any simpletests before, but I started working on one yesterday for this function. If you are interested, I would be happy to finish it up and send it this way for review. If so, here are a couple of quick questions.
Thanks!
Kevin
Comment #4
catch1. That'd probably be fine. Anything else can always be added to later, but the idea here is provide input, pass it through the function, verify expected output, not much more than that.
2. You'd need a new group and a new mail.test - you could copy modules/simpletest/tests/form.test (or some other .inc that already as tests) to get started quickly.
Also feel free to post patches that are still in progress if you think there's something worth looking at, even if it's not quite ready.
Comment #5
Kevin Hankens commentedHere's a proposed mail.test file. It's a little rough, but the tests worked, and I was able to make them fail by manipulating the output of drupal_html_to_text(). Here are some considerations:
Otherwise, take a look and see if it's something you can use!
Comment #6
webchickMarking needs review.
Comment #7
Kevin Hankens commentedOn a related note, here are two things that I noticed in mail.inc that might warrant revision - hope it's ok to compile both into one patch!
<a>tags to generate footnote URLs.<div>tags which act like a<p>tag, since divs are usually a separate piece of infoThanks!
[edit]Just found an error - will post revision soon[/edit]
Comment #8
Kevin Hankens commentedApologies, I forgot to include div in the list of allowed tags.
One other note, would it make sense to include a blank line before paragraphs, divs, etc.?
Comment #9
arjenk commentedWell, i started reviewing the test patch, but ended writing it. Sorry about that. Some notes:
Comment #10
drewish commentedA few things:
Comment #11
arjenk commentedcleanup
Comment #12
drewish commentedvisually it looks a lot better. haven't tried running them.
Comment #13
arjenk commentedThanks,
After searching Drupal for references to drupal_html_to_text i stumbled upon the initial patch for this function (#154218: Add HTML to text convertor for email (and wrap mails properly)) by Steven. Here he showed a sample conversion, which gave me the idea to add an more functional unit test to the patch. Pro: it tests the function in a more reallife (tm) use, Cons: i adds a lot of lines to the mail.test file. What do you think?
test results so far:
There is some strange behaviour with utf8 characters, and some trivial bugs in drupal_html_to():
<li>foo</li>gives a PHP:notice;I will make a separate issue for these findings (if not too trivial), after this patch is committed ;-) ...
Please comment.
Comment #15
lilou commentedSee: #335122: Test clean HEAD after every commit and http://pastebin.ca/1258476
Comment #16
catchThere are redundant empty lines in the phpdoc. Also I think this could use some extra inline comments about what's being tested and how.
Comment #17
arjenk commentedTried to improve comments and generated output.
Comment #18
catchComment #19
cwgordon7 commentedIt probably doesn't matter, but since this is a test case and should try to be as near the actual case as possible, shouldn't valid HTML be used? For example, the line
<ul><li>Drupal<li>Drupal</ul>should probably be<ul><li>Drupal</li><li>Drupal</li></ul>, unless there is a reason for the invalid htm — perhaps you were trying to test the function's handling of such a case?Comment #21
jrglasgow commentedI added the tests suggested by cwgordon7 in #19 in additional to the tests already there.
Comment #22
jrglasgow commentedI fixed some alignment problem based on the recommendation of cwgordon7
Comment #23
cwgordon7 commentedCode style is good, and all tests passed, so rtbc. :)
Comment #24
sunSpot the double spacing here.
Please remove the blank line.
Shouldn't it actually be
'> Drupal'?Could we ammend the first line with a comment that states that we're explicitly testing malformed HTML?
Also, there is no test for an UL in an OL and vice-versa.
Concatentation mistake here.
Comment #25
arjenk commentedMore code cleanup, coder gives no errors anymore. Also i changed the test to DrupalUnitTestCase for speedup of the test (see #464714: Speed up the tests).
True,
'> Drupal'makes more sense, lets fix this after the unittest gets in. I also added a test with a nested<ul> and <ol>, and visa-versa.Comment #26
stella commentedMinor fixes required:
There are a number of trailing spaces in the patch. Also should
'Drupal<br>Drupal'not really be'Drupal<br />Drupal'inline with Drupal standards, or is that<br>on purpose as part of the test?Also not sure about the assertion message text below, it could use some simplification:
Ditto with the assertion messages for the assertions in
testDrupalHtmlToTextArgs(). I'd rather English sentences explaining the functionality that passed/failed, rather than a piece of code. And how does the 3rd assertion here differ in functionality from the 1st? They both appear to be testing that the array() overrides the default html tags allowed.Other than those few things, patch is looking good, great work!
Comment #27
sunGood catch! The answer is: both. We need to test both, since the handling of
<br>and<br />may break. Whereas the former should go into the "malformed" array section.The other remarks from stella I agree with.
This is a wonderful & great patch! Let's get it done! :)
Comment #28
sunAh well. And that was a nice test for HTML Filter...
Comment #29
arjenk commentedThanks for the reviews!
code.
<br> and <br />tags.Comment #30
arjenk commentedComment #31
sunNow that I look at it again:
I think we should call this
DrupalHtmlToTextTestCase.Likewise, can be removed here.
Duplicate space before
$expected_text.s/Tests/Test/ - PHPDoc summaries never use the third form.
we want to write uppercase here.
That's a wrong notation for inline comments. We only use multiline comments for PHPDoc, just use
//here.Also, I believe the function name is
drupal_html_to_text(), no?Can we simplify that to ?
One more instance of html_to_text() here.
Comment #32
arjenk commentedre-roll with sun's comments.
Comment #33
arjenk commentedSome small updates according to the latest test guidelines [http://drupal.org/node/325974]:
The getInfo is now a public static, and the descriptions use a more active tense now. Also changed mail to e-mail, and spotted a missed lowercase html. Last: added an mail.test entry to simpletest.info, otherwise it wont
show up in the testlist.
Comment #34
kscheirerwill review this when I get a chance
Comment #35
kscheirerAll the tests pass, this is a great addition!
just a few quibbles:
Comment #36
lilou commentedNow,
t()is removed ingetinfo(): #500866: [META] remove t() from assert messageComment #37
arjenk commented# Need a newline at the end of modules/simpletest/tests/mail.test to follow coding standards
changed.
# You have two tests that report with the message "Tests drupal_html_to_text() basic working of the parameter $allowed_tags." - perhaps the second one could explain how it differs from the first?
how about: "Tests drupal_html_to_text() with tags which are not given by the parameter $allowed_tags.". Better suggestions are welcome!
# When reporting the function and parameters used, why truncate at 15 characters? Most of the time the test text will not be seen by anyone, but when one fails it will be useful to see the complete test performed.
ok, looks better indeed. Also removed the t() there as t('drupal_html_to_text()') makes no sense.
#Now, t() is removed in getinfo()
done.
Comment #39
sun.core commentedTests don't qualify as critical.
Comment #40
sreynen commentedThese tests seem to be failing due to a bug in the actual function. That is, I think the tests are working as intended, and catching actual errors. Specifically, line breaks aren't be added as expected in drupal_html_to_text(). It's not clear to me whether it's really necessary to have line breaks at the end of many of the examples, but there definitely needs to be a line break in place of an actual page break (e.g. "Drupal<br />Drupal", and there isn't currently.
So I think the function needs to be updated to pass the tests, but I'm not very familiar with the function. Maybe someone more familiar with the function can confirm this?
Comment #41
gowriabhaya commentedCode sprint tag
Comment #42
MichaelCole commentedWorking this today. Will fix test and code, and resubmit patch.
Comment #43
MichaelCole commentedGiving up. Catch was unavailable on IRC.
Problem is how newlines are tested in the tests.
Comment #44
eric_a commented#508738: drupal_html_to_text negative padding error
Comment #45
eric_a commentedThe testbot is happy in #508738: drupal_html_to_text negative padding error. Please review! (This issue depends on that one.)
Comment #46
eric_a commented#37: mail.test-299138-37.patch queued for re-testing.
Comment #48
arjenk commentedPatch 1.26 of mail.inc (http://drupalcode.org/viewvc/drupal/drupal/includes/mail.inc?r1=1.25&r2=...) broke the test by doing:
The commit message was "- Patch #356074 by chx, Damien Tournoud: provide a sequences API." Which does not make much sense to me for this change... Perhaps the idea was to replace \n by MAIL_LINE_ENDINGS ?
Comment #49
sunJust remove "\n" from all the assertions. As you can see, the input text of those assertions does not contain a newline or BR, so the outlined change in #48 was a fix to not insert newlines, which do not actually exist; therefore making newlines match between input and output.
Comment #50
kscheirerMan, this took a while to track down. First of all, Dries mixed up his commit messages! Arjenk, the issue for that patch is really #407452: Remove use of drupal_html_to_text() from system mails. Which was basically sun's patch.
sun - what wasn't obvious (because we had no tests for this function!) was that many of the tags were relying on that extra
\nto format themselves properly. Stuff like h1, h2, dl, br, p, ol, ul, li, dl, dt, dd, etc. I think your patch was trying to make this function more general - but it doesn't seem to fulfill its original purpose now. Also_drupal_html_to_text_pad()is doing asubstr, relying on the fact that there was an extra\nto throw away.So right now
drupal_html_to_text('Drupal<br />Drupal')returnsDrupalDrupal, which is probably not what we want. Some of these can be fixed by changing$chunk = ''; // Ensure blank new-line.to$chunk = "\n"; // Ensure blank new-line..But that doesn't take care of all the cases - like
'<ul><li>Drupal</li><li><ol><li>Drupal</li><li>Drupal</li></ol></li></ul>'- what text should that result in? Reading RFC 3676 did not make it clear either.Arjenk - you wrote many of the test cases way back, based on #154218: Add HTML to text convertor for email (and wrap mails properly) - which included some examples that aren't available online anymore. It now looks like we have broken tests for a broken function. This may be a case where we write the test cases first, and then have the function match them - anyone up for it?
Comment #51
Tiburón commentedAssigning to me.
(DrupalCon CPH 2010 Core Dev Summit)
Comment #52
Tiburón commentedThe attached patch (based on #37) fixes the expected result of five tests of which four currently parses locally.
The tests for em, i, strong, b and br incorrectly expected a newline at the end of the result with the given input.
The test for br still fails as br is broken in drupal_html_to_text() (allowed but not handled).
In includes/mail.inc the functions drupal_html_to_text() and _drupal_html_to_text_pad() both needs to be fixed.
Comment #53
arjenk commentedWait, i'm puzzled here
First, kscheirer gave a excellent summary. And i agree with his
conclusion that the function is broken, but i'm not sure we can fix it
in the test.
The function is called drupal_html_to_text(), it is (was) a function
that was capable of formatting html to readable, formatted (hear the
newlines here?) text. It did need to add newlines in order to make
something out of the orginal html. To not allow the function to add
newlines, is to break the use of the function. So that is the status
of the current function: broken.
to give an example to illustrate:
(this is not the real text from core) gave the following text email:
but now:
The options are, to my opinion:
the function from core, and change it with strip_tags, or
likewise. Which effectively removes all tags in mails.
strong, b) and removes the others. Perhaps also change the name as
well.
there and try to fix it in another way. Perhaps make a test first if
we are dealing with a html text, and then call the function, and
else do not...
(EDIT: fixed the correct issue number)
Comment #54
sunLast option 3) in #53 is not possible. I tried to track and keep up with this issue, but didn't manage to do so. I'm not sure whether I understand #50. I like the tests added in #53, but didn't verify in detail whether they are valid. Overall, drupal_html_to_text() implements quite complex logic, so I'm pretty sure that it is possible to achieve any newline behavior we can think of. We should start by making sure that our test contains all expectations. Afterwards fix the logic.
Comment #56
kscheirerAgreed, we're back to square one.
Need to track down when this function came into Drupal and see what its original purpose was?
@sun: The short version of #50 is - your patch in #407452: Remove use of drupal_html_to_text() from system mails broke this function (or at least, it doesn't do what it did before that patch). There were no tests for this function in core to catch that. The old function may not have been correct, but the current version definitely is not.
Comment #57
sunI've slightly fixed the tests (which also means that expectations might be slightly changed).
The tests added here should definitely not try to be "compatible" with the current output of drupal_html_to_text(), but rather assert our expectations.
Our expectations are pretty clear: A proper conversion from HTML to plain text. If you happen to get e-mail notifications from drupal.org, then you may know most details of the expected format already.
Test results are clear, the current function is horribly broken. We should have another round to make sure that the expectations are right, then start to actually fix the bugs.
Comment #58
sunFeedback welcome.
Comment #60
arjenk commentedIt would also help to add a test for the issue which led to the troublesome patch. Perhaps it had something to do with wordwrap and adding a newline on long lines? So how about adding the test:
str_repeat('Drupal ', 15) => str_repeat('Drupal ', 11) . "\n" . str_repeat('Drupal ', 4)
Comment #61
arjenk commentedrfc3676 :
among others, we should also add some tests to test if this works properly.
Comment #62
sunok, but what about the existing expectations/assertions in the patch? I need a confirmation that they are correct, before I'll continue with this patch.
Comment #63
sun.core commentedEither we have serious problems here or we don't. Feedback on the added tests would be helpful to move on here (and fix the actual bugs).
Comment #64
kscheirerWell considering we don't even have enough information to write tests, I think we have a serious problem here. No one posting in this issue queue seems to know what the tests should look like. Someone has to know what this function is supposed to do, can we find the person that committed it?
Comment #65
sunWe already have tests in the latest patch, the current task is to confirm (or object to) the expectations in the tests.
Only with precisely tested expectations, we can fix drupal_html_to_text().
Comment #66
pillarsdotnet commentedPatch to mail.inc restoring functionality in converting lists.
Comment #67
pillarsdotnet commentedComment #68
pillarsdotnet commentedThat failed. Let's try just reversing the commit that caused the current brokenness.
Comment #69
pillarsdotnet commentedI wish the testbot would make up its mind as to whether it prefers "-p0" or "-p1" in its patches...
EDIT: Now I see -- earlier patch was hand-edited and I deleted a couple lines i shouldn't have.
Comment #70
pillarsdotnet commentedNote: all tests passed.
Comment #71
panchoThis bug renders the Simplenews module unusable and blocks further progress towards a D7 release there.
Please commit what you have there. #69 fixes the most urgent problem with drupal_html_to_text(). Further tests are a good idea but shouldn't block fixing a major bug. Thanks.
Comment #72
pillarsdotnet commentedComment #73
pancho@pillarsdotnet: Please reconsider your attitude. (You're not talking to a script kiddie, and even then you don't need to assume it was bullshit.)
Actually, I'm using Simplenews on D7 just as well to deliver our newsletter every two or three days.
So "unusable and blocking further progress" might in fact be a bit overstated. Correct would have been: "out of the box it is unusable, and because the bug is to be solved upstream, it paralyzes further development, plus: because admins don't get it work correctly, they refrain from further testing".
I thought to be specific enough, because it is obvious enough what patch #69 does. However to be even more detailed:
After applying #69 from this queue, drupal_html_to_text() stopped stripping off newlines from my simplemail newsletters (see #1099188: simplenews_html_to_text() strips off newlines).
Now, how come you're obviously not affected by the bug?
You might be using a different backend, such as Mime Mail, bypassing drupal_html_to_text().
Now the title of this major (!) issue here states "drupal_html_to_text() formatting is broken", so it doesn't mainly refer to adding tests. So what I'd like you to consider is fixing "drupal_html_to_text()" asap based on the tests we have now, even if the tests are not 100% complete. Thank you.
Comment #74
simon georges commentedClosing #1099188: simplenews_html_to_text() strips off newlines as a duplicate of this one, and subscribing.
If I may be of any help to help solve this issue, I'm clearly willing to help, to stop people raising this in Simplenews issue queue ;-).
Comment #75
pillarsdotnet commentedBumping, tagging, and re-rolling according to standard policy.
@sun: I still haven't mastered simpletest, but a minimum test requirement should be that:
<br />or<hr />tag must be separated by (at least) a newline in the plaintext version.Comment #76
sunIt would tremendously help if people would re-roll the proper patch. See #57.
Comment #77
pillarsdotnet commented@sun -- Obviously we want to do more, but at least the single-line change in #75 should be committed ASAP, because it fixes an obvious regression.
Added patch from #57, part of which had already been applied.
Also added my suggested test from #75 above.
Comment #78
pillarsdotnet commentedSorry; forgot the second argument to assertFalse().
Comment #79
pillarsdotnet commentedComment #81
pillarsdotnet commentedTrying again...
Comment #82
pillarsdotnet commentedAs long as I'm doing major surgery to drupal_html_to_text(), I might as well merge in #345931: Support value attribute in ordered lists.
Comment #84
pillarsdotnet commentedRemoved the "Full HTML Document" test because:
Comment #85
pillarsdotnet commentedComment #86
sunComment #87
pillarsdotnet commented@sun -- You fix it, then. This whole issue is a "feature request". The only part that was a bugfix was the patch in #75 which you so offhandedly rejected.
EDIT: Opened #1130198: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText()
Comment #88
pillarsdotnet commentedComment #89
pillarsdotnet commentedTo justify a possible RTBC by someone other than @sun:
Comment #90
pillarsdotnet commentedTests-only: failed (Good!)
Tests+fix: passed (Good!)
Comment #91
beatnikdude commentedsubscribe
Comment #92
pillarsdotnet commentedReplaced all
"\n"withMAIL_LINE_ENDINGSfor consistency. May possibly fail tests now, as tests assume:(MAIL_LINE_ENDINGS === "\n")This begs the question: Should we write separate tests for each possible value of
MAIL_LINE_ENDINGS?Comment #94
pillarsdotnet commentedFixed typo.
Comment #95
simon georges commentedIs there still something missing here ?
I'm really looking forward to having it in the 7.1, since we can't release Simplenews without it.
Comment #96
pillarsdotnet commentedYou can help by posting a review of the patch in #94. An acceptable review would include the following:
Confirm that all code and comments comply with Drupal coding standards.
Confirm that the tests included with the patch are necessary and sufficient.
Confirm that the patch actually solves the problem. Describe a reproducible test where the current code fails and the patched code succeeds.
Change the issue status from "Needs Review" to "Reviewed and Tested By the Community".
If, after thorough testing, you cannot in good conscience affirm all of the above, please explain what is lacking in the current patch and I will do my best to address your concerns.
Comment #97
lars toomre commented@pillarsdotnet:
I spent some time reading through the patch file tonight and the new tests. As a whole, the new patch and tests look good. However, I would suggest that you expand the tests to include some additional cases:
1) Multiple good tags override
$text = '<ul><li>This is a <em>complex</em> text string.</li></ul>';
$result = drupal_html_to_text($text, array('em', 'ul'));
2) Purposefully mal-formed tags in string test (from user input)
$text = '<ul><li>This is a <em>complex</em> text string.</ul></ul>';
$result = drupal_html_to_text($text, array('li','em'));
should be => ' * This is a *complex* text string.'
3) Ignorant user example of highlighting
$text = "This is <B>highlighted</B> text.';
$result = drupal_html_to_text($text, array('B'));
should be => 'This is *highlighted* text.'
Cases 2) and 3) I have experienced in the last few months from users hand-entered node bodies and/or comments. These are examples of the edge cases that I think this particular function needs to test for. Users are notorious for what they might enter without trying to be malicious.
I am unsure whether this function is intended to be used with utf-8 strings. If it is, I would want to add some text strings that include characters outside of the usual ASCII range like the European special vowels and non-Roman language sets.
Finally, I can easily see this function used as part of a larger webpage data extraction function. That typically involves using something like cURL to retrieve a webpage as $text and then start stripping out the tags to get to whatever information is desired. Hence, I think that we also want a test that takes a mocked up web page and demonstrate that this function handles that "complex case" as well.
Edit: Two further thoughts/questions:
1) What happens if 'html' is passed in as the second parameter?
2) Looking at #732992: Allow filter_xss_admin() to accept HTML5 tags, I am wondering if the permissible html tags acceptable as the second variable should be constrained to the acceptable terms used in filter_xss()? If so, how are we going to deal with what is acceptable for HTML5 (being dealt with in that issue)?
Comment #98
pillarsdotnet commented@Lars
I don't understand the significance of testing
'This is <B>highlighted</B> text.'.<b>tag should be supported?Comment #99
pillarsdotnet commentedOkay, I've added all the tests that Lars asked for, except for the utf-8 ones. I really don't know what to do about those and would welcome any suggestions.
@further thoughts/questions:
Nothing. I added a test to prove it. Unsupported tags are stripped and ignored.
Same as the previous. Passing unsupported tags into the second parameter will not affect the output, unless someone rewrites the drupal_html_to_text function.
If I get *real* ambitious, I might rewrite the entire function to make it easier to add new tags in the future, but right now I'm just looking for short-term results.
Comment #100
lars toomre commentedSorry if my point was not clear enough... I was trying to highlight the <B> case as in capitalized. I know that well formatted XHTML wants all tags in all lower-case, but users have the funny tendency to want to use both <B> and <b>. I thought a test to prove that we could accept either was appropriate.
OK... I have now read through all of the function changes and they make sense to me. I have not installed this patch though so I cannot say that it installs or runs as expected.
The added comments in the code look good, with full sentences ending in periods. I did not open up the patch in a text editor so I am unable to confirm that the comments do not extend beyond line 80. I also was unable to check for extra white spaces at the end of lines. There is one phrase 'see #345931: Support value attribute in ordered lists.' which I am unsure about in a comment. I am pretty sure we are not suppose to reference d.o. issues in comments, except in special cases. Is this the format that a special case should be in?
From reading the code and the tests, this patch appears to solve the problem and passes appropriate tests. I am not prepared to RTBC this however since I cannot confirm that the patch actually applies. Other steps to complete the review of this patch include:
1) A couple of people to also review this patch and make sure that it applies appropriately.
2) Add a couple of utf-8 strings with HTML tags embedded therein to make sure that the regex expressions do not degrade for instance an o with omlats to an ascii o.
Perhaps someone who does not speak English as a primary language could make some suggestions for item #2??
Comment #101
pillarsdotnet commentedThe comments don't. Some of the code does, a little.
See the very last test. It compares
drupal_strtoupper(drupal_html_to_text($input))withdrupal_html_to_text(drupal_strtoupper($input))where$inputis a fairly long mix of tags and text.You are correct. I have removed the references.
Now running coder review... done. Fixed problems identified by coder.
Comment #102
pillarsdotnet commentedResponding to #97 and also to sun's comment in #86:
First of all, this function is not supposed to do as good a job as a dedicated text-mode browser such as Lynx, links, or w3m. If you want a general data extractor, write a module. This is for sending plain-text email messages. That's why it's located in
includes/mail.incinstead ofmodules/filter.Secondly, producing such a test as you describe basically means taking a big chunk of html, running it through the function, and copying the output to your
$expectedvariable. What does that prove? Only that any future improvements to the function must also be accompanied by a change this huge test. And very likely, the future coders will likewise run the html through the function and retroactively "expect" whatever comes out the other end. I think this is a poor testing methodology. That's what I meant when I said in #84 above, "I think it's going about things the wrong way."Thirdly, I *have* run web-pages through the original function (as it existed in d6 or in #1130198) and I can testify that the resultant output is acceptably readable. I have about twenty people on my Simplenews mailing list who must receive text-only messages, and they're doing just fine.
Comment #103
lars toomre commentedOK... Glad to know that you had run a larger more complex text string (webpage) through this function. Skip the thought since you already demonstrated that 'html' tag cannot be 'slipped in' as an exception tag. My only other concern was what would happen if it encountered 'mal formed' HTMl code.
Glad that you ran the code through coder and corrected anything that came from that. Now if we can get the two items I raised previously in #100, we should be good to go!!
P.S. - I have code in contrib module that mimics this function and another that just eliminates one HTML tag. I am curious to test out this function as a possible replacement/enhancement there.
Comment #104
pillarsdotnet commentedI already added the "malformed HTML" tests that you suggested. If you think those tests are insufficient, please suggest additional ones.
From the patch:
Comment #105
pillarsdotnet commentedFound a bug and fixed it. Dunno how to test for it, but calling
drupal_html_to_text('<li>Drupal</li>')would produce a PHP warning.Comment #106
pillarsdotnet commentedFigured it out.
Comment #108
pillarsdotnet commentedMust have brushed against the middle mouse button while editing.
Comment #109
pillarsdotnet commentedComment #110
pillarsdotnet commentedComment #111
kscheirerdon't give up! You're trying to get drupal_html_to_text-tests+fix-299138-107.patch in right? I'll take another look as soon as I can, and thanks for picking this up.
Comment #112
pillarsdotnet commentedIncluded a patched copy of
drupal_html_to_text()in the Mail System 6.x-2.3 release to resolve the bug reported here:#1135978: Mail System format() method uses drupal_html_to_text() which strips linefeeds from input.
It appears that this bugfix is also applicable to D6. Tagging accordingly.
Also removing the "Needs tests" tag because I honestly believe the tests in this patch are sufficient. If anyone disagrees, please do so constructively. That is, don't just complain -- provide a suggestion for the test you think is missing.
Comment #113
pillarsdotnet commentedMore constructive issue title.
Comment #114
heine commentedD6 behaviour is by design and correct. Only pass real HTML to the function. Additional whitespace will be collapsed, see #298708-12: drupal_html_to_text() removes line endings.
Comment #115
pillarsdotnet commented@Heine -- Are you saying that the
DefaultMailSystem::format()method is broken because it blindly appliesdrupal_html_to_text()to all input?If so, have you submitted a bug report?
Or perhaps you are claiming that the d6 and d7 versions of
drupal_html_to_text()are somehow completely different despite sharing identical doxygen headers?If so, have you submitted a bug report?
Comment #116
heine commentedI was under the impression that this is the bug report dealing with drupal_html_to_text(). The following is what I understand is wrong about the current D7 drupal_html_to_text implementation.
Suppose you pass the following HTML through drupal_html_to_text.
Expected output ( aka the 'acceptable plaintext representation') should keep the paragraph divisons of the text with:
Output in D6 is according to expectation, in D7 we get:
Comment #117
pillarsdotnet commentedThe only functions that call
drupal_html_to-text()are:system_mail()(d6)hook_mail()(d6 and d7)DefaultMailSystem::format()(d7)None of the above functions check whether the argument to
drupal_html_to_text()is HTML or plaintext.Heine asserts that
drupal_html_to_text()should collapse whitespace, by design. Yet this would break the functionality of the only cases where the function is used.So I ask Heine, for the record, which of the following is true?
Your assertion was in error, and we should expect
drupal_html_to_text()to preserve plaintext formatting.All the documented use cases are in error, and
drupal_html_to_text()should be deleted from drupal core.Comment #118
damien tournoud commentedShouldn't that be MAIL_LINE_ENDINGS too?
Other then that, this looks RTBC to me. Also, this is purely a bug report. drupal_html_to_text() was accidentally broken on D7, and this patch fixes it.
@Heine: I think that pillarsdotnet was talking about backporting some additional bug fixes for drupal_html_to_text() that are also in that patch. The behavior around line breaks between the tags should obviously be the same on D6 and D7.
@pillarsdotnet: Heine is not saying that D7 behavior is correct.
Comment #119
pillarsdotnet commented@Damien -- Corrected.
Also corrected doxygen to reflect the current list of supported tags.
Added a test confirming that internal (not leading or trailing) whitespace is preserved; this revealed a bug which I subsequently fixed.
Double-checked that everything passes coder review.
Hopefully this is ready to commit.
Comment #120
pillarsdotnet commentedCouldn't resist one more tweak -- it bothered me that some of the "expected" results had trailing whitespace. Fixed.
Comment #121
pillarsdotnet commentedEven better...
Comment #122
heine commentedWhy should drupal_html_to_text operate on non-HTML?
Now it incorrectly converts HTML suchs as
Which renders as
The acceptable plaintext representation should match this rendition and display
Not what this patch displays.
Shoehorning two incompatible formats into a conversion function indicates that the callers are in the wrong.
Comment #123
heine commentedWell, that wasn't exactly fair; the incorrect whitespace treatment is not related to the patch.
I'll open a new issue.
Comment #124
pillarsdotnet commented@Heine
See the comment starting at line 497:
I find it interesting that the relevant commit is the same one which provoked this issue in the first place. Obviously there was more intended than the "provide a sequence API" comment would suggest.
I restate my conclusion in #117: either whitespace should be preserved, as required by the above comment and documented use-cases, or the function should be removed entirely. As I also said in #102, this function is not intended as a general-purpose text-mode browser. It is a convenience function to ensure that text emails remain readable, regardless of the input format.
Perhaps the simplest way to satisfy you would be to rename this function something else, like "drupal_sanitize_plaintext", or "drupal_html_or_text_to_plaintext".
Comment #125
heine commentedNot really, sending mail is fundamentally broken; the API is misdesigned.
Plaintext should be passed through nl2br or perhaps the linebreak filter before being passed to drupal_html_to_text.
Comment #126
pillarsdotnet commentedComment #127
heine commentedIf you can't bear reading reviews / musings aimed at the betterment of Drupal, stop posting patches, pillarsdotnet.
/me out
Comment #128
damien tournoud commentedNope, not at all. Mails are handled 100% in HTML in Drupal 7, and are only converted into plaintext (via drupal_html_to_text()) at the very end of the chain (when sending out the mail), because core by default only support sending email serialized as plain-text. You can see a summary of the design on #407452-35: Remove use of drupal_html_to_text() from system mails.
Converting plain-text to HTML before calling
drupal_mail()is the responsibility of the caller.You are completely right in saying that drupal_html_to_text() should collapse whitespace. pillarsdotnet is wrong in expecting that plain-text input should be handled.
Needs work again: we need to fix drupal_html_to_text() so that it properly collapse whitespace. The initial patch from #407452: Remove use of drupal_html_to_text() from system mails was clearly misguided in that particular part.
Comment #129
damien tournoud commentedAlso see http://drupal.org/update/modules/6/7#email-html for cristal clear statements that every mail is HTML and that we don't handle plain-text.
Comment #130
pillarsdotnet commentedI give up; you win. Re-rolled as requested. Also removed the comment.
If and when this ever trickles down to D6 it will either need to preserve whitespace or be accompanied by a patch to every place in core that calls
drupal_mail().(looking...)
Hmm... Not such a big list as I feared... Will probably tackle this myself when the time comes.
Comment #131
pillarsdotnet commentedDoes anybody care about this? Fixing it would be ... complicated, and I don't have a use-case, so I'm tempted to leave it broken-as-designed:
The old code doesn't support the
<pre>tag at all. I only added it as a synonym for<p>because it's listed among the block-level tags and I wanted a test that every block-level tag forces a line-break.Comment #132
pillarsdotnet commentedFound another bug:
Trying to fix it, but I'm having problems.
Comment #133
pillarsdotnet commentedFinally got it fixed.
<pre>tags are now fully supported and tested, as are multiple consecutive<br />tags.The only gotcha is that spaces within
<pre>tags turn into unicode non-breaking spaces.There's probably a cleaner way to do this, but for now it passes all tests so I'm gonna go to bed and await further review.
Comment #134
pillarsdotnet commentedFixed the gotcha and simplified the code.
Comment #135
pillarsdotnet commentedNested
<pre>tags are now tested and properly handled.Comment #136
pillarsdotnet commented#135: drupal_html_to_text-tests+fix-299138-135.patch queued for re-testing.
Comment #137
pillarsdotnet commented#135: drupal_html_to_text-tests+fix-299138-135.patch queued for re-testing.
Comment #139
pillarsdotnet commented#135: drupal_html_to_text-tests+fix-299138-135.patch queued for re-testing.
Comment #140
stephandale commentedI've made some modifications so that consecutive block level elements are separated by a blank line (i.e. two new line characters rather than one) and nested lists are handled better. As part of these changes block level elements output a new line before and afterwards.
The most significant changes are a) the tags now output chunks as they need this control when nested, b) the indent is applied by the following value rather than by the current tag and c) $item is used as the current element of $split so $value can store values only (as they may be needed by closing tags when determining their chunk).
More tests have been added, to test these modifications and test short tags that don't contain a space (which is the more correct use - the space was only recommended as a guideline to support old user agents - see http://www.w3.org/TR/xhtml1/#guidelines)
I also replaced the 3 instances of \r\n in drupal_html_to_text() with MAIL_LINE_ENDINGS as I assumed that they were forgotten earlier. Hopefully there wasn't a reason they hadn't been changed!
I wasn't sure what to name the patch so I kept the same format as the last one and incremented the last integer in its name. Hope this is ok.
P.S. I couldn't find the Coder module for 8.x. Is there any way to automatically test coding guidelines on 8?
Comment #141
pillarsdotnet commentedNew approach using a DOM tree instead of a custom parser. Looks cleaner to me and passes all tests locally.
Comment #143
pillarsdotnet commentedOkay; let's try that again...
Comment #144
pillarsdotnet commentedComment #145
pillarsdotnet commentedCoder griped about using
substrinstead ofdrupal_substrand found an indentation error.Corrected patch.
Comment #146
pillarsdotnet commented@stephandale --
Comment #147
pillarsdotnet commentedFound and fixed a subtle error in character counting.
Comment #148
pillarsdotnet commentedAnother coder gripe.
Note that the net difference in size between #140 and this patch is to add 31 comment lines and to take away 15 code lines.
Comment #149
pillarsdotnet commentedCode in #148 has been added to Mail System module:
And is being considered for incorporation into the Simplenews module:
Comment #150
salvisPaste this
into Devel's Execute PHP textarea (devel/php) and you'll get
I think line 2 should not be indented.
Comment #151
salvisHere's another test case — I've put the following into a node body:
where the X is a space, not followed by any newline.
This shows up just fine as
and it results in the markup in the following snippet:
Running the snippet in Devel's Execute PHP function gives:
Note the indenting again, and especially the missing 0!
Comment #152
pillarsdotnet commentedFixed, and tests added; thanks.
Comment #153
pillarsdotnet commentedTurns out I didn't need that block-tag function after all. No functional change from previous patch; just deletion of cruft.
Comment #154
pillarsdotnet commentedSlight refactoring; no functional changes.
Comment #155
pillarsdotnet commentedFixed a couple of gripes identified by coder. Again, no functional changes.
Comment #156
pillarsdotnet commentedComment #157
pillarsdotnet commentedMinor nitpicks; no code changes.
Proposed patch for:
Comment #158
pillarsdotnet commentedNote that the code in this issue has been incorporated into Mail System since 6.x-2.6/7.x-2.6/8.x-2.6. According to the usage statistics page, that's 1,348 users. Does this qualify as "Reviewed and Tested by the Community" ?
Comment #159
pillarsdotnet commentedComment #160
webchickNo. You never mark your own patch RTBC. All patches require peer review.
Comment #161
pillarsdotnet commentedWell, it's had plenty of review along the lines of "Looks good but..." and all the noted shortcomings have been fixed and tests added to ensure that they stay fixed.
Marking "Needs committer feedback" in hopes that someone will be willing to stick their neck out.
Comment #162
salvishere's a <a href='/node/123'>node</a> linkhere's a node [1] link[1] /node/123The link must be fully qualified, like http://example.com/node/123
Also, there should be an empty line between the text and the link list.
Comment #163
salvisI'm going through all the tags. You've done an amazing job with this. Here are a few more comments:
<p>a</p><p><hr /></p><p>b</p>currently produces two empty lines above and one below the separator line; there should be just one empty line above and below.
What is your strategy regarding soft line breaks (http://www.ietf.org/rfc/rfc2646.txt), space stuffing, etc.?
The difficulty in this issue is that it's so complex, and once a patch is committed, it'll be almost impossible to get new issues fixed that might pop up in the future. Putting this into production is certainly a good way to build confidence, but the feedback generated that way is very slow.
I'll incorporate your code into Subscriptions to generate some more feedback (hopefully), but the switch to HTML will take some major surgery before I can actually do this (right now I'm using the D6 mail.inc).
Comment #164
pillarsdotnet commentedRe #163:
I don't like them, and they aren't mine. I suggest emulating Markdown.
Two out of three text-mode browsers disagree with you. Both w3m and links render two empty lines above and below; only the older and arguably less-sophisticated lynx renders one empty line above and below. None of them, by the way, render headers as you suggest.
RFC 3676 disagrees.
The w3 org says:
The HTML-5 draft says you have to check all the child tags to decide whether to render them as phrasing content or flow content.
The Web Design Group says that they are inline if e.g., within another inline element or a P.
On reflection, I'm dropping support for both, as neither one has any accurate plain-text representation.
And they were missing before, too. This is a bug report, not a feature request. See my comment in #102 above.
Since you ask so nicely, and I happen to agree, done. (2 files changed, 8 insertions, 8 deletions)
Wrapping is mandatory, according to RFC 3676, and will often be performed by the email-reading software even if we don't.
I considered coding full support for tables, but I thought about the
colspanandrowspanattributes, and decided that it wasn't worth the trouble.Comment #165
stephandale commentedEdit 01/06/2011 16:03: Patch in comment#167 does the trick.
Edit 01/06/2011 15:49: Bad timing - pillarsdotnet posted at same time. I'll make the mods to his latest patch and repost.
I've found that the use of the PCRE_MULTILINE modifier m in preg_replace() resulted in different behaviour on different servers, possibly due to the difference in the PCRE version though I can't be sure at this stage.
Server 1 (tests pass):
php > $s = "one\ntwo\n"; php > echo preg_replace('/^/m','FOO',$s); FOOone FOOtwoServer 2 (tests fail):
php > $s = "one\ntwo\n"; php > echo preg_replace('/^/m','FOO',$s); FOOone FOOtwo FOONote how on server 2 (which has an older PCRE library), the replacement is also made on the trailing new line.
I've added a three line workaround to the drupal_wrap_mail() function, so tests are now green on both servers. The change was made to the last three lines before the return statement and simply removes the $clean_indent if it's present at the end of the text.
// Apply indentation. We only include non-'>' indentation on the first line. $indented_text = preg_replace('/^/m', $clean_indent, $text); $indented_text = preg_replace('/' . $clean_indent . '$/', '', $indented_text); // Workaround for differences in multiline behaviour. $text = $indent . drupal_substr($indented_text, drupal_strlen($indent)); return $text;I updated to the latest Drupal 8.x, applied the patch from comment#157 (http://drupal.org/node/299138#comment-4535020) and made my modifications. The attached patch should include those changes and my modifications. Note that it differs to the patch in comment#157 by more than the three lines I edited, which I can only assume is due to other changes in core since comment#157.
Comment #166
pillarsdotnet commentedRe #162:
I agree; done.
Re #165:
Try this slightly simpler workaround and see if it passes on your server 2.
Comment #167
pillarsdotnet commentedComment #168
stephandale commentedRe #166: It does, yes. Thanks.
Note: I tried patch from #167.
Comment #169
salvisRe #164:
<hr> is a separator. It separates two equal parts. There's no way to tell whether it should belong to the part above or below. Putting it in the middle is the right thing to do. Apparently, all three text-mode browsers agree with me that the space above and below should be equal. That's my point. I'd say one empty line is enough, but if you prefer two, that's perfectly fine with me, as long as the line is in the middle.
Let's take a step back. Headings are not 'footings' and not separators. They announce what comes after them, and as such they semantically belong to what follows. They have very little connection to what's above them. Thus the distance above should be larger than the distance below.
I don't know what the design goals and priorities of the text-mode browsers have been — probably to pack as much information on a 25-line CRT as possible, to minimize the required navigation (= transmission over a slow connection). That's completely irrelevant in a text file (or email message).
The goal here, IMHO, should be to render HTML as text in a way that helps the human reader to understand its structure and content. You're forced to drop a lot of information to convert to plain text, and any hint that you can reasonably provide will be appreciated by the human reader.
Huh? Our interpretations of RFC3676 seem to differ. Let me cite:
RFC3676 goes on to say that there can be problems with this type of text, and that in general text should be flowed. BUT (my interpretation now) if the author explicitly says that he formatted the text exactly as he wants it to appear, then the renderer should comply, if technically possible.
The <pre> tag explicitly asks for the fixed format (preformatted text). Preformatted means that indents should be preserved, spaces should be preserved, line breaks should be preserved, etc. — essentially as much formatting should be preserved as possible. If you don't have a horizontal scrollbar (e.g. on a cell phone), then you obviously have to wrap the lines to avoid clipping them, but if you can avoid it, then you should preserve formatting.
Some email clients may still munge it, but others will do the right thing. If you munge it even before it leaves the site, then you ensure that it'll be equally incomprehensible for both camps. Is that the goal?
See also HTML5. It doesn't say this explicitly, but the intention of <pre> clearly is that line breaks should neither be added nor removed inside <pre>. That's what <pre> is all about.
Yes, and it's called drupal_html_to_text(). Thus, if I want to send a node body through email, then this function is the obvious choice. I thought the idea of this exercise was to make the D7 version better than the D6 one, was it not?
_Underlined_ is firmly established. I don't know whether it was supported in the D6 version or not, and it doesn't really matter. Call its absence a pre-existing bug if you want, but that doesn't justify carrying its absence over into your new implementation.
IMO, the goal should be to try to preserve as much information as possible. If a node author decides that underline or strikeout are needed to convey his meaning, then it would be pity to lose those markups when transmitting the node body through text-mode email, given that there is a reasonable way to express them.
Here's where we differ. Where in RFC 3676 does it say "MUST wrap"? In my reading it says "SHOULD wrap", which means there can be overriding considerations. The author's clearly expressed intention via <pre> or <table> are such an overriding consideration, IMO.
RFC 3676 even says explicitly that Hand-aligned text, such as ASCII tables or art, source code, etc., SHOULD be sent as fixed, not flowed lines.
Comment #170
salvisThe patch in #167 now produces
here's a <a href='/node/123'>node</a> linkhere's a node [1] link[1] http://example.com//node/123Note the extraneous slash. Apparently, there's no test for this yet.
It seems like you have to remove the leading slash before you call url().
Comment #171
pillarsdotnet commentedPoint by point:
Please revise the following test code to meet your expectations:
From RFC 3676:
From RFC 2046:
From RFC 821:
So line wrapping MUST occur before 1000 characters (998 plus
CRLF). Outside of<pre>tags, it SHOULD occur at a convenient width for text-mode readers, which historically has been 80 characters (78 plusCRLF). If there is a problem, it lies in the fact that we use the same code to meet both constraints.The idea of this exercise is to "fix the broken formatting in drupal_html_to_text() and also provide tests". Please understand that "broken" means something entirely different from "sub-optimal". I now regret that I overzealously added support for tags that were not originally supported, as I may have unnecessarily delayed adoption of the more important fix. Also note the "Needs backport to D6" tag.
Fine; added support for the
<u>tag, even though its actual use is quite rare. Stylesheets would be the far more common use case, but supporting them is way out of scope.Thanks. Fixed, and test added.
Comment #172
pillarsdotnet commentedI wasted four hours on this one. Try running the following program:
You will discover that
<p><hr /></p>gets corrected to<p/><hr/>because you can't nest<hr />inside a<p>container.Comment #173
sreynen commentedLooking over this thread, the scope has expanded far beyond the initial bug report, which was addressed a long time ago.
The latest patch solves the original bug (lack of tests) without introducing any new bugs. If there are additional feature requests, or even other bugs that aren't yet fixed by this patch, I suggest we take them to separate issues and close this thread before it reaches 3 years old.
Comment #174
pillarsdotnet commentedRe-ran #171 through coder review and found one
strlen()that could be replaced bydrupal_strlen().Changing to avoid rejection by pedants, even though it makes absolutely no functional difference. Leaving status at RTBC.
Comment #175
lars toomre commentedI agree with @sreynen #173 that this patch fixes the original bug report and goes far beyond. Hence, I strongly urge that this patch be applied with the clear understanding that 'better' may not be 'perfect'. However, 'better' is also *much* better than the current state of core Drupal.
This patch adds the initial requested tests and passes tests also included for all of the extra code that was added to better handle the complex challenge of converting HTML text to its plain text equivalent. Edge cases, additional bugs and/or feature requests can be started in separate issues. It is time for this RTBC patch to be applied. It makes HTML-text to plain text conversion so much better!!
Comment #176
bdragon commentedSkimmed through the patch and noticed a couple of minor nits in the doxygen docs:
* From somewhere whereabouts line 300ish in includes/mail.inc:
Typo, need a ' instead of ) after mail_line_endings.
* From the comment block above function _drupal_html_to_text():
Should have () after drupal_html_to_text there.
Comment #177
pillarsdotnet commentedUpdated.
Comment #178
sunNeeds work for @bdragon's review.
Please also make sure that all the RFC research is documented/summarized in inline comments (including RFC numbers). It would be silly to go through all of this research once again in the future.
Furthermore:
Should be MAIL_NBSP. Although "NBSP" is not really descriptive. I don't understand why this is a constant, and why we're decoding
to get the character instead of directly specifying the character. This needs to be documented in the (missing) phpDoc.I don't understand why we're adding a variable for this. Especially since the actual function code talks about platform-specific line-endings -- so if something is platform-specific, why is it a variable? A short explanation for why it's a variable would be helpful.
The first preg_replace() replaces \r\n with platform-specific or variable-defined line endings. The second preg_replace() still tries to match original line endings even though they got already replaced with $eol.
Exceeds 80 chars.
$parents should use array type-hinting, too.
Always use !isset() instead of is_null().
Stray leading and trailing space in if condition.
Missing
break;or missing inline comment that explains the fall-through to 'dt'.1) Stray leading and trailing space in if condition.
2) Extra parenthesis around $value can be removed.
3) Shouldn't wrap.
The empty default case needs an inline comment for why it's needed (if it is needed).
Bogus change: DefaultMailSystem is a class, not a function.
1) Please remove the leading underscore and rename to stringToHtml().
2) Missing phpDoc.
1) Rename to assertHtmlToText().
2) Rename $args to $allowed_tags.
3) $reason should be $message.
4) Use a separate $this->verbose() after assertEqual() to output the debugging info as a page instead of squeezing it into the assertion message.
5) Missing phpDoc.
Might make sense to unconditionally run check_plain() on assertHtmlToText()'s $message, so we don't need those
<...>.Let's make these a bit cleaner:
Powered by Dreditor.
Comment #179
webchickGuys, if this gets marked RTBC one more time before it's ready, it's going to the very bottom of my stack.
Comment #180
webchickComment #181
pillarsdotnet commentedAlready fixed.
If the objective is absolute RFC-compliant pedantry, I'm unqualified for the job. Someone else will have to do it.
Changed globally to
chr(160), which is what it evaluates to in my copy of PHP. Added inline comments to explain the magic number.The explanation is at the top of the file, lines 8-13 inclusive:
I'm using a variable
$eolinstead of repeating the codevariable_get('mail_line_endings', MAIL_LINE_ENDINGS)but if you prefer the repetition, I will comply.Okay, replaced
'/ +\r?\n/'with"/ +$eol/"but I don't think it makes any difference.Fixed.
Changed. Drupal in general should use type-hinting. It doesn't.
Changed. (PHP is such a screwball language! If we can't use the function, why does it exist?)
Re-wrote long
ifconditions using a temporary variable.Good catch. Fixed.
Added an inline comment.
Fixed, but please file an issue against Coder Review, which claims the parentheses are needed, and against API, which fails to create a link without them.
Done.
Added, but is phpDoc really necessary for internal-use-only methods within a test class that don't show up on api.drupal.org anyway?
All fixed. I didn't know about the
verbose()function.Done.
Done.
Comment #182
pillarsdotnet commentedComment #183
pillarsdotnet commentedComment #184
pillarsdotnet commentedFixed typos, and added a test for wrapping very long text lines. The test fails, however. Trying to figure out why.
Comment #185
pillarsdotnet commentedFigured it out.
Comment #186
salvisThe patch produces PHP syntax errors, so it's NW in any case. I think item 2 below (wrapping <pre> and <table> below 80 characters) also makes this NW.
Re #171:
That would give us
Try this:
It gives
<pre>aaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbccccccccccccccc</pre>aaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccccccccccccccThere are two newlines among the b's which shouldn't be there.
The one space ahead of the c's turns into two spaces. It's a nit, but if you're trying to convert something that depends on exact indenting, then it's maddening.
However, not getting any clue whatsoever about where a cell ends and the next one starts is not so great. Hence my suggestion to put two spaces between cells, rather than just one. This would usually allow the reader to count the cells in a row and give him a fighting chance to analyze and reconstruct the table structure, if he really wanted to. With only one space, it's completely impossible unless the cell contents are just single words.
I was going to argue that <strike> was on this list, and I think it was yesterday, and I think I've seen it used by people to show where they have made minor edits... It's still on g.d.o, for example here. It would be a pity to lose the <strike> information when such a node is transmitted through email. So, please, would you reconsider adding <strike>?
Comment #187
salvisOh, sorry, our messages crossed. My item 2 referred to your patch #182. Apparently, you've noticed the problem now.
Comment #188
salvisNo, the <pre> line is still getting wrapped below 80.
Comment #189
pillarsdotnet commentedComment #190
pillarsdotnet commentedComment #191
salvisI'm sorry for being a source of frustration for you. You've taken on a big task and done a great job. No one knows your code as well as you do — please don't give up! Take a break, let some feedback accumulate, and then deal with it calmly, rather than frantically shooting patches.
Re #190:
on Devel's Execute PHP page now gives
<pre>aaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbccccccccccccccc</pre>aaaaaaaaaaaaaaaaaaaaaaabbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbcccccccccccccccThe one space in front of the c's now turns into no space, it should be preserved.
Two newlines are inserted between the b's inside the <pre> element, that shouldn't happen, as long as the resulting line is shorter than 998 characters.
The trailing spaces are gone, which is a clear improvement.
Comment #192
pillarsdotnet commentedHere's a patch with a failing test. I think that making it pass will require a rewrite/replacement of (at least)
drupal_wrap_mail()and friends, which would prevent the possibility of a backport.EDIT: Had to change the test after learning some more about space-stuffing:
Comment #193
pillarsdotnet commentedHere's an entirely different approach. ;->
Comment #194
pillarsdotnet commentedComment #196
pillarsdotnet commentedFixed.
Comment #197
pillarsdotnet commentedCorrected approach from #193.
Comment #198
pillarsdotnet commentedComment #199
pillarsdotnet commentedComment #200
pillarsdotnet commentedWeird. Looks like we somehow broke core.
Comment #201
salvisPlease wait and give us some time to think...
In #192 you've pointed to drupal_wrap_mail(). Why would this stop working after having worked for D6?
The goal is to do format=flowed, delsp=yes wrapping, and this is NOT what I thought we should do!
Comment #202
pillarsdotnet commentedActually, I think the most interesting solution is in #197.
Comment #203
salvisAfter thinking about this some more I believe that RFC 3676 calls for soft wrapping <pre> tags.
I think the proper test for <pre> is the following:
Comment #204
pillarsdotnet commented<blockquote>,<li>,<ol>, or<ul>tag. Because of the way we are rendering them, you also can't soft-wrap anything that is a descendant of a<b>,<cite>,<em>,<i>,<strong>, or<u>tag. The more I consider this, the more I think the approach in #197 is the way to go. There is no right way to do things here.Comment #205
pillarsdotnet commentedNote that this will get rejected the first time it gets reviewed, as the comments extend beyond column 80.
EDIT: nevermind; code fails with a syntax error anyway.
Comment #206
salvisYou're not serious, are you? I'm pretty sure that spawning a child process is way too slow, and the text-mode browsers are designed to produce output for 25x80 terminals, which is not the same thing as a text email message.
Ah, this is a test, right? I'll readily admit that I'm not able to look at each of your patches, at the rate that you produce them...
Comment #207
pillarsdotnet commentedAll tests complete in less than one second on my six-year-old Lenovo T-60.
All of the ones I referenced have command-line options to set the line width. None of them care about the screen height.
You can tell which ones I'm serious about because I commit them to Mail System, which is the only thing that is going to benefit from this issue before D9 ships. And also, interdiff is handy when comparing large patchsets.
Comment #208
pillarsdotnet commented#205 has a syntax error and a test error which are now both corrected.
The final test, cut-and-pasted from #203, fails because it now requires behavior which you condemned in #186/#191 and which I fixed in #196.
The comments still extend beyond column 80.
Comment #209
pillarsdotnet commentedComment #210
salvisI think this is wrong. Space stuffing allows you to have leading spaces:
I believe this has been working in D6, but I'm not sure.
Why? Is this a limitation of your algorithm (can it be improved?) or do you see something in RFC 3676 that causes difficulties?
Yes, I got wiser between #191 and #203. I'm sorry for leading you down a path which I now think is a wrong one.
My hat's off to you for your drive, but maybe you're taking "talk is silver, code is gold" a bit too far. The basic problem in this issue is that we not only didn't have any tests, but we didn't even know how things are supposed work and why. In this situation it would be wise to discuss first and make sure we really understand what we're talking about (ideally getting consent from several people), before you start coding. This may seem slower, but ultimately it's probably faster, and certainly less frustrating.
I was referring to drupal_wrap_mail(), which apparently hadn't been touched since D6. This is what caused me to question whether my test case was correct.
Obviously, the D7 drupal_html_to_text() is badly broken, but I still think that the D6 version was mostly working (except that it made it difficult to send HTML mail). It was broken by inadvertently committing #407452-64: Remove use of drupal_html_to_text() from system mails. The error was noticed a year ago in #299138-48: Improve \Drupal\Core\Utility\Mail::htmlToText(). Your patch in #299138-69: Improve \Drupal\Core\Utility\Mail::htmlToText() and in #1130198: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText() may have been all that was needed to get things working again, but without tests we can't really know.
There's a tremendous amount of work, know-how, and tuning in the D6 version that we (you, mostly) have had to redo and rebuild. Once you've pulled this off, D7 will be in much better shape, of course, but it takes a lot of learning and hard work. It's not just coding...
Comment #211
salvisGuessing performance is very difficult. A more relevant test would be how many mails can be pumped through drupal_html_to_text() and sent out in one cron run. This may include spawning sendmail (or something similar), which could explain why that number is notoriously low. Spawning a text-mode browser may be a comparable effort, which might mean that you'd be cutting the number in half.
Not an attractive proposition...
Comment #212
pillarsdotnet commentedOnce someone has pulled this off, D9 will be in much better shape.
There; fixed that for you.
Comment #213
salvisOh, you're using <del>!
;-)
Comment #214
pillarsdotnet commentedTestbots are idle; let's put them to work.
Comment #216
pillarsdotnet commentedComment #217
pillarsdotnet commentedComment #218
pillarsdotnet commentedComment #219
pillarsdotnet commentedMy point is that we're rapidly approaching the complexity of (e.g.) html2text. If the goal is to duplicate the entire functionality, we might as well spawn the compiled binary, which is bound to be more efficient than its interpreted-language equivalent.
Re-opened #1130198.
Please add relevant tests to that issue.Copied tests from this issue, then adjusted the "expected" values until they passed. Discovered that the current version of
drupal_wrap_mail()will happily generate lines of over 2000 characters in length, soremoved the RFC-3676 testreplaced thetestVeryLongLineWrap()assertion with apass().Comment #221
pillarsdotnet commentedLooks like the testbots broke down around 10pm last night.
Comment #222
pillarsdotnet commented#218: drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-218.patch queued for re-testing.
Comment #223
pillarsdotnet commented#200: drupal_html_to_text-fix_broken_formatting_and_add_tests-299138-200.patch queued for re-testing.
Comment #225
pillarsdotnet commentedYay! Got rid of all magic constants except for 160 (the non-breaking space character), 80 (the default line length for soft-wrapping) and 1000 (the default line length for hard-wrapping).
Working on adding support for tables; stay tuned.
Comment #226
damien tournoud commentedThis issue transformed into a massive feature request. Let's keep the other #1130198: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText() to fix the regression.
Comment #227
pillarsdotnet commentedFull table support added.
Comment #228
pillarsdotnet commentedAdded inline comments to explain the strategy for generating tables, improved efficiency slightly, and cleaned up inline comments in test code.
Comment #229
pillarsdotnet commentedComment #230
pillarsdotnet commentedCoder review and cleanups.
Comment #231
pillarsdotnet commentedMinor changes from actual use-testing.
Comment #232
pillarsdotnet commentedLast one;
I promise... :-)EDIT: must have had my fingers crossed.Comment #234
pillarsdotnet commentedFixed errors in tests.
Comment #235
pillarsdotnet commentedComment #236
pillarsdotnet commentedCorrected a one-off error: Need to subtract the length of the line-break character from the maximum line length before wrapping or adding padding.
Comment #237
pillarsdotnet commentedRemoved
url()calls from test code and also switched toDrupalWebTestCase.Comment #238
pillarsdotnet commentedCorrected the initial calculation of table cell wrapping width, and eliminated a duplicate assignment.
Comment #239
pillarsdotnet commented#237 fails, which means that #238 will probably fail too, but #1179448: verbose() results are missing.
Moving
verbose()text to assertion message so that I can figure out why the testbots are failing where my local tests are succeeding.Comment #240
sun@pillarsdotnet: In light of #239, it would be helpful if you would run tests and debug test failures on your local box, before submitting patches to drupal.org. The common exception are patches that affect all of Drupal core, which take hours to test locally and only ~30 minutes for the testbot. But especially if it's only a single test-case that needs to be run, it's not really time-consuming to do that on your local box. Probably 90% of all follow-ups on this issue are pure patch/testing noise, which could have been avoided, in turn leading to more solid and focused follow-ups, reviews, and faster progress. Thanks!
Comment #241
pillarsdotnet commentedI am testing on my local box. Tests pass here but fail on drupal.org.
Dunno why.EDIT: Okay, I need to run local tests with clean urls turned off and drupal running from a subdirectory, else the results aren't valid.
Really? 220 comments worth? (counting non-noise comments... 148 out of 243)
So.... 40% noise, including my (now removed) snarkiness.
Comment #243
pillarsdotnet commentedImproved table generation accuracy and performance.
Added test for correctly formed deeply-nested tables.
Test also asserts that table generation takes less than one second.
Comment #244
pillarsdotnet commentedComment #245
pillarsdotnet commentedComment #246
pillarsdotnet commentedFound and fixed a typo -- doxygen comment didn't match parameter name.
Comment #247
pillarsdotnet commentedFixed an edge case reported by Seph. Haven't added a test because I haven't yet figured out how to reproduce the problem. CNR for testbot only.
Comment #248
pillarsdotnet commentedChanging back to "needs work" because #247 still lacks the following features:
<ins>,<del>, and<strike>.align,rowspan, andcolspan.<th>and<td>.Also, it doesn't end world hunger, save the whales, or promote world peace. Definitely needs work.
Meanwhile, please support #1130198-46: Regression: line-breaks are mangled by \Drupal\Core\Utility\Mail::htmlToText()
Comment #249
salvis@pillarsdotnet: Please stop rewriting history (like #164 that you posted on June 1 and edited just now).
Comment #250
attiks commentedsubscribing
Comment #251
wim leersComment #252
catchCould be done in 8.1.x as an API addition.
Comment #253
ianthomas_ukComment #263
jungleOutdated?
htmlToText()is in\Drupal\Core\Mail\MailFormatHelperATM.\Drupal\Core\Utility\Mail::htmlToText()does not exist in 9.1.x.Comment #264
jungleSetting to NR for closing this if applicable.
Comment #265
kscheirerWell the original issue (12 years ago?? sheesh) was to add tests for htmlToText() since it's part of Drupal core.
Looking at
core/tests/Drupal/Tests/Core/Mail/MailFormatHelperTest.php::MailFormatHelperTest()we still are not testing thehtmlToText()function, so I think this issue still applies. Along the way this issue also wanted to improve the function. I haven't read through all the comments, but perhaps it would be easier to commit tests as-is, and then a new issue for improvements. Or close if there is no desire for the improvements anymore.Comment #267
quietone commentedI spent some time reading the patch and I agree with jungle in #263 that this is outdated.
The tests are being converted to Unit tests in #2035133: Convert system module's Mail unit test to phpunit. And it there appears to be agreement to change the mail system, see #1803948: [META] Adopt the symfony mailer component.