Email reply text not being stripped from comment body [#1188584]

I have somewhat resolved an issue I've been having with mailcomment 2.x-beta1 that I thought I would share and also ask for feedback on. It seems that when I reply to an email message the portion of the email below the "((( Reply ABOVE this LINE to POST a COMMENT )))" reply text does not get stripped out of the comment post. I scoured the support queue but can't seem to find anyone with the same problem.

Upon further investigation I found the mailcomment_feeds_after_parse function to be the culprit.

Original Function in mailcomment.module


// Need to clean up body text after the comment is parsed and before it is processed
function mailcomment_feeds_after_parse(FeedsImporter $importer, FeedsSource $source) {
  // Now trim out the rest of the message if separator text exists
  // @ TODO May fail for html mails
  if ($marker = variable_get('mailcomment_reply_text', t('((( Reply ABOVE this LINE to POST a COMMENT )))'))) {
    // Allow other modules to act on the text and node object before we start stripping things from it.
    //mailcomment_invoke_mailcomment_alter($node, 'pre');
    // Now the dirty part. May need some more clean up for line endings, spare html, etc...
    $pos = strpos($node->body, $marker);
    if ($pos !== FALSE) {
      // Mailhandler brings this in as a full node with teaser/body but we need to change context to comment
      $node->body = drupal_substr($node->body, 0, $pos);
    }
  }
}

After looking at this function and referring to the feeds source to look at the original hook, I have no idea where the $node object is coming from. After doing some debugging it seems to me that it just simply does not exist within the context of this function. I have modified the code utilizing the $source object and manipulating the 'origbody' string

My Version

// Need to clean up body text after the comment is parsed and before it is processed
function mailcomment_feeds_after_parse(FeedsImporter $importer, FeedsSource $source) {
  // Now trim out the rest of the message if separator text exists
  // @ TODO May fail for html mails 
  if ($marker = variable_get('mailcomment_reply_text', t('((( Reply ABOVE this LINE to POST a COMMENT )))'))) {
    // Allow other modules to act on the text and node object before we start stripping things from it.
    mailcomment_invoke_mailcomment_alter($node, 'pre');
    $count = 0; // Initialize counter for batch items
    foreach ($source->batch->items as $message) {
      $pos = strpos($message['origbody'], $marker);
      if ($pos !== FALSE) {
        $source->batch->items[$count]['origbody'] = drupal_substr($message['origbody'], 0, $pos);
      }
      $count++;
    }
  }
}

So far tests on my version seem to be working ok for my dev site. But I really want to know is there something I'm missing in my configuration? Or is this feature just not working in the beta1 release?

Comment	File	Size	Author
#21	strip_reply_text-11188584-6.patch	15.2 KB	davideads
#21
#20	strip_reply_text-1188584-5.patch	14.47 KB	davideads
#20
#8	strip_reply_text-1188584-4.patch	6.02 KB	davideads
#8
#7	strip_reply_text-1188584-3.patch	2.59 KB	davideads
#7
#2	strip_reply_text-1188584-2.patch	1.47 KB	arvinsingla
#2

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

Dane Powell CreditAttribution: Dane Powell commented 15 June 2011 at 00:09

Status:

Active

» Needs work

Sounds legit... it would be great if you could roll a patch, and even more amazing if someone else could review it. Otherwise I may not be able to get to this for a while.

Comment #2

arvinsingla CreditAttribution: arvinsingla commented 16 June 2011 at 02:33

File	Size
strip_reply_text-1188584-2.patch	1.47 KB

Here is a patch for my changes. I didn't know what to do with the implementation of 'mailcomment_invoke_mailcomment_alter' in the function since it relies on the $node object so it's currently commented out.

Comment #4

davideads CreditAttribution: davideads commented 9 August 2011 at 21:41

Field testing report: I'm definitely still having problems with this, even after patch is applied. Two problems I've seen: 1) In some cases it just doesn't work at all (no idea why not) and in a couple other tests, the leading '>' character before the reply-to line is causing some extraneous markup to get inserted.

A few years ago I wrote (and, tragically, have now lost) a processor for the Exim mail transport agent which handled chopping up messages in a similar way. It used a couple of passes with regex matching and was WAY more complicated than the matching techniques used in mailcomment currently.

I hate to be a downer, but I suspect for this to be functional across a wide range of use cases this code will need a lot more work and testing.

Comment #5

robokev CreditAttribution: robokev commented 12 August 2011 at 18:41

I'm using the Mail Comment module in Open Atrium and trying to figure out if the problem I am having is due to this Mail Comment bug.

Does this bug manifest itself by leaving HTML formatting codes in the comment instead of stripping them all out? For example, having an orphan angle bracket, div tag, etc. being left in the Comment?

FYI, the context I have is using Gmail as the mailbox.

Comment #6

davideads CreditAttribution: davideads commented 18 August 2011 at 19:54

@robokev -- based on my recent testing, your problems likely do stem from this bug. Usually the cut-line is preceded by a quote marker like > which is currently not accounted for. This can lead to strange results after being processed by some input format filters.

Comment #7

davideads CreditAttribution: davideads commented 18 August 2011 at 21:53

File	Size
strip_reply_text-1188584-3.patch	2.59 KB

I've rolled a new patch that uses preg_match to look for the cutline preceded by any number of >'s or spaces, as well as common "On XX-XX-XXXX so-and-so wrote" lines, adapted from mailcomment filter. The first match position is used as the end of the proper message body. This means that in many cases the cut-line specified by the user won't be the first match. In essence, the cut line becomes the universal fallback for mail clients whose attribution style is not covered by other regexes.

Unlike many techniques I've seen (including the one used by mailcomment filter), no looping over each potential match is required -- the delimiter regexes are concatenated into one big OR regex and applied against the message body once.

I moved the attribution stripping into this function because I believe that the comments should be as clean as possible -- it makes little sense in my mind to keep extraneous/worthless lines in the database and strip them out later with an input format. And what if someone cuts and pastes from an email in a comment when using the mailcomment filter? Some of the message could inexplicably disappear. I'd love to hear what a module maintainer/developer has to say about this.

The patch still needs work (I think the Thunderbird regex could be better, and a new variable is introduced which needs a discussion and needs to be exposed in the administrative UI), but it is working nicely in my initial testing (see https://ecenter.fnal.gov/issues/74 -- comments were sent from thunderbird and gmail). Please help me test!

Comment #8

davideads CreditAttribution: davideads commented 19 August 2011 at 20:25

File	Size
strip_reply_text-1188584-4.patch	6.02 KB

One last patch: This one adds a regex for Outlook Web Express (at least the version we use at my job) and unit tests for message stripping.

Comment #9

Dane Powell CreditAttribution: Dane Powell commented 30 August 2011 at 17:21

Status:

Needs work

» Fixed

I committed #2 with a few minor changes to both branches:
http://drupal.org/commitlog/commit/12314/9bc8da147c800dbe197f5db7f4e9423...
http://drupal.org/commitlog/commit/12314/2fb70b161048b81b1bf78a906629301...

In regards to your question...

And what if someone cuts and pastes from an email in a comment when using the mailcomment filter? Some of the message could inexplicably disappear.

In fact, this is one reason why we don't strip signatures, etc... from messages any more, and rather allow an input filter to hide such stuff. If we strip the messages, content may be lost forever, whereas a filter can always be tweaked to be more or less conservative.

In the end, it boils down to a question of what you consider to be a "message", which should be stored in the database, boils and all. When we planned the 2.x branch, we decided to define a "message" as everything generated by the replying user, including signatures, etc... but to not consider text below "reply above this line..."

For more discussion about this issue please see #1054334: Convert message cleaner and signature separator to input filters

Anyway, if you'd like to clean up the input filters using the code from your other patches here, please do open a new issue. Thanks.

Comment #10

davideads CreditAttribution: davideads commented 30 August 2011 at 18:59

Thanks for the thoughtful reply, Dane. You've obviously thought about this and framed it well, though I can't say I agree with the choices you've made about where a message begins and ends. My apologies in advance if this is a lot of talkin' for a small issue. It matters to me a lot because I don't want to have to maintain custom patches or a custom fork of mailcomment for all my projects, Mailcomment rules and provides a really killer piece of functionality for social and collaborative sites.

One practical problem with the patch from #2 is that it doesn't account for leading characters in the "reply above this line" token. At the very least, I'd say the fix should include the regex pattern from my later patch to account for any number of leading quote markers. Otherwise you can get extraneous characters (which could be rendered in any number of amusing and unexpected ways) and also, the system simply does not work as advertised -- the way it works now, it should really say "reply before this text".

I think using the message cleaner and signature separator in an input filter, especially combined with the token stripping, is really problematic behavior. I have two big projects I'm working on that will use mailcomment, and in neither case do I think "oh, I'd like to strip some of the message at parse time and the rest of it at display time." I absolutely see why you might wish to preserve the original message body in the database, and also why you might wish to heavily process the message before writing it to the database. But why the hybrid approach? And isn't it that much more likely to be prone to strange glitches?

I prefer the parse heavily before writing to the DB approach because, if it works right, it most closely matches how the Drupal db would look if I wasn't running mailcomment at all. For both my projects using mail comment, we keep messages in our mailboxes and will independently archive them, so the original message should always be available if necessary.

Another thought: Once you are stripping above the special token, in most cases mailcomment will already be stripping out the signature (gmail and many other mail clients always put the sig at the bottom, below the quote). So that leaves only the "On XX-XX-XXXX So and So wrote:" line, which, if you've already taken the hatchet to the message aggressively before writing to the DB, is most definitely noise.

That last thought makes me wonder if you'd be open to changing mailcomment's configuration and behavior to support at least two and probably three message parsing modes: 1) No message stripping, 2) "Above this line" stripping, and 3) "Aggressive" stripping (2 and 3 are similar enough they could/should be a single option). Instead of my DRY-violatin' patch above, if mode 2 or 3 are used, the input filter can be called directly on the message. If the user goes with "no stripping" mode (or reply above this line mode), it will be up to the user to implement their own feed processing hooks and/or enable the message cleaning input filters. These modes could come with some useful help text, i.e. "Aggressive message processing runs the message cleaner input filter on the message *before* it is written to the database. It is highly recommended that you configure mailhandler to keep messages on the server when using this mode."

Comment #11

Dane Powell CreditAttribution: Dane Powell commented 31 August 2011 at 16:33

Hey David- thanks for the response. I've thought about it some more and here is what I propose:

Move the actual text-stripping functionality into private functions that can be called either from an input filter or during parsing, and improve the regexes as you propose.
Give the option of stripping messages during parsing, as you wish to do, and also keep input filters as an option.
Reorganize message cleaners to have the following options (again, either during parsing or in an input filter):
- Do nothing (like it sounds)
- Strip Mailcomment-related text (including the 'reply above this line' line and the mailcomment signature - but leave everything else untouched, including content below the 'reply above this line' line)
- Aggressive cleaning (gets rid of quoted replies, signatures, etc...)

I'd have to think about how to unify the interface between Mailhandler and Mail Comment, since Mailhandler needs at least a signature separator, but I don't really want to have a separate "signature remover". Let me know what you think about this plan.

Comment #12

davideads CreditAttribution: davideads commented 31 August 2011 at 17:03

Dane, much thanks. Your plan is exactly what I had in mind -- it preserves the old functionality and afaict covers every possible parsing scenario. I'm not trying to discourage the use of the input filters or any other combination of options.

Why would the "strip mailcomment-related text" cleaner not delete content below the line? That seems like it would be confusing when used alone...

I don't entirely understand your last point. Why does mailhandler need a signature separator?

I can get approval at work to tackle this if you'd like me to write a patch implementing it.

Comment #13

Dane Powell CreditAttribution: Dane Powell commented 31 August 2011 at 23:41

Status:

Fixed

» Active

Why would the "strip mailcomment-related text" cleaner not delete content below the line?

As you pointed out, it's kind of silly to strip everything below that line out and be left with a message that looks like "body body body >On Aug 31, so and so said:". But there's absolutely no point (that I can see) to leaving Mailcomment-generated gunk in the message that ends up in the database. In fact, maybe we should just automatically strip it out and not even make it an option...

I don't entirely understand your last point. Why does mailhandler need a signature separator?

Mailhandler currently has a filter that strips out people's email signatures. I'm just saying that we need to consider how that should integrate with these new filters.

If you have the resources to roll a patch, I would certainly appreciate it- it would allow me to focus on other issues. Just let me know.

Comment #14

davideads CreditAttribution: davideads commented 1 September 2011 at 15:16

But there's absolutely no point (that I can see) to leaving Mailcomment-generated gunk in the message that ends up in the database. In fact, maybe we should just automatically strip it out and not even make it an option...

Cosign. So would this imply only two modes? No stripping and aggressive stripping?

If you have the resources to roll a patch, I would certainly appreciate it- it would allow me to focus on other issues. Just let me know.

I got approved to work on it, so I have resources to roll this one, though it could be a week or two since I don't have the final say in my development priorities. I hope to have something on this by some time next week.

Comment #15

Dane Powell CreditAttribution: Dane Powell commented 1 September 2011 at 18:26

Maybe "none", "basic cleaner" and "aggressive cleaner". Each cleaner can either be run as a filter (filters with these names already exist) or as a stripper/parser. The reason for offering people the option of having no cleaner is that in order to run the aggressive cleaner as a filter, you might need to have the "reply above this line" line to key off of. So we might not want to automatically strip that out.

If that sounds good, I'll look forward to a patch!

Comment #16

Dane Powell CreditAttribution: Dane Powell commented 6 September 2011 at 22:43

Version:

6.x-2.0-beta1

» 6.x-2.x-dev

Hey David, is the ETA on this still this week? I'd love to incorporate it into 6.x-2.0-rc1 sometime this week.

Comment #17

davideads CreditAttribution: davideads commented 7 September 2011 at 16:09

Dane, I'm not sure. I should have some time to work on this tomorrow, and will update then.

Comment #18

davideads CreditAttribution: davideads commented 9 September 2011 at 12:58

Dane -- hopefully I don't get sidetracked at work. If not, I will try and have a patch today.

Comment #19

davideads CreditAttribution: davideads commented 9 September 2011 at 19:41

Quick update on this: Finally free to work on it, so I should have this done early next week. Now that I'm digging in again, I'd like to explore the possibility of dropping the basic cleaner as an option (while of course keeping it as a filter option). The basic cleaner seems to simply strip off a few trailing lines, which doesn't work seem to work very well in my tests.

Comment #20

davideads CreditAttribution: davideads commented 13 September 2011 at 18:45

File	Size
strip_reply_text-1188584-5.patch	14.47 KB

Okay, patch is here. This adds the behavior we discussed. Here are some subtleties that may be cause for discussion:

Added cutting below the reply separator to the basic filter.
Added update hook: If reply separator variable is set, switch to basic filtering to emulate prior behavior, otherwise set mode to no filtering.
Added tests for the aggressive filtering function.
Fixed minor style mistake (incorrect comment line) in existing update hook.

I've testing this pretty thoroughly at this point. A test thread can be found (at the moment) at https://ecenter.fnal.gov/issues/87

These are all individual commits in my git branch if we need to re-roll patch and omit / change any of these behaviors.

Comment #21

davideads CreditAttribution: davideads commented 21 September 2011 at 17:25

File	Size
strip_reply_text-11188584-6.patch	15.2 KB

New patch: Added Roger Saner's Mac Mail 4.2 regex from #597756: Mac Mail and old Outlook emails not being cleaned properly with improvements, an additional test to cover the new Mac Mail style. and cleaned up inconsistent use of quotation marks.

Comment #22

Dane Powell CreditAttribution: Dane Powell commented 16 October 2011 at 18:52

Version:	6.x-2.x-dev	» 7.x-2.x-dev
Status:	Active	» Patch (to be ported)

Thanks for your help with this David- I've committed #21 with some minor changes to the 6.x-2.x branch:
https://drupal.org/commitlog/commit/12314/4966fdb8cba28e682b85d2dd68856a...

Comment #23

Dane Powell CreditAttribution: Dane Powell commented 16 October 2011 at 19:14

Status:

Patch (to be ported)

» Fixed

Committed to 7.x-2.x:
http://drupal.org/commitlog/commit/12314/f2087fd26ff11325a132527c18f858d...

Comment #24

davideads CreditAttribution: davideads commented 18 October 2011 at 14:22

Dane, this is awesome, thanks for committing! Now we can gently mock Github every time their email integration fails to strip a quote line :)

Comment #25

1 November 2011 at 14:30

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Comment #26

MickC CreditAttribution: MickC commented 27 November 2013 at 05:48

I had to add a couple of lines to the regex for Mac Mail 4.6 using dd/mm/yyyy - for one or two word names:

"On \d{1,2}\/\d{1,2}\/\d{4}, at \d{1,2}:\d{2} [AP]M, [a-zA-Z0-9._%-]+@?[a-zA-Z0-9.-]*[\.]?[a-zA-Z]{0,4} wrote:", // Mac mail 4.6
"On \d{1,2}\/\d{1,2}\/\d{4}, at \d{1,2}:\d{2} [AP]M, [a-zA-Z0-9._%-]+@?[a-zA-Z0-9.-]*[\.]?[a-zA-Z]{0,4} [a-zA-Z0-9._%-]+@?[a-zA-Z0-9.-]*[\.]?[a-zA-Z]{0,4} wrote:", // Mac mail 4.6