Situation:
Using Mailhandler in combination with listhandler to import an old mailman mailing list achieve via a mail server. Most of the time the mails stored on such a server do have the "order received" ID the reverse order due to transferring a maillist to other emailserver.
In these cases the newer emails get read in earlier than older emails. Then is the thread is to be reconstructed via listhandler the thread ordering is wrong. Tis is because since for the newer email there was not topic and then created. An older email comes in after that an then linked BELOW the other message since either the message In-Reply-To ID is the same. If there is no In-Reply-To ID the topic is the same and thus assumed to be same thread the wrong same result.
What the behavior should be:
Before starting to read emails from the server we should sort them on arrival date at the email server. Since nobody could have replied on an email that is not received this will always work OK.
What to do:
Therefore the SORTARRIVAL, as implemented on all current IMAP servers known, should be set before iterating through emails to receive.
Disadvantage:
The sort behaviour does not alter any other behaviour. However if the mailbox contains a lot of unread emails that need to be read via mailhandler, e.g. more than 100000 the import of the mail could slow down.
A patch to solve the issue is attached and tested with a reasonable dataset.
| Comment | File | Size | Author |
|---|---|---|---|
| #8 | imap_perf.zip | 2.64 KB | ilo |
| mailhandler.retrieve.inc_.patch | 1.32 KB | cor3huis |
Comments
Comment #1
cor3huis commentedComment #2
cor3huis commentedComment #3
cor3huis commentedChanged status, It would be great if it would be integrated into the main dev version. and make a new patch if needed for the 6.x-1.x-dev version of 14-march-2011 with all improvements of ilo
Comment #5
cor3huis commentedThanks for test, will make a new patch but confused against which version of 6.x-v1.x.
It would be great if anyone could guide me there.
Comment #6
ilo commentedcor3huis, I'll take this one and provide a patch later.
Comment #7
ilo commentedAfter some research, I've found that for IMAP it could be a great improvement, but pratically a blocker for pop accounts, because php's imap extension donwloads full message body when sorting POP accounts. On the other hand, as long as the module retrieves a message by its number, I think it will be easier to sort the array of message numbers instead.
I have also found that it might be quicker to call imap_headers to return all the headers in a single network call, with a great performance boost on larger mailboxes. The dark part of this is that headers might be limited to 25 chars subject according to a comment in php's site.
Cor3huis, do you have a large inbox to do some performance testing? I'll be upload a pair of scripts to do so.
Comment #8
ilo commentedI'm not getting any significant improvement using imap_headers(). Looks like imap_headerinfo() performs call and caches the headers, using that cached information in all subsequent imap_headerinfo() calls.
as_mailhandler.php: imap > Unread messages: 443 time 14.690289974213 secs
fetch_headers.php: imap > Unread messages: 443 time 15.039510011673 secs
The 'not so funny part' is that the same logic as in mailhandler does not return the same results using POP as protocol (against the same email account):
as_mailhandler.php: pop > Unread messages: 276 time 102.59804010391 secs
fetch_headers.php: pop > Unread messages: 0 time 130.51864504814 secs
Comparing sorted and unsorted:
as_mailhandler.php: imap > Unread messages: 443 time 14.128609113693 secs
as_mailhandler_imap_sorted.php: Imap > Unread messages: 443 time 15.430547952652 secs (once it reached: 13.430300951004 secs)
Using apache's benchmark tool (ab) results are more or less the same.
The mailbox information I've used:
stdClass Object
(
[flags] => 31
[messages] => 702 <-- total messages
[recent] => 0
[unseen] => 443 <-- new messages as processed by mailhandler.
[uidnext] => 704
[uidvalidity] => 2
)
So, after this, I come to the conclusion that sorting does not provide such a pain, regardless it performs a little bit slower (as a background process it might not be so important), but I also think that we should reconsider the whole retrieval stuff as long as pop and imap results are not the same.
Comment #9
ilo commentedJust curiosity, cor3huis, what server does not return the email list sorted by arrival date using default imap_open?, I'm asking because gmail does.
Comment #10
ilo commentedCor3huis, considering your description, and in a setup where there are several mailboxes, do you think it makes sense to parse all headers of all mailboxes, sort the messages by their arrival date (mixing mails from all mailboxes) and then perform a retrieve operation from all these mailboxes, reading the emails according to this arrival date (and not only per mailbox)? This is the only way I can think of to get all emails from all mailboxes in the correct order.
Marking as critical as it may affect the way current retrieving process works.
Comment #11
danepowell commentedSorry, 6.x-1.x is no longer a supported release. Please try upgrading to 6.x-2.x and reopen if still an issue.