I'm a bit worried about $attachment_text = shell_exec($helper_command);.

Does anyone know what happens if the document is eg a word doc of 150MB? Is there a possibility to prevent memory overflow in such a case? Or don't I need to worry about this?

Comments

markj’s picture

Wouldn't this depend on the helper app? I think timeouts are a real issue, and will try to add some check to protect against them (IIRC search.module does this so I'll take a look).

Jürgen Depicker’s picture

My question was more about the possibility of php running out of memory, since the whole text content of the file is assigned to the variable $attachment_text in one go. I think this cannot work for huge documents, which means parts of the attachment (or the whole attachment?) will not get indexed.
I want to know whether my assumption is right or wrong. If right, I would suggest some chopping up of the output using temp files of a size half the php memory limit, and have them indexed part by part. It is rather complicated, since we need to keep track of which part we're indexing for the case a time-out occurs. I did learn about how to do that in the Pro Drupal development book though, but I just don't know whether my assumption is right in the first place. I could empirically check it out of course...
Any ideas anyone?

markj’s picture

I don't think that a buffer overflow is an issue, as long as PHP handles running out of memory sanely. However, timing out due to large attachments is a real issue. Running cron.php from the command line would reduce that problem, but most Drupal admins run cron.php via wget or some other HTTP client. Perhaps writing out text to temp files, as you suggest, would be a good option for large files.

Would adding a setting that allowed admins to define the largest file that could be indexed be a temporary (and admittedly bad) short-term solution until we can figure out something that uses the Drupal function you mention, which I think is register_shutdown_function()?

Jürgen Depicker’s picture

Mmm... If php times out, or if there's a buffer overflow, attachments don't get indexed properly (I think. Is this right?).
So I think we should always pipe the output to a temporary file first.
Secondly, we should somehow chop it in parts of let's say 500000 words (probably we could use sed for that? But I haven't got experience with this; google to the rescue ;-)).
Thirdly, these 'pieces' should be indexed one by one.

We should use the variable_set function to store info about the temp filename, the number of parts, and the part which is completed. And the register_shutdown_function will do that for us. A good example is at Drupal Pro development page 209 (great book, oh yes!).

But I haven't got time to look into this deeply right now, but am surely ready to cooperate if you want to give this a try. A very busy week ahead...

markj’s picture

I agree potential timeouts are a problem, and I'm sure we can work something out. I'll make sure we address this problem, probably along the lines you suggest, once other core functionality (searching attachments separately from node text, better display, etc.) stabilizes. I too have a busy couple of weeks ahead but will have more time to look into this toward the end of June.

texas-bronius’s picture

Good forward thinking. But has anyone actually experienced a buffer overflow or timeout yet?

markj’s picture

Title: Potential buffer overflow? » register_shutdown_function added to version 5.x-4
Version: » 5.x-3.0
Status: Active » Fixed

I've just added register_shutdown_function() plus the required variables to the second version of 5.x-4-dev (to be released once I fix a couple of other things).

Also, 5.x-4 has a feature that allows the admin to define how many files are indexed at one time, which should provide a way to reduce timeouts when indexing of sites with large numbers of files.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.