I need to parse large (5000+ entries) RSS files for a project I'm working on.

I've so far managed to crash Apache once, blown the max memory and execution limits of PHP by a mile, and not managed to successfully process even 10% of the file before something goes wrong.

Looking at how the RSS files are processed with FeedAPI, I have a feeling my only solution is not to use large files. Without changing the internals of FeedAPI that is.

Has anyone looked into how you could reasonably process large RSS files in chunks (of items at a time)?

Or should I just bite the bullet and implement a custom solution?

Comments

quickcel’s picture

I'm not sure if these will help, but have you tried increasing the max memory and execution time for PHP?

For example, in the settings.php file can you use:

ini_set('max_execution_time', 300); #Number value is in seconds
ini_set('memory_limit', '96M'); #reference link: http://drupal.org/node/207036

Or if it's timing out during cron have you tried adjusting the "Cron Time for FeedApi %" variable in admin/settings/feedapi?

pixelantegroup’s picture

I'm at 2400 seconds, and 768M of memory.

At some point when the size of the files grows "too large", it simply is not possible to process the file in one go, but in chunks. FeedAPI insists on processing the whole file, storing all the items in an array, and then iterating through that array to do the database operations. That approach won't work for large files, because no matter how high you set your limits, they won't be enough.

quickcel’s picture

Is there a way you can break your file into smaller parts then?

Can you page through the entries in one feed loading 500 at a time? Or Maybe setup multiple feeds that handle different parts of the larger feed?

Not sure what the solution might be just trying to brainstorm

aron novak’s picture

If you have shell access to the box, i can suggest you to use drush.
One possible way is described here: http://developmentseed.org/blog/2009/jun/24/feedapi-and-drush-refresh-yo...
With drush, you can fine-tune the php settings (for CLI environment), without messing up the settings for the webserver's php instance.