Feedapi doesn't scale due to high memory usage. Is it true?
| Project: | FeedAPI |
| Version: | 6.x-1.9-beta2 |
| Component: | Code parser_common |
| Category: | support request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
Jump to:
Hi folks.
I'm facing the dreaded "allowed memory size of XXX exhausted" again. Here's my story:
- Install Fedd API with node creation a few years ago
- Added some feeds (28 in total)
- Started to fetch the feeds
- A few months later, started to get "memory exhausted" errors. My PHP settings was 16MB so I bumped to 32M
- One month after, feeds stopped with same memory problem. Bumped the config to 64MB
- Started to use all memory again a few months later. I started to investigate the issue and saw so many users having the same problem. Well, maybe cron it not having time to complete, so the node to expire just tile up, just adding to the problem...
- Switched then from cron.php to drush. Everything went ok... for one month.
- Bumped memory limit ro 128MB. Memory exhausted a couple months after doing that.
- Read about a memory leak in SimplePie. Applied the patch, to no avail. Still facing memory exhausted errors.
- I started to get the memory exhausted error in all cron runs, so this bug was disrupting the service of my site. I decided then to switch from SimplePie to the Common Syndication Parser.
- No change. Still seeing "memory exhausted" errors (which also tells us that it's not a SimplePie problem).
So what should I need to do? Bump memory to 256MB? 1GB? Buy a web farm just to aggregate content?
I'm wondering why the module needs so much memory to parse a simple XML file! Notice that no one of my feeds RSS files are that "big", in the sense that you'll shouldn't need a few MBs to parse them.
Upon investigating further, I'm guessing that this problem may not be related to feed size, but to the number of feed itens you keep on your database. My feeds are setup to only expire itens that are 6 months old, or even more that that. There are many several thousand feed itens at any time on the database. One thing that makes me think that the problem is in expiring nodes, is that if I refresh some feed using my browser I get the WSOD, but if I reload the site I can see on the session messages that the modue expired some very very old feed itens, which makes me think that expiration is not working as expected.
Currently I'm getting the error on all cron jobs, and all my feeds are stalled. Comments and suggestion to make things work again are welcome.

#1
The memory consumption of feedapi/aggregation _is_ high. That said, there seems to be something off in your case. Here are the options I see:
- Use drush to refresh feeds. This lowers the memory usage. FeedAPI Drupal 6 feature, drush integration should be easy to backport.
- The number of feeds you're handling seems low. Are they very large? You could force FeedAPI to refresh them less often. Again, this is standard feature in Drupal 6, in Drupal 5 you will be required to patch the module.
- Profile the application with xdebug + kcachegrind.
Not a critical issue (we use 'critical' only for features that are critical for the next planned release).
#2
Hi Alex. Thanks for replying.
I'm using drush already. In fact, I was forced to do that, because my cron runs were running forever and being aborted by the PHP timeout (yes... I adjusted the timeout setting even to a lot-more-than-reasonable value just for testing, and it didn't solve it).
Yep, 28 feeds are a relatively low number, and they're not big. But as I said, I have the impression that the problem is more related to expiring nodes than RSS parsing itself. I understand that aggregating RSS uses some memory, but c'mon 128MB is way too much for such task! I have some modules that may provide expensive
hook_node()functions, so they may be the culprit. I will dig into it soon.#3
Care to xdebug/kcachegrind an isolated version of your setup?
#4
Never heard of such beasts before. Any pointer for downloads/docs?
Can I enable them temporarily on a production site just to collect the needed statistics? I mean, will those tools disrupt service if I run them on my production site for a few minutes or so (e.g. during a cron run)? I'm more willing to do that if it would not cause any problem for the site. Duplicating the site to debug it will need more effort and some more time do occur because I'm currently loaded of work.
FYI, I'm using Feedapi on a D5 site on a Debian etch server (standard OS Apache and PHP5 packages).
#5
you need to get a VPS and set PHP's max. execution time high if you have so many feeds.
#6
#4:
http://www.google.com/search?client=safari&rls=en&q=xdebug&ie=UTF-8&oe=U...
http://www.google.com/search?hl=en&client=safari&rls=en&q=kcachegrind&aq...
Wouldn't run them on production.
If you're doing critical aggregation based applications, you'll want to have a good staging/testing environment to do profiling.
#7
I guess I'll just add my anecdotal experience. I have an application monitoring 185 feeds (Drupal 6). The feeds are mixed, some update frequently, others less so - although all are refreshed every 30 minutes. I'm not sure what the actual memory usage is, but the PHP memory limit is set to 256MB (dedicated server). We haven't had any performance problems thus far.
As you can see by the mention in this post, feed monitoring is quite resource intensive.
#8
I have the same problem. Running Version 6.0 beta2 and running using Drush (the shell script at developmentseed). More than 10K feeds to parse and I use only a feed at a time to refresh. After couple of hours the RAM gets full. Total 2 GB RAM and almost 1.9 GB is used.
Today I found that the cache files inside files/parser_common_syndication* has large number of files and many of them in the range of 100s of MBs. Deleting all of the files for some reason freed the RAM back to normal free value (1GB free).
""I suspect that some function is reading and storing all the files into memory and as more files are getting added the buffer gets full (?) ""
I expected the cache to be cleared frequently on cron run (?). So checked if apache had permission on the files. Since I was running the Drush as Root, all the temporary files were created by root. So re-run drush as apache. Still the same issue.
As a temporary workaround I plan to write a cron task that deletes all the files frequently. Could it have a side effect ?
I hope my finding helps. I'm not very familiar with the code, so cannot help with debugging.
#9
I'm not sure if my conclusion above and decision to purge the cache is a good one. Maybe, feedapi is expected to behave in that way and the large number of feeds that I have is what is causing this issue. Can someone tell me if it is ok to delete the cache files without any unwanted consequences (other than the extra time it will take to fetch the data again) ?
#10
Deleting caches is fine.
#11
Related: #608812: Option to prevent a node from being cached when calling node_load()