There is an memory leak in the module for very large xml files:

PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 85 bytes) in /var/www/drupal/sites/all/modules/feeds_xpathparser/FeedsXPathParserBase.inc on line 73

(If somebody wants to reproduce this error just drop me a line and I'll upload a small feature (including example importer/content type) and the xml file)

Because I want to import some large XMLs, I have created a patch for that ;-)
My way was to fix FeedsXPathParserBase using the batch importing stuff from the feeds module (same as FeedsCSVParser does)
The limit for each loop is $source->importer->getLimit(); which seemes to get its value from the FEEDS_PROCESS_LIMIT definition at top of FeedsProcessor.inc

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Shuairan’s picture

Status: Active » Needs review
FileSize
2.87 KB
twistor’s picture

Status: Needs review » Needs work

A few notes first:

  1. It's not a memory leak, it's just using that much memory.
  2. This is not the same method as the CSVParser. It stores a pointer to a line in a file, and continues reading from that file.
  3. On first glance, it don't think this patch will work for an item number > FEEDS_PROCESS_LIMIT. The state of the object isn't stored between batch runs. I.e. the $this->all_nodes and $this->xpath. That's why the $state object is there.

That said, I'm not entirely opposed to this idea, but it will have to be a bit different. You keep the pointer to which node you are parsing, but you have to reload the entire XML file each time. There still could be some memory savings with this approach since the $items array you create will be smaller, but I'm not sure how great.

If I'm mistaken about any of the above, let me know.

Also, if you could send me your example, I'd love to take a look at it.

twistor’s picture

Also, patches should be created from the module directory, not the root.

elliotttf’s picture

Status: Needs work » Needs review
FileSize
2.48 KB

I think this patch actually is valid. While it won't solve a memory issue with an insanely large file, it will help avoid memory leaks that could arise from whatever the processors might be doing. For example, I've recently been working with a large feed (about 1000 items) that creates file entities from images. At around 300 items memory is exhausted and the import halts. Batching the process does prevent this from being an issue since memory that the processors are eating up is released after the batch is reloaded.

I've re-rolled the patch against 7.x-1.x.

I think there should be a separate issue tackle handling large files; however, I suspect this will be more complicated since you'd have to come up with a way to ensure the integrity of the XML across file chunks or else the XPath parsing might not work.

twistor’s picture

Category: bug » feature
Status: Needs review » Needs work

I like this patch, but why these changes?

-    $this->xpath = new FeedsXPathParserDOMXPath($this->doc);
+    if (empty($this->xpath)) {
+      $this->xpath = new FeedsXPathParserDOMXPath($this->doc);
+    }
+

...

-    $all_nodes = $this->xpath->namespacedQuery($source_config['context'], NULL, 'context');
 
-    foreach ($all_nodes as $node) {
+    if (empty($this->all_nodes)) {
+      $this->all_nodes = $this->xpath->namespacedQuery($source_config['context'], NULL, 'context');
+    }
elliotttf’s picture

This way the xpath and all_nodes variables will persist between batches since the object will be reused.

twistor’s picture

The object is not reused though. It's just the $state object that is kept.

twistor’s picture

Title: Allowed memory size error (patch included) » Implement batch parsing.
Version: 7.x-1.0-beta3 » 7.x-1.x-dev
randallknutson’s picture

Patch in #4 worked for me although I did notice the items in #5 don't make sense. It would be nice if there was some sort of static caching for xpath and all_nodes to speed things up although I don't know how much of a difference that would make. It seems to run fine on my 32MB xml file with 28,000 items.

geefin’s picture

Could you expand on how you are getting through a 32Mb XML file?

The parser is falling over for me on a 5Mb XML file. I've tried upping the memory limit to 1024M, processing in the background/cron and implementing this patch with still no joy.

Processing normally I get the memory allocation error. Processing in the background the process just seems to hang.

I noticed this :- http://drupal.org/sandbox/darthsteven/1728218 and tested it but received errors on that one too... :S

twistor’s picture

Status: Needs work » Needs review
FileSize
1.92 KB

This is about what I had in mind. It needs to be cleaned up a bit.

bropp’s picture

Patch in #11 works for me -- feeds now imports separate batches of the size set in the variable feeds_process_limit.

oryx’s picture

Applied #11, and it looks like it is working for me with my 16Mb xml file (although the parsing takes ages :)

Thanks a million !!

oryx’s picture

Have you tried to play with the http timeout setting? admin/structure/feeds/%your_feed%/settings/FeedsHTTPFetcher

klausi’s picture

Status: Needs review » Reviewed & tested by the community

Excellent, took me a couple of hours to figure out this was not a core feeds problem but needed to be fixed in the xpath parser. Patch works fine!

twistor’s picture

Version: 7.x-1.x-dev » 6.x-1.x-dev
Status: Reviewed & tested by the community » Patch (to be ported)
twistor’s picture

Version: 6.x-1.x-dev » 7.x-1.x-dev
Issue summary: View changes
Status: Patch (to be ported) » Closed (fixed)

6.x usage is declining slowly. I don't think there's a reason to backport this. If anyone has the desire, feel free.