Implement batch parsing. [#1397434]

There is an memory leak in the module for very large xml files:

PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 85 bytes) in /var/www/drupal/sites/all/modules/feeds_xpathparser/FeedsXPathParserBase.inc on line 73

(If somebody wants to reproduce this error just drop me a line and I'll upload a small feature (including example importer/content type) and the xml file)

Because I want to import some large XMLs, I have created a patch for that ;-)
My way was to fix FeedsXPathParserBase using the batch importing stuff from the feeds module (same as FeedsCSVParser does)
The limit for each loop is $source->importer->getLimit(); which seemes to get its value from the FEEDS_PROCESS_LIMIT definition at top of FeedsProcessor.inc

Comment	File	Size	Author
#11	feeds_xpathparser-batch-parsing-1397434-11.patch	1.92 KB	twistor
#4	1397434-4-feeds_xpathparser-batched_processing.patch	2.48 KB	elliotttf
#1	batched_parsing-1397434-1.patch	2.87 KB	Anonymous (not verified)

Comments

Comment #1

Anonymous (not verified) commented 7 January 2012 at 22:31

Status:

Active

» Needs review

Status	File	Size
new	batched_parsing-1397434-1.patch	2.87 KB

Comment #2

twistor commented 8 January 2012 at 03:19

Status:

Needs review

» Needs work

A few notes first:

It's not a memory leak, it's just using that much memory.
This is not the same method as the CSVParser. It stores a pointer to a line in a file, and continues reading from that file.
On first glance, it don't think this patch will work for an item number > FEEDS_PROCESS_LIMIT. The state of the object isn't stored between batch runs. I.e. the $this->all_nodes and $this->xpath. That's why the $state object is there.

That said, I'm not entirely opposed to this idea, but it will have to be a bit different. You keep the pointer to which node you are parsing, but you have to reload the entire XML file each time. There still could be some memory savings with this approach since the $items array you create will be smaller, but I'm not sure how great.

If I'm mistaken about any of the above, let me know.

Also, if you could send me your example, I'd love to take a look at it.

Comment #3

twistor commented 8 January 2012 at 03:21

Also, patches should be created from the module directory, not the root.

Comment #4

elliotttf commented 7 February 2012 at 18:46

Status:

Needs work

» Needs review

Status	File	Size
new	1397434-4-feeds_xpathparser-batched_processing.patch	2.48 KB

I think this patch actually is valid. While it won't solve a memory issue with an insanely large file, it will help avoid memory leaks that could arise from whatever the processors might be doing. For example, I've recently been working with a large feed (about 1000 items) that creates file entities from images. At around 300 items memory is exhausted and the import halts. Batching the process does prevent this from being an issue since memory that the processors are eating up is released after the batch is reloaded.

I've re-rolled the patch against 7.x-1.x.

I think there should be a separate issue tackle handling large files; however, I suspect this will be more complicated since you'd have to come up with a way to ensure the integrity of the XML across file chunks or else the XPath parsing might not work.

Comment #5

twistor commented 17 February 2012 at 20:08

Category:	bug	» feature
Status:	Needs review	» Needs work

I like this patch, but why these changes?

-    $this->xpath = new FeedsXPathParserDOMXPath($this->doc);
+    if (empty($this->xpath)) {
+      $this->xpath = new FeedsXPathParserDOMXPath($this->doc);
+    }
+

...

-    $all_nodes = $this->xpath->namespacedQuery($source_config['context'], NULL, 'context');
 
-    foreach ($all_nodes as $node) {
+    if (empty($this->all_nodes)) {
+      $this->all_nodes = $this->xpath->namespacedQuery($source_config['context'], NULL, 'context');
+    }

Comment #6

elliotttf commented 17 February 2012 at 20:28

This way the xpath and all_nodes variables will persist between batches since the object will be reused.

Comment #7

twistor commented 17 February 2012 at 21:22

The object is not reused though. It's just the $state object that is kept.

Comment #8

twistor commented 21 February 2012 at 10:39

Title:	Allowed memory size error (patch included)	» Implement batch parsing.
Version:	7.x-1.0-beta3	» 7.x-1.x-dev

Marking #1447788: Adding support for step batch import or background processing as a dupe.

Comment #9

randallknutson commented 25 July 2012 at 21:48

Patch in #4 worked for me although I did notice the items in #5 don't make sense. It would be nice if there was some sort of static caching for xpath and all_nodes to speed things up although I don't know how much of a difference that would make. It seems to run fine on my 32MB xml file with 28,000 items.

Comment #10

geefin commented 16 November 2012 at 04:05

Could you expand on how you are getting through a 32Mb XML file?

The parser is falling over for me on a 5Mb XML file. I've tried upping the memory limit to 1024M, processing in the background/cron and implementing this patch with still no joy.

Processing normally I get the memory allocation error. Processing in the background the process just seems to hang.

I noticed this :- http://drupal.org/sandbox/darthsteven/1728218 and tested it but received errors on that one too... :S