There is an memory leak in the module for very large xml files:
PHP Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 85 bytes) in /var/www/drupal/sites/all/modules/feeds_xpathparser/FeedsXPathParserBase.inc on line 73
(If somebody wants to reproduce this error just drop me a line and I'll upload a small feature (including example importer/content type) and the xml file)
Because I want to import some large XMLs, I have created a patch for that ;-)
My way was to fix FeedsXPathParserBase using the batch importing stuff from the feeds module (same as FeedsCSVParser does)
The limit for each loop is $source->importer->getLimit(); which seemes to get its value from the FEEDS_PROCESS_LIMIT definition at top of FeedsProcessor.inc
Comment | File | Size | Author |
---|---|---|---|
#11 | feeds_xpathparser-batch-parsing-1397434-11.patch | 1.92 KB | twistor |
#4 | 1397434-4-feeds_xpathparser-batched_processing.patch | 2.48 KB | elliotttf |
#1 | batched_parsing-1397434-1.patch | 2.87 KB | Shuairan |
Comments
Comment #1
Shuairan CreditAttribution: Shuairan commentedComment #2
twistor CreditAttribution: twistor commentedA few notes first:
That said, I'm not entirely opposed to this idea, but it will have to be a bit different. You keep the pointer to which node you are parsing, but you have to reload the entire XML file each time. There still could be some memory savings with this approach since the $items array you create will be smaller, but I'm not sure how great.
If I'm mistaken about any of the above, let me know.
Also, if you could send me your example, I'd love to take a look at it.
Comment #3
twistor CreditAttribution: twistor commentedAlso, patches should be created from the module directory, not the root.
Comment #4
elliotttf CreditAttribution: elliotttf commentedI think this patch actually is valid. While it won't solve a memory issue with an insanely large file, it will help avoid memory leaks that could arise from whatever the processors might be doing. For example, I've recently been working with a large feed (about 1000 items) that creates file entities from images. At around 300 items memory is exhausted and the import halts. Batching the process does prevent this from being an issue since memory that the processors are eating up is released after the batch is reloaded.
I've re-rolled the patch against 7.x-1.x.
I think there should be a separate issue tackle handling large files; however, I suspect this will be more complicated since you'd have to come up with a way to ensure the integrity of the XML across file chunks or else the XPath parsing might not work.
Comment #5
twistor CreditAttribution: twistor commentedI like this patch, but why these changes?
Comment #6
elliotttf CreditAttribution: elliotttf commentedThis way the xpath and all_nodes variables will persist between batches since the object will be reused.
Comment #7
twistor CreditAttribution: twistor commentedThe object is not reused though. It's just the $state object that is kept.
Comment #8
twistor CreditAttribution: twistor commentedMarking #1447788: Adding support for step batch import or background processing as a dupe.
Comment #9
randallknutson CreditAttribution: randallknutson commentedPatch in #4 worked for me although I did notice the items in #5 don't make sense. It would be nice if there was some sort of static caching for xpath and all_nodes to speed things up although I don't know how much of a difference that would make. It seems to run fine on my 32MB xml file with 28,000 items.
Comment #10
geefin CreditAttribution: geefin commentedCould you expand on how you are getting through a 32Mb XML file?
The parser is falling over for me on a 5Mb XML file. I've tried upping the memory limit to 1024M, processing in the background/cron and implementing this patch with still no joy.
Processing normally I get the memory allocation error. Processing in the background the process just seems to hang.
I noticed this :- http://drupal.org/sandbox/darthsteven/1728218 and tested it but received errors on that one too... :S
Comment #11
twistor CreditAttribution: twistor commentedThis is about what I had in mind. It needs to be cleaned up a bit.
Comment #12
bropp CreditAttribution: bropp commentedPatch in #11 works for me -- feeds now imports separate batches of the size set in the variable feeds_process_limit.
Comment #13
oryx CreditAttribution: oryx commentedApplied #11, and it looks like it is working for me with my 16Mb xml file (although the parsing takes ages :)
Thanks a million !!
Comment #14
oryx CreditAttribution: oryx commentedHave you tried to play with the http timeout setting? admin/structure/feeds/%your_feed%/settings/FeedsHTTPFetcher
Comment #15
klausiExcellent, took me a couple of hours to figure out this was not a core feeds problem but needed to be fixed in the xpath parser. Patch works fine!
Comment #16
twistor CreditAttribution: twistor commentedThanks!
Committed to 7.x http://drupalcode.org/project/feeds_xpathparser.git/commit/815c456
Comment #17
twistor CreditAttribution: twistor commented6.x usage is declining slowly. I don't think there's a reason to backport this. If anyone has the desire, feel free.