I posted a question in the Feeds issue que asking about larger xml files and what is the best way to import them. I am looking at 1500 to 2000 items at the most in an xml file so I may not have an issue, but within the next year I may want to import 100k+ . Someone else in Feeds was having a problem with 180k and they suggested using a "State Based Parser", which PHP has but there is no module to use it with Drupal. Since I am neither a designer or programmer, I have a few questions.

Here is a quote from the Feeds issue que:

You're looking for a state based parser, SimpleXML will always need to load the entire document and thus require a lot of memory.

PHP has an XML library that is state based: http://www.php.net/manual/en/ref.xml.php - no feeds integration for it AFAIK.

I didn't say what parser I was using, which is why SimpleXML was mentioned but I have never installed that. Does XPath work the same way, the whole file has to load first? If so, how hard would it be to implement a State Based Parser as an option? Wouldn't it be easy since there is a PHP library for it?

How much more work is this to add as a feature request? It seems like this is a more proficient way of doing it, unless I am missing something. I think currently

I

Comments

twistor’s picture

Well,

This module uses SimpleXML under the hood, so my first response will be 'no'. XPath needs access to the dom, so it has to load the whole document.

However,
I have been thinking about this myself, and it would be possible to have each item the context returns be its own dom. The problems with this approach are many, we'd have to come up with a new way to describe the context, and it would be more or less a whole rewrite.

Honestly, if you're planning on parsing 100k items, I see a custom parser in your future. There are other issues too, such as batching. The D7 version of Feeds supports batching at the parser level which is needed in the 100k case.

Also, I much prefer http://us2.php.net/XMLReader to the SAX parser you linked to. Both of them are included in PHP as well as SimpleXML.

tyler-durden’s picture

Thanks so much for your input Twistor, it is helpful to me since I am not much of a programmer and klnow little about php.

As for my 100k+ feeds needs, that will not be until next summer or fall and I should be ready to dive into D7 by that time so problem is solved.

As for now, I think I can make the current Feeds for D6 work since at most I will have 1300 feeds. I ran a test of 100 items and it worked, I just need to rampi it up now.

Should we close this then?

twistor’s picture

Status: Active » Closed (works as designed)
osopolar’s picture

Version: 6.x-1.8 » 7.x-1.x-dev
Component: Documentation » Code
Category: feature » support

It is still designed to read the whole file in memory which performs poorly on large files with 100k or more items in a 150MB large file.

Running the XPATH-Query on command line: xpath feed.xml "(/items/item)[position() > 0 and position() <= 100]" I also get a out of memory message.

Did anybody wrote a custom parser as described in #1? I found so far Steven Jones's sandbox: Feeds XPath Parser + XMLReader.

osopolar’s picture

Status: Closed (works as designed) » Closed (duplicate)