Hi Alex. I'm starting this as a support request as I'm not sure if it's my issue or within core feeds. I'm using a recent Feeds -dev from CVS (around the last week of Feb) with #712304: Batch import does not continue where it left off, instead starts from the beginning and #727332: Only batch if necessary applied. I'm using Pressflow but I don't think that would matter.

OK so here's the deal. Using my own class that extends FeedsNodeProcessor (to build new features into process), I am still getting out of memory errors. So what I did was at the top of my process function was to add this:

  $memusage = memory_get_usage(); watchdog('apublisher', 'Process Start: ' . $memusage);

and then I added a similar call at the start of the while ($item = $batch->shiftItem()) and at the end of the while to see what memory looked like.

Trying this with an XML file with 24 fields and a single
record, my memory start/start/end looks like this:
Process Start: 4272652
Parser memory start: 5389644
Parser memory end: 6435520

OK it starts at 4MB and ends up at 6MB. It's going up a bit, but most of my feeds have been working. Now I'm dealing with the largest import I have done, which is an XML file with 17,100
records and 23 fields which is about 32MB in size. Here's what memory usage looks like for that.
Process Start: 100895176
Parser memory start: 102012392
Parser memory end: 103352632

... and so on until I run out of memory. It STARTS at more than 100MB of memory. Same Feeds code, same processor, same importer, same feeds node config. The only difference is source XML file.

PHP Fatal error:  Allowed memory size of 805306368 bytes exhausted (tried to allocate 103424027 bytes) in /mypath/includes/database.mysqli.inc on line 382, referer: http://mysite/node/52317/import

Yes, you read that right - I set memory_limit to 768MB to test with to see if it worked. No dice.

I edited FeedsNodeProcessor.inc and defined FEEDS_NODE_BATCH_SIZE down to 10 but it still craps out. I can get through about 21 records before it dies.

On console I see WSOD generally, or a batchapi error. In all cases it's PHP memory exhausted.

I don't really know what to ask you for here other than for some advice as to where to look. Why is it using such a large amount of initial memory? Any way to make it smaller? What would you start looking at to debug this further?

Thanks in advance for any help you can provide.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

alex_b’s picture

Measure the memory consumption in FeedsCSVParser::parse() - my hunch is that the array that feeds builds from the source CSV file is eating up all that memory.

rjbrown99’s picture

Awesome, thanks a ton for the reply - I'll check it out and report back.

rjbrown99’s picture

For extra fun, I am using my own parser with this :) It's based on #631104: Extensible XML parser (mapping more sources) comment #21 but cleaned up for the current API.

Good call on memory:
Parser Start: 4348340
Parser End: 102224480
Myprocessor single item memory start: 102226748
Myprocessor single item memory end: 103702204

So you are right on that it's not the processor, it is the parser that is chewing up memory before it even gets to the processor. After a bit more work, I found it happening specifically in my parser include file which calls myinclude_parser_parse. My code is a cross between opml_parser.inc and common_syndication_parser.inc with more fields. A simplified and shortened version of my code looks like this:

function cjxml_parser_parse($raw) {
  // Memory starts here at 38168360 (38MB)
  @ $xml = simplexml_load_string($raw, NULL, LIBXML_NOCDATA | LIBXML_NOERROR | LIBXML_NOWARNING);
  // Memory is now at 38168728 (38MB)
  $customItems = $xml->xpath('//product');
  // Memory is now at 43113880 (43MB)
  foreach ($customItems as $cItem) {
    $item = array();
    $item['sku'] = isset($cItem->sku) ? "{$cItem->sku}" : "";
    // Lots more key/value creation goes on now for $item['price'] and the other fields
    $parsed_source['items'][] = $item;
  }
// Memory is now at 139431696 (139MB)
return $parsed_source;
}

The huge jump happens within that foreach, probably because I'm dealing with more than 17,000 items. Given this case, I would think the actual Feeds opml_parser.inc or c_s_p.inc may run into the same issue when scaling to a lot of items?

I was adding a bunch of unset() calls in there during testing but it didn't really deallocate enough to make a difference.

If you have any ideas as to how to shrink this down I'd love a suggested approach. Is this another candidate for batching? I'm also using drupal_queue... perhaps the $customItems product nodes could be stuck into the queue and processed one at a time?

alex_b’s picture

That's what I was expecting...

What you're looking for is batching on the parsing level. It's not implemented (yet) and I assume that it's not going to be straightforward and may require some further reaching refactoring of Feeds than just adding a return value to parse() and modifiying FeedsSource::import() accordingly. You could give it a shot (or just live with high memory consumption for now).

rjbrown99’s picture

Title: BatchAPI - Scaling with large files » Implement BatchAPI for Parsers
Category: support » feature

Thanks again Alex, I'm changing this to a feature request.

Depending on the complexity I'd be happy to take a crack at this. For me it's not just a matter of living with high memory consumption - I can't get high enough to allow me to scale my imports to work. With about 384MB of PHP memory I can pull in about 12,000 items but even with 768MB I can't get to 17,000 items. I'd like to get to around 100K products in a single import file at some point. Right now my max is at that 17,000 number.

What would you suggest as an approach for how to implement? I'm talking fairly high level here. I'm wondering if there are helpful things that could be done to assist in scaling without a full-scale refactoring. At the same time I don't want to write a bunch of 'band-aid' code that will be tossed out after a new more universal approach is adopted. Ideally it would be something that without huge lifting could help now and then could also be leveraged into the final solution.

alex_b’s picture

#5: Can't think of any workarounds other than breaking your CSV files into pieces for now. Otherwise, the way to go is to attack FeedsSource::import().

rjbrown99’s picture

Thanks Alex. It's difficult to break up my file, as I would have to somehow cleanly break it in the same way each time. If I simply split it into one file of 10,000 XML nodes and another with 7,000, some items could be in either file and would result in duplicates. There's no easy way for me to cleanly 'break' the file so that the same XML nodes go into the same file each time.

I'm working on a different parser at the moment. I'm going to come back to this in a few days and will start to populate this issue with thoughts on FeedsSource::import(). Thanks for the reply.

rjbrown99’s picture

I looked into this in a bit more detail. I can get my 17,000 node XML file going if I allocate 1500MB of memory to PHP. I also set FEEDS_NODE_BATCH_SIZE to 5.

Although the parser does eat up some initial memory, the biggest consumer seems to be in FeedsNodeProcessor, specifically:

        // Execute mappings from $item to $node.
        $this->map($item, $node);

In most instances, I get a jump like this, assuming my memory checking lines are immediately before and after that line.
Before map: 112,398,912
After: 112,897,648

Not bad... but there are a few occasions where I see this:
Before map: 113,751,956
After: 161,764,020

And every subsequent item that is processed starts at the higher memory utilization figure. I.E. the next item started at 162,235,304. The memory usage drops down again on the next batch run at the parser start - 43,975,232. So where I am really running out of memory during processing is on that map.

The map seems to lead to FeedsProcessor.inc, protected function map. Within that function, it's the foreach ($this->config['mappings'] as $mapping) that is taking the memory. Eureka - it's the callback in that function.

Feeds Map Callback Before: 113472308
Callback: content_taxonomy_feeds_set_target
Feeds Map Callback After: 161440352

I'm using #637334: Content Taxonomy Mapper, so it looks like I should keep working on this and narrow it down and move on to that ticket.

alex_b’s picture

#8:

This is clear as FP::map() takes a very small stub of a $node object and populates it by copying values to it. It's a copy machine! It's a memory hog!

Unfortunately PHP doesn't have much direct control over memory management, but you should definitely try to unset($node) at the end of the while block in FNP::process().

Generally: seeing that you're starting at a very high level already (after a lot of memory has been consumed at the previous parsing stage) I'm wondering whether you should just bite the bullet and fix batching on parsing.

rjbrown99’s picture

It's actually working a lot better now. The taxonomy issue was killing me. It's still using a lot of memory, but not nearly as much now that I went from almost 50,000 taxonomy terms to about 400. Do you have a wiki page anywhere of best practices for Feeds? I have a few important ones that I learned over the past few days.

1) Be careful when choosing to import to taxonomy. This is especially true with tens of thousands of feed items that may create independent terms. It's better to use a CCK text field and index it with Solr or something else.

2) Be aware that most of the actions within Feeds will trigger an entry into the MySQL binary log. I found this out the hard way when I imported 17,000 items, ran out of disk space on the drive holding the binary logs, and the site went down. Make sure you either have lots of space for the binary logs, only keep them for a short time, or run "reset master" to flush them out.

I'm not sure that I can easily fix batching on parsing which is why I'm working around it. I looked at it, read into Batch API, and wasn't sure if I could make any meaningful improvements there.

alex_b’s picture

#10: do you want to start a "Importing large files" section on the site builder's guide? http://drupal.org/node/622696

Don't create a subpage, just add to #622696 under the chapter on Performance.

alex_b’s picture

pounard found that FeedsSource::save() has uses a lot of memory when handling large batches - this may cause some of the high memory consumption reported here.

#764324: FeedsSource::save() method will consume too much memory when parsing large file using FeedsFileFetcher

pounard’s picture

Just to repeat here my point of view:

  • items should not be stored into database
  • parsers that can do incremental job should always parse only the subset of items that the processors can process
  • if the processor does not limit itself to a subset of elements, the parser should be able to (through a global variable)
  • then only current offset should be stored into drupal's batch table

This would implies the following changes:

  • FeedsSource::import() method should be fully rewritten
  • FeedsSource::save() and load() methods also should be
  • processors should have an optional method they can override from the FeedsParser class to give their own item processing limit
  • parsers also should have this optional method from the FeedsParser class to given their own parsing limit
  • the lower one of these two limits should be used for parsing and processing during the same import to avoid useless parsing
  • the import method should be able to extract a subset of the parsed elements if parser does not supports incremental parsing in order to give only this subset to the processor

Piece of cake, 10 minutes of coding :)

EDIT: this is extracted from an IRC chat with alex_b, I don't want to get credit for his ideas, which are most of these statements.
Re-EDIT: typo.

pounard’s picture

FileSize
8.69 KB

Here is a sample patch (likely needs some rewrite).

  • Added public method getItemLimit() to both FeedsParser and FeedsProcessor super classes.
  • Wrote an algorithm to determine which limit to use in FeedsSource::import()
  • Added $offset and $limit optional parameters to FeedsParser::parse() abstract method
  • Added $limit optional parameter to FeedsProcessor::parse() abstract method
  • Added FeedsBatch::$offset property
  • Rewrote the FeedsSource::save() method in order to get rid of current item if remaining they are
  • Rewrote FeedsImportBatch::shiftItem() method in order to increment offset at each shift

EDIT: I did this and I no longer have memory limit problems using 512mb as PHP memory limit (for a 50mb RSS file parsing). Using 256mb was fatal and PHP did die, so I assume there are a lot of remaining memory leaks.
I think the SimplePie parser is also a bottleneck on my configuration, 50mb of RSS at each cron run is somehow really resource intensive to parse, now I have memory error in SimplePie parsing, which is not a Feeds module problem anymore.

pounard’s picture

Any one alive here?

alex_b’s picture

Alive. Incredibly busy. This is on my review list, but I gotta tell you. DrupalCon is coming up and I have a big project to push out before that...

pounard’s picture

Ok, I'm going to standby for this issue.

rjbrown99’s picture

I'm here - haven't had a chance to try out the patch yet but I am very interested in doing it. Either later this week or this weekend I will give it a go.

pounard’s picture

The best optimization anyone could do here, is using an XML parser based on SAX implementation, that would be able to fetch easily a subset of content without loading the full XML file into memory. In my case, when importing a large file using SimplePie parser, it would be incredibly faster.

SimplePie loading a 50mo file is like hell, and with my patch, it does it on every batch action (load it once and put it in database would be as slower as this, even maybe impossible).

alex_b’s picture

Status: Active » Needs review

#19: right.

setting to NR, will get to review later.

pounard’s picture

This patch may need some improvements. I'm currently using it (works well) but the $total property of batch is never set (saw this this morning), so the batch just remains to 0 progression until it finishes.

I'll do some fixes and improve code inline documentation as soon as I have some time to do it.

alex_b’s picture

#21: I see.

Don't worry about documentation yet: I am going to review with an eye towards expanding batching to the fetcher, too. This may bring some significant changes. I had recently discussions with people who'd like to use Feeds for importing from IMAP, POP, log files or for paging through XML feeds. In all of these cases batch support on the fetcher level would be ideal.

pounard’s picture

#22 I was going to add this feature myself :) I will need it in the next weeks.

alex_b’s picture

Title: Implement BatchAPI for Parsers » Expand batch support to fetchers and parsers

There we go, busting the scope :) Feel free to give it a first stab.

pounard’s picture

I think in order to handle the limit/offset correctly, at least the processors should be implemented another way.

The FeedsSource class must remains the master, and when asking for N elements to the parser, should then call N time a processItem() method on the processor, instead of asking it to parse N elements from the entire items array.

Then, implementing this way, no processors could override this behavior, it will a more fail-safe behavior, and even would remove complexity to processor implementation, and would make it easier to write/override. Parsers and fetchers are the bottleneck in the full processing, so they still should be able to have their own algorithm with less control (then the fail-safe behavior will be totally under the parser and fetcher developper's mind, not that dreadful though).

This doesn't means the processor can't advertise its own limit, it should, at least, give some sort of "level of bottleneckness", some sort of division factor for the FeedsSource batch limit. In all cases, every limit given (processor, fetcher, or parser) should be overridable by site admin with global (or less global) site variables.

Plus, the Batch class has nothing to do with the processor class, the processor does not have to known it, there is no use for this, I think this pattern must be changed. The processor should never control the FEEDS_BATCH_COMPLETE itself, this has no sens at all. It makes the code harder to read, and breaks the encapsulation principle.

EDIT: Another note, in order for the UI to be more comprehensive (I'm talking about bugs such as the "I don't known the total number of items" bugs, the batch should display information such as the number of treated items, even if it does not the total, at least it ensures that the batch is not bugging.
Even better, the batch class could carry information such as the number of duplicates it found, the number of processing failures, and such, and should display it on batch screen.

RE EDIT: Duplicates are hard to follow without keeping some sort of cached identifier list, so it could be database intensive, so may be a good compromise would be to display the total number, side by side with the updated items subset count and the fresh newly created item count within this total.

rjbrown99’s picture

I'm using this now. No feedback yet since it seems to work so far. I'm importing around 50K items over the next few days so it should give it a nice run.

rjbrown99’s picture

FileSize
10.07 KB

Re-rolled the patch from #14 so it applies against HEAD, which at the moment is the same as Alpha 15. No other changes.

robbertnl’s picture

I am using #27, but i am not sure how to use it. I wrote my own processor (extending FeedsProcessor) using the Feeds XML parser (http://drupal.org/project/feeds_xmlparser) as parser . I added the getItemLimit function and changed the process function to process(FeedsImportBatch $batch, FeedsSource $source, $offset = 0, $limit = 0).
But I have still memory problems

stella’s picture

Status: Needs review » Needs work

applying the patch from #27 gives me the error:

Fatal error: Declaration of FeedsCSVParser::parse() must be compatible with that of FeedsParser::parse() in feeds/plugins/FeedsCSVParser.inc  on line 125
pounard’s picture

@stella: API changes in this patch may not be ported to all parsers, this is pending work. Thanks for reporting anyway.

stella’s picture

I did tweak the parse() function definition, but still ran into memory problems after about 350 records - I've got over 10,000 to import nightly :( I've gone back to using the migrate module, but would be interested in this if you can get it working.

rjbrown99’s picture

This is not a silver bullet for memory issues - you still need to have a fairly high memory allowance in php.ini. In my experience I can say so far with this patch I can have a much smaller value - on the order of 384MB instead of 1100MB which is what I was using previously for very large feed imports.

stella’s picture

Increasing the php memory_limit setting to 1100MB is not realistic for live sites. Depending on your server's available memory, number of webserver processes and other apps running on the server, you could quickly run out of memory. Then on shared hosting you often can't change your php memory_limit setting.

I don't think this patch is a silver bullet, but the code shouldn't run out of memory so quickly, and it should handle the scenario where that starts becoming a problem. Maybe the maintainers can take a peek at the migrate module code - it detects memory usage and when the memory limit is being neared it triggers a new batch. Kinda nifty really, but I think feeds would be a better solution if I could get it working.

pounard’s picture

@stella What intends to do this patch (which is highly WIP by the way) is to allow batch support for parsers. This will decrease memory usage drastically once it will be done. For large operations such as massive node import, whatever you do, you will need a lot of cpu time and will have a large memory footprint due to Drupal node management and PHP extremely bad memory management; Except if you cut this massive import into pieces, this is the whole point here. The actual problem here is that the SimplePie library won't be able to parse XML documents as a SAX parser would normally be able to do, but any other SAX based specialized parser would do the trick, and this is another issue here.

EDIT: I totally agree with you about the memory limit, 128 is in many case really too much on many production sites, so would be 1GB.

stella’s picture

@pounard - i totally agree. While I don't have the resources for working on a solution, I'm more than happy to help test patches.

pounard’s picture

@stella I must admit I don't have time to spend for dev seed's module right now, even if I need this feature:) But if I propose more patches, I'd be glad you test it.

rjbrown99’s picture

I agree with everything said about PHP memory usage and the overall stated goal is to lower it. The challenge as with any open source project is who has the itch and the skills to scratch it. I have the itch, but I'm still a much stronger network+OS person than a PHP person so I can't scratch it as easily. That being said, here's my initial plan to solve my immediate large scale import challenge.

In short, I am going to split the import to a different server from the one that hosts the site. The idea here would be one database, two webservers.

Webserver #1 is your normal production site serving users.
Webserver #2 would be used when needed to handle the large imports.

They are not load balanced, but webserver #2 does back-end to the same database and can either rsync or NFS mount the filesystem from Webserver #1.

On Webserver #2, you would configure a much larger share of PHP memory to allow for an entire fetch, parse, process loop without impacting the users. I'm personally hosted on Amazon EC2 so this would not be a huge deal to spin up a machine for an hour to do the import. A large instance with 7.5GB of memory is $0.34 per hour and I can do all of my imports in under an hour on a box like that. I can throw 2GB of memory at PHP without breaking a sweat and $11 a month in machine time doesn't break my wallet.

It would be something like this, minus the load balancer.
http://www.johnandcailin.com/blog/john/scaling-drupal-step-two-sticky-lo...

I will say however that the patch in this thread did significantly lower my memory usage so I may not even need to spin up the second EC2 instance.

pounard’s picture

@rjbrown99 I strongly agree with the fact that the environment around the software must be at the right scale. This issue is about writing performant code (or at least, robust). A good webapp needs good admins, and also good developpers :)

milesw’s picture

subscribing

Very interested in batching at the fetching and/or parsing stage.

pounard’s picture

I'm still looking forward to do it, I just don't have enough time these days. It's still written on a small postit in a corner of my mind.

rjbrown99’s picture

Well, it doesn't fix the root of the problem but I did make a change to my parser that helped quite a bit and I thought I would share.

My parser is pulling in XML files, and there are 43 possible fields in the definition. For my purposes, I only map and import 22 of those 43. In my custom parser, I simply eliminated the extra 21 mapping targets by commenting them out. This effectively removed them from the $batch array during the FeedsNodeProcessor/process step. Since that is a LOT smaller than it used to be I can scale a bit higher.

Just an FYI, if you are running in to this issue make sure to check your parser and kill any extraneous object fields.

alex_b’s picture

#849986: Cleaner batch support aims to clean up batch support which should make it easier to implement batching on a fetcher/parser level.

alex_b’s picture

Another prerequisite for this issue: #850298: ParserCSV: Support batching

alex_b’s picture

This patch is my first shot at full batching support using #849986: Cleaner batch support - in fact, #849986 is included in this patch.

The patch is in a proof of concept state. See FeedsSource::import() and FeedsCSVParser::parse() for core changes. Basic CSV file import works, not all tests are passing.

This patch implements the loosely coupled approach where any fetcher, parser and processor can consume as many items as they want and Feeds import will handle it.

This contrasts with the tightly coupled approach pounard suggests in #25 where the number of items to be consumed is negotiated between at least the parsing and the processing stage.

I am not entirely happy with the loosely coupled approach as implemented in the patch I am posting here as it introduces some weirdness around tracking the status of the batch object (see FeedsImportBatch::needsFetching() etc. methods).

I will give the tightly coupled approach more thought once I get some more time. In the meantime: here is the status quo of my work.

Overall, I don't think the patch in #27 is the way to go: I think we should keep the status of an import wrapped in the FeedsImportBatch object rather than tracking it by variables that we pass into import stages in addition to the batch object.

For that very reason I have created #849986: Cleaner batch support to clean up and expand on the FeedsBatch object concept in a way that is powerful enough for full batching support.

alex_b’s picture

Status: Needs work » Needs review
FileSize
19 KB

Small adjustment - now all tests are passing.

pounard’s picture

Hello again, its been a while! I'm going to test your patch right now, I'll give you some feedback in the next days.

pounard’s picture

Ouch, still a nightmare for memory consumption.. I'll try to adapt my old work to the new patch you did participate to the new issue you opened.

pounard’s picture

@alex_b : What about my suggestion:

The FeedsSource class must remains the master, and when asking for N elements to the parser, should then call N time a processItem() method on the processor, instead of asking it to parse N elements from the entire items array.

I still think that the processors should never know about the FeedsBatch, the batch himself should call processor's methods. One key about performances would be to avoid multiple array recopy.

Are you still not fond of the 'offset' approach, rather than the shiftItem (it avoids array modification in memory, which implies a really faster algorithm while processing large amount of items)?

I'm trying to reduce memory consumption, at all cost, actually, when I try to import a 40Mb file containing thousands of items, feeds explode PHP memory limit (set to 500Mb) after it imported only 50 nodes.

EDIT: Another notice, if the fetcher is not "batching enabled", the processor, with the actual implementation, will still aptempt to import all items, without any control. This is the batch responsability to give the processor only a subset of items, one by one, to remove this kind of processor behavior which totally override any item limit we could set to the batch.

pounard’s picture

I did a proof of concept working patch, see file attached.

It's based upon the 1.0-beta3 version and the #45 patch, it includes some similar concepts than the #27 patch.

What's different about #27:

  • I removed the negociation you don't like
  • Did not reimplemented the incremental fetch

What's added since the #45 patch:

  • Changed the FeedsProcessor::process() method signature, to do a 'per item' parsing, without knowing anything about the batch class
  • Added FeedsBatch::getLimit() and FeedsBatch::setLimit() methods
  • Added FeedsImportBatch::process() method which moves the limit logic from the processor to the batch itself
  • Added back the 'offset' style array walking, instead of the arrayShift() method
  • Ported the FeedsNodeProcessor class to the new signature (did not do the other processors yet)

Why did I use the 'offset' method? Because in batched environment, the shiftItem() method assumes that the fetcher was able to fetch only a sub part of the items the batch will process. In my case, the problem is that the fetcher generated a quite too big item array to be store into database, and should reparse the file each iteration (which is quite heavy, but seems to save some memory and avoid errors on my environment). So, I can't rely on saving items in the batch instance.

There is still some things that bothers me:

  • The fetcher should be able to do incremental fetch (SimplePie can't handle this though)
  • The batch class itself should now the fetcher in order to be able to fetch the needed data, only on a per-need basis
  • I'm still forced to pass the FeedsSource class in process() method signature (processor should never know its environment)

EDIT: This code is really bugged, currently fixing it.

pounard’s picture

This is exactly why items should never be temporarly saved into database while batching:

[Mon Jul 19 17:58:18 2010] [error] [client 127.0.0.1] PHP Warning:  Got a packet bigger than 'max_allowed_packet' bytes\nquery: INSERT INTO watchdog\n    (uid, type, message, variables, severity, link, location, referer, hostname, timestamp)\n    VALUES\n    (1, 'php', '%message in %file on line %line.', 'a:4:{s:6:\\"%error\\";s:12:\\"user warning\\";s:8:\\"%message\\";s:11380875:\\"Got a packet bigger than 'max_allowed_packet' bytes\\nquery: UPDATE feeds_source SET id = 'pcf_original_site', feed_nid = 0, config = 'a:1:{s:16:\\\\"FeedsFileFetcher\\\\";a:2:{s:6:\\\\"source\\\\";s:52:\\\\"sites/default/files/feeds/20100407-pcf_rss-small.xml\\\\";s:6:\\\\"upload\\\\";s:0:\\\\"\\\\";}}', source = 'sites/default/files/feeds/20100407-pcf_rss-small.xml', batch = 'O:14:\\\\"FeedsFileBatch\\\\":12:{s:12:\\\\"\\\\0*\\\\0file_path\\\\";s:52:\\\\"sites in /var/www/test.triage/www/includes/database.mysql.inc on line 128, referer: http://test.triage.guinevere/

The query says that this is during watchdog message save that it fails, but if you look a bit further, you see the watchdog aptempt to save the batch itself because the batch save into feeds_source failed!

And BTW, believe me, my max_allowed_packet is already quite big, but feeds is aptempting to save parsed and serialized data which was originaly a 40Mb XML file.

pounard’s picture

Auto moderation, deleted this comment content which had no use.

alex_b’s picture

#48:

The FeedsSource class must remains the master, and when asking for N elements to the parser, should then call N time a processItem() method on the processor, instead of asking it to parse N elements from the entire items array.

As we can't assume that every parser is going to implement batching we will have to handle the case where the parsing stage returns more items than the processing stage consumes.

Hence negotiating the number of items is a special case where parser _and_ processor support batching. To address that, we could introduce the following methods:

// Returns the number of items consumed per page load, 0 if batching is not supported.
FeedsParser::getBatchSize();
FeedsProcessor::getBatchSize();

a parser like FeedsCSVParser could use FeedsProcessor::getBatchSize() to inspect the processors batching capabilities and adjust its batch size accordingly.

pounard’s picture

@alex_b

I'm continuing on http://drupal.org/node/849986#comment-3225522 which seems more appropriate since my code also overlap the issue.

pounard’s picture

As seen in the http://drupal.org/node/849986#comment-3225522 comment, I made a new version of my patch (actually working on Feeds 6.x-1.0-beta3 version).

This patch adds:

  • Streamed representation and usage of fetchers and parsers
  • Better code encapsulation and all the progress logic moved to the FeedsImportBatch class
  • Implicit usage of old fetchers implementations without needed porting
  • Heavy API changes on processors, which needs them to be ported (only the node processor has been done so far, so tests won't pass until all others have been ported)
  • Same for parser, only the SimplePie parser have been ported so far
  • Temporary raw fetched content are now stored in physical files instead of database, which would relieve a lot the database (putting a 50Mb database row is madness)

I now can import a 50Mb XML file, containing 4000+ nodes with a 512Mb memory limit (which is still high) without breaking the feeds import. For a comparison, on the same environment, with a 6Mb XML file containing 400 nodes, the actual feeds non patched version breaks my memory limit. And with the same 50Mb file, it the feeds legacy version breaks after aptempting to allocate more than 1Gb!

This is still at a "proof of concept" state, and needs strong testing (at least unit testing), and there is a lot of debug messages (using intensibvely watchdog) that remains in this patch. Feel free to test! You can test only with the FeedsNodeProcessor and the SimplePie parser processor sadly, this is the one I really needed.

adityakg’s picture

Subscribing. I will try to help whenever necessary, please let me know.

IRC: adityakg

pounard’s picture

@adityakg : any help is appreciated, from my point of view, but changes are useless until their orientation are not supported by the module maintainer. I did it because I really needed. Nevertheless, any testing, bug report, wrong behavior report, feedback, design inconsistency would be appreciated, the more this patch comes stable and the more are chances that it will come to something that won't be garbaged the next release of feeds :)

andrewlevine’s picture

Just another idea for those that need an immediate solution:
I had a several-hundred MB file. I made my own fetcher that reads the file line-by-line and only returns a few hundred "items" at a time. It then saves a pointer of the progress in the file. This actually means I can process multiple sections of the file in parallel.

pounard’s picture

My current files are pretty poorly formated, they come from various exports.

alex_b’s picture

pounard: I just read your patch from #54 again.

A FeedsBatch takes on the responsibility of iterating over items. I don't think this assumption holds for all use cases. E. g. it breaks FeedsFTProcessor.
B There is a practical problem with the fact that the patch changes a range of function signatures that are part of the API. This would break existing third party modules. This of course could be mitigated by going into a new major version.
C But most importantly, given the far reaching changes proposed, the only advertised advantage is that it avoids storing large swaths of temporary data in the database. This is something that can just as well be avoided with the much less invasive approach suggested in #45/#52.

Here is how: Implement batching for parser. Make sure to use $batch->getFilePath() to get a local file to iterate on and avoid pulling down the document over networks for each page load. Use $source->processor->getBatchSize(); to create the exact amount of items that the processor will consume on the same page load.

Am I missing something?

milesw’s picture

alex_b:

I've been trying to get batching working in my fetcher using the patch from #45 (which I think needs to be re-rolled against current HEAD). After staring at FeedsSource::import() and FeedsCSVParser::parse() for far too long I'm hoping you can give a quick description of how it's supposed to work.

When I go to import a single feed source it seems to create new FeedsImportBatches. So if my fetcher batch size is 10, those 10 items are fetched, parsed and processed, then fetch() is called again and a new batch is created. But $batch->total and $batch->progress and any other status saved to $batch have been lost, so my fetcher can't pick up where it left off.

Got any hints?

pounard’s picture

@alex_b

You're right about API changes, I'm always fond of refactor :) These are not such big API changes, and with some bits of help, module maintainers would port existing code quite quickly, I think.

Iterating over item is good I think, it forces all processors and parsers to implement a unique interface, a unique way of doing things. Besides, it keeps the batch control in the batch class, processors should never have to deal with batch class themselves, this is an ugly design, and really limitative for processor code reusability.

alex_b’s picture

Iterating over item is good I think, it forces all processors and parsers to implement a unique interface, a unique way of doing things. Besides, it keeps the batch control in the batch class, processors should never have to deal with batch class themselves, this is an ugly design, and really limitative for processor code reusability.

I completely fail to see the advantages of the design you propose and the shortcomings you hint at. I say this not to minder your proposal, but to lay it out very clear: I do not understand. Please explain, please be as specific as possible.

It may help to explain my thinking behind the current design. I am going to bullet point and number for ease of reference, none of this is set in stone:

- Importer with a Fetcher, Parser, Processor plugin (FeedsImporter, FeedsFetcher...)
- Source (FeedsSource)
- Batch Object (FeedsBatch)

The responsibilities of these structures are as follows:

1.1 Importers hold the configuration that describes how to import something.
1.2 A Source captures information of what is to be imported.
1.3 A Batch Object holds the state of an ongoing import.

The relationship between these structures is:

2.1 A Source holds a pointer to its importer and a pointer to a batch object, IF there is an ongoing import.
2.2 Pointers from an importer to a source or from a batch object to a source are explicitly avoided to keep relationships simple.

The paradigms that these structures follow are:

3.1 Importers and its Plugins never hold any state of a particular source that is to be imported, thus one importer object can be used to import multiple sources without reconfiguration.
3.2 Fetchers spawn Batch Objects. This gives us the ability to lazy fetch documents depending on the requirements of the parser. (Before we had that, we ran into problems like this one: the File fetcher would load a document entirely into memory and then pass it on to the CSV Parser who would actually prefer the file path to directly read from the disk).
3.3 Fetchers have no knowledge whatsoever of what's in a document, Parsers do.
3.4 There are as little assumptions as possible around what is passed between Fetcher, Parser and Processor. The common pattern right now is fetch something that's going to contain multiple similar entities, parse them out, then hand them on to the processor to produce one Drupal entity per such item in one call. Doesn't need to be that way.
3.5 Sources hold the state of an ongoing import in their batch property and nowhere else.

Anonymous’s picture

subscribing

@rjbrown99, for your XML Parser, you would need a different approach. You can't use SimpleXML to load the entire feed anymore, you have to switch to XMLReader() and then only load the current node into SimpleXML. Only read one node at a time:

$xml = new XMLReader();
$xml->open('file.xml');

// move cursor to first node we need
while ($xml->read() && $xml->name !== 'product');

$doc  = new DOMDocument;
$node = new SimpleXMLElement($xml->readOuterXML());

$node then holds the current node as a small SimpleXML thing. You can move to the next node with:

$xml->next('product');
$node = new SimpleXMLElement($xml->readOuterXML());

-----------------

But the core issue here is, you cant load the entire XML in the Parser.inc. Only single nodes should be put in the memory.

I'll try to support this, when I have time, because I'm dealing with a 500MB XML file... And I need to run it live with PHP memory at about 96M-128M max.

pounard’s picture

@alex_b

I'm ok with your design, but I don't really understand why the processor would have to fill itself the number of elements it did updated or created within the Batch class? By centralizing, you avoid buggy processors to give a wrong result, and you factorise a huge part of the code, and easier the work for third party processor developers (and I stand by the fact it would make processors re-usable in other use case).

For parsers, you have a point here, which is they can behave differently depending on how the parser works, but still, I don't see where this can't be factorisable, at least some bits.

Third notice, the whole point of batching is doing the less you can in each operation, if you let the parser doing the full parsing itself, it won't be able to know how much elements, and which elements it has to parse, your actual design forces all parsers to parse the full file at each batch iteration (not quite because you save the internally formatted items array into the database, which is absolutely WRONG, and if you want to avoid this with the actual implementation, you would be forced to load the full file on each operation).

Even for the CSV parser, I can't see where a limit/offset parameter couple could'nt be used?

milesw’s picture

After spending more time working with the #45 patch I see where I was going wrong.

I'm trying to implement batching at the fetcher level because I'm making hundreds of HTTP requests for sometimes large amounts of data. This is all happening in $batch->getRaw(). I first tried to use $batch->setTotal(FEEDS_FETCHING, $total) and $batch->setProgress(FEEDS_FETCHING, $progress) since I consider this the fetching stage.

However, when FEEDS_FETCHING progress is set to anything less than FEEDS_BATCH_COMPLETE, $batch->needsFetching() will be TRUE. This leads to another call to $importer->fetch(), creating a whole new $batch, wiping out any saved progress and leading to a never-ending import.

I managed to get things working by setting the total and progress for FEEDS_PARSING instead. I guess it makes sense since $batch->getRaw() is called by the parser, but it still seems a little awkward.

Anonymous’s picture

I also tested patch #45 now. I used a 500MB CSV file (300K nodes) with the FeedsCSVParser and FeedsNodeProcessor -- and it worked! The CSV could be parsed now, step by step.

Seems to work really clean.

-edit-

I also noticed the memory was still high. I set PHP's memory to 512M for a 500MB CSV file. I can live with that though.

A similar approach should easily be used for XML parsing, using the XMLReader to skip to a start-byte, then load the node batch in SimpleXML to handle processing.

infojunkie’s picture

Attached is an updated patch #45 for today's dev release. It's worth nothing that much of the patch #45 has already been committed.

robbertnl’s picture

I am not sure how to use batching support. I updated to the latest dev release and made a processor plugin which extends from the feedsprocessor.
I keep track of the progress using batchsize=100, $batch->updated++,batch->created++; and I end the batch using ''if ($processed >= variable_get) just like the provided example.
When cron starts the import starts as well, but it never completes and i see warnings in the Drupal log like 'Cron run exceeded the time limit and was aborted.' and 'Cron has been running for more than an hour and is most likely stuck.'.

My importer imports about 7000 records which should result in updating/creating users, user content profiles and organic groups. The import file is a XML file of about 5 mb and results in huge memory (more then 1 gb) use and a runtime of over 15 hours in the previous feeds version.

robbertnl’s picture

#67
Hunk #4 FAILED at 155.
1 out of 4 hunks FAILED -- saving rejects to file includes/FeedsSource.inc.rej

***************
*** 144
- $this->importer->processor->process($this->batch, $this);
--- 0 -----

infojunkie’s picture

Re #69: I just tried with feeds-6.x-1.x-dev and there were no patching failures. Please make sure you're using an unaltered version from HEAD.

Anonymous’s picture

Concerning high memory usage, this happens to me only on cron-runs. I have many feeds, so I guess the cron scheduler stuffing them all into memory.

So I think we need to extend batching to the cron-scheduler as well, schedule-level batching...

Anonymous’s picture

I can confirm there is a real memory leak in the scheduler system.

In FeedsCSVParser.inc I added a watchdog message to track memory:
watchdog('Memory', round(memory_get_usage() / 1048576,2) . ' MB', array()); (outputs in MB memory)

This in at the top of public function parse(). I have many feeds, for example 20 feeds are in the schedule. Individual feeds can run under 70MB top, but using the scheduler the memory slowly builds up like this:

11.78 MB
75.35 MB
129.58 MB
177.04 MB
231.52 MB
287.68 MB
340.49 MB
396.2 MB
452.17 MB
.. etc.

It seems that the scheduler keeps each round in memory, instead of flushing them.

Anonymous’s picture

After a lot of investigating, I can only conclude that the cron() scheduler needs batching support.

I have 20+ feeds, accross 5 importers. The current cron logic in FeedsScheduler.inc causes 25 import runs to be executed. (5 importers * feeds_schedule_num of 5)

But this process is not batched! We've added batch-support for parsers and fetchers, but none of that has any effect as long as this still happens in cron() runs:

if ($importers = feeds_importer_load_all()) {
      $num = $this->queue() ? variable_get('feeds_schedule_queue_num', 200) : variable_get('feeds_schedule_num', 5);
      foreach ($importers as $importer) {
        foreach ($importer->getScheduleCallbacks() as $callback) {
          $period = $importer->getSchedulePeriod($callback);
          if ($period != FEEDS_SCHEDULE_NEVER) {
            $result = db_query_range("SELECT feed_nid, id, callback, last_executed_time FROM {feeds_schedule} WHERE id = '%s' AND callback = '%s' AND scheduled = 0 AND (last_executed_time < %d OR last_executed_time = 0) ORDER BY last_executed_time ASC", $importer->id, $callback, FEEDS_REQUEST_TIME - $period, 0, $num);
            while ($job = db_fetch_array($result)) {
              $this->schedule($job);
              // @todo Add time limit.
            }
          }
        }
      }
    }

You see, it just foreaches through each importer and executes 5 imports each. This process should be batched as well, or it stacks all import runs into memory.

Anonymous’s picture

Attached is my attempt to do schedule-level batching. Maybe someone can have a look.

rjbrown99’s picture

#73 have you tried Drupal Queue API? http://drupal.org/project/drupal_queue

This is inherently supported in Feeds if you turn it on and could make a difference for you. Notice that the module author is the same :)

alex_b’s picture

#72 say hello to PHP :) - I second #75's recommendation of Drupal Queue.

alex_b’s picture

Back from vacation and DrupalCon.

pounard's feedback made me think a little harder and revisit the current import architecture.

Problems:

- Currently, we don't have batch support on fetcher and on parser level (this is subject of this patch).
- FeedsBatch class is a fuzzy concept, capturing results of import stages and their status. This complicates the solution of this issue, making more complex batching messy very fast.

Solution:

- Break FeedsBatch into FeedsResult objects and FeedsState objects. This has the wonderful benefit that FeedsResult objects can now be generated by fetcher and parser stage, making the parser more independent of the fetcher. Further, the resulting code is much cleaner (see FeedsResult classes in patch).
- Introduce a FeedsState object that is used to capture the state of an import or clearing operation across page loads. A FeedsSource object holds an array of up to three FeedsState objects, each one for a separate operation such as fetch, parse, process or clear.
- Introduce a FeedsProcessor::getLimit() that allows processors to specify the number of items they can process per page load. Parsers that support batching must query this limit by calling FeedsImporter::getLimit(). This is a clear paradigm change from the initially proposed approach where a parser's results would be buffered until they are worked off by the processor (hat tip to pounard).
- There is no similar getLimit() for the parser, we assume that a parser can always parse more than a processor can digest. If that is not true for some reason this assumption could easily be adjusted by making getLimit() a little smarter.
- Batching on the fetcher level is supported the FeedsState object.

This changes the function signature of parse(), process() and clear() slightly:

// Old:
FeedsParser::parse(FeedsImportBatch $batch, FeedsSource $source);
FeedsProcessor::process(FeedsImportBatch $batch, FeedsSource $source);
// New:
FeedsParser::parse(FeedsFetcherResult $parser_result, FeedsSource $source);
FeedsProcessor::process(FeedsParserResult $parser_result, FeedsSource $source);

We could make the upgrade to the new signature less painful by using CTools' plugin API's plugin version numbering.

Now you will wonder how the state of an import is being tracked, see this simple example from FeedsNodeProcessor:

public function process(FeedsParserResult $parser_result, FeedsSource $source) {
  // This retrieves a status object for this plugin, status() lazily instantiates one for us if it does not exist yet.
  $status = $source->status($this);
  while ($item = $parser_result->shiftItem()) { 
    $status->created++;
    //...
  }
  if ($source->progressImporting() == FEEDS_BATCH_COMPLETE) {
    drupal_set_message($status->created);
  }
}

In case FeedsNodeProcessor is chained together with a parser and/or a fetcher that supports batching, progressImporting() will not return FEEDS_BATCH_COMPLETE until fetching AND parsing say that they are finished. Until then the $status object retrieved by $source->status($this); will be the same object aggregating any modifications to it.

I have reviewed our fetchers and I am confident that we can avoid any large cached swaths of data with this approach. I would argue that we should force any parser that wants to support parsing _very large files_ to use FeedsFetcherResult::getFile() instead of getRaw() - this will avoid any loading or caching or repeated download of large content. getFile() at most copies a file and then returns a path to that file. For instance, when using the CSV parser together with the HTTP or File fetcher, no large piece of content is ever loaded into memory, rather the imported file is written to the disk from a URL or is present on the disk already. In either case, the fetcher only holds a file path. The CSV parser then also does not load the actual file, but reads from it line by line. In the proposed architecture, the CSV parser would only read as many lines from the file as requested by the processor.

Todo:

- Upgrade all remaining plugins to the new API
- Introduce new API version number and display friendly notes to upgrade older plugins to latest API (we may be even able to write an adapter for that?)
- Test and write tests.

alex_b’s picture

Approach from #77, more cleanups. Many tests still failing.

For newcomers to this thread: #67 is the latest working approach AFAIK (infojunkie correct me if I am wrong). #77 is the way we should go though. It is much cleaner and the more sustainable approach.

#77 however requires us to change the API of fetchers/parsers/processors. What's more, it's pretty far reaching refactoring that needs more fine tuning and tests. For instance, I don't think that the current getProgress() handling in FeedsSource class is very tight at this point. I'd like to improve call structures before we go live with it.

At this point, Drupal 7 is drawing nigh and I will have to focus on an upgrade to D7 before pouring more time into full batching support. Once Feeds Drupal 7 is released, we should be working on a patch to 7 and then - if there is enough demand and active contributors - backport it to 6.

alex_b’s picture

Atom's Feed paging and archiving extension is a great example of where Fetcher level paging will be useful:

Feed paging specifies the link rel=next element:

<link rel="next" href="http://example.org/index.atom?page=2"/>

the fetcher can use this information to deliver page after page and only reports "done" when there is no next tag anymore.

alex_b’s picture

Rerolled #78 against Drupal 7 branch currently residing in github.

http://github.com/lxbarth/feeds/tree/DRUPAL-7--1-0-alpha1

While this patch has 67k, it only adds 52 lines of code:

27 files changed, 579 insertions(+), 527 deletions(-)

Please review changelog in diff to learn about changes.

I need to open a Drupal 7 branch on the issue tracker.

alex_b’s picture

Status: Needs review » Patch (to be ported)

#80 is comitted now to Drupal 7 branch on github: http://github.com/lxbarth/feeds (see diff between 7.x 1.0 alpha 1 and 7.x 1.0 alpha 2)

For usage in Drupal 6 it would have to be ported.

Given the sheer size of it, I am tempted to close this issue as won't fix for Drupal 6 - I will set it to 'patch to be ported' though. Let me know if someone wants to tackle this. It's quite tricky as much is moving in the 7 upgrade.

andrewlevine’s picture

I'm very glad this made it into Feeds 7. I tend to agree with alex_b that this is likely not a good candidate for 1.0 as it is a far reaching change at a point that was supposed to be fairly close to a release (as far as I understand. alex_b, please correct me if I'm wrong).

I am, however, very interested on collaborating on a "contrib contrib" module that works around this deficiency.

Again, if this has a chance for getting in 6.x-1.0 then let's work on getting it done here.

milesw’s picture

I was hoping this news about Feeds 7.x including support was going to push me toward working with D7, but the reality is that I have to stick with 6 for a while.

I would also be interested in working out a solution for 6.x, whether it's a separate contrib or part of Feeds 6.x itself.

rjbrown99’s picture

I have a requirement to get this working in some fashion in 6.x, whether that means helping via coding, testing, or hiring somebody to do it. I haven't had a lot of time to test the most recent patches but I plan to give them a look this weekend. There may be enough there in #67 or #78 to make it work for me. For me Drupal 7 is far into the future so any enhancements to that codebase aren't terribly useful :)

digi24’s picture

I am also in need of batch functions in D6. I tried to apply #78 only to notice that it probably would have been better to start with #80.

I still have some minor problems and some self-made bugs when applying some changes from #80 manually, but generally it seems to work.

alex_b’s picture

None of the patches here are much more than proof of concept. What needs to happen is full backport from the Drupal 7 work. That's going to be a serious task and will require us to notch up the major version of Feeds (go to 2) because too many APIs change in its course... I will very likely not tackle this backport. If someone wants to step up and take on this task, please get in touch with me on IRC (#drupal).

Steven Jones’s picture

subscribe.

NicolasH’s picture

subscribe

twistor’s picture

I have a working version of this. It's actually a port of the entire 7 version to 6 with enough changes to get it working. The tests are passing with minimal changes. Definitely a version 2 release. I can post a patch in a bit, but it's rather gigantic. Thoughts?

rjbrown99’s picture

My thought is that you should post the patch :) I'd be interested in checking it out.

twistor’s picture

Alright well here it goes. The DataProcessor hasn't been touched yet so it's not working. I'm not too familiar with the data module. Everything else should be working ok if the tests are any indication. This is a 7-6 port.

**Currently there are no upgrades. This patch should be performed on a feedsless install.

twistor’s picture

@Alex, I looked for you on IRC the past couple days but no luck. I realize this is a big patch. I'll be on IRC the best I can. Put it on github? Anyway, my justification for this direction is that it synchronizes the the two branches so much that ports from 6-7 or vice versa are trivial. That will help in the long term maintainability of both branches.

twistor’s picture

twistor’s picture

Status: Patch (to be ported) » Needs review

For attention grabbing sake.

digi24’s picture

Twistor, thanks for the patch. I think it would be a great way to keep up with the developments in 7.2 if there were a 6.2 branch.

I am going to port my feeds installation to your version over the weekend. (just need a minor database update for feeds_item and the changes to targets_alter) and post a patch for feeds_querypath_parser.

Fabianx’s picture

Issue tags: +patch
FileSize
4.98 KB

@twistor: Great work!

However I needed something now, so I ported the old patch in #67 to 6.x-1.0-beta10.

It works for me also by setting FEEDS_PARSER_CSV_BATCH_SIZE to 1000.

Thanks for the patch @alex_b and perhaps this is still helpful to some ppl running into memory limits with CSV parser.

Patch for 6.x-1.0-beta10 attached.

Best Wishes,

Fabian

digi24’s picture

FileSize
1.33 KB

Twistors patch from #93 works for me. I have tested it with my setup, which differs slightly from the plain install, but I think it should not matter. It would be good to see a feeds-6--2 branch, so other plugins could be adapted to the new API. And feeds-7--2 users could profit too, as plugins currently only availible for the drupal 6 branch could be ported easily to 7.

I have it working with FeedsQueryPathParser (required patch attached) and FeedsTamper (Twistor probably already has a patch for this)

As a side note, I think FeedsBatch.inc and FeedsFeedNodeProcessor.inc are not removed by the patch, haven't looked into it though.

twistor’s picture

@digi24, Thanks for the review! Thanks for the patch as well, but that needs to go in the feeds_querypath_parser issue queue.

Ayesh’s picture

#96 is working like a magic!
However, I'm getting status reports for each five rows imported.

Created 5 X nodes.
Created 10 X nodes.
Created 15 X nodes.
jamesdixon’s picture

Thanks to everyone who worked hard to get large batch support going for the d6 feeds version! #96 saved my life and worked with 6.x-1.0-beta11+9-dev.

I also had a problem where the % processed wasn't calculating correctly and the first 50 nodes would be created over and over:
http://drupal.org/node/1139376

Magically if my file was under 2218 lines I wouldn't get this problem. Changing server environments made the problem go away.

karengrey’s picture

Theres been alot of help in this thread but I need some advice still:

I am using xpath_parser to import xml files to my site which works well. I have now been given a 40mb xml file with about 6,400 nodes inside. I get a 500 error as soon as I try to import the file; im pretty sure the problem lies with xpath using the simpleXMLElement to try and load the file first which is eating all the memory.
I took 1 node out of the huge file to test the memory with that. It started on 88mb and finished on 116mb. thats an awful lot of memory for 1 node, right?

Can someone help me write a parser that uses xmlreader() so it passes just 1 node at a time instead of the whole document? an alteration to the xpath_parser would probably be ideal as this would improve this module for everyone who is trying to import large xml's without having to write stuff from scratch!

Thanks :)

pounard’s picture

Working with big XML files you should consider writing your own XML parser based on SAX implementation such as http://www.php.net/manual/en/intro.xmlreader.php library. This allows to parse the document incrementally without loading it completely and will theorically allows you to parse huge documents using a few memory.

For CSV files you'll need CSV readers that actually uses fgetscsv but which doesn't cache or buffer results to achieve the same goal, I did write this one https://github.com/pounard/fat-csv-reader/blob/master/lib/File/CSV/Smoot... myself which memory consumption is stable even for million+ line count CSV files.

If you still experience memory leaks, then Feeds still have a huge problem, I would recommend you either to start implement your own processors which actually force some Drupal static caches to be cleared or use the Migrate module which seems to take care of those problems.

Fabianx’s picture

@Ayesh: Change the 5 to 1000 in the code, where the define is:

+// Number of lines to parse in one page load.
+define('FEEDS_PARSER_CSV_BATCH_SIZE', 5);

in sites/all/modules/feeds/plugins/FeedsCSVParser.inc.

@ jamesdixon: Great that its still working for you :-).

It makes me kind of proud that this patch is still working one year after its creation :-).

chingis’s picture

FileSize
5.01 KB

Attached patch for 6.x-1.0-beta12

PQ’s picture

I know this is a very old issue, but it seems to have dropped off the radar without ever really reaching a conclusion... Is it likely that batching of fetchers/parsers is likely to be supported at any stage?

Apologies if the discussion has continued in other issues but I couldn't find anything it elsewhere in the queue.

MegaChriz’s picture

Version: 6.x-1.x-dev » 7.x-2.x-dev
Status: Needs review » Needs work
Related issues: +#1363088: Feeds batch process not fully implemented - needed for large imports

There is an issue that looks at least related: #1363088: Feeds batch process not fully implemented - needed for large imports

At this point it seems unlikely that batching of fetchers and parsers will going to be supported, because some parsers need to parse everything at once. The CSV parser has sort-of implemented batching, it only parses x items per time where 'x' is the value set in the variable 'feeds_process_limit'.

From that other issue, there is one useful comment (#5) that says this:

Theoretically, this would be fine. Fetchers and parsers would respect the processor's limit, and only give it what it could handle. In practice, this doesn't work at all. Most fetchers fetch 1 item, and ignore batching altogether. LOTS of parsers HAVE to parse everything at once. There is a limited number of parsers that can batch. CSV is one, because it can read from a file. Most XML based parsers use the DOM, and parse everything at once.

imclean’s picture

Would it be better for individual parsers to take care of their own batching? There may be different requirements for different parsers so a generic solution may not be the best, or even possible.

An XML parser could possibly use XMLReader: http://php.net/manual/en/book.xmlreader.php

It seems someone has already done this: https://www.drupal.org/sandbox/darthsteven/1728218

Does this work for large XML feeds? #1397434: Implement batch parsing.

twistor’s picture

Version: 7.x-2.x-dev » 6.x-1.x-dev
Status: Needs work » Closed (outdated)
Issue tags: -patch

Moving this back to 6.x

The solution for 7.x has been solved for quite some time.

See https://www.drupal.org/node/1363088#comment-11303269 for details.