Especially for importing any feeds and deleting feeds that create nodes through the UI we should Batch API

- Ideally we can implement this on the FeedsImporter level
- Keep the overhead for FeedsPlugins low
- Make sure fallback with JS is clean
- How can we do incremental import?
-- CSV parser could read X items from file
-- XML parser could parse all file, but only pass on X
-- Both would require an internal iterator that can be incremented and reset.
-- Batching may be supported only by some parsers.

Comments

Version:» 6.x-1.x-dev

A patch for a more generic approach is here: #626352: A transporter interface to allow for batching

Status:Active» Needs work
StatusFileSize
new17.06 KB

This is an alternative approach to #626352

- Minor modifications to support batching in FeedsImporter and FeedsProcessor
- Implements batching for FeedsNodeProcessor
- Batches node import on stand alone forms, try Node import from Feeds defaults, as test file upload feeds/tests/feeds/many_nodes.csv.
- Leaves door open to support batching on parsing stage with a future patch.

Open:

- Use batching on imports triggered from node form or from Import tab on nodes.
- Use batching for clear()
- Clean up $source->cache. Contextual information for batches is being kept with the source. This approach is ok but the handling of the cache is cumbersome. Needs cleanup.
- Support batching on cron. The whole reason why this batch does not rely too much on Batch API is that batching should work with Batch API and FeedsScheduler. There is a @todo in FeedsScheduler on the concrete steps that need to happen. There is a reasonable stop gap in place for now so this patch could go in without actually implement batching on cron.
- Write tests.

Reviews welcome. I'm setting this issue to NW as there are a couple of known open tasks.

I'll try to take a look on this as soon as possible. My schedule hasn't allowed me to invest as much time as I wanted, lately.

Although, I'd really like to have the batch engine pluggable, as in my first patch. But if we have reasonable cron support, that may be ok :) I'll try to take a look at it soon.

StatusFileSize
new25.92 KB

This patch expands on #2, not finalized yet.

- Batching on import/delete on standalone and node embedded import forms. Works beautifully.
- Updated/created user feedback is being cached and computed correctly so that users sees how many nodes total have been created/updated not just latest batch.
- Caching of batches is improved, not ideal yet. This could be simpler. I also start to dislike shoving the cache into the feeds_source table as otherwise feeds_source is completely stateless. This may be a slippery slope.
- The actual handling of the batch is neatly separated between FeedsImporter::import() and the actual processor. Reading FeedsImporter::import() one can see how easily batching could be expanded to the parsing stage if need be.

Todo:

- Finalize batching for clear and expire - I've only started this.
- Further simplify caching.
- Implement batching on cron: I did more thinking on this and it shouldn't be a big deal to make this work. We basically have to set the last_scheduled_time flag not before importing but after importing and only if the import was finished. This way FeedsScheduler will pick up non finished sources at the next cron run.
- Write/modify tests.

StatusFileSize
new18.96 KB

This patch builds on #641522-15: Consolidate import stage results and provide generalized enclosure class.

- Simplified batching/caching of batches.
- Implement batching for clear()

Open:

- Implement batching on cron: I did more thinking on this and it shouldn't be a big deal to make this work. We basically have to set the last_scheduled_time flag not before importing but after importing and only if the import was finished. This way FeedsScheduler will pick up non finished sources at the next cron run.
- Implement batching for expire() (not 100 % necessary for first commit).

StatusFileSize
new56.67 KB

- Added batching on cron
- Add source level locking

Todo:

- batching for expire() (will require to abstract out source level locking as expire is importer level)
- further testing

Quick comments -

Functions changed. Gone are FeedsFetcherResult and FeedsParserResult. It required me to make a few small changes to my custom parser. FeedsOPMLParser.inc provides a good example of what changed and how to update your module. Anyone testing the patch with a custom parser, make sure you port it (or be prepared to), otherwise you will find a WSOD.

So far so good. I have only imported a few nodes but it's working. Assuming that I can fix #686470: Filefield mapper: URLs with spaces not parsed properly I will probably try an import of over 1000 items this weekend. My previous imports would break at about the 500 item mark so this will be a good scalability test of batch for me.

I have now imported up to 1,300 nodes in a single session, and it no longer breaks with memory errors like it did prior to the BatchAPI patch. Sweet. I actually have two feed import nodes going now, one ~150 items or 18kb XML file, and one with 1,300 items or 178kb XML file.

My initial comments are:

1) Visually it stays at "Initializing" for a while, moves to "Remaining 1 of 1" with no blue progress bar and 0% completion, and then eventually finishes. It's definitely parsing and creating nodes during the "Remaining" step even though it stays at 0%. It may be expensive or impossible to accurately tell the user where the import is, but perhaps the Initializing and Remaining text could be changed to let them know it might take a while. It may also be a good idea to capture the start time in a variable, then at the end when all is done show the user how much time the import took. Mine are still taking a pretty long time and I'm not sure if speed-wise this will scale to where I need it to go (130,000+ items in a 28mb file, re-parsed daily for changes.)

2) There seems to be no way to cancel a batch import after you start one.

I initially had an auto_nodetitle error, but that was my goof using a raw token instead of a formatted one.

subscribing

#698356: Allow other scheduled tasks than 'import' and 'expire' will break this patch. The conflicts *should* be limited to a formatting level, no conflict of concept. Indeed, #698356 will make it a tad easier to accomplish batching for expire() as it creates a single point of entry for working off a task (FeedsImporter::work())

StatusFileSize
new968.35 KB

- Updated after commit of #698356: Allow other scheduled tasks than 'import' and 'expire'

Todo:

- some tests still failing
- batching for expire() (will require to abstract out source level locking as expire is importer level)
- further testing
- Make many_items.rss2 file much smaller

Issue tags:+beta blocker

This issue should be included in the first beta release.

Did not test nor really review, just a quick note :

+++ feeds.module 1 Feb 2010 21:06:31 -0000
@@ -370,6 +372,71 @@ function feeds_scheduler_work($feed_info
+  $context['finished'] = FALSE;

Instead of FALSE (0) you should provide a float value between 0 and 1 (1 being 'finished, don't call me back') providing a notion of completion level.
This avoids your users staring at a progress bar blocked at 0% for the whole process until it jumps at 100%.

Also, $context['finished'] = TRUE; (or 1) can be omitted, that's the value Batch API assumes if none is set.

Powered by Dreditor.

Status:Needs work» Needs review
StatusFileSize
new187.27 KB

yched - thank you, this was helpful. I've incorporated your suggestions, even progress tracking. I really had wanted to dodge this one.

Here's the new patch, addresses all issues of #12, except locking. Locking needs more fleshing out and is a larger task than I had thought. I'd like to deal with it in a separate patch #640508: Lock sources before importing/expiring/clearing them.

All tests passing. I'm tempted to commit this patch here very soon.

Status:Needs review» Fixed

Committed - for follow ups please post new issues.

FWIW, I imported 9000+ items to nodes with the patch, works great.

@alex_b
Thank you for this patch. My 20Mb CSV is no longer a demon!

I'm seeing some really odd behavior with this and I'm having a hard time debugging. To be fair, I need to go back and test this with stock feeds before I can really open an issue, but I wanted to get just a general sense on a couple of things I'm seeing.

In this use case I have a 700 line CSV file. This file is being imported twice by two different stand alone feeds (no feed node) to populate 2 different content types.

In the first case each line will result in a node creation as each line is unique. This is to create user profiles containing a subset of the CSV data and using a GUID unique to each line. This import works fine through the batch API implementation, however the progress bar does not update as the properties of $batch are not being updated. At the end of the import it says that 50 nodes were created, although it actually created 700.

In the second case we're using a different set of columns from the CSV to create a related organization. Each line contains information about the organization the member belongs to, however many of the members belong to the same organizations so rather than 700 nodes created we create 150. This is using a different column as the unique value. While not really efficient, it works. Or it did before the batch API implementation.

This second case behaves very oddly. The import starts and the batch is initialized. A small set, usually 10 nodes is created. Then those same 10 nodes are updated continually and it never moves past to the rest. I think this may be loosely tied to the issue with the properties of $batch not being updated well in the previous example, but with more severe consequences in that the import gets thrown into an infinite loop and fails.

So, right now my point is how fully tested is this batch API implementation? Does anyone have thoughts on what is preventing $batch from being updated? I did notice that if I added the public properties of FeedsBatch to FeedsImportBatch that things worked a bit better (I could now see those values in the serialized column in feeds_source table). However it did not solve the other issues. I'm wondering if part of the issue might go back to the fact that $batch contains $batch->items, which may be causing a problem with the serialization?

#19: First off: Batch API needs more testing. Thanks for your concise report.

At the end of the import it says that 50 nodes were created, although it actually created 700.

Looks like $batch->created in FeedsNodeProcessor::process() is starting over on each page load. This indeed points to a problem that could be the same that you describe next:

A small set, usually 10 nodes is created. Then those same 10 nodes are updated continually and it never moves past to the rest.

There is an issue for this already here: #712304: Batch import does not continue where it left off, instead starts from the beginning.

I would start debugging in FeedsSource::import() by printing information about the results of $this->importer->processor->process() to the error log.

Thanks Alex. I did spend some time debugging yesterday and was frankly unsure why it wasn't working. I'll look into this again with fresh eyes and your latest commits.

In my testing I wondered if it made sense to keep a counter like $batch->deleted for processing, perhaps $batch->processed. I would think this would be a bit more reliable for the return calculation as (especially in my second example) $batch->created + $batch->updated does not always equal the % through processing you are as not every row will always produce a updated or created node.

If you think this makes sense I'll include it in any work I might put into this.

Status:Fixed» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

subscribe

#23: This issue is closed - there is nothing to subscribe to.