My problem

I've recently come across a client that wants to index a product registry for their website. The registry is a CSV file and contains almost 8000 rows with 26 columns. 10 of the columns is mapped into different taxonomy and CCK fields with the Feed API Mapper. This is a huge feed with a complex configuration that Feed API can't handle due to PHP's max_execution_time and memory_limit (I've tried with a different kind of setups).

FeedAPI has support for internal batching. But that isn't implemented everywhere and makes the code a little complex. Nor will it be effective enough on huge feeds like the one above.

My solution

The optimal solution would be to implement the Batch API for both the parsing and processing. But implementing it at the parsing level is hard without too much hacking. After some simple benchmarking I found that roughly 95% of the time is spent on the process level. And implementing the Batch API at the process level is possible and would result in a much smoother experience.

The only real problem with implementing the Batch API at the process level is to run the batch_process() during hook_cron(), as the redirect will mess up the cron flow. But I solved it by registering batch_process() as a shutdown function, that runs after all cron jobs.

Summary

So where does this leave us? Well, all items will always be parsed and processed on every refresh. Parsing is done without any batching or other timeout controls. Processing is done completely with the Batch API. The only drawback with this is that the user needs to sit trough the batching if the cron job is manually invoked.

Tests

I've done some simple tests of this approach on my local machine. I made a CSV file with 5 columns and 10,000 rows. Then I mapped those columns with the Feed API Mapper to node title, node body, one taxonomy field and two different CCK text fields. When hitting refresh it takes about 5 - 6 seconds to parse the CSV file. Then the Batch API kicks in with the processing which takes about 4 minutes.

The parsing needed at least 128 MB in memory_limit. The max_execution_time was never a problem as only 5 - 6 seconds was spent inside Feed API. With the Batch API handling all processing, this could scale very well until the parsing becomes a problem. But I think my simple tests proves that wont happen unless the document is freaking huge.

The patch

I've tried to make this first patch as clean as possible, at this early stage. Some functionality has been removed just to get the Batch API up and running in a clean way. But my intention is to land this with all previous functionality.

Functionality missing in this first patch:

  • hook_feedapi_after_refresh() isn't called. The problem is that the batch operations are missing contextual information about the process. Making use of &$context['sandbox'] inside the batch operations will solve this.
  • No stats are recorded. The problem is the same as above -- lack of contextual information for the batch operations.
  • The counter variable isn't being used in hook_cron() (was this variable used in any useful way, anyway?).

Things the patch still needs to do:

  • Remove all use of the half_done flag, as that isn't used any more.

So, go ahead and burn me if I'm doing the wrong thing here! ;)

Comments

dixon_’s picture

Status: Active » Needs work
StatusFileSize
new8.39 KB

And here is the initial patch mentioned above.

If the patch is interesting I'll try to keep it up speed with the 6.x-1.x-dev branch. At least update the patch once a week or so.

aron novak’s picture

Really amazing!
My ideas here:

  • About hook_feedapi_after_refresh() : what about appending this to the last item of the batch queue? This needs an additional function.
    To do this, where do you need contextual information? I see that it needs to be executed after all of the previous tasks in the queue was finished and the $feed variable should be available.
    $func = $module .'_feedapi_after_refresh';
    $func($feed);
    
  • About the stats: yeah, if we're doing the updates in this way, a permanent location is needed to store these data. Stats are really important.

All in all, this seems to be a great idea!

dixon_’s picture

StatusFileSize
new9.27 KB

Okey, here is another roll.

I didn't use the Batch API as effective as I could do in the previous patch. Now, instead of creating one batch operation for each item, we iterate over the same operation. This will hit the batch table in the database much less. This also allows us to use the &$context variable provided by the Batch API. This is because &$context only lives through iterations of one operation, not between separate operations.

This patch has brought back hook_feedapi_after_refresh(). I haven't had time to fix the stats for now. But it's totally doable.

The function documentations still needs to be updated to reflect the changes.

dixon_’s picture

StatusFileSize
new9.27 KB

Minor code style mistake in the previous patch.

dixon_’s picture

Status: Needs work » Needs review
StatusFileSize
new14.05 KB

Here is a more complete patch:

  • Now we record stats and set messages properly.
  • Function documentations is updated.
  • The half_done flag is removed everywhere.
  • An update function is created to drop the half_done column.

This i still to be done:

  • Record the processing time for the stats. The operation iterations is split over several page request, so timer_start() and timer_read() won't do it here. So whats the cleanest way to solve this?

I have done some testing of this patch and it seems to work pretty well. But I haven't run Simpletest on it yet. I mark this for a review if someone else got time.

dixon_’s picture

Oh, I forgot one thing ... All the code for the "cron percentage time" should be removed as this isn't used any more.

Anyways, a review of the patch so far, is still a good thing.

dixon_’s picture

StatusFileSize
new16.26 KB

Okey, here is a patch fixing the "cron percentage" stuff mentioned in my last comment.

Still to be fixed:

Record the processing time for the stats. The operation iterations is split over several page request, so timer_start() and timer_read() won't do it here. So whats the cleanest way to solve this?

alex_b’s picture

Status: Needs review » Closed (won't fix)

As discussed on IRC, the show stopper for using batch API for feed processing is that it assumes a browser with either JS or meta refresh support on the other end. This would kill using drush for refreshing feeds.

Further, I think it does not attack the problem at its core where FeedAPI starts a $half_done feed from scratch because it cannot keep a pointer of where it left off.

alex_b’s picture

Title: Improvements to refresh batching » Use Batch API to improve refresh batching

Clarify title.

dixon_’s picture

Yes, alex_b is right. I totally missed that point of view.

Then, the way to go is probably to cache all unprocessed items in the database. And have _feedapi_invoke_refresh() prioritize them during the next refresh. Or something like that.

alex_b’s picture

dixon_: I thought more about this problem.

While you said that importing chokes on the processing level and not on the parsing level, I think the parsing level, and more specifically, the fetch (read) level is where batching needs to happen.

Ultimately, this is the very front of the import process and thus the best place to batch. If we add batching somewhere in the middle of the import process (i. e. the processing stage) we don't attack the problem at its core. We might at some later point find out that really, really huge files need to be imported and then we wind up with a scenario where we have batching on more than one level.

I've tried to solve such cascaded batching with some pseudo code yesterday and it inevitably leads to a mess.

Further, we would waste resources by importing a huge but parseable file, even loading it from cache, to then let the processors nibble from it in a batch cycle.

Thus, I think batching should happen in parsers.

We will still have to download full HTTP documents, we will even have to parse them completely, but the parser should only pass on as many items as the rest of the stack can take. This will need to be some sort of setting with a good default.

It's getting much more interesing when we're talking about files. CSV files can be read line by line and fetching and parsing can be batched easily. This would allow us to process extremely big files, gigabytes of data if need be.

In regards to parsing extremely big files, have a look at killes' patch over here: #548312: Allow feeds to be collected from files

dixon_’s picture

alex_b: Yes, I guess you are theoretically right. But I don't know if it would make much sense in the real world...

The CSV file I talkes about in my issue description (1,6 MB with 8000 rows and 26 columns) takes some 5 - 6 seconds to parse and about half an hour (!) to process with the Batch API patch above. It's on the process level that the whole Drupal machine kicks in with hook_nodeapi() etc. And god knows how many modules that can hook into hook_nodeapi().

What I'm trying to say is that batching at the parser level isn't needed in 99% of the cases. Thus, it's not a bad idea if we were talking about gigabytes of data ofcourse, but thats a freeeeaking huge CSV file :)

alex_b’s picture

#13: I completely see your point.

in the scenario I'm describing in #12, there would be a default setting for how many feed items can be parsed at once - a 'chunk size setting'. It would be pretty low, something like '20'.

With such a default setting, we could guarantee that pretty much any processor set up would be able to complete within PHP execution time (this is _always_ a guess btw). While we wouldn't break most feeds (news feeds with less than 20 items) into chunks.

Further, this setting would only limit how big a chunk FeedAPI can bite off at once, not how many.

If users need more performance or if the processor set up is particulary heavy, this setting can be tuned.

Obviously, the disadvantage of this approach is that there is a setting that needs to be tuned in special cases - but I would argue those are very few cases. On the upside, it keeps FeedAPIs import procedure simple and flexible by moving the slicing of feeds all the way to the front of the process.

In a first iteration, chunk size setting could even be a parser_csv internal setting. You could read N lines from the CSV file, parse it, pass it on to FeedAPI and cache the position of the file pointer. Next time FeedAPI calls you, you keep reading where you've left off until you hit the end of the file where you reset the pointer to the beginning of the file.

parrottvision’s picture

subscribing