We should wrap batch progress reporting into the FeedsBatch class to make it easier for implementing processors to support batching and to make a future support of parser level processing cleaner.

Specifically, I'd like to suggest to replace the return value of process() with a status of the batch object. Patch coming.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

alex_b’s picture

Status: Active » Needs work
FileSize
10.13 KB

Breaks tests currently. Look at FeedsNodeProcessor::process() for improvements:

return (1.0 / ($batch->total + 1)) * ($batch->updated + $batch->created); // Add + 1 to make sure that result is not 1.0 = finished.

becomes obsolete and

return FEEDS_BATCH_COMPLETE;

becomes

$batch->complete();

The latter may need to change to $batch->complete('processing'); to distinguish between the status of the separate stages fetching, parsing, processing.

Proof of concept, breaks tests.

alex_b’s picture

Status: Needs work » Needs review
FileSize
13.69 KB

Slightly improved approach.

- Takes into account that fetching, parsing, processing may have different completion levels at some point.
- Removes any responsibility from fetchers, parsers, processors to set a progress status if they do not implement batching.

alex_b’s picture

- Make setting the total number of items to be processed the processors responsibility. Requires less 'insider knowledge' of the processor about the batch object.

alex_b’s picture

Ignore patch in #3.

lyricnz’s picture

Status: Needs review » Needs work

Some (pedantic) comments on #4:

- imho, there should not be a default second argument on this - completion should be set explicitly (none of the code omits this parameter anyway)

setProgress($stage, $progress = FEEDS_BATCH_COMPLETE)

- variable initialization in getProgress() should be moved to the else().

- actually, using array_sum() rather than a loop for the following code might be a little more efficient, and not require initializing that $total variable

+      foreach ($this->total as $t) {
+        $total += $t;
+      }

- code sample above class FeedsImportBatch says:

$batch->setProgress(FEEDS_PROCESSING, $batch->created);

but this doesn't match the prototype ($batch->created is a count, not a fraction)

  /*
   * @param $stage 
   *   The stage to set the progress for. One of FEEDS_FETCHING, FEEDS_PARSING,
   *   FEEDS_PROCESING or FEEDS_CLEARING.
   * @param $progress
   *   A float between 0 and .9999 that indicates the progress on the given
   *   stage or FEEDS_BATCH_COMPLETE if the stage is complete.
   */
public function setProgress($stage, $progress = FEEDS_BATCH_COMPLETE) {

This is also in FeedsNodeProcessor.inc code???

- suggest members in FeedsImportBatch constructor are initialized in the same order as the fields declared, to make it more obvious they are all done? (super pedantic)

... Do the tests work?

lyricnz’s picture

2659 passes, 46 fails, and 3014 exceptions (but my setup makes noise about notices)

pounard’s picture

If you want to be able to do incremental fetching, and incremental parsing, either the batch must carries references to fetcher, parser and processors, either the FeedsSource class must remains the master, but if you choose to have a "strong" batch but keeps tracks of all instances in FeedsSource, you'll have a really weird pattern here, and you may not be able to do such things.

I would rather say that, if the Batch remains only a carrier for current Drupal batch data, it *must not* contain the items array (I tried to import a 40Mb file; If on each Drupal request I aptempt to update a 40Mb SQL row, this is quite the worst nightmare of sysadmin. The worst thing is that 40Mb would be bigger than the full Drupal database aside, and doing this you will expand the physical SQL storage files, without being able to make them thiner by the end, which will eat a lot of useless, and lost forever disk space, think about it twice before storing items before processing in database).

The whole point of batching is performances, and low consumption. If you store items in database, you do exactly the opposite of what you want to achieve. Please, consider about doing incremental fetching, and you'll see then your batch class does not have much to do, because you are able to do incremental fetching (or, if the fetcher can't, do incremental parsing instead), then the batch class has no use anymore than carry the offset. EXCEPT if you get rid of the FeedsSource class and merge it with the batch class.

EDIT: I would rather find the best algorithm to apply for incremental fetching before revamping the batch class itself, since this class is design dependant, and not the opposite. Find the right desing for parsers, fetchers and processors interaction before the batch class itself, else you'll have some surprises.

RE-EDIT: HTTP requests to get back an XML feed over the internet may be a lot slower than loading a local file, so what I said upper about performances may be not true in that case. For batched HTTP feeds fetching, you may consider storing the result on a physical file rather than into the database, which could be a lot better. I'm thinking about HA environments, where there is more than one Drupal frontend, for example, having only one physical machine working on a file is better than one messing up with the only database, if you slow down the database you will hit performances for all frontends.

lyricnz’s picture

Agree about not loading the entire batch into memory with every iteration (I noticed this too, and it's pretty horrible - I was importing very large files, which was creating a memory-management nightmare - especially with PHP 5.2 memory leak issues).

Simplistically, it seems:

- the fetcher should run once, and store the file somewhere locally. Maybe one day this could be divisible and resumable, just for consistency with the two below, and processing really big data sets.

- the parser should be able to run in fractions, to parse the feed into batch-items (say we have 100MB file, it may not be possible to parse it in one chuck - we should support resumable parsing)

- the processor continues to run lumps of batch items, reading no more than the number of batch-items required for processing

pounard’s picture

@lyrincz Totally agree with you, actually I think I'd finish a working patch that does all this for tonight (actually, if I have choice I have a deadline here). This patch add the incremental support for all including the fetcher (could be useful in some use cases), in a generic manner, where the batch class is the master of the whole operation, based on alex_b initial patch.

pounard’s picture

While aptempting to run test, I got this error:

[Tue Jul 20 12:50:58 2010] [error] [client 127.0.0.1] PHP Fatal error:  Call to undefined function _drupal_get_last_caller() in /var/www/drupal-installation-profiles/testing/modules/contrib/simpletest/drupal_web_test_case.php on line 217, referer: http://test.triage.guinevere

Any ideas? I'm using SimpleTest 2.10, D6 core 6.17, SimpleTest patched (no fails while patching).

lyricnz’s picture

Are you sure you applied the patch? That function is added to common.inc by the patch.

pounard’s picture

I just did it, may be it did not go so well with D6.17. Yup that's it, SimpleTest core patch failed on my Drupal core. I wasn't patching the right drupal root..

pounard’s picture

Moving what I started there #744660: Expand batch support to fetchers and parsers right here.

@all : Attached a proof of concept, working patch, that allowed me to import a 50Mb XML file containing 4000+ items, without saving parsed items into database, and without exceeding my 512Mb memory limit (which is still high, but with original feeds, it couldn't and broke a 1Gb limit).

@alex_b

As we can't assume that every parser is going to implement batching we will have to handle the case where the parsing stage returns more items than the processing stage consumes.

Did that in my patch, it assumes that fetching and parsing are streams, which means, at each batch hit, it aptempt to fetch only the necessary items (using limit/offset parameters), then parse them all (the fetched subset). If the fetcher was unable to fetch incrementally (most fetchers won't, or with a local file fetching as not any sens), then aptempt to fetch a subrange of items using the parser. If the parser can't, the batch class can still extract a subset of the result array, and process only this subset.

It's quite easy for most parsers to extract a subrange, and it should not be difficult to implement (now, I did it with the SimplePie parser only). I did some API changes and moved back a lot of logic in the batch class. The API modifications I did for fetchers have a default implementation in FeedsFetcher abstract class, which means that any new or existing parser may not implement this new API and will work flawlessly.

I have two problems now:

  • Because of this "streamed" orientation, fetchers are not able to give the total number of items (which is not that usefull only for user display);
  • The current algorithm that should tell to the FeedsSource class if the end has been reached does not work (it's easy to fix I think).

EDIT: I took the liberty to give some air to some part of the original code, added a lot of inline documentation on some classes properties, and also defined some methods as final. For the sack of stability and code readability. I also moved up some FeedsImporterBatch stuff into the FeedsBatch class.

pounard’s picture

Hence negotiating the number of items is a special case where parser _and_ processor support batching. To address that, we could introduce the following methods:

// Returns the number of items consumed per page load, 0 if batching is not supported.
FeedsParser::getBatchSize();
FeedsProcessor::getBatchSize();

a parser like FeedsCSVParser could use FeedsProcessor::getBatchSize() to inspect the processors batching capabilities and adjust its batch size accordingly.

Totally agree with that.

alex_b’s picture

Let's move #13 (#14) back to #744660: Expand batch support to fetchers and parsers as this issue is really just about cleaning up existing batch support.

pounard’s picture

@alex_b : Ok, but cleaning batch support is also tied to the other issue, both may be merged?

alex_b’s picture

#16: yes, this issue is closely related, but really just aims at cleaning up the existing implementation to make it easier to implement full batching. I'd love to keep this separated. That makes it easier for folks who are using existing batch support to review this patch without wrapping their minds around full batch support.

If you're in need to post a single patch because the groundwork here does not work for the approach you are suggesting, please do it on #744660. Thanks!

pounard’s picture

Ok NP, I will post future patches there.

alex_b’s picture

Status: Needs work » Needs review
FileSize
14.07 KB

lyricnz, thanks for the review, here are the adjustments:

- DONE there should not be a default second argument on setProgress()
- DONE variable initialization in getProgress() should be moved to the else().
- DONE use array_sum() where possible
- DONE, ADJUSTED COMMENT $batch->created is documented as fraction, but used as a count
- DONE initialize members in FeedsImportBatch constructor in the same order as the fields declared

All tests are passing. I'm using feeds_test profile. The setProgress() mechanism still feels a little odd. I wonder how we could make this cleaner.

pounard’s picture

setProgress() could be removed and the array could be accessed directly, but it needs that the processor and the parser could not access it (I think that the processor and the parser should never know about the batch class btw, this has no use).

EDIT: Plus, it would be quite near from the single responsability principle, and would give more encapsuled code, then processors and parsers would be easier to develop and extends if they do not have the responsability to update the batch themselves.

Re-EDIT: That's why I think this batch class can not be cleaned up without the whole desing refactor, because to achieve this it would need FeedsProcessor and FeedsParser classes signatures changes.

lyricnz’s picture

FileSize
2.57 KB

@alex_b: Where should I post issues testing this patch?

- Call to undefined function feeds_include_library() in ..feeds_parser_csv.test line 31
==> add this above line 31
drupal_load('module', 'feeds');

- (pedantic) FeedsDefaultsFastFeedTestCase doesn't actually need views_ui (currently)

- FeedsMapperFileFieldTestCase needs Libraries and SimplePie parser (not mentioned in test description like most others) -- not needed up until this point

- FeedsMapperLinkTestCase should not require devel module, not check for it's permission

Attached patch fixes 1,3,4 above (and should be applied on top of #19). With just {cck, ctools, data, date, feeds, filefield, format_number, formatted_number, libraries, link, simpletest, views} this gets: 2945 passes, 0 fails, and 0 exceptions

(Call that RTBC on your code, and Needs Review on mine).

lyricnz’s picture

Happy to provide a patch to fix up Coder warnings (mostly about whitespace, use of capitals for FALSE etc, and @file comments) once this patch is committed. No point until then.

alex_b’s picture

alex_b’s picture

Status: Needs review » Fixed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.