Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
We should wrap batch progress reporting into the FeedsBatch class to make it easier for implementing processors to support batching and to make a future support of parser level processing cleaner.
Specifically, I'd like to suggest to replace the return value of process() with a status of the batch object. Patch coming.
Comment | File | Size | Author |
---|---|---|---|
#21 | feeds-tests.patch | 2.57 KB | lyricnz |
#19 | 849986-19_clean_batching.patch | 14.07 KB | alex_b |
#13 | feeds-6.x-1.0-beta3-new_batch_and_fetch_and_parse_as_stream.patch | 39.2 KB | pounard |
#4 | 849986-4_clean_batching.patch | 14 KB | alex_b |
#3 | 850298-4_parser_csv_batching.patch | 10.3 KB | alex_b |
Comments
Comment #1
alex_b CreditAttribution: alex_b commentedBreaks tests currently. Look at FeedsNodeProcessor::process() for improvements:
becomes obsolete and
becomes
The latter may need to change to
$batch->complete('processing');
to distinguish between the status of the separate stages fetching, parsing, processing.Proof of concept, breaks tests.
Comment #2
alex_b CreditAttribution: alex_b commentedSlightly improved approach.
- Takes into account that fetching, parsing, processing may have different completion levels at some point.
- Removes any responsibility from fetchers, parsers, processors to set a progress status if they do not implement batching.
Comment #3
alex_b CreditAttribution: alex_b commented- Make setting the total number of items to be processed the processors responsibility. Requires less 'insider knowledge' of the processor about the batch object.
Comment #4
alex_b CreditAttribution: alex_b commentedIgnore patch in #3.
Comment #5
lyricnz CreditAttribution: lyricnz commentedSome (pedantic) comments on #4:
- imho, there should not be a default second argument on this - completion should be set explicitly (none of the code omits this parameter anyway)
setProgress($stage, $progress = FEEDS_BATCH_COMPLETE)
- variable initialization in getProgress() should be moved to the else().
- actually, using array_sum() rather than a loop for the following code might be a little more efficient, and not require initializing that $total variable
- code sample above class FeedsImportBatch says:
$batch->setProgress(FEEDS_PROCESSING, $batch->created);
but this doesn't match the prototype ($batch->created is a count, not a fraction)
This is also in FeedsNodeProcessor.inc code???
- suggest members in FeedsImportBatch constructor are initialized in the same order as the fields declared, to make it more obvious they are all done? (super pedantic)
... Do the tests work?
Comment #6
lyricnz CreditAttribution: lyricnz commented2659 passes, 46 fails, and 3014 exceptions (but my setup makes noise about notices)
Comment #7
pounardIf you want to be able to do incremental fetching, and incremental parsing, either the batch must carries references to fetcher, parser and processors, either the FeedsSource class must remains the master, but if you choose to have a "strong" batch but keeps tracks of all instances in FeedsSource, you'll have a really weird pattern here, and you may not be able to do such things.
I would rather say that, if the Batch remains only a carrier for current Drupal batch data, it *must not* contain the items array (I tried to import a 40Mb file; If on each Drupal request I aptempt to update a 40Mb SQL row, this is quite the worst nightmare of sysadmin. The worst thing is that 40Mb would be bigger than the full Drupal database aside, and doing this you will expand the physical SQL storage files, without being able to make them thiner by the end, which will eat a lot of useless, and lost forever disk space, think about it twice before storing items before processing in database).
The whole point of batching is performances, and low consumption. If you store items in database, you do exactly the opposite of what you want to achieve. Please, consider about doing incremental fetching, and you'll see then your batch class does not have much to do, because you are able to do incremental fetching (or, if the fetcher can't, do incremental parsing instead), then the batch class has no use anymore than carry the offset. EXCEPT if you get rid of the FeedsSource class and merge it with the batch class.
EDIT: I would rather find the best algorithm to apply for incremental fetching before revamping the batch class itself, since this class is design dependant, and not the opposite. Find the right desing for parsers, fetchers and processors interaction before the batch class itself, else you'll have some surprises.
RE-EDIT: HTTP requests to get back an XML feed over the internet may be a lot slower than loading a local file, so what I said upper about performances may be not true in that case. For batched HTTP feeds fetching, you may consider storing the result on a physical file rather than into the database, which could be a lot better. I'm thinking about HA environments, where there is more than one Drupal frontend, for example, having only one physical machine working on a file is better than one messing up with the only database, if you slow down the database you will hit performances for all frontends.
Comment #8
lyricnz CreditAttribution: lyricnz commentedAgree about not loading the entire batch into memory with every iteration (I noticed this too, and it's pretty horrible - I was importing very large files, which was creating a memory-management nightmare - especially with PHP 5.2 memory leak issues).
Simplistically, it seems:
- the fetcher should run once, and store the file somewhere locally. Maybe one day this could be divisible and resumable, just for consistency with the two below, and processing really big data sets.
- the parser should be able to run in fractions, to parse the feed into batch-items (say we have 100MB file, it may not be possible to parse it in one chuck - we should support resumable parsing)
- the processor continues to run lumps of batch items, reading no more than the number of batch-items required for processing
Comment #9
pounard@lyrincz Totally agree with you, actually I think I'd finish a working patch that does all this for tonight (actually, if I have choice I have a deadline here). This patch add the incremental support for all including the fetcher (could be useful in some use cases), in a generic manner, where the batch class is the master of the whole operation, based on alex_b initial patch.
Comment #10
pounardWhile aptempting to run test, I got this error:
Any ideas? I'm using SimpleTest 2.10, D6 core 6.17, SimpleTest patched (no fails while patching).
Comment #11
lyricnz CreditAttribution: lyricnz commentedAre you sure you applied the patch? That function is added to common.inc by the patch.
Comment #12
pounardI just did it, may be it did not go so well with D6.17. Yup that's it, SimpleTest core patch failed on my Drupal core.I wasn't patching the right drupal root..Comment #13
pounardMoving what I started there #744660: Expand batch support to fetchers and parsers right here.
@all : Attached a proof of concept, working patch, that allowed me to import a 50Mb XML file containing 4000+ items, without saving parsed items into database, and without exceeding my 512Mb memory limit (which is still high, but with original feeds, it couldn't and broke a 1Gb limit).
@alex_b
Did that in my patch, it assumes that fetching and parsing are streams, which means, at each batch hit, it aptempt to fetch only the necessary items (using limit/offset parameters), then parse them all (the fetched subset). If the fetcher was unable to fetch incrementally (most fetchers won't, or with a local file fetching as not any sens), then aptempt to fetch a subrange of items using the parser. If the parser can't, the batch class can still extract a subset of the result array, and process only this subset.
It's quite easy for most parsers to extract a subrange, and it should not be difficult to implement (now, I did it with the SimplePie parser only). I did some API changes and moved back a lot of logic in the batch class. The API modifications I did for fetchers have a default implementation in FeedsFetcher abstract class, which means that any new or existing parser may not implement this new API and will work flawlessly.
I have two problems now:
EDIT: I took the liberty to give some air to some part of the original code, added a lot of inline documentation on some classes properties, and also defined some methods as final. For the sack of stability and code readability. I also moved up some FeedsImporterBatch stuff into the FeedsBatch class.
Comment #14
pounardTotally agree with that.
Comment #15
alex_b CreditAttribution: alex_b commentedLet's move #13 (#14) back to #744660: Expand batch support to fetchers and parsers as this issue is really just about cleaning up existing batch support.
Comment #16
pounard@alex_b : Ok, but cleaning batch support is also tied to the other issue, both may be merged?
Comment #17
alex_b CreditAttribution: alex_b commented#16: yes, this issue is closely related, but really just aims at cleaning up the existing implementation to make it easier to implement full batching. I'd love to keep this separated. That makes it easier for folks who are using existing batch support to review this patch without wrapping their minds around full batch support.
If you're in need to post a single patch because the groundwork here does not work for the approach you are suggesting, please do it on #744660. Thanks!
Comment #18
pounardOk NP, I will post future patches there.
Comment #19
alex_b CreditAttribution: alex_b commentedlyricnz, thanks for the review, here are the adjustments:
- DONE there should not be a default second argument on setProgress()
- DONE variable initialization in getProgress() should be moved to the else().
- DONE use array_sum() where possible
- DONE, ADJUSTED COMMENT $batch->created is documented as fraction, but used as a count
- DONE initialize members in FeedsImportBatch constructor in the same order as the fields declared
All tests are passing. I'm using feeds_test profile. The setProgress() mechanism still feels a little odd. I wonder how we could make this cleaner.
Comment #20
pounardsetProgress() could be removed and the array could be accessed directly, but it needs that the processor and the parser could not access it (I think that the processor and the parser should never know about the batch class btw, this has no use).
EDIT: Plus, it would be quite near from the single responsability principle, and would give more encapsuled code, then processors and parsers would be easier to develop and extends if they do not have the responsability to update the batch themselves.
Re-EDIT: That's why I think this batch class can not be cleaned up without the whole desing refactor, because to achieve this it would need FeedsProcessor and FeedsParser classes signatures changes.
Comment #21
lyricnz CreditAttribution: lyricnz commented@alex_b: Where should I post issues testing this patch?
- Call to undefined function feeds_include_library() in ..feeds_parser_csv.test line 31
==> add this above line 31
drupal_load('module', 'feeds');
- (pedantic) FeedsDefaultsFastFeedTestCase doesn't actually need views_ui (currently)
- FeedsMapperFileFieldTestCase needs Libraries and SimplePie parser (not mentioned in test description like most others) -- not needed up until this point
- FeedsMapperLinkTestCase should not require devel module, not check for it's permission
Attached patch fixes 1,3,4 above (and should be applied on top of #19). With just {cck, ctools, data, date, feeds, filefield, format_number, formatted_number, libraries, link, simpletest, views} this gets: 2945 passes, 0 fails, and 0 exceptions
(Call that RTBC on your code, and Needs Review on mine).
Comment #22
lyricnz CreditAttribution: lyricnz commentedHappy to provide a patch to fix up Coder warnings (mostly about whitespace, use of capitals for FALSE etc, and @file comments) once this patch is committed. No point until then.
Comment #23
alex_b CreditAttribution: alex_b commented#21: #866492-1: Clean up tests
Comment #24
alex_b CreditAttribution: alex_b commentedThank you.
Committed: http://drupal.org/cvs?commit=398568