Hi,
I noticed that my feeds batch files are often duplicated in /files/feeds. For each batch start the feed is re-downloaded and stored with a different timestamp. It looks like this:
FeedsHTTPBatch_ad788980927c3e7bb5b534c4b0b3189b_1280416692
FeedsHTTPBatch_ad788980927c3e7bb5b534c4b0b3189b_1280413332
FeedsHTTPBatch_ad788980927c3e7bb5b534c4b0b3189b_1280414442
etc.
These files are becoming a bit of a storage problem (mine are 10-50 MB a piece). They aren't deleted automatically, but they aren't being used either. Looks like a one-time use, orphaned files.
What is the reason/logic behind this? How to implement a call that wipes unnecessary batch files?
Comments
Comment #1
Anonymous (not verified) CreditAttribution: Anonymous commentedIt happens on CRON runs as well, duplicate feed batches, unecessary re-downloading.
I think the problem might exactly be the batch file's timestamp "_1280414442" -- because this is information not stored in the database aynwhere, so how can feeds module know there is a file available? It doesnt know, so it re-downloads (and the timestamp then exists only in the active memory, but is not stored for re-use).
And also, when re-downloading files, it should overwrite existing batch files (for the same feed).
To start a solution, I would go into FeedsBatch.inc and remove the "_timestamp" part of the filename. That should -in theory- overwrite existing identical feeds.
Comment #2
Anonymous (not verified) CreditAttribution: Anonymous commentedHave a look at this (from FeedsBatch.inc):
What is wrong here? ;-) It shows "FILE_EXISTS_RENAME", correct, but... the filename has time() in it! Which means there will never occur a duplicate name... And feeds always re-downloads each feed.
The solution must therefore be to remove time():
Now you stop getting duplicate batch files.
Comment #3
alex_b CreditAttribution: alex_b commentedThe problem is concurrency - what if a new version of the file is stored on top of an older one while that older one is imported by a different process? This is unlikely, but possible. I know that time() is really not a good safe guard at the moment.
I wonder whether we could treat any FeedsHTTPFetcher file paths temporary and just remove them after import. New imports can always download fresh.
- Create a protected method FeedsFetcher::getFileDestination() that breaks out the generation of the file destination from getFilePath().
- Override getFileDestination() in FeedsHTTPFetcher and use file_directory_temp() as base directory.
- Introduce a FeedsBatch::destroy() method, invoke it in FeedsSource::import() where the import is FEEDS_BATCH_COMPLETE.
- Override destroy() in FeedsHTTPFetcher and delete all temporary files.
The destroy() mechanism is important because the alternative, PHP's __destruct() will be invoked after every page load while the file needs to stay alive for a full batched import over several page loads.
Comment #4
Anonymous (not verified) CreditAttribution: Anonymous commentedGood point about concurrency. But I would keep files for re-use during the "refresh-period" (if set):
With a refresh period:
- check the "import_period" before re-downloading. Re-use the existing file, avoid another FeedsHTTPFetch
- after the import_period: re-download and overwrite the existing file (no duplicates)
Without a refresh period, always delete after full import. This both cleans up files & reduces uncessary data-transfer.
Comment #5
Anonymous (not verified) CreditAttribution: Anonymous commentedHave a look at this (ugly, but it works). I change the decision logic in getFilePath() in FeedsBatch.inc.
I assumed the getRaw() is what calls the FeedsHTTPFetcher to remote fetch data. And I didn't know how to cleanly get the import_period, so I had to do a db query. This needs work.
Result:
- if $this->file_path exists, return it
- then if file_exists($dest): check import_period and file timestamp
- any other case: delete file, re-download file
This prevents FeedsHTTPFetch calls, and it avoids duplicate files.
Comment #6
brad.bulger CreditAttribution: brad.bulger commented3 years later and this is still a problem. Any chance of it being fixed?
Comment #7
twistor CreditAttribution: twistor as a volunteer commented