Many duplicate feeds batch files, caused by the batch filename time() extension [#868630]

Hi,

I noticed that my feeds batch files are often duplicated in /files/feeds. For each batch start the feed is re-downloaded and stored with a different timestamp. It looks like this:

FeedsHTTPBatch_ad788980927c3e7bb5b534c4b0b3189b_1280416692
FeedsHTTPBatch_ad788980927c3e7bb5b534c4b0b3189b_1280413332
FeedsHTTPBatch_ad788980927c3e7bb5b534c4b0b3189b_1280414442
etc.

These files are becoming a bit of a storage problem (mine are 10-50 MB a piece). They aren't deleted automatically, but they aren't being used either. Looks like a one-time use, orphaned files.

What is the reason/logic behind this? How to implement a call that wipes unnecessary batch files?

Comments

Comment #1

Anonymous (not verified) CreditAttribution: Anonymous commented 29 July 2010 at 15:54

It happens on CRON runs as well, duplicate feed batches, unecessary re-downloading.

I think the problem might exactly be the batch file's timestamp "_1280414442" -- because this is information not stored in the database aynwhere, so how can feeds module know there is a file available? It doesnt know, so it re-downloads (and the timestamp then exists only in the active memory, but is not stored for re-use).

And also, when re-downloading files, it should overwrite existing batch files (for the same feed).

To start a solution, I would go into FeedsBatch.inc and remove the "_timestamp" part of the filename. That should -in theory- overwrite existing identical feeds.

Comment #2

Anonymous (not verified) CreditAttribution: Anonymous commented 29 July 2010 at 15:57

Title:

Many duplicate feeds batch files, not being re-used

» Many duplicate feeds batch files, caused by the batch filename time() extension

Have a look at this (from FeedsBatch.inc):

$dest = file_destination($dir .  '/' . get_class($this) .'_'. drupal_get_token($this->url) .'_'. time(), FILE_EXISTS_RENAME); // MT

What is wrong here? ;-) It shows "FILE_EXISTS_RENAME", correct, but... the filename has time() in it! Which means there will never occur a duplicate name... And feeds always re-downloads each feed.

The solution must therefore be to remove time():

$dest = file_destination($dir .  '/' . get_class($this) .'_'. drupal_get_token($this->url), FILE_EXISTS_RENAME); // MT

Now you stop getting duplicate batch files.

Comment #3

alex_b CreditAttribution: alex_b commented 29 July 2010 at 16:17

Component:

Miscellaneous

» Code

The problem is concurrency - what if a new version of the file is stored on top of an older one while that older one is imported by a different process? This is unlikely, but possible. I know that time() is really not a good safe guard at the moment.

I wonder whether we could treat any FeedsHTTPFetcher file paths temporary and just remove them after import. New imports can always download fresh.

- Create a protected method FeedsFetcher::getFileDestination() that breaks out the generation of the file destination from getFilePath().
- Override getFileDestination() in FeedsHTTPFetcher and use file_directory_temp() as base directory.
- Introduce a FeedsBatch::destroy() method, invoke it in FeedsSource::import() where the import is FEEDS_BATCH_COMPLETE.
- Override destroy() in FeedsHTTPFetcher and delete all temporary files.

The destroy() mechanism is important because the alternative, PHP's __destruct() will be invoked after every page load while the file needs to stay alive for a full batched import over several page loads.

Comment #4

Anonymous (not verified) CreditAttribution: Anonymous commented 30 July 2010 at 10:32

Good point about concurrency. But I would keep files for re-use during the "refresh-period" (if set):

With a refresh period:
- check the "import_period" before re-downloading. Re-use the existing file, avoid another FeedsHTTPFetch
- after the import_period: re-download and overwrite the existing file (no duplicates)

Without a refresh period, always delete after full import. This both cleans up files & reduces uncessary data-transfer.

Comment #5

Anonymous (not verified) CreditAttribution: Anonymous commented 30 July 2010 at 11:06

Have a look at this (ugly, but it works). I change the decision logic in getFilePath() in FeedsBatch.inc.

I assumed the getRaw() is what calls the FeedsHTTPFetcher to remote fetch data. And I didn't know how to cleanly get the import_period, so I had to do a db query. This needs work.

public function getFilePath() {
  public function getFilePath() {
    if (!isset($this->file_path)) {
      $dir = file_directory_path() .'/feeds/batch/';
      
      if (!file_check_directory($dir, TRUE)) {
        throw new Exception(t('Feeds directory either cannot be created or is not writable.'));
      }
      
      $dest = file_destination($dir .  '/' . get_class($this) .'_'. drupal_get_token($this->url), FILE_EXISTS_REPLACE);
      
      // Morningtime: implement (re-)download decision logic
      if (file_exists($dest)) {
        $import_period = unserialize(db_result(db_query('SELECT fi.config AS config FROM {feeds_importer} fi, {feeds_source} fs WHERE fs.source = "%s" AND fs.id=fi.id', $this->url)));

        // Morningtime: if there is a period set and it has not passed: we re-use the existing file
        if ($import_period['import_period'] && (time() - filemtime($dest) < $import_period['import_period'])) {     
          $this->file_path = $dest;
          return $this->file_path;
        }   
      } else {
        // Morningtime: delete before re-downloading with getRaw()
        file_delete($dest);
      }
      
      $this->file_path = file_save_data($this->getRaw(), $dest);
        
      if($this->file_path === 0) {
       throw new Exception(t('Cannot write content to %dest', array('%dest' => $dest)));
      }      
    }
    return $this->file_path;
  }

Result:

- if $this->file_path exists, return it
- then if file_exists($dest): check import_period and file timestamp
- any other case: delete file, re-download file

This prevents FeedsHTTPFetch calls, and it avoids duplicate files.