NOTE: this patch depends on #1211166: Create separate directories for each cache bins

1)

The reason I started doing this reorganization is:
while patching/testing other parts of this module, I spotted that get() calls delete_flushed(), which at some semi-unpredictable time starts reading the full contents of a cache bin. This could take minutes, for all we know. It's really wrong for a random get() call do do this.

(This behavior was copied from the DrupalDatabaseCache class. The big difference there is that DrupalDatabaseCache's entries are indexed by expiry time, so it can just delete all expired entries with one quick SQL DELETE statement. DrupalFileCache has only flat files named after (i.e. indexed by) cache ID - and no index by expiry time. Which is why it needs to read the contents of all files.)

2)

Then I needed to learn/think about all the variables involved. This caching stuff is complicated.
I'm going to write (perhaps way too) much here, to possibly use it as a reference later:

Current behavior:

  • The clear() function works exactly like DrupalDatabaseCache::clear(). quoting my code comments:
    // won't delete anything until <cache_lifetime> has passed since the
    // last cache clear, but when that time comes only entries with an expiry
    // time in the future are preserved. (i.e. the created time is not checked
    // against <cache_lifetime> and all CACHE_TEMPORARY items are deleted.)

    This behavior is kept intact.

  • even though clear() sets $user->cache, this value is never used anywhere!
  • filecache_cron() calls delete_expired() directly (which deletes all epxired entries always). This mostly invalidates the check on 'cache_lifetime' which is done by clear(), since entries are deleted 'behind its back'.

What the patch does in terms of 'code organization' is:

  • remove delete_flushed(), and the call from get() to it. This call is merged with the big 'if' statement below it; both things are doing essentially the same. The logic in the resulting 'if' statement has been brought in line with the one in DrupalDatabaseCache::get(), but not completely; see "differences" below. (Note in the existing code there is a bug in the variable name which means $cache_flush is always 0. The patched code is functionally equal to the actual existing code, not to the intended code.)
  • add a check on $user->cache to get(), in the same way DrupalDatabaseCache()::get() has it. (See 2nd point above.)
  • make delete_expired() the only function that does any 'bulk' reading of files and deleting them, and replace all 'logic/assumptions' by function arguments.
  • make a public function called garbageCollection() which encodes the 'logic'. That is: decide on which expiry/creation time should be checked by delete_flushed(), and check the 'cache_lifetime' / 'cache_flush_BIN' variable to see whether or not delete_flushed() should be called at all.
    (This checking of 'cache_flush_BIN' is moved here from clear() - and filecache_cron() now also uses it. Note that some 'logic' is abstracted into variables even though they never have more than one fixed value; this may be good for future experimenting.)

This means that the only two places in the module which encode 'logic' about expiry of cache entries, are garbageCollection(), and get(). (The former checks all items in the bin, the latter just checks one item).

Behavior changes

The only behavior changes should be:

  • filecache_cron() deletes less than it does now, for sites that have a minimum cache lifetime set. (It flushes the bin only at certain times, just like clear() does, but then entries deleted are not exactly the same; entries created after the last cache flush are not deleted.)
  • get() sometimes does not return data, according to $user->cache.
  • (sometimes a lot) less disk reads, to achieve the same results.

Differences in behavior between this module and DrupalDatabaseCache

  • Regular cache maintenance (flushing of expired entries) is done on cron, since it cannot realistically be done during get().
  • in get(), entries which have a real expire time set (i.e. CACHE_TEMPORARY), are also deleted if the site has no 'minimum cache lifetime' setting, while DrupalDatabaseCache just uses them in that case, until an explicit clear() is done. (IMHO at first sight this seems like a core bug...)
CommentFileSizeAuthor
#2 1488870-2.patch12.59 KBroderik
filecache-expiry.patch13.26 KBroderik
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

roderik’s picture

A note: after this patch, which does not introduce much actual behavior change (except eliminating disk trashing on get() calls)... the code is IMHO ready for more thinking, testing and fixing.

Here are some things that come to mind:

  • The delete_expired() call (from cron, and from clear(NULL)) takes much time and disk I/O. And it does not even make a practical difference, because get() refuses to use the cache entries anyway, and deletes them from disk, when they're supposed to be gone. (Note: this statement is not true for clear(NULL) calls IF your site has no 'minimum cache lifetime' set.)
  • Should we move more logic (w.r.t. the last sentence in italics) into get() and eliminate the actual deletion of files on clear(NULL) calls altogether?
  • If your minimum cache lifetime is small (or 0) and your cron runs are done often... way too much time / disk I/O is wasted. What can we do to fix this?
    • Make a setting to prevent these cleanups altogether, for people who don't care about wasting disk space on expired cache files?
    • Introduce a maximum number of deletes per cron run?
    • ...by making use of some round-robin pointer? Can this be done by using Drupal Queues?
    • Introduce some indexing by expiry date, so that we don't read in loads of files for nothing?
      • Optional? Make index on expired files using SQLite?
      • Or at least create a flat file that notes all CACHE_PERMANENT files so we don't read those for nothing?

I don't know if this is for this issue, or followup issues.

roderik’s picture

ogi’s picture

Status: Needs review » Active