When a node is saved, cache_clear_all() is called. If a cache minimum lifetime is in effect, cache_clear_all() does not clear any cache entries until cache_clear_all() is called again after the minimum lifetime has passed since the first call. Imagine this scenario:

0. You are an editor at a busy newspaper site.
1. Your cache minimum lifetime is 1 hour.
2. At 8am, you create a new story. It is temporarily unpublished.
3. At 8:15am, a curious user visits node/larger-nid-than-yesterday and hits your story. She gets Access Denied because the story is unpublished.
4. At 8:45am, you publish the story and it goes on your home page. cache_clear_all() is called but, because you have a minimum cache lifetime, the Access Denied page entry is not wiped.
5. Because the story's URL is in the page cache with Access Denied, that is what all anonymous users see.
6. You have cron running, but cron does not call cache_clear_all().
7. At 10am, no new stories have been saved, so cache_clear_all() is still not called, and your users still see Access Denied.
7. At *noon*, you save a new story.
8. Now, users finally see the first story, "published" three hours ago.
9. The publisher fires you for deploying such a stupid CMS.

It seems to me that when a node is saved, that node's page view entries should be wiped from the cache immediately. See the attached patch.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

macgirvin’s picture

Looks good to me. The Drupal code police are going to call you on putting spaces around the string concat operator '.' - I'm not one of them but felt I should probably give you a heads up.

kbahey’s picture

@macgirvin

The Drupal Coding Standards Council has declared it heresy to have a space on one side of the '.' but not the other. For 7.x, any blasphemer who commits this violation will be coding Drupal in COBOL for a month. The exact issue # is left as a penance for the transgressor.

@bjaspan

I agree with your that this is a case where the cache needs to be forced cleared. I wondering if we really need two attempts, rather than only one (which we know would be the way it works). Also, what about comments and other node related actions? Should those trigger a flush as well?

macgirvin’s picture

@kbahey - If you only knew how many batches of punched cards I've had to redo because my fingers tripped up on IDENTFICATION DIVISION.

IDENTIFICATION DIVISION.
PROGRAM-ID. SMILEY.
PROCEDURE DIVISION.
MAIN.
     DISPLAY ':-)'.
     STOP RUN.
Damien Tournoud’s picture

@macgirvin: In that case, why calling the full cache_clear_all() anyway? Isn't that purely wasteful?

bjaspan’s picture

@kbahey: Yes, the same idea probably applies in a number of other locations (e.g. comments), though nodes are probably the single most important.

@Damien: The comments explain why cache_clear_all() is still needed. Many page might be affected by a change in a node, it is just that "minimum cache lifetime" should still apply to them.

floretan’s picture

The patch uses the drupal 5 syntax for the url() function. Here's a patch that uses the new syntax with an associative array for the arguments.

I was able to reproduce the problem manually (and the patch does indeed fix it), but when I wrote a test for it the page never got cached even though page caching was enabled.

c960657’s picture

Status: Needs review » Needs work

+ cache_clear_all($base_url .'/node/'. $node->nid, 'cache_page');
This trick doesn't work when clean URLs are disabled. In this case, cid is the actual URL. You can get the un-aliased path by calling url() with 'alias' => FALSE in the $options array.

c960657’s picture

gpk’s picture

> "Editing a node does not wipe its cache entry"
I think this is actually the intended behavior. I think of the cache_lifetime variable as a "cache flush delay" - how long the system waits before removing all expired and temporary entries. However there is the code in cache_get() that ensures a user who has changed content actually sees changed content.

If trying to remove a specific entry from the page cache, there are potentially unlimited numbers of URLs that could correspond to the same page (different domain names, different path aliases, query parameters, clan URLs or not as noted above, etc.). Even if cache_clear_all was modified to allow clearing everything that matched something like "*cid*", what about a node embedded in another node etc? Views?

bjaspan’s picture

If the cache minimum lifetime actually worked, that would be fine. But as the scenario described at the top of this issue shows, it doesn't---cached pages can survive far longer than their minimum lifetime, well past when they are no longer valid.

catch’s picture

I don't understand point #6 in the opening post.

6. You have cron running, but cron does not call cache_clear_all().

  $core = array('cache', 'cache_filter', 'cache_page', 'cache_form', 'cache_menu');
  $cache_tables = array_merge(module_invoke_all('flush_caches'), $core);
  foreach ($cache_tables as $table) {
    cache_clear_all(NULL, $table);
  }
gpk’s picture

@10, oh my, this cache_lifetime headache is bigger than I thought :P :P !!... #198855: check_markup() hardcodes cache expiration.

The admin/settings/performance screen describes it as follows:

"On high-traffic sites, it may be necessary to enforce a minimum cache lifetime. The minimum cache lifetime is the minimum amount of time that will elapse before the cache is emptied and recreated, and is applied to both page and block caches. A larger minimum cache lifetime offers better performance, but users will not see new content for a longer period of time."

As I understand it, it is the amount of time that cache entries that have already expired (or are of type CACHE_TEMPORARY) will remain in the cache, once a wipe of expired/temporary entries has been requested. So in the case of the page cache and anonymous visitors, it is the length of time for which anonymous visitors may see stale pages.

Re. 11, this was added in 6.2 #226728: Temporary cache table entries are not flushed. The upshot is that if you run cron regularly and have a minimum lifetime of 0, then normal mode page caching won't do as much for you in 6.x as in 5.x, since the cache keeps getting cleared every cron run. Of course this particularly affects sites where content is not added continually (but you might still want cron to run frequently and the lifetime set to 0 so that when there *is* new content it immediately gets indexed and is visible etc.). *

Re. your initial scenario, I think it will not be quite as bad:
4. At 8:45, when the story is published, the cache_flush_block and cache_flush_page variables will be set to time() + cache_lifetime.
6. As catch noted, cron *does* call cache_clear_all(), but this won't make any difference in your scenario (if, as you say, it is a busy site) because it won't get rid of the expired/temporary entries until after an hour (i.e. cache_lifetime) has passed since the original call to cache_clear_all(). Probably the garbage collection routine (per 7.) will be triggered first by an anonymous visitor.
7. At 10am, assume a visitor tries to view the story, or the front page of the site, or any other page for that matter. The page cache tries to serve the page from the cache (http://api.drupal.org/api/function/cache_get/6). However, the garbage collection will be triggered, so the new content will now be visible to anonymous visitors.
9. Well yes, quite possibly ;)

Hope this helps clarify...

* I've seen the difference in performance for 5.x and 6.x by comparing sites running on the 2 versions: the 5.x site builds a huge page cache which makes the site extremely responsive to visitors. Well under 10% of anonymous pageviews require a full bootstrap (i.e. are not served from the cache). On the 6.x site, about 60% of pageviews require a full bootstrap. Neither site is very busy (the 6.x one gets a few thousand pageviews a month), but this is probably true for 95% of the sites out there. I'm still wondering how I might get the 6.x site to build a bigger page cache while keeping cron regular and the cache_lifetime low (preferably 0). Part of the problem is that the page cache can't really set an expire time on its entries and has to use CACHE_TEMPORARY so the only thing would be to inhibit system_cron() somehow.

The bottom line is that the regular cache clearing in 6.x via cron doesn't meet my needs at all, because it means that 6 times (in fact, possibly 10 or even 20 times) as many page views are of the "less snappy" variety. Cron currently runs every hour, so setting the minimum cache lifetime to say 1 hour wouldn't make a huge different to snappiness. (OK, site responsiveness not bad once you are in, its just not as good as the 5.x site. And because this is (albeit fairly good) shared hosting, the first page view in particular can be slow, because the cached PHP code and DB tables may not all be sitting in RAM on the server... this is where having stuff in the page cache really helps.)