So, we should probably clear the node cache on comments? I've written the code, but turned it off. I wonder if it needs to be a config option somewhere, but I don't want to end up adding a ton of variables. This could turn into a nightmare if I'm not careful.

Comments

steven jones’s picture

Title: Clear cache on comments » Central Cache Clear API

So thinking out this it probably makes sense (in the loosest definition) to provide some kind of central way for other modules to say:

Hey I've got some data that you should store a 'best before date' of.

The module can tell me when the data is not fresh anymore, and the caching plugin can provide a easy way for users to select from a number of different things that could invalidate the data that they are using to construct their view from, the external modules could implement hooks that return forms for setting options on the 'types' of data they expose.

For example, the 'node' module could expose a hook that defines that users can look for changes on nodes. It would supply an option form, allowing users to restrict the types that they'd like to monitor.
Or the votingapi module could define a set of options for the different voting result tags that it supplies.

steven jones’s picture

Title: Central Cache Clear API » Central Cache API
steven jones’s picture

Seems like there are an awful lot of modules all trying to solve the same issues here. Notably, the expire module is trying to cover all the bases with a very advanced cache clearing system.

I think until that lands, we can add value by making it really easy to cover most bases when it comes to caching.

This basically means, we'll keep caching on a per content type basis, and not offer any further granularity, but will allow the user to decide if they want to consider comments and things like voting api votes as things that should clear the cache.

huesforalice’s picture

I think the module should be kept fairly simple for now. The feature of flushing the cache on node creation is something quite essential, and I was quite surprised that views comes without it. I would definately include flushing on comments and see that it integrates with core comments and node_comments as well. Not sure about the interface though, maybe just a checkbox "clear on comment". I'm pretty sure people will start using this and come up with useful ideas. As soon as there actually is a bunch of good ideas we can start work on a 2.0 version which maybe has an own API with hooks and submodules and so forth, but possibly that should be done if enough people are interested.

What do you think?

steven jones’s picture

Here's my current thinking:

There are two distinct sides to the caching coin, one is keeping track of what might change the contents of the view, and another is segmenting viewers so they only see content they are supposed to. The way views handles the expiry time stuff quite handily deals with the first case, and the way that the cache keys are built up goes some way to deal with the second. For the cache keys, I think a simple module_invoke_all would help a lot here, because modules like OG are going to need to segment the data based on what groups the user is in etc.

For the cache expiry, here's what I'm thinking:

On some event, say the posting of a comment, we ask module what keys they'd like to set for an event. The comment module (our hook implementations on its behalf) would set a 'comment' key.
Then, crucially we offer up these keys for altering in a drupal_alter, other modules, such as the node module, or OG can decide to extend the keys or modify them, so node module would duplicate the single 'comment' key for the node type to give 2 keys of comment-{node_type} plus the original global 'comment' key. OG can then come along and add its keys, depending on the group that the node was in, duplicating the existing keys but appending its own OG key. We then store these keys against the timestamp for the action in database.

Then, we allow modules to place bits of form on our plugin form, so they can let users choose what the caching on the view should look for.

When executing a view, we just look for the active options a user chose and concatenate them to give a string that matches somewhere in our DB table, we retrieve the expiry time, and away we go.

Example:

  1. Comment is posted on a 'blog' post in the 'First group' and 'Second group' groups, with nid 11 and 12 respectively.
  2. hook_comment calls Views content cache with a new event, like so:
    event($comment, 'comment', 'insert', array(array(array('module' => 'comment', 'key' => 'changed'))));
    

  3. We allow other modules to supply keys for this action. None do.
  4. We allow the other modules to alter the keys for the action.
    1. Node module duplicates the keys, to make the $keys array look like:
      array(
        array(
          array(
            'module' => 'comment',
            'key' => 'changed',
          ),
        ),
        array(
          array(
            'module' => 'comment',
            'key' => 'changed',
          ),
          array(
            'module' => 'node',
            'key' => 'blog',
          ),
        ),
      );
      
    2. Then OG gets its chance, giving:
      array(
        array(
          array(
            'module' => 'comment',
            'key' => 'changed',
          ),
        ),
        array(
          array(
            'module' => 'comment',
            'key' => 'changed',
          ),
          array(
            'module' => 'node',
            'key' => 'blog',
          ),
        ),
        array(
          array(
            'module' => 'comment',
            'key' => 'changed',
          ),
          array(
            'module' => 'og',
            'key' => '11',
          ),
        ),
        array(
          array(
            'module' => 'comment',
            'key' => 'changed',
          ),
          array(
            'module' => 'og',
            'key' => '12',
          ),
        ),
        array(
          array(
            'module' => 'comment',
            'key' => 'changed',
          ),
          array(
            'module' => 'node',
            'key' => 'blog',
          ),
          array(
            'module' => 'og',
            'key' => '11',
          ),
        ),
      array(
          array(
            'module' => 'comment',
            'key' => 'changed',
          ),
          array(
            'module' => 'node',
            'key' => 'blog',
          ),
          array(
            'module' => 'og',
            'key' => '12',
          ),
        ),
      );
      
  5. We then sort and store the keys as:
    array(
      'comment:changed' => $timestamp,
      'comment:changed|node:blog' => $timestamp,
      'comment:changed|og:11' => $timestamp,
      'comment:changed|og:12' => $timestamp,
      'comment:changed|node:blog' => $timestamp,
      'comment:changed|node:blog|og:11' => $timestamp,
      'comment:changed|node:blog|og:12' => $timestamp,
    );
    
  6. Now we can do lookups to find the last time any comment changed, the last time any comment changed on a blog post, or the last time a comment changed on a blog post in the 'Second group' really quite easily.

    Sounds complex, but isn't actually too bad.

yhahn’s picture

After talking with Steven today I thought over this method a bit and came up with some more observations/thoughts. Please take it or leave it as you will : )

Suppose...

  1. You have n number of cache segments, where a cache segment is node:type, comment:changed, og:nid, etc. as Steven described above.

  2. If for a given event you choose to generate every combination of cache segment key (as indicated in the example above) then you need to generate keys denoted by (see Wikipedia http://en.wikipedia.org/wiki/Combination):

    C(n, k) + C(n, k-1) + ... C(n, 1) 
    

    For 2 segments (node, comment):

    comment, node, comment|node
    2/1 + 2*1/2*1 = 3
    

    For 3 segments (node, comment, og1):

    node, comment, og1
    node|comment, node|og1, comment|og1
    node|comment|og1
    3/1 + 3*2/2*1 + 3*2*1/3*2*1 = 7
    

    4 segments (node, comment, og1, flag) is 15.

    5 segments (node, comment, og1, flag, taxonomy_term1) is 31.

    etc.

  3. In order for such a key storage to scale with node types, OG group segments, etc. it seems to be implied that each key/timestamp pair would be stored as a row in a table. If this is the case, we're talking about 3 additional INSERTs/UPDATEs on an event (node_save, comment_save, etc.) with two segments, 7 with three segments, etc.

  4. Based on experience with aggregation tools in the past (Feeds, FeedAPI and Managing News in particular) I know that insert/update is one of the slowest operations and the cause of serious bottlenecks. Adding an additional 15 INSERT operations per node_save, for example, would be quite painful.

The short is

  • You are going to be inserting often (save/update events can be very frequent on some sites)
  • If you have anything beyond a couple of cache segments, you'll be inserting a lot of rows

Possible solution

Suppose that it's possible to query/retrieve the timestamp for a key like node|comment|og1 with only one or two of the segments (e.g. just the node segment or the node|comment segment). Then it actually turns out that every key in the combination except the deepest one contains duplicative information. For example, the row

  • node|comment|og1 => timestamp

Would tell you the right answer for all of the following questions:

  • latest og1?
  • latest node?
  • latest comment?
  • latest node|comment?
  • latest comment|og1?
  • etc.

This means that supposing the retrieval method is robust enough on a given event we need to write only 1 row, or at most a couple if multivalues like OG are involved.

Implementation possibilities

You could imagine such a retrieval method working very easily if your DB schema were something like

+---------+------+------+-----------+
| comment | node |  og  | timestamp |
+---------+------+------+-----------+
| changed | blog | 1    | 50        |
| changed | book | 1    | 500       |
| NULL    | blog | NULL | 501       |
+---------+------+------+-----------+

A View watching only blog nodes could very easily find out its last valid cache time

SELECT MAX(timestamp) FROM {table} WHERE node = 'blog'

And a View watching a more complex combination like comment|blog,book|og would also have no problem finding out its necessary information:

SELECT MAX(timestamp) FROM {table} WHERE node IN('blog', 'book') AND comment = 'changed' AND og = 1

Now, obviously, we don't want to hardcode our table schema to specific segment types nor do we want to change our schema dynamically to accommodate new/unexpected segment types. We can get around this by a schema like

+----+----+----+----+----+----+----+----+-----------+
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | timestamp |
+----+----+----+----+----+----+----+----+-----------+

Where we assume a maximum number of cache segments. This seems like a safe assumption to make as in the examples above, when you get past 4 or 5 segments the use cases become increasingly corner-case.

Now whenever the cache is cleared we can lazily instantiate a mapping scanning all the enabled Views on the site that use the caching system:

node => c1,
comment => c2,
og => c3,
etc.

And voila, we're ready to populate our cache table cheaply and retrieve the timestamps we need quickly.

huesforalice’s picture

I'm not an experienced module developer so I can't really help that much with creating the module structure etc., but what you're writing definately makes sense to me. If there's anywhere you think I can help out, let me know.

Just a quick insert because it has crossed my mind: Another feature I was thinking about is the possibility to replace/update certain information after the views data has been retrieved from the cache. Node comment for example has an own views cache which adds a "new" flag to new comments based on which user is looking at the comment-thread. Haven't really studied the code to find out how this is done, but it could come in quite handy. I'm currently working on a website where the client has a sort of Q&A, where the threads are listed below each other. He likes his timestamp to be an "ago" value instead of an absolute date-time string. In this case our cache wouldn't really work. Probably there'd have to be a hook which other modules can use to replace certain tokens. Something like that.

steven jones’s picture

I don't think what we're talking about here is actually complicated in terms of code, so when I start implementing some basic testing and sanity checking would be very, very useful. Also, documentation is always a bigger task than you'd think, so help there would be very much appreciated.

This sort of functionality (replacing tokens in cached output) is actually in views already, though I've not seen any working examples.

steven jones’s picture

@yhahn, I knew that there should have been a better way of storing all that duplicate data, and making the DB do the heavy lifting in this makes me happy. Seems like a good refinement of the process to me.

coreyp_1’s picture

Just out of curiosity, why no just add Rules integration? The would let the user decide when to clear the cache.

For example, suppose I have a view that shows nodes based on a flag from the Flag module. I could then set up the Rule to clear the cache for that view when a flag is added/removed.

steven jones’s picture

This is now being worked on on github, I will copy it over to the 2.x branch in d.o CVS.

http://github.com/yhahn/views_content_cache

steven jones’s picture

Status: Active » Fixed

This is now in CVS.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.