Download & Extend

PERFORMANCE! Central caching for category API functions

Project:Category
Version:6.x-2.0-beta3
Component:Code
Category:bug report
Priority:critical
Assigned:JirkaRybka
Status:closed (fixed)

Issue Summary

I'm opening this as 5.x-1.1 issue, because that's the version where I had to fix this quickly, and so it's where the initial patch applies. But however, I'm going to rework this for 6.x ASAP, as it's a stopper for my site's upgrade.

The problem and example of my site: It's known for a while that Category is a quite resource-intensive module. It didn't bother me that much until yesterday, when the hosting provider took my site down because it was overloading the shared server too badly. There were many slow queries, most of them originating from Category module (install with some sub-modules). To outline the performance profile of my site - from initial state, I saved as much as 25% of total pages generation time with this patch, and that's a lot for single module out of many. Another 25 to 30% was saved by disabling Statistics module, disabling certain blocks, adding indexes and removing unnecessary DISTINCT from queries, across core and contrib present on the site - in other words, all the other optimizations together are more or less equal to this Category patch alone. The remaining <50% of initial time includes core, quite a few contribs, and several heavy Views-made sidebar blocks, so we're not exactly speaking about 25% out of nothing.

The patch: What this patch does, is adding a central caching mechanism to the query-heavy data loading functions, mostly found in category.inc. There are already attempts on caching visible in code, somewhere static variables, somewhere even a kind of central function handling the queries, but it doesn't cover everything, *and* it only just employs static cache, so it helps with repeated calls in the same request, but doesn't really help with these numerous slow queries on every single page. All the same, existing code made the database server choke - some of the queries seen in category.inc are really horrible, joined multiple times, forcing MySQL to use temporary tables, filesorts etc., and despite static caching, they still need to run on every page (in some cases).

The basic idea of this patch is, that we cache all the data from "horrible" queries not only in static variable, but also in database, to make it persistent across requests. We cache it per-node, so that all data relevant to a given node may be loaded by single query from the cache (as it's quite likely, that a node being processed will need more of the data retrieved by various API functions later). And we make the mechanism central to all Category functions, by defining an uniform function call, minimizing impact on the code in each API function, and allowing for the desired bundled per-node storage of all data.

So first (and biggest) part of the patch is transformation of all the functions (that are likely to benefit from this) to use the new uniform function for caching, instead of previous static / old function / none workflows. It's just a simple get / rebuild+set-if-empty structure, uniform to all these.

Once we have all the caching in a central function (which is a single function with get/set/flush ops, to allow for easy static caching), we're almost there. First, the caching function have static variable caching, which is therefore applied to all the callers (i.e. more places than it was). Second, the cache is stored to a new database table {cache_category}, in the form of merged node-oriented records (serialized), so once a node is queried once, all API functions can process it without further queries (assuming the cache is already built), reducing the whole bunch of horrible joined queries into a single and simple cache get.

With node ID equal to zero, various global data (such as full tree and the like) are stored.

The database cache may persist quite long, not even flushed by most postings on the site. The positive thing about this approach is, that we've all cached data per-node, so we can only flush the cache for the very node being saved or deleted, preserving all the rest. This only applies to regular nodes, obviously - if a category or container gets saved, we need to flush the whole cache, as the hierarchy might be affected. But saving categories and containers is usually not very common. We also flush the zero-key global data on every node save (as it might be affected), but considering that viewing content is much, much more frequent than saving, this is still a win. The horrid big queries now only run once, after a node gets saved.

Since we have the caching centralized, the flushes are now handled directly as a part of hook_nodeapi operations, so the various $reset flags passed around might be obsolete now (although I didn't remove them in this patch).

This patch is a quick fix created overnight for my site, but it appears to work quite well, doesn't break anything, and saves a lot of time (plus saved my site from being shot by the hosting provider). The patch have code style issues at the moment (weird half-indentations, to keep the patch file small where actual code didn't change), but I really consider it a proof-of-concept at this point. I'm going to try, and implement the same for 6.x version, and then roll a proper patch with chances to be considered for commit.

Anyone wanting to try this 5.x-1.1 patch: It's not much tested, although it runs on my production site now, and helps a lot. It's against Category 5.x-1.1, but I didn't really try to apply it to a clean Category package (my code is hacked/patched a bit - I guess a bit of offset is the only trouble, but who knows). The database table needs to be created manually - there's no update function in the patch! I just asked in PhpMyAdmin "SHOW CREATE TABLE cache_filter", changed table name in the returned query to "cache_category", and executed that to create the new table.

AttachmentSize
cat-central-cache-D5.patch19.55 KB

Comments

#1

Version:5.x-1.1» 6.x-2.0-beta3
Status:needs work» needs review

So, here is the big 6.x Category performance patch. Looks BIG in size, but it's mostly just changes of indentation. Beware: This doesn't apply to Category 6.x-2.0-beta3 (or to any other official release) - it only applies to beta3 after ALL my other patches are applied, as listed in my post #65 at #158598: Category port for Drupal 6.x. This is another layer of patching after I'm done with the fixes listed over there, and since I'm getting short on time, I didn't have the nerve to "back-port" this patch to the last official release (which is broken in certain ways). Either my other fixes will get committed, making this patch fine to apply, or I'll deal with a re-roll afterwards, when an other decision is made. Testers: Either apply all the patches, or look into the other issue (I'm going to post a full patched Category tarball there for testing purposes, but not today.)

This patch is large, but highly needed, since Category is a huge performance drain on larger sites (almost took my one down, recently). This patch provides a generic solution for all Category code, and cuts time/number of Category queries on regular pages to a half, if not more (my benchmarking environment is currently broken, but the difference is *really* visible on my [a bit larger] site [some >25% of total page generation time down, somewhere more], Devel shows quite a bunch of very slow queries gone).

Overview of the patch:

- A new central caching function category_cache_op($op, $nid, $key, $data) added to category.inc (plus hook_flush_caches() implementation, and schema/update code to create needed database cache table). It's mostly the same as in my initial 5.x patch, only the global data block is now stored per-key, for better granularity of this (large) data. To recap, this function provides both static variable, and database caching for all the caller-functions, per-node storage of all data to minimize database queries against the cache, and global flushing mechanism (so that all the functions don't need to care about flushing).

- All category (+sub-modules) functions, which are retrieving data from the database (with quite complicated queries, mostly), are adapted to the new caching system, along this pattern:

<?php
function category_foo($nid, $bar) {
 
$cache_key = 'cat_foo'. $bar// Provide all context of stored data in the key.
 
$data = category_cache_op('get', $nid, $cache_key);
  if (!isset(
$data)) {
   
// Here goes the original code of the function, that builds the data.
    // Changes are mostly just indentation, or somewhere simplified array
    // keys due to local static variable caching no more needed.
   
category_cache_op('set', $nid, $cache_key, $data);
  }
  return
$data;
}
?>

This replaces static variable caching, which is now provided by the central function, so the $reset argument is now gone from these functions:
category_node_get_categories()
category_get_parents()
category_get_children()
category_get_category()

I checked carefully - the $reset argument was never used, except for one instance in category_menu, where I replaced it with a call to the central function to do a flush. The same may be done in any external/custom code needing a flush, but all category functions for saving data handle flushing automatically now, so it shouldn't be needed. ($reset wasn't passed through a lot of functions to subsequent calls, and so was of limited use already BTW.)

- All data saving functions got the responsibility for cache flushing, calling the central function to do explicit flushes. Somewhere it relies on another function being called, that already have the flush, somewhere the flushes are a bit duplicate in the workflow, but the idea here is, that any change to the data done through API functions invokes a correct cache flush ASAP. It's either per-node (plus global data) on a regular node save, or full flush on category/container save.

- Wrapper modules are not converted to do the caching. I was thinking about this, and while there *are* some queries worth the attention, it's rather minor [experience from my live site], and the taxonomy/book tables, being core, may be easily modified by other modules/code without a proper cache flush, so I better left them out.

- Additionally, there's an extra performance fix, in fact a follow-up from #457364: TOC (tree) display missing, where we added $behavior check to avoid problems with content pruning from TOC trees. Now I realize, that the new $behavior check is sufficient by itself, so we don't need to run the query at all - and it's a joined query, performed inside a loop for all nodes in a tree (!) So this patch removes it. (As a side-note, after I looked at the code a bit, I'm not a great fan of the transition to menu-system-stored hiearchy in category_display, but that's out of scope now.)

- Another additional performance fix is in the code, where content of category pages is built: Previously, category_select_nodes() was called first, retrieving the nodes to be listed on a category page, through a rather horrible query, and only then a decision was made, whether we render it, or forget it (and either show nothing, if category_display says so, or retrieve the list *again* through Views). Now it's only retrieved just before the rendering phase, where we are sure that it'll be really used.

This patch is not fully tested yet, but I already tested most of normal workflow, plus (carefully) install/uninstall/update, and it seems to work fine. I'll report back after I test this fully, and put it online afterwards. Note to testers (if any): Although this patch have update function and adds a new database table 'cache_category' (requires to run update.php!), there are no changes to existing database schema, so it's possible to revert back to beta3 afterwards (perhaps dropping the new table manually).

AttachmentSize
cat-performance.patch 46.53 KB

#2

After some testing, I've fixed a few problems:

- I've messed the array keys in category_get_tree() in the previous patch (my apologies!), now fixed.

- The caching function now stores serialized data, even into the static variable. This is necessary to consistently store/retrieve copies of data (emulating the original not-cached behavior, where the data got rebuilt afresh on each call), regardless variable type - because, frankly, objects and php suck. PHP5 always pass and assign objects by-reference, so if we just assign an object to static cache, the cache gets changed whenever the original caller change their object, that got stored to the cache previously. Even if we check for objects explicitly through is_object(), and do a clone() where needed, we don't get out of trouble - there's a significant difference between PHP4/PHP5, and we still have the same problem with nested objects (such as array of category objects). So, serialize() is the easiest reliable way, how to copy a variable without knowing it's type/structure. (Why are we using objects for simple arrays of data, is another question.)

(This applies to the 5.x version above too, it needs serialize() and unserialize() in category_cache_op() 'set' and return parts. Stating it just in case that someone needs that 5.x patch.)

New patch attached, see above for the notes how and where it applies.

AttachmentSize
cat-performance-2.patch 48.59 KB

#3

Status:needs review» fixed

Committed to HEAD. Thanks! See:

#158598: Category port for Drupal 6.x

#4

Status:fixed» closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.