File engine: Performance of cache flushes, and related cleanup

JirkaRybka - September 15, 2009 - 22:46
Project:Cache Router
Version:6.x-1.0-rc1
Component:Code
Category:bug report
Priority:critical
Assigned:Unassigned
Status:needs review
Issue tags:cache, file
Description

I recently tried to improve performance of my site (amongst other changes) by installing cacherouter module, with the file engine (which is the only option I have on a shared hosting, apart from default database caching [cache_get() was identified by devel module as slow queriers source, over 20% of total time and about 50% of queries on my site]). Cacherouter seems like the most promising module to me (as Boost module is out of question due to .htaccess restrictions on the hosting, and fastpath_fscache module seems to be even more flawed).

I managed to get the site running with cacherouter+file, with uncertain-to-promising performance results, but then I ran into problems and had to revert to database caching. I would love to have this caching (for a test period at least), and I wanted to hack/patch it somehow, but it turns out to be more complicated than my time permits, so for now, I'm just writing my ideas down here...

My comments below are all about the file engine, and all more or less boil down to filenames/paths and related processing, so I leave them all together in a single issue for now.

---------------------------------

- Why do we use md5 hash in filenames? The cache keys are already supposed to be unique, so should be enough for a filename. We only just need to escape them properly in a non-destructive way (i.e. transform bad characters so that they are still unique, rather than replacing them all by some fixed placeholder). A good candidate to do this in a quick, simple way, might be urlencode() - I believe it replaces everything non-alphanumeric with %xx codes. (If '%' is a problem - which I hope is not - we may convert that further with a simple strtr.)

Per code-comment '%' seems to be OK:

// Make sure we have a good filename when we concatentate $hash with $appendix:
// * Can't be over 255 bytes (ext2, 3, 4) or 255 characters (NTFS, FAT) as per
//  http://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits
//
* Can't include "? * / \ : ; < >" in NTFS and FAT as per
//  http://technet.microsoft.com/en-us/library/cc722482.aspx

- This then allows us to do proper wildcard-matched deletions. We already append the actual cache-key for that purpose (effectively doubling the key), but that's unreliable: One bit is, that we're potentially running into length limitation. We're basically putting together 32-characters hash plus '--' plus the actual cache key (which allows up to 255 characters in core, already), getting up to 289 characters of a filename. That's not guaranteed to be wildcard-deletion compatible already, due to various bad characters being all replaced with uniform '-', and we're then going to truncate the name to the length of 255, making it even worse.

If we use only the escaped cache key, we'll gain extra space of 34 characters, which may, or may not, be consumed by the escaping sequences ('%xx' triples each 'separator' character in size). This means, that if the cache key is really long, if it have less than some 12% of non-alphanumeric characters, then the change helps us. I believe that there are no cache keys that long, and if there were, they would be most probably constructed from larger pieces of data (such as md5 hashes), so being mostly alphanumerical. For the most common short cache keys, this is obviously a simplification, which may (or may not) help the filesystem performance a bit, cutting length of the filenames in half if not more.

- We might still want to calculate the md5() internally, to determine which subdirectory should be used, preserving the random distribution of data into several directories. And I would say - have a setting for how many characters from the hash should be used (0 - 1 - more?), for scalability. Having >5000 files in a directory is no fun.

- The biggest, critical problem for me is the expiration of temporary/lifetime-driven cache entries. The code of flush() and purge() is really heavy performance-wise: After a bit of testing, I tried to put rc1 to a production site, and although the site speed went up, and no obvious problems arised at first, it then killed cron quickly.

Let's see some cache statistics of the site in question, before discussing more: There are 10 cache tables, at the moment with total 11750 entries, 7152 permanent, 4598 temporary. (cache 4-permanent, cache_block 51-temporary, cache_category 4176-permanent, cache_content 140-permanent, cache_filter 4072-temporary, cache_form 51-temporary, cache_menu 2679-permanent, cache_page 424-temporary, cache_views 2-permanent, cache_views_data 151-permanent). It's a site with some 7000 nodes, 2000 not-much-active users, quite a bit of contrib modules, 2500 unique visitors a day; observed less than a day after full flush of caches.

Now what happens is, that at every single cron run, system_cron() collects all core and contrib cache tables (via hook_flush_caches) and invokes cache_clear_all(NULL, $table); for each of these. This goes to flush() in our file engine (BTW retrieving cache lifetime and global user object pointlessly, as far as I can see), and so purge() is called for every single subdirectory. This one in turn iterates through all the files present, and unless there's minimum cache lifetime, every single file is loaded and unserialized, so that we can see what the $cache->expire value is.

So in the end, this means that every single cron run on my site needs to traverse 100 directories, and load and unserialize 11750 files, just to see that there's actually nothing to do. No wonder that my cron hits php time limit, and fails. I had to revert back to database caching because of that, for now.

- IMO, we should somehow divide the data into permanent/non-permanent parts. (On my site, that would reduce the # of items to deal with, from 11750 to 4598). I see basically two options: Store into two subdirectories based on the non/permanent status, or put the non/permanent info into filename (as an extension?). Either way, we would need to check both locations (I suggest permanent first) on reads and deletions, or else process wildcarded filenames on these operations (might need glob(), I didn't really test yet).

- We also need to store the expiration timestamp outside of the serialized cache object in the file, to avoid the excessive loading and unserializing. Together with the above, I feel like having a numeric filename-extension 'filename.[timestamp here]', so that we can tell the expiration from filename alone, and (if not separated into a different directory) a 'filename.p' extension for permanent, so that we can filter these out by the regex pattern in file_scan_directory(), and not iterate through them all later. (And while we're at it, we might use the callback argument of file_scan_directory() for our processing, and drop our own loop.) Again, this will require to access the files by wildcard 'cache_key.*' on reads and deletions, and delete any existing variations before writing a fresh entry, which might be tricky and performance concern(?).

- Still, checking all the non-permanent entries in a single cron run seems to be a killer. I see two options here: One is to have some variable-stored pointer, to have each cron run process only a part of the files/directories. The 16 per-hash-leading-character directories might be a good layer for this.

The other option is to have some sort of an index-record somewhere, having all the expiration info together (database or file), so that we can load it to an array, and expire [a limited amount of] items without searching in the filesystem. I'm quite unsure, how to maintain that index though, given the potentially high concurrency level, and potentially high number of items. Perhaps we can have a per-directory summary file, and insert info about new [non-permanent] cache items in a syntax like 'timestamp:filename/', so that we can add that (under a flock() ) through a simple file-append (fopen with 'a' mode), without need to somehow parse and re-write the whole thing on every cache_set(). Then, flush() might "parse" the file with a simple explode(), [the separate strings working as numeric timestamps with extra filename info - but what about duplicates in the file, if the cache got written multiple times?], and expire some limited amount of entries, plus write the rest of file implode()'d back. Downside is that we would probaly need an exclusive lock for this whole thing, although for only just one subdirectory at a time.

Also worth to mention, we might keep the expiration info in a well-synchronized database table.

- I believe that lowering the rate of cron flushes by dividing the operations to more runs shouldn't hurt, since there's no guarantee of cron frequency already. (BTW, the botton half of get() seems weird to my eyes, in that I can't see where it gets $cache populated, and how it ever runs with the above page_fast_cache() check - but I admit to being rather new to OOP, which still looks weird and messy to me generally).

- Additionally, there's a problem with filepaths being by default defined as relative (per README.txt) in configuration - occassionally, I got errors on is_dir() inside file_scan_directory(), being caught by open_basedir restriction as the script crawled somewhere completely else [don't really know where, as whole http-accessible directory tree for my domain is allowed, and more]. Also I see in the code of Boost module, that they explicitly set the current directory, with a code-comment that Apache sometimes mess this up. I consistently got huge amount of these errors on admin/build/modules page. In the end, I resolved that by setting the path as absolute, but that required quite a bit of gymnastics in settings.php, taking __FILE__ and building the path to my files directory based on that.

---------------------

I might be able to do some work and/or testing on a patch, but I realized that this is probably too big task for me alone.

Setting critical for the cron breakage.

#1

JirkaRybka - September 16, 2009 - 10:39

Today I observed 17209 items total in my caches, 6656 with expiration time (cache_filter and cache_form only), 727 with CACHE_TEMPORARY status, 9826 permanent. I see that generally each table only contains one type of entries, mixed stuff seems to be nowhere used. The critical-for-me tables are either permanent (majority), or temporary (cache_page, cache_block), which would be definitely possible to handle by a status flag in cache filenames - in cron, we would only need to delete some 700 temporary files identified by either filename, or even a table/bin setting [if the assumption of only one type of entries in each table is considered sufficiently safe]. The time-expired tables (cache_filter, cache_form) are definitely either long-term or low-traffic, so slower rate of expiration [chunked processing in cron] shouldn't be a problem.

#2

JirkaRybka - September 16, 2009 - 19:03
Title:File engine: Cron times out, engine needs cleanup» File engine: Performance of cache flushes, and related improvements of filenames

Well, now it occured to me, that Windows is a case-insensitive file system, so that's probably why the hash is there... So now my proposal for filenames is like this: First the actual cache key (escaped), so that we can do proper wildcard deletions with simple matching (not searching over the entire hash first), also gaining more human-readable filenames. Then (with some separator character, for the sake of human-readability of filenames) add the hash, which only just serves to strengthen the uniqueness on Windows. If the cache key was too long (which should be rare, if ever seen), we'll truncate the end, reducing the appended hash first (only just lowering the defense against edge cases on Windows, where even a single character from the hash is still likely to make lower/upper cased cache keys differ), not losing anything from the actual key yet. Cut the name (if running into the length limit) to a fixed size still allowing for extension. We only need to escape very few characters, so let's do a custom way through strtr(), which should be fast, and we can reduce the escape sequences from 3 (i.e. '%40') to 2 (i.e. '!A') characters, and greatly reduce the amount of them (improving filename readability, and preserving space). We don't need to decode that, anywhere. And finally add filename extensions indicating the expiration, to allow for quick filename matching in purge(). This matching should be possible using glob() - should be much faster than php+regex iterations in ancient file_scan_directory(), and is available since php 4.3.0 (Drupal 6 requires php 4.3.5 minimum). We'll also need to find files by wildcard on other operations, though, due to variable extensions.

Going to try some initial patch now. Also improving issue title.

#3

JirkaRybka - September 17, 2009 - 10:35
Title:File engine: Performance of cache flushes, and related improvements of filenames» File engine: Performance of cache flushes, and related cleanup
Status:active» needs review

OK, finally I decided to give it a try, and I reached this version (patch attached). It seems to work just fine, so I have just installed it to my live site - no problems so far, but running for only a hour yet. As for the cron, 6.0-1.0-rc1 died on 20sec. max execution time (with some 6000 cache entries total), while this patched version manages to do full cron run (including all the other core tasks) in 3 seconds (with about 2500 cache entries total, and about the same time with 5706 entries a hour later). Overall performance seems to be more or less identical to rc1 (although I didn't do any benchmarks, I just observed on a few manual requests, that page loading times on localhost are along the lines of: Drupal core - 11 seconds, rc1 - 9 seconds, patched - 9 seconds).

EDIT - addition: After some 10 hours, the cache on my site is almost complete (15489 entries total), and under high load conditions (middle of busy day, crawlers currently hitting a lot of 404's etc.), the cron still finishes in 11 seconds at most (our flushing being just a part of that, obviously). Acceptable perfornamce, I would say :-)

Changes included in the patch:

- some indentation/whitespace fixes (sorry, I couldn't resist)

- port of a recent core fix, where the variable 'cache_flush' needs to be per-table, to avoid interference between various cache tables.

The rest only affects the file engine:

- Introduced the above-proposed filenames in the form of [escaped-cache-key]-[md5-hash].[expiration-timestamp]. key() now returns this thing MINUS extension, because the handling of extensions is different in every single caller function. The escaping is done by a simplified '!X' pattern through simple strtr(), aiming to only change the few dangerous characters, while still keeping them unique, and minimize the impact on string length. The hash comes last, so that wildcard deletions are possible by a simple [escaped-key]* match, and the length limit cuts into the (mostly redundant) hash first.

The extension is .0 for CACHE_TEMPORARY, or .0xxxxxxx for expiration linux timestamp, so that both may be expired properly by a single cast to integer, and the leading zero identifies them all by a simple filename match *.0* Permanent entries have the extension .perm, and are therefore invisible to the purging code entirely (*.0* doesn't match).

It might be questionable, whether we need to care about the unprintable characters \x00 - \x1F, but I made it binary-safe for now; doesn't hurt. In addition to the characters already listed in existing code, we need to escape whitespace (not really nice in filenames), dot (needed for extensions), and exclamation mark (used for escaping now).

- get() now retrieves the actual filename through glob() with [key].* pattern. A bit of benchmarking might be useful here, but I can't do that, and I certainly don't see any noticeable difference. I bet the filesystem caches that, otherwise I would see something different I guess...

- Additionally, get() got some cleanup: $this->content static caching is now in Cache.php, so removing cruft. The minimum cache lifetime handling is now in cacherouter.inc, so again removing cruft - additionally this bit was obviously broken, as the $cache object wasn't getting populated anywhere, should the code run to the else{}. I left the check for page_fast_cache() in place for now, as I don't know what was the intention, but it seems weird to me, too.

- set() now generates the file extensions based on $cache->expire. We also need to deal with the situation, when the actual filename changes due to different expiration of the new entry. I investigated possibilities to rename the old file, but that was quite unsolvable, as I needed to first retrieve the actual filename through glob(), then open it, and only then I'm able to obtain a lock on that file (as flock() only works on actual file handle on some operating systems, per php manual). And even if I locked it somehow, another concurrent request may come after the rename, and add a new (duplicate) entry, thinking that the old one is missing.

So I decided to use second level of locking, only for set() alone, to avoid duplicate writes in the unlockable glob-rename-write sequence. Since we change (potentially) filenames here, we need this lock on entire directory - which isn't that bad, as we have 16 directories per table, so the write-locked part is not really big, and writes to cache are supposed to be rare. The lock is done on a dummy per-directory file, which is stored outside the actual directory, to avoid this file matching '*.*' patterns on mass deletions of data entries. So we have for example 'a.lock' (0 bytes) alongside the actual 'a' directory.

The actual operation on the files is done by deleting any old entries, then writing a fresh file. Considering that we need to do a glob() anyway, and that we're going to re-write the whole file too, this is no worse than a rename, and the benefit is that we remove any duplicate entries (with different timestamps) for sure, even if they existed for some mysterious reason.

- delete() is reduced to three simple conditions, with all these slow file_scan_directory() calls gone. Now it uses a helper function _delete(), shared with set(), which accepts shell-like patterns (i.e. directory/*/key*.* and such). It's just a glob() call and a simple foreach.

- flush() with all it's manual iterations through directories is more or less gone, as glob() does that part for us already. Now, flush() is more or less just a wrapper around purge() - we might want to put the actual code in flush() and get rid of purge(), but I didn't go that far for now.

- purge() now uses glob() to quickly fetch all filenames for non-permanent entries (identified by numeric filename extensions), and the extensions are already the relevant timestamps, so a simple loop does the cleanup just fine.

- Quite a bit of crufty code is now gone from purge() - all the handling of minimum cache lifetime is now in cacherouter.inc. Additionally, the old code in purge() is buggy AFAICT - the minimum cache lifetime is just MINIMUM, not maximum, so it doesn't mean we want to delete all non-permanent entries after that period no matter their timestamps. Also the other case (no minimum cache lifetime) didn't use locking on deletions. Now, that's all gone.

As for my other proposal - doing purge() in smaller chunks to avoid cron time limits - that's not necessary for now. The switch from file_scan_directory() plus parsing all files, to a simple glob() match, gave enough performance gain to allow us keeping the all-in-one processing.

EDIT: BTW, it would be cool if this project had more choices in the "component" field, namely all the separate engines, so that all issues don't need to be just about "code".

AttachmentSize
cache-file.patch 13.92 KB

#4

JirkaRybka - September 17, 2009 - 19:04

Hmm, well... Now I see that the ported core fix already have a different issue open - so in fact this patch have #569436: Cache flush as core (per table) rolled in, slightly different due to some instances of the variable being removed here. (The core fix was: #227228: cache_clear_all and cache_get fail to clear caches when Minimum cache lifetime is on.)

#5

vacilando - September 17, 2009 - 23:04

Excellent findings and suggestions. Esp. the very inefficient cache deletion during cron run is a problem that needs an urgent fix. Subscribing. I'll also try Jirka's patch.

#6

harking - September 17, 2009 - 23:27

I agree that this is a great leap forward for the File cache and that the cache deletion improvements will allow this to scale to larger sites.

I'll look into the patch for any issues.

Thanks for the hard work JirkaRybka

#7

slantview - September 17, 2009 - 23:32

I will get to this patch in the next few days. I don't have time to take a look right now, but so far it looks good.

#8

JirkaRybka - September 18, 2009 - 19:37

Vacilando pointed me to #514030: Another unlink error with file caching - thanks!

The problem in that issue is that unlink() might sometimes throw an error, if we succeeded to open the file, but before we grabbed a lock on it, some concurrent request already unlink()'d it. The other issue suggests an additional file_exists() check, but I guess we just need to add the error control operator: @unlink() - the same effect, and less code.

New patch attached, only that one @ added, otherwise identical.

AttachmentSize
cache-file-2.patch 13.92 KB

#9

vacilando - September 19, 2009 - 12:42

Thanks, JirkaRybka! Your patch worked just as expected. The latest patch (#8) indeed also solved my problem with the unlink errors reported at #514030: Another unlink error with file caching. I am using it for many hours now without any problem.

I recommend others, esp. the maintainer, to test this patch some more and incorporate it in the dev release soon. Thanks.

#10

harking - September 21, 2009 - 16:58

@JirkaRybka

It is much faster to utilize the actual check for the file existing:

http://spreadsheets.google.com/pub?key=tsxyY18qooS3dsjkdX8hybA&oid=3&out...

I've attached the source file that generated these times.

I suggest we remove all error suppression from the file engine.

AttachmentSize
error-suppression.php_.txt 1.51 KB

#11

JirkaRybka - September 21, 2009 - 18:38

Hmm, if that's the case, then something is utterly wrong in the PHP internals (which we can't fix, obviously). I probably can change my patch to that more-code-bloated version if that's significantly more performant, but I would like to see the speed test done in more "real" situation.

I mean, the test script attached to #10 hits always the same file (additional risk of measuring a cache rather than the real thing) that doesn't exist, so the file_exists() version always goes the shorter way with unlink() never really called, while the @unlink() version goes always the (likely) longer way, where the error condition really occurs and PHP have to deal with that (whatever that might mean). In real situations, we're in 99.9999% (or so) hitting different files, that DO exist, so it will be the other way around. Essentially, in the test we compared just file_exists() to failed @unlink(), while in reality we'll compare file_exists() plus successful unlink() [not to mention if {} structure], to a successful @unlink(). I really don't know whether that makes a significant difference, but I guess it might. Can we generate some 1000 real files, and then delete _these_ under the timer?

Other than that, I believe we still need the @ operator on fopen() calls, because we can't know whether the file got removed just prior to fopen() by another request. If so, fopen() throws an error. We can't obtain a lock on the file prior to fopen(), because flock() requires an open file pointer.

The time critical part is get() AFAICT, where we might be able to get rid of @fread(), but that bit is almost not changed by my patch, and is also a different issue (#581476: Filesize errors caused by unlink), so I would rather not roll everything into a single patch. (Deletion was different matter, because this patch already changes that code completely.) Perhaps a follow-up issue on error control operators alone?

Today I don't have the time to test (sorry).

#12

harking - September 22, 2009 - 07:03

I just did some testing with files. Looks like when they exist just using error suppression is faster than checking if it exists first.

http://spreadsheets.google.com/pub?key=tD0h58h9-Zc90xZVEgNtDSA&oid=2&out...

Attached test code

AttachmentSize
error-suppression.php_.txt 2.06 KB

#13

JirkaRybka - September 22, 2009 - 10:49

Thanks for doing these tests, harking - greatly appreciated.

I don't see any problems with the test script at #12, and based on the results, I believe that the patch doesn't need any changes in the @ operator part.

So, still 'needs review' for #8. Thanks everyone.

#14

JirkaRybka - September 26, 2009 - 00:21

I re-rolled the patch with a fix for #588030: File Engine - Cache delete with wildcard "*" does not purge correctly due to incorrect file filter. The additional problem here is, that the cache key needs to be escaped consistently across all operations (including wildcard deletions - and believe it or not, I got bitten by this bug recently on a live site, but didn't have time to investigate).

To avoid further breakage with two separate escape definitions being difficult to maintain, I just abstracted the purely filename-related part to a separate function escape_key(), ensuring consistent output for all operations.

I rolled it into this patch, because the code in question is completely changed here already, and fixing this in existing 6.x-1.x-dev would have further nasty implications [i.e. the key is used directly as a part of regex there]. I suggest considering this patch rather than fixing sub-parts in old code [that needs cleanup anyway].

AttachmentSize
cache-file-3.patch 14.64 KB

#15

harking - September 27, 2009 - 19:41

JirkaRybka:

Thanks for rolling in the delete wildcard fix.

I've read through the patch and error suppression needs to be performed on the unlink() call in purge.

In addition, we do not need to get an flock on unlink calls, as they operate on filenames and not file pointers. If another process has the file open, the unlink call will not harm that process as it still has the file pointer to the location on disk.

The error suppression is needed in case we get swapped out between the glob call and unlink, in which time another process has already unlinked it. _delete() and purge both fopen and then request a flock before unlinking. I suggesst we remove the fopen and flock calls around all unlink()s and simply add error suppression.

Besides these changes, everything looks good. How were the character substitutions generated? They seem arbitrary.

#16

JirkaRybka - September 27, 2009 - 20:47

harking:

Thanks for review!

- Error suppression on unlink(): Good catch! We discussed that above, but I missed that second occurrence of unlink() completely. Now fixed.

- Locks on unlink(): I'm unsure here; I'm not anything like a filesystem-guru, but I've seen a few old-fashioned filesystems before, and I really don't know whether it's a good idea to do unlink() on a file that's possibly being read by other process currently. Another process already have a pointer - but what are the implications of unlink() in this case, and on different filesystems, that's completely unknown to me (looks scary, though). Is there any good documentation out there, exploring this situation on Linux, Windows, and Mac at least? I would like to see that, before changing the existing code in such a way. May be a subject for further research (i.e. other issue).

- Character replacements are indeed arbitrary (hand-written) - feel free to improve if you like, but it's not really important part IMO. My reason for this is, that no already provided system seems to fit here:
* We can't escape the string like for a regex, for SQL, any backslash-or-similar way, because that won't really remove the bad characters. (Plus backslash is illegal on windows by itself, and would require further transformation.)
* We don't want to encode the string to character codes (like '%code' or '&#code;' or whatever), because length is a concern, and any generic encoding like that involves hexadecimal (or even decimal) codes along with an escape character (that's 3 characters per code minimum), and would apply to unnecessarily large set of characters (making the length concern possibly critical).
* We don't want to just remove the bad characters (or replace by a placeholder, like old code did), because that removes the uniqueness from cache keys.
We also don't want to replace more characters than necessary, for pure readability of the filenames. The custom way I finally used replaces only minimum of characters, and additionally & for-this-very-reason manages to do that by just two-legal-characters sequences, minimizing the impact on length.

Attaching updated patch (just one @ added on unlink).

AttachmentSize
cache-file-4.patch 14.64 KB

#17

harking - September 27, 2009 - 23:22

Linux, OSX and Windows all wait until the file is not in use by any process before removing it from the filesystem actual. Linux and OSX hide the file, but the file is still accessible via the pointer until closed.

In Windows case, the unlink will fail if any other processes have it open.

Using the file lock will be necessary in _delete() as we need the file to be removed right then. However, for purge(), we can safely remove the flock()s around it since either the file will be removed without issue on Linux and OSX, or in Windows case, it will be removed in the next CRON run as the comment above it states.

#18

JirkaRybka - September 29, 2009 - 19:28

OK, thanks for the research. This looks good to me, so I updated the patch. I also updated the code-comments on both unlink() occurrences, to briefly state what we explored here; the old comment in delete() was (according to the resources linked in #17) misleading IMO (although actually it applied to the Windows case, it sounded to me like the concurrent reads were in danger, not the unlink, which is not the case).

AttachmentSize
cache-file-5.patch 14.84 KB

#19

slantview - October 2, 2009 - 19:17

Great stuff guys, I'll review and get this committed shortly.

#20

EvanDonovan - October 28, 2009 - 21:46

Subscribing.

#21

nonsie - November 18, 2009 - 17:52

Initial testing shows it works as expected on Windows.

 
 

Drupal is a registered trademark of Dries Buytaert.