Adaptive path caching

chx - February 18, 2008 - 00:39
Project:Drupal
Version:7.x-dev
Component:base system
Category:task
Priority:critical
Assigned:chx
Status:duplicate
Issue tags:Performance
Description

See http://drupal4hu.com/node/124 for a description. Tables are:

CREATE TABLE `path_alias_statistics_map` (
  `page_path` varchar(255) NOT NULL,
  `language` varchar(255) NOT NULL,
  `path` varchar(255) NOT NULL,
  `alias` varchar(255) NOT NULL
);
CREATE TABLE `path_alias_map` (
  `page_path` varchar(255) NOT NULL,
  `language` varchar(255) NOT NULL,
  `path` varchar(255) NOT NULL,
  `alias` varchar(255) NOT NULL,
  `counter` int(10) unsigned NOT NULL default '0'
) ;

I still need to add a timestamp to the statistics table and zap only entries older than a given timeframe instead of throwing out the whole table. I still need to check whether it works at all and need to write the init part in path.inc... but I tested the queries in cron and they were rather promising. I just dump this here -- release early, release often :)

AttachmentSize
path_alias.patch6 KB

#1

pwolanin - February 18, 2008 - 00:52

subscribe

#2

Wim Leers - February 18, 2008 - 22:25

Subscribing.

Sure as hell more advanced than my blacklist/whitelist approach: http://drupal.org/node/106559 :)

#3

neclimdul - February 18, 2008 - 22:30

Nice, I've always liked an idea like this. The logging is the tricky part but I think the benefit makes it worth investigating.

#4

catch - February 18, 2008 - 22:39

subscribing. Does this kill the blacklist/whitelist or are they complementary?

#5

BioALIEN - February 18, 2008 - 23:04

Subscribing to the sound of more caching!

#6

chx - February 19, 2008 - 00:35

Sure can live besides a white/backlist. All this does is collect some statistics and then grabs the URLs that appear on most loads of that page and preloads them in one query. Our model here is a node page with a user block that has changing usernames in it or the forum block with everchanging recent posts links etc. The node links , the node author link etc etc will be present on every load but those dynamic links will appear only on a tiny fraction of the pages. Edit: not pages, but loads of one single page.

#7

Wim Leers - February 19, 2008 - 06:43

blacklist/whitelist was initially pushed back because it would be too hard for users to configure, whereas this works completely autonomously.
Let's first see what the impact of this patch will be, and then consider porting the blacklist/whitelist patch to D7.

#8

Crell - May 1, 2008 - 06:55

I can't believe I missed this. subscribing, note to self, need to test.

#9

m3avrck - May 5, 2008 - 19:42

subscribing

#10

irstudio - May 6, 2008 - 19:38

Hey

I think the tables should be prefixed with url_alias_ eg:(url_alias_map, url_alias_map_log)
or something like that, so they show up right near the other path module tables (url_alias, url_alias_extra)

Secondly,
what's the idea behind using variable names and table field names like $path and $alias,
we already have $src, and $dst and we should use them for consistency, right?

OR
we should change 'src' to 'path' and 'dst' to 'alias' and 'url_alias' to 'path_alias' everywhere else as well

PS
'footer' is not really a lookup action, it should be called 'log' or 'record' or something like that

PPS
path_alias_statistics_src??

#11

chx - May 25, 2008 - 19:57
Status:needs work» needs review

Yay! A patch after so long! Untested.

AttachmentSize
adaptive_path.patch 11.68 KB

#12

dami - June 13, 2008 - 20:53

subscribe

#13

R.Muilwijk - July 11, 2008 - 07:36
Status:needs review» needs work

New patch..

- Applies to HEAD
- Fixes schema in install script
- removes the PRACTICAL_DATABASE define which is not needed in my opinion. Change back if you think differently.
- Reworks the .test file to run the current available tests for the different caching types. I get 5 failures in the PATH_CACHE_ALL. Haven't looked into why yet. So for now code needs work.

AttachmentSize
adaptive_path.patch 19.42 KB

#14

p.brouwers - July 14, 2008 - 09:53

Muilwijk, I'll have a look at those failures somewhere this week.

#15

smk-ka - July 20, 2008 - 22:51

Since we're successfully using a backport of this code in-production on D5 (hard-wired to adaptive caching), I've tried to port it forth for HEAD.

Issues identified:
1. We cannot immediately load the cache, since the first call to drupal_get_path() is a 'source' op triggered by drupal_init_path() to determine the initial value of $_GET['q']. However, since $cid relies on $_GET['q'], we need to lazy-load the cache upon the first 'alias' op.
2. On op 'wipe' we also need to clear the cached paths, otherwise we would be loading the old cached entries on the next page request. However, the currently implemented solution to just wipe anything isn't optimal.
3. The cron queries were broken.
4. Inserting mappings into the statistics table one by one pretty much neutralizes the effect that we actually wanted to achieve. Implemented an extended INSERT that pushes all paths through a single query.

To-dos:
1. Wiping the cache_path table is sub-optimal.
2. The extended INSERT clause can exceed MySQL's max_packet_length if there is a huge number of mappings on a page. The code should break down queries to a reasonable length and execute multiple INSERTs if required.

Patch is untested.

AttachmentSize
adaptive_path.patch 21.13 KB

#16

jaydub - July 21, 2008 - 13:20

subscribe

#17

catch - July 21, 2008 - 13:27

Looking at the patch, adding the extra ops to drupal_lookup_path() makes it a very, very long function. Could we de-op it into drupal_lookup_path_alias() drupal_lookup_path_system() etc.?

#18

smk-ka - July 21, 2008 - 15:45

Using standardized static caching? Or just factoring out the actions, passing $map around the old-style way (for now)?

#19

catch - July 21, 2008 - 16:34

I meant just factoring out the actions, but yeah this would also be a candidate for the static cache facility.

#20

chx - July 28, 2008 - 05:30
Status:needs work» needs review

Adding extended INSERTs are the task of the DB layer IMNSHO. Yes, DELAYED is also the task of that but I think we can add that (much) later should the DBTNG not get in. I have copied your cron and wipe fixes (thanks!) but implemented a new way to not cache on the first call - namely I pass in an argument which says it's a first call. Easy :) I post the patch despite we have four failures in PATH_CACHE_ALL -- I hope I can fix it later but, release early, release often.

AttachmentSize
adaptive_path.patch 17.15 KB

#21

andreiashu - July 28, 2008 - 05:49

subscribe

#22

chx - July 28, 2008 - 12:57

Yeah, easy fix: we needed to take empty language into consideration as fallback. Path tests now pass.

AttachmentSize
adaptive_path.patch 17.17 KB

#23

smk-ka - July 28, 2008 - 13:59
Status:needs review» needs work

1. The shutdown action has got lost.
2. Clearing the cache is actually not enough... In fact, we also need to get rid of changed or deleted aliases from the database tables, otherwise they will simply be re-read on the next page request. Does path module implement actions?

#24

catch - July 28, 2008 - 23:07

Just tested again and I get three fails with "every alias cached" - all 404s.

#25

chx - July 31, 2008 - 03:48
Status:needs work» needs review

The wipe action definitely needs fine graining in a followup patch. I added back the footer call to the drupal_page_footer -- not shutdown action because this way you can write a script which decides to skip the shutdown actions for perf reasons. With register_shutdown_function you can't.

AttachmentSize
adaptive_path.patch 18.06 KB

#26

catch - August 1, 2008 - 11:12

Here's a re-roll with the schema and updateimplemented for cache_path. This now passes all the simpletests, and manual testing is fine too.

I think the caching radios should probably be moved to /admin/settings/performance instead of admin/build/path/settings - will probably re-roll with that a bit later on.

AttachmentSize
adaptive_path_with_tables.patch 21.55 KB

#27

catch - August 1, 2008 - 11:35

And with the form moved to admin/settings/performance plus a few string tweaks.

AttachmentSize
adaptive_path.patch 21.87 KB

#28

Robin Monks - September 1, 2008 - 02:20
Status:needs review» reviewed & tested by the community

Applied with some offset (fixed in attached patch), tested and worked OK.
Simpletest: Path: 180 passes, 0 fails, 0 exceptions

RTBC?

Robin

AttachmentSize
adaptive_update[1].patch 21.53 KB

#29

chx - September 2, 2008 - 16:53
Status:reviewed & tested by the community» needs work

DBTNG is in. Let's fix the insert first.

#30

chx - September 26, 2008 - 23:50
Status:needs work» needs review

Like this.

AttachmentSize
adaptive_path.patch 20.28 KB

#31

chx - September 26, 2008 - 23:54

Patch description. The patch collects a statistics of what aliases appear on a page. If an alias appears frequently enough (60% but overrideable) then that alias is decided to be constant and these are written into a table and then later cached per $_GET['q'] . Testing the patch is not trivial as merely setting the path strategy will not show or change much as the path_alias_map table will be empty. You can, however, generate content, load a few pages, run cron and see that path_alias_map is populated and once that is true, benchmarking would make sense.

#32

meba - September 29, 2008 - 15:25

subscribing

#33

Crell - September 29, 2008 - 18:31

chx: Would this patch apply to D6? If so, we could possibly install it experimentally on some production D6 sites and see what happens.

#34

eli - October 1, 2008 - 00:12

Subscribing.

Curious to see how this plays out on benchmarks simulating real site use.

#35

hass - October 2, 2008 - 12:08

Subscribing

#36

agentrickard - October 3, 2008 - 13:22
Status:needs review» needs work

When running cron with the patch, I get this:

[03-Oct-2008 09:15:04] PHP Fatal error:  Unsupported operand types in /Applications/MAMP/htdocs/drupal-cvs/includes/database/database.inc on line 325

Something in the path_cron() function -- maybe interaction of the temporary tables with DBTNG?

#37

andypost - October 6, 2008 - 19:11

Subscribing

#38

mfer - October 13, 2008 - 02:04

Subscribing

#39

chx - October 25, 2008 - 02:15
Status:needs work» needs review

#325895: [DBTNG + simpletest] queryTemporary is broken and need simpletest test case . patch rerolled against HEAD.

AttachmentSize
adaptive_cache.patch 20.14 KB

#40

mfer - October 30, 2008 - 00:39
Status:needs review» needs work

In the function definition for function drupal_lookup_path() the argument $first_call is missing. This is in path.inc.

<?php
 
function drupal_lookup_path($action, $path = '', $path_language = '') {
?>

should be something like

<?php
 
function drupal_lookup_path($action, $path = '', $path_language = '', $first_call = ????????) {
?>

#41

chx - October 30, 2008 - 00:45
Status:needs work» needs review

Excellent catch!

AttachmentSize
adaptive_cache.patch 20.5 KB

#42

Rob Loach - October 30, 2008 - 01:32

Small tweaks and the form fix.

AttachmentSize
223075.patch 21.3 KB

#43

Rob Loach - October 30, 2008 - 01:44
Status:needs review» needs work

Forgot to report "URL contains the path alias." is reporting a failure.

#44

keith.smith - October 30, 2008 - 01:45

There are several (three or four) comments in here that do not end with periods.

+ * Alters the System Performance Settings form.

I'm not sure we capitalize things like this.

+    PATH_CACHE_ADAPTIVE => t('Cache aliases that appear to change infrequently'),

Appear to infrequently change? Why not just "...that change infrequently"?

+    '#description' => t('Path alias caching can significantly reduce the number of database queries on each page load. If the page cache is also enabled, performance increases from enabling the path alias cache will mainly benefit authenticated users.'),

This is pretty technical (which is probably OK), but could possibly be simpler. How about:

"Path alias caching significantly reduces the number of database queries on each page. Using the page alias cache in conjunction with the page cache will increase performance for authenticated users."

#45

chx - October 30, 2008 - 01:49
Status:needs work» needs review

Added a path.install file.

AttachmentSize
adaptive_cache.patch 20.31 KB

#46

chx - October 30, 2008 - 01:50
Status:needs review» needs work

Opsie crosspost. Did not want to CNR.

#47

mfer - October 30, 2008 - 02:21

This patch fixes path_schema so it now calls the schema 'cache_path' rather than 'cache_page', returns $schema, and doesn't add all of the system schema to this one. Added hook_uninstall.

There is still a test that fails.

AttachmentSize
adaptive_cache-223075-47.patch 21.14 KB

#48

chx - October 30, 2008 - 07:11

That test fails but various bits and pieces were lost durign the rerolls most improtant most importantly the path schema and the new path tests. Interestingly, still that one single path fails. I also moved some of the code into a new utility function.

AttachmentSize
adaptive_cache.patch 25.71 KB

#49

chx - October 30, 2008 - 08:03
Status:needs work» needs review

<?php
 
if (!$count && $action != 'wipe') {
    return
FALSE;
  }
?>

$action != 'wipe') was missing so that adding the first alias was not taken into account. All tests pass.

AttachmentSize
adaptive_cache.patch 25.73 KB

#50

mfer - October 30, 2008 - 10:21

Patch applies, works correctly though drupal install, and passes all tests. When set to 'Cache all aliases on each page (not recommended for sites with a large number of aliases)' I saw it caching all my aliases. When set to 'Cache aliases that appear to change infrequently' I could not get any of my aliases to be cached. The rows were appearing in the cache_path table but the data in the serialized array was always empty. I have not had a chance to look through the code.

Is there something special I need to do to test the adaptive caching part of this patch?

#51

mfer - October 30, 2008 - 12:25

Ah, should have read #31. I'll retest the adaptive caching later tonight.

PATH_CACHE_ADAPTIVE requires cron to work. Shouldn't there be a note on the form where you enable about this requirement?

#52

mfer - October 31, 2008 - 00:02

In drupal_lookup_path and the footer operation it was calling ->delayed() on an insert query. Should have been ->delay(). This patch fixes it and statistics are new being recorded.

AttachmentSize
adaptive_cache-223075-52.patch 26.37 KB

#53

mfer - October 31, 2008 - 01:29
Status:needs review» needs work

cache_path had stale information in it, even after a crop run. cache_path was keeping the old version. It had an expires of 0. When I truncated the cache_path table it started caching the paths. But, not exactly as expected. If I went to node/1 it cached the path. If I went to test1 which is the alias of node/1 nothing is cached.

#54

catch - November 1, 2008 - 13:22

I like this patch but I don't like having two choices for the admin - how much is a 'large number'? 100, 2,000?

If possible we should hide strategy that's used completely if possible - probably by doing a COUNT on path aliases each cron run and switching the strategy based on that. If we can pick a hard number of 500 or whatever over which in-memory caching might cause issues, then hard code it to that, if it's too restrictive make it a variable which can be altered in settings.php or a contrib module. But as it is there's a lot of margin for user error and confusion.

#55

mfer - November 1, 2008 - 15:34

@catch - It's really hard to pick a hard number because so much of that depends on your setup. In some hosting environments it might be 500 and in others it might be 3,000. If we did set a limit it would need to be configurable.

In either case we need some good help documentation outlining when/how to use these and that cron is needed for adaptive caching to work.

#56

catch - November 1, 2008 - 16:11

mfer: In that case I think we need two variables, naming is random but hopefully gets the point across:

path_alias_cache_strategy_threshold - variable is set by the COUNT on {path_alias} on cron runs, defaults to N (1000?)
path_alias_cache_strategy - off, all, adaptive - based on the combination of the on/off setting on the performance page and the strategy threshold.

We don't offer any UI for path_alias_cache_strategy_threshold, but you can change it in settings.php or contrib if you want.

Users already get warnings if cron isn't running, so I'm not sure we need to note that on the performance page - if cron isn't running your watchdog table will fill up and that's much more of a performance hit than not having path alias caching.

Should the help text for caching be in hook_help() or the handbook? I don't remember seeing all that much for CSS and JS aggregation, or block caching either - but it'd be good to fix that.

We also need a way to document variables that aren't triggered by a settings form - but that's not a job for this patch.

#65

basicmagic.net - December 8, 2008 - 13:23

subscribe

#66

swentel - December 9, 2008 - 23:49

Chasing HEAD, nothing changed so far.
Note: We should test upgrade from D6 to D7 too, as I'm quite sure if path is enabled, path_alias_map & path_alias_statistics_map won't be created. Anyway, initial tests dropped a lot of db queries, so this has a lot of potential!

AttachmentSize
adaptive_cache-223075-66.patch 27.32 KB

#67

catch - December 10, 2008 - 00:19

All tests pass too. So it seems like we need to do a bit of dbtng update, resolve the UI issues, and see about that caching bug reported earlier.

#68

lilou - December 10, 2008 - 00:58

#69

killes@www.drop.org - January 3, 2009 - 13:24

path.inc has been converted to dbtng.

#70

chx - January 6, 2009 - 21:16

#71

eaton - January 22, 2009 - 01:37

It seems that this would be a good match for http://drupal.org/node/320132 - making path.inc swappable, and providing an alternate "adaptive" implementation in a separate .inc file, would be a reasonable approach with less arguing about where cutoffs should be.

#72

Crell - January 22, 2009 - 15:58

Swappable include files break the registry.

Note to self: Just damn well write the handlers patch.

#73

eaton - January 22, 2009 - 23:49

Hmmmm. I didn't realize that - has the registry patch broken swappable cache.inc?

#74

justinrandell - January 23, 2009 - 00:35

adding tag.

#75

Crell - January 23, 2009 - 05:57

@Eaton: The registry right now contains a hack to avoid it. The problem is that if the same function/class name appears twice in any file parsed by the registry, we end up with an SQL duplicate key insert error. To avoid that, the registry rebuild process uses a merge query rather than an insert query. That means we are using 1 query per function/class record (which comes out to somewhere north of 1000 merge queries for core alone, which for the degenerate driver is actually 2000 queries because it's a multi-step process) rather than being able to optimize with a single (or a small number of) multi-insert statement. It also means that we end up with essentially garbage data in the registry for those function/class names. Just as api.drupal.org right now has no knowledge of the MySQL version of db_query(), the registry in D7 right now has no knowledge of cache_set() except for whichever version of cache.inc gets parsed second.

For the cache that isn't causing a problem yet, because it initializes before the registry does. For anything that we want to not have to initialize manually (thereby bypassing the registry completely), swappable include files simply do not work. That is why classes exist in the first place. :-)

#76

Flying Drupalist - April 19, 2009 - 00:30

Can anyone backport this patch to 6.x? Even without being committed this will make a large difference on many sites!

#77

CorniI - April 22, 2009 - 18:10

subscribe

#78

michtoen - April 24, 2009 - 23:57

subscribe

#79

Berdir - April 25, 2009 - 00:39

As I already mentioned in an another issue.. drupal_static() allows to handle the caching *outside* of that function, without touching it.

So, if we just want to have an alternative caching, and not an alternative implementation, we don't necessarly need to make this file swap-able.

#80

catch - May 16, 2009 - 22:35
Status:needs work» duplicate

#456824: drupal_lookup_path() speedup - cache system paths per page. just got in - which stole the basic idea from here but didn't actually cache aliases. So closing out as duplicate.

 
 

Drupal is a registered trademark of Dries Buytaert.