Early in the 4.7 release cycle we accepted a patch that changed the way url aliases are retrieved. Instead of putting them all in a giant serialized array we opted to look up each alias separately. This is a major problem for some people with limited SQL resources and therefore a bug. I propose the attached patch (which is not complete yet). What the patch does is to store all the aliases that have been requested on a page view inside the database in a dedicated cache_path table. Next time that page is viewed we retireve all these aliases and first look up the aliases there before checking the db for single path aliases that might have been added later.

On each page view the updated list of aliases is stord in the database again.

What is missing:
1) SQL updates, for testing simply copy one of the other cache tables
2) a cleanup procedure. A site like drupal.org has a lot of pages and the path cache should be cleared once a day or so.
3) testing and benchmarks

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

killes@www.drop.org’s picture

updated patch, fixed some thinkos, added docs.

killes@www.drop.org’s picture

FileSize
2.88 KB

updated patch, fixed some thinkos, added docs.

RobRoy’s picture

Status: Needs review » Needs work

A couple minor style issues

- "+ drupal_set_path_cache ();" should be "+ drupal_set_path_cache();"
- "else if" should be "elseif"

FiReaNGeL’s picture

While I agree that the current path auto tables are far from optimal, I think page caching for paths is only part of the solution. Did you have a look at http://drupal.org/node/63635 (path alias scalability - add path column and possibly _path hook)? This one would help solve 2 more problems with the path system - pathauto indexes, which have very bad performance right now due to the current state of the url_alias table (http://drupal.org/node/91116) and some slow pathauto queries (http://drupal.org/node/76172).

Also, I think pretty much everyone agree that multiple alias for the same target is a bad idea (in general, for search engine optimization, and it complicates some dev work (ask chx about this)).

Would a 'path' column in the 'node table' along with some other changes be a better solution to the '80 SQL queries for paths' problem? I think so.

FiReaNGeL’s picture

Another idea would be to add an 'id' column to the url_alias table, and a 'type' column (node, user, menu). This way we could just do a join to find paths, and it would allow other kinds of query involving paths to be faster too.

RobRoy’s picture

Yeah, I don't really use multiple aliases for a node, and I agree about the SEO duplicated content issue which is a real concern. However, I recently used the multiple aliases for a node when I switched a static site over and wanted to change the alias scheme to something cleaner but not lose traffic coming to those other pages. But this is the lazy approach and I didn't really need multiple aliases per se, as I should have set up some permanent redirect (301) for those URLs. So maybe the 301 issue is for another issue/day (I think it was already discussed somewhere...).

m3avrck’s picture

FileSize
3.57 KB

Here's an updated patch after talking with killes on IRC. Fixes a number of bugs so this patch actually works :-)

m3avrck’s picture

FileSize
2.95 KB

Removed settings.php, oops.

m3avrck’s picture

Quick benchmarks:

on /admin/

Before: 57 queries per page
After: 14 queries
Reload: 13 queries (because of the cache set)

75% redunction in queries where there are lots of aliases. Not too shabby :-)

webchick’s picture

Status: Needs work » Needs review

Looks like this is ready for review?

m3avrck’s picture

Some quick numbers, running ab -n 1000 -c 10:

Before:
Requests per second: 7.65 [#/sec] (mean)
Time per request: 130.782 [ms] (mean, across all concurrent requests)

After:
Requests per second: 8.95 [#/sec] (mean)
Time per request: 111.676 [ms] (mean, across all concurrent requests)

15% faster on average.

chx’s picture

I always wanted to do something like 'store often used aliases' but could not figure out how to 'teach' Drupal which aliases are used 'often'. This idea is nothing short of brilliant. The number of rows in this table would be approximately the same as the cache_page table and we all know that page caching is good.

So, a huge +1 .

chx’s picture

FileSize
2.61 KB

The patch removed a recent bugfix and was much more complicated than necessary. Instead of returning a path alias from drupal_get_path_cache now we return the whole map and initialize drupal_lookup_path with it.

chx’s picture

Priority: Normal » Critical

Given the possible clashes with other parts of Drupal (nil) the benefits (huge) the amount of change (2.61 kbyte patch) I think we can finish this quickly and we shall not release Drupal 5 without it.

killes@www.drop.org’s picture

FileSize
3.48 KB

here is another improved patch due to Karoly. It avoids to save the cache if all url aliases were already in the cache.

chx’s picture

FileSize
4.43 KB

More of the same. We now save $no_src too, but we only save $map or $no_src if absolutely necessary. I removed the new cache functions as they are completely unnecessary, now all changes are inside drupal_lookup_path . The function logic has been changed to a switch model which will also give a small speedup w/ PHP5 I think.

killes@www.drop.org’s picture

One slight problem with this patch is that some of our links are user-specific even in a default Drupal install. Main culprit is the "my account" link, which has the user's UID in it. Because of this, for every user seeing a particular page, that page's path cache must be updated. But I believe it should be easy to omit the user ID for the "my account" link. The problem will of course exist too in a similar form for stuff with changing content such as a "who's online" block or "recent comments". This will lead to a slight growth in the stored arrays. Currently the biggest array is about 60k, while 18-19k seems normal.

chx’s picture

FileSize
4.44 KB

Robert Douglass discovered a missing break which broke the patch.

m3avrck’s picture

Gerhard, the slight problem you mention is somewhat fixed depending on how often we clear this page path cache as well. I don't forsee it being a problem either. The issue Fireangle links to should see consideration in Drupal 6 to really bring this 100% full circle. But this patch now is a great addition and stepping stone.

I'll post some new benchmarks in a couple.

m3avrck’s picture

Before:
Requests per second: 7.41 [#/sec] (mean)
Time per request: 135.012 [ms] (mean, across all concurrent requests)

After:
Requests per second: 8.97 [#/sec] (mean)
Time per request: 111.434 [ms] (mean, across all concurrent requests)

Seems to be 17% faster now -- so that's a nice improvement from Karoly and Gerhard :-)

m3avrck’s picture

FileSize
7.69 KB

Here's and updated patch. This patch adds in the necessary database updates and database create table statements.

@TODO: Figure out *when* this cache should be cleared. Is it a cron that runs daily? Weekly?

Gerhard Killesreiter’s picture

regarding cache invalidation I agree that it fixes the problem of growing cache items. It doesnt fix the need to update the cache each page load, though.

We shoud implement a solution similar to the filter cache.

killes@www.drop.org’s picture

FileSize
9.72 KB

updated patch including properly setting expiration time of one day. We still need a place to put the cron hook in. system_cron anyone?

moshe weitzman’s picture

not all of this is introduced with this patch, but as long as i am reviewing these functions ...

- exactly what is the meaning of the $no_src array? And further, it seems not needed to cache this on every page.
- would be nice if cache keys are prefixed with path_.
- i think that COUNT(*) is optimized in DBMS but COUNT(pid) is not. we should use COUNT(*).
- in lookup_path(), i see us doing doing processing even when $count === NULL. can't we just return there, if the action is 'alias'.
- i'm fuzzy now about cache expiration details, but it seems like setting the expire time should be sufficient and no system_cron() is needed.

on the one hand, i love the saved queries here. on the other hand, it feels a bit odd to write a cache entry for each and every page that re-states that node/12 has no alias (for example). further, we are adding a LOCK TABLES and UPDATE+INSERT queries to many page views.

i can't think of a better solution right now.

Dries’s picture

Some of Moshe's questions should be answered with code comments. :)

chx’s picture

Component: base system » color.module

At least one of Moshe's questions is already answered by a code comment -- no_src. See how the negative result of 'source' is loaded into it and there is a comment which reads We can't record anything into $map because we do not have a valid index and there is no need because we have not learned anything about any Drupal path. Thus cache to $no_src.

chx’s picture

Component: color.module » base system

What the hell..? it recategorized itself. help! sentient issue :)

killes@www.drop.org’s picture

More on Moshe's comments:

- prefixing with path_ seems not neccessary seeing that the cids are in the cache_path table.
- I'll replace ccount(pid) with count(*), but this is not really part of this patch.
- return if $count === NULL, yes we can do that.
- due to my earlier changes to the cache system the cache won't be cleared automatically anymore. We need a cron call, see filter_cron.

moshe weitzman’s picture

i saw that comment about $no_src but it is not clear still. It only has a couple of entries in it when i look into my DB. There are many more system paths without aliases. and those are stored in duplicate in $map_src ... I wonder if other people can look at the data and the code and understand exactly what is going on. If they can. then it is just my problem.

@killes - sounds like the cache API now lets you set an expire time, honors that during cache_get, but doesn't do garbage collection. That seems like an odd combination. I guess the logic is that the system doesn't 'own' ancillary tables like cache_path(). is OK for now, and not so relevant to this patch anyway.

in my last post i mentioned that we are adding a LOCK TABLES and INSERT and UPDATE and UNLOCK TABLES for every new page. I imagine that a crawler will seriously congest the cache_path table. Not sure what to do about that. page cache table has same problem.

Dries’s picture

Category: bug » task

Maybe it is best to postpone this patch. It's not really a critical bug is it, it's more of a critical task. We have more pressing bugs to fix, IMO. Especially because this is not a regression compared to 4.7. Let's make sure we stay focused on real bugs.

killes@www.drop.org’s picture

Version: 5.x-dev » 6.x-dev
Category: task » bug
Priority: Critical » Normal

Well, I still think it is a bug. If not in the code, then in our review process which led to the current system. ;)

I didn't really think I'd get it into 5 but I thought I might try. :p

FiReaNGeL’s picture

I think that this is worth including in 5.0 for 2 reasons. First of all, some hosts limit the number of SQL requests you can make on an hourly / daily basis. See http://www.totalchoicehosting.com/forums/index.php?showtopic=28272&hl=dr... for an example. 50000 SQL requests an hour may seem like a lot, but with the current ~ 100 requests per page Drupal does, you get a mere 500 page views an hour. Most of the request in Drupal right now are path SQL queries, and this patch reduce this number significantly. Secondly, most devs seem to agree that the path system will be redone / improved (see issues I posted above) for 6.0.

I think the proposed patch is an acceptable temporary fix for 5.0 while we wait for a real (and much more complicated patch-wise) fix in 6.0.

I'm not changing the version back to 5.x on this issue, waiting for further comments.

matt westgate’s picture

Subscribing to this thread. I'd like to help continue down this road.

killes@www.drop.org’s picture

FileSize
10.41 KB

Glad to get some help. I've re-rolled the patch, added system_cron, implemented an early return, and changed count(pid) to count(*).

BioALIEN’s picture

I was hoping this makes it into D5 as I'm severely suffering from this problem on a site. But I'm all for testing this and ensuring it gets included for D6.

Wim Leers’s picture

+1 and subscribing.

Complementary optimization in this patch: http://drupal.org/node/106559. (I had almost already started adding functionality similar to this, but using a inferior approach - thanks chx for preventing a huge waste of time :))

killes@www.drop.org’s picture

FileSize
10.3 KB

Updated for language and schema changes.

killes@www.drop.org’s picture

FileSize
10.3 KB

update was borken

killes@www.drop.org’s picture

FileSize
10.35 KB

We already have system_cron...

morphir’s picture

Here is what happens on a fresh install:
1) I try to set up and register the first account and get a series of errors (see log)
2) I'm unable to to login to my account. That is as far I get for now.

Error log

    * notice: Undefined index: mail in /home/morphir/www/HEAD/modules/user/user.module on line 1346.
    * notice: Undefined index: name in /home/morphir/www/HEAD/modules/user/user.module on line 1347.
    * notice: Undefined index: pass in /home/morphir/www/HEAD/modules/user/user.module on line 1349.
    * user warning: Duplicate entry '' for key 2 query: INSERT INTO users (pass, init, status, uid, created) VALUES ('d41d8cd98f3242400998ecf8427e', '', 1, 1, 1180444939) in /home/morphir/www/HEAD/includes/database.mysql.inc on line 163.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 259.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 1379.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 1382.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 1397.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 1404.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 3097.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 1277.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 1277.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 1277.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 3102.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 3105.
    * notice: Trying to get property of non-object in /home/morphir/www/HEAD/modules/user/user.module on line 3156.
morphir’s picture

under this error log, I receive this message:

Thank you for applying for an account. Your account is currently pending approval by the site administrator.
In the meantime, your password and further instructions have been sent to your e-mail address.
morphir’s picture

sorry. I'am a bit trigger happy. The current HEAD is cause in this error. And the errors I posted does not concern this patch as far as I can tell.

morphir’s picture

this one is for real:


Warning: Table 'cache_filter' already exists query: CREATE TABLE cache_filter ( `cid` VARCHAR(255) NOT NULL DEFAULT '', `data` LONGBLOB DEFAULT NULL, `expire` INT NOT NULL DEFAULT 0, `created` INT NOT NULL DEFAULT 0, `headers` TEXT DEFAULT NULL, `serialized` SMALLINT NOT NULL DEFAULT 0, PRIMARY KEY (cid), INDEX expire (expire) ) /*!40100 DEFAULT CHARACTER SET UTF8 */ in /home/morphir/www/HEAD/includes/database.mysql.inc on line 163

Warning: session_start() [function.session-start]: Cannot send session cache limiter - headers already sent (output started at /home/morphir/www/HEAD/includes/database.mysql.inc:163) in /home/morphir/www/HEAD/includes/bootstrap.inc on line 895

Warning: Cannot modify header information - headers already sent by (output started at /home/morphir/www/HEAD/includes/database.mysql.inc:163) in /home/morphir/www/HEAD/includes/bootstrap.inc on line 550

Warning: Cannot modify header information - headers already sent by (output started at /home/morphir/www/HEAD/includes/database.mysql.inc:163) in /home/morphir/www/HEAD/includes/bootstrap.inc on line 551

Warning: Cannot modify header information - headers already sent by (output started at /home/morphir/www/HEAD/includes/database.mysql.inc:163) in /home/morphir/www/HEAD/includes/bootstrap.inc on line 552

Warning: Cannot modify header information - headers already sent by (output started at /home/morphir/www/HEAD/includes/database.mysql.inc:163) in /home/morphir/www/HEAD/includes/bootstrap.inc on line 553

* warning: Cannot modify header information - headers already sent by (output started at /home/morphir/www/HEAD/includes/database.mysql.inc:163) in /home/morphir/www/HEAD/includes/common.inc on line 141.
* warning: Cannot modify header information - headers already sent by (output started at /home/morphir/www/HEAD/includes/database.mysql.inc:163) in /home/morphir/www/HEAD/includes/common.inc on line 141.

killes@www.drop.org’s picture

FileSize
10.3 KB

silly me, I had added the cache_filter table to system.schema thinking it was missing, but it was in filter.schema. Updated.

killes@www.drop.org’s picture

FileSize
10.35 KB

Patch had a few issues.

morphir’s picture

this is starting to look at really hot RTBC-candidate. I do not get any errors anymore.

Need ppl to help out testing before code freeze. Please!

moshe weitzman’s picture

Status: Needs review » Needs work

update path is broken. probably needs port to schema model ...

killes@www.drop.org’s picture

Status: Needs work » Needs review
FileSize
9.61 KB

patch updated

morphir’s picture

Status: Needs review » Reviewed & tested by the community

alright!

now update.php runs smoothly too.. I'm gonna set this one to RTBC.

add1sun’s picture

patch applies fine and the install and update are smooth. this isn't playing well with path and aliases though.

Here's what I did:
- enabled path module
- create content with NO alias
- create content with alias
- go to the front page. first time i click the title link to go to read the post for both works fine
- go back to front and refresh the page. go to the content with an alias and i get a page not found, no alias works fine. creating new content with aliases give page not founds as well.
- reverse the patch and the content with alias works again

killes@www.drop.org’s picture

Status: Reviewed & tested by the community » Needs work

I've thought a bit about this patch and I a not yet 100% satisfied with it.

There is one problem remaining that we should fix: On a standard Drupal install, there is on each page one link that is user-specific. This link is the "my account" link. What this means is that for each page and for each user tht hasn't visited the page yet there will be a cache write. This is not really acceptable. Therefore, we should look into getting rid of the "user/n" form of the link. In D5 we could have replaced it by "user" but the auto-redirect is no longer available.

How about introducing a custom "my_account" url?

I also think we should merge the two separate caches into one and will discuss this with chx.

morphir’s picture

@addisun - I did manage to reproduce the bug you got. Which was a weird one.

but I also manage to get this working. by applying the patch after installation. And creating nodes with and without path alias IS working. Both when I type the full url and click the link. All works.

@killes - the question is: what is happening during installation? what is this patch doing with drupal that it shouldn't?

We are almost there now. Need the final touch of love.

kbahey’s picture

Subscribe.

bjaspan’s picture

I've put some though into the path alias inefficiency recently. I did not know this issue existed and went in a very different direction. I'll record my thoughts here, then probably drop them since this issue is so far along.

I looked at the previous issue that led to the "query for every alias" patch. I noticed that, in the previous version, all aliases were loaded into memory but only in a one-way hash table. If the forward hash lookup failed, the backwards lookup was performed with a LINEAR search (array_search). Of course that was slow on large sites. Perhaps I misread the previous code, but if not I think the benchmarks showing that "query for every alias" was a good idea were inaccurate; we should have just had a hash table in both directions. Of course that would have taken twice the memory. But there is basically no way that querying the database can be faster than a hash-table lookup, especially in PHP where adding even a no-op function call has a measurable performance impact!

So, a two-way hash table is definitely the fast way to implement the this, but on very large sites loading 10^6 rows from the db into memory on every page load is itself time consuming and takes a lot of memory. What occurred to me is that the vast majority of those 10^6 path aliases are for nodes. Furthermore, node aliases have tremendous temporal locality: most of the time, pages are only going to be showing links to the most recent N nodes in various blocks, teaser lists, etc. (the obvious exception is search).

This led me to the idea that we ought to have a cache of the most recent N node aliases (perhaps 1000 or so). We always load those aliases into a hash table at startup, in both directions. Any alias lookup not found in the hash table gets looked up by db request as Drupal 5 currently does it. Keeping this cache would be pretty easy: insert a new row into a table "cache_recent_path" each time a new node is created. Once per time period, trim that table to the most recent 1000 entries. So the number is usually not exactly 1000, but so what.

Finally, any path aliases defined manually at admin/build/path/add ought to be loaded into a hash table on every page load as well. Human admins don't manually create many aliases, and the ones they do are probably pretty important/common.

I now return you to your regularly scheduled patch review. :-)

killes@www.drop.org’s picture

@bjaspan: The main problem with the previous approach at url alias handling was actually a lot of memory consumptionif you had a lot of aliases defined. I agree that the tests done for the evaluation which lead to the current appraoch must have been flawed or weren't detailed enough.

webchick’s picture

bjaspan: interesting. another thought? make the hash table lookup only select rows where path NOT LIKE node/%. For everything but the current node you're viewing, take the path alias out of the hash table. Do one single query to get the current node's path alias.

webchick’s picture

Ah, ignore me. That is a stupid idea. I forgot about all the nodes that might be showing in blocks like "active forum discussions" or whatnot.

bjaspan’s picture

Actually, I do think path aliases to node ought to be keyed by nid, not just the string 'node/%'. Nodes are special, so I think we should have a node_path_alias table: int nid primary key, varchar alias. Or it could just be a column in the node table (though then we can only have one per node, which might be a good thing). Then "load the most recent 1000 node aliases" could be "SELECT * FROM node_path_alias ORDER BY nid DESC LIMIT 1000" (or "SELECT n.alias FROM node ORDER BY ..."). Sorting by PK should be fast. If not, then we keep a cache_recent_path table as I previously described.

This would get tricky, though, because when the admin sets up a manual alias to "node/nid", the code should not have to know to parse it out separately and insert into a different table. Ugly. That's why I did not mention it previously.

I think our (or at least my) thinking on this topic is constrained by the way we currently do it and thus I am not really thinking fully clearly about it. ie: I'm confused. I seem to be saying that in many issues recently.

kbahey’s picture

People,

Tying paths to nodes is not possible with the current architecture, since l() can occur anywhere and reference virtually anything, including any node, any term, or any user, and even module generated pages that do not fit in the above 3 categories. There is no relationship among items, say, in a site map page with lots of terms and nodes listed, or blocks with other user's profiles linked, ...etc.

The current approach is far better than the old one for a large number of aliases.

Think about it for a bit: people want Search Engine Friendly URLs, so they install pathauto, and it automatically creates these aliases for nodes, terms, and users. The site grows and grows.

The old approach just did not scale. I had a site that growing fast on 4.6, and the only way was to implement a backport of the 4.7 patch for aliases. Here is an analysis I wrote on the issue on URL aliases slowing down sites. It took 25 to 30 seconds for a page to load the front page, and 15 for single node views!

So, the current approach is better, not flawed.

Is there room for improvement? Yes. Two patches are in the queue:

- #106559 drupal_lookup_path() optimization: provide a configurable blacklist and whitelist of Drupal paths

- Another approach is to find out if there are only (say) 200 aliases (configurable threshold) and do things with the old associative array approach if that is the case.

matt westgate’s picture

This is probably too off-topic and warrants a broader discussion, but after speaking with other CMS teams and getting feedback as to how they solved this problem let's explore denormalizing the path alias system. In otherwards, add an alias column to the menu, node and maybe even user table, that way the most common aliases are pulled in as part of queries that already need to be run.

Crell’s picture

Doesn't the new menu system already store its own path aliases? I seem to recall that being one of the new features, but I could be wrong.

The current system isn't bad for sites with 1000+ aliases. It's sites that have only 30-40 pages, or big sites with only 30-40 aliases (eg, Drupal.org), that currently get hurt the most by all the "search for something that isn't there" queries.

Yes, l() can run anywhere, but if we can denormalize to get the common cases then we can still cut down the number of queries that l() needs to look up in the first place.

kbahey’s picture

I think the solution for 100 or so aliases is just to cache them all like the old system in 4.6 used to do. One SELECT * FROM {url_aliases} and an associative array built in a static variable, and we are good to go.

We can make this adaptive, with a sensible default. For example the site admin can select what is the threshold where the new system kick in you have more than X aliases. Below that, we have the associative array.

Combine that with the black list/white list patch that I linked to above, and things are highly tunable.

We can try this for one release cycle before jumping to denormalization. Remember that on a typical site there may be blocks that have list of nodes, list of users, ...etc. These are not related to whatever node(s) we are viewing in the content area. Not sure if EVERY block out there does node_load() and user_load() so it would load the aliases it needs.

bjaspan’s picture

Review of the code with the current patch applied:

- 'alias' and 'source' cases both test if $count > 0, but if $count == 0 we already returned.

- For 'source', mapping a path alias to a system path, a linear search is performed to find the alias in the in-memory cache of all aliases for the current page (array_search($path, $map[$path_language]). This is an O(N^2) algorithm (N = number of links on the page) that has to be performed once per day per day (if there are 100 unique links on a page (see below), that's an average of 50,000 operations for 'source'). Why not keep a static $path_map and $alias_map so that 'source' can be O(log N) like 'alias' is? You almost are already, since $no_src is exactly the reverse map I'm talking about except it is only used when no source exists. Instead, use $no_src for every 'source' call and store FALSE into it when no source exists, like you store FALSE into $map when no alias exists.

I could see an argument that we do not perform 'source' lookups nearly as often, perhaps only when a request comes in instead of while generating every link on the page. But if that's true, perhaps we should not cache 'source' lookups at all; it may not be worth the extra complexity.

- This patch stores an array of every link on every page. On this particular "New comment" page I'm typing into, there are 199 links (granted, . A node link with an alias of "thisisatest" serializes into 32 characters. If all 199 links are aliases (which on a pathauto site they will be), that's 64k of data in the cache. Even if some links are not aliased, we will end up caching "no_src" entries for them. So, how many unique pages are viewed on d.o per page? If 1,000, that's 64M of cache data. If 100,000, that's 6.4GB of cache data (d.o has about 100,000 nodes less than a year old). Is tossing 6.4GB into cache_path okay? Disk is cheap, but that's still a lot of cache data.

Maybe counting 199 links is unfair; this issue has 62 comments, each of which contains a link to the issue page that will only show up in $map once. So perhaps we're only talking about 3.2GB of data if 100,000 pages per day are viewed.

It just seems inefficient to me to store a cache of links per page. There will be an awful lot of redundancy in the cache. Most links are in sidebars, etc., that vary by day but not by page.

bjaspan’s picture

Typos; crud. In the paragraph on "this page has 199 links":

(a) Ignore the text "(granted, ".

(b) When I say, "how many unique pages are viewed on d.o per page", I mean "are viewd on d.o per day", since one day is how long the cache entries last.

bjaspan’s picture

I've created a separate issue for my url alias optimization proposal based on temporal locality of nids: http://drupal.org/node/153888

Gábor Hojtsy’s picture

Gerhard, I review the patch. Here are my thoughts:

- there could still be other links specifically for users, not the my account page only
- your path_set_alias() part of the patch include cache keys concatenated with underscores, not colons

Otherwise this could work well with the end user (much more automated, less to think of, less to reconfigure if aliased paths change) then http://drupal.org/node/106559 which is another popular contender...

Gábor Hojtsy’s picture

Title: Tidy up the path alias sql problem » Page level path alias caching

Retitled.

Dries’s picture

Personally, I think the only viable (easy) solution is to denormalize the path table. Let's just add path aliases to the node, menu and taxonomy tables (for starters).

kbahey’s picture

Having a 1:1 relationship for each node, term, menu (and I should add user as well) may be desirable long term goal.

But it does not solve the issue about l() being object agnostic. By agnostic, I mean that l() knows only about paths, and not nodes, users, ...etc. and hence can retrieve the alias for anything regardless of whether it is a first class object, module defined path or whatever.

I think that refactoring l() is out of the question for D6, unless we can keep the API the same (can't think how though).

Before we jump the gun on this one, let us see some low hanging fruit first, e.g. reducing the number of queries to the alias table using a dynamic white list, or caching all the alias table if there are only a certain number before a certain threshold.

mercmobily’s picture

FileSize
51.51 KB

Hi,

"caching all the alias table if there are only a certain number before a certain threshold" -> I am _all_ for it.
My situation with queries is tragic right now, and I have a busy site (free software magazine).
If you look at the attached file, you'll see how bad it is - and I can only assume other sites are worse...

merc@merc-laptop:~$ cat fsm_load.txt | wc
507 7464 52746
merc@merc-laptop:~$ cat fsm_load.txt | grep drupal_lookup_path | wc
246 2702 19786
merc@merc-laptop:~$

Pretty much 50% of the queries could be prevented if drupal_lookup_path were "allowed" to cache up to (let's say) 300 URLs, which would be all loaded with one query at request time.

Bye,

Merc.

treksler’s picture

Hey

- does anyone know what is the status of Drupal urlalias caching?
- has any of this made it into D5 D6 D7?
- where do we go from here? is there an active issue/ patch to test?
- i.e. which one of these possible solutions should i be testing?

References
http://drupal.org/node/102311
http://drupal.org/node/106559
http://drupal.org/node/100301

moshe weitzman’s picture

none of this made it into 6. IMO, the patch to test/refine is 'adaptive path caching' at http://drupal.org/node/223075

catch’s picture

Status: Needs work » Closed (duplicate)