Exploit temporal locality for node loads [#153888]

This is yet another proposal for improving the performance of the URL alias system. It is complementary with http://drupal.org/node/106559 (regexp whitelist/blacklist) but not, I think, with http://drupal.org/node/100301 (per-page cache of all url aliases).

In 4.7, Drupal sucked all aliases into memory on every request. This worked great until there were too many aliases. In Drupal 5, a separate query is issued for each call to l(). Everyone realizes this is not a great solution---too many small queries.

My proposal is a middle ground that exploits the temporal locality of node aliases. The basic idea is that, generally, the most recent nodes tend to be the ones showing up all the time: on the home page, in most recent/most popular blocks, etc. And nodes are created with sequential nids. So if we need the alias for node N, we are likely to need it for node N-1 and/or node N+1, too. Therefore, we shoud load node aliases from the database in blocks of, say, 100 or 500 or some dynamically determined number.

To do this, node aliases have to be associated with the nid, not just be a string 'node/nid' in the url_aliases table. If there can be at most one alias per node, it can go in the node table. If there can be more than one, they can go in a separate table with two columns, nid and alias, with nid as primary key and a unique index on alias.

Using this new table requires some code changes.

1. The minimal-change approach is to modify drupal_lookup_path() and path_set_alias() to treat looking up an alias for 'node/nid' as a special case. This works and is very simple. It is not particularly efficient, however, since these functions need to do substr()s and string comparisons to decide if the special case applies.

2. The more efficient approach is to replace all calls to l('text', 'node/'.$nid) with l_node('text', $nid). Now no substr()s or string comparisons are necessary because the nid is immediately available. path_set_alias() could still know about the special case to keep things simple.

The same idea of relating aliases directly to an object id instead of via a string could be applied to other object types. User is the only other type I think there are ever enough of to worry about. I'm don't think the same kind of temporal locality exists for users, though, so the bulk retrieval approach may not be warranted.

Once we remove all the node (and perhaps user) aliases from url_aliases, there won't ever really be that many left in the table (basically, just those that an admin created manually). At that point, we can and should go back to caching the entire table in memory (either by loading it explicitly at startup or using a cache table entry). This might actually make the whitelist/blacklist patch unnecessary.

I've attached a proof-of-concept patch against Drupal 5 (not HEAD b/c devel_generate does not seem to be working). If people think this is a promising approach, I'll persue it.

Comment	File	Size	Author
	node-alias.patch	2.25 KB	bjaspan

Comments

Comment #1

wim leers

Ghent 🇧🇪🇪🇺

commented 22 June 2007 at 19:50

Although there is no one-solution-fits-it-all possible here, this one comes pretty close I think.

But I strongly dislike #2: a forked l() function... blehh! #1 is indeed less efficient, but is it really worth the efficiency to sacrifice the elegance of the l() function? FWIW, the blacklist/whitelist patch uses regexps, so that's *also* not super-efficient. But nevertheless the performance gain was substantial.
In the end though, the most efficient thing to do is to turn off path.module :)

Also, if this patch gets through, I'd like to see the number of cached paths to be configurable.

And lastly: I'm not sure if this approach would be more efficient than the stack-concept. I.e. the concept where you keep pushing aliases to an array (if it's not already IN the array) and then always shift the one at the bottom (the one you pushed first) away. That one would be more efficient on "finding" the most frequently looked up aliases. But on a site like drupal.org, that approach would fail completely. Perhaps some kind of hybrid is possible?

Comment #2

moshe weitzman commented 23 June 2007 at 02:46

subscribe.

will this handle an alias to node/x/edit OK?

i have no problem with l_node(), if the substr() calls are too expensive.

devel_generate just got a patch and should work now. if not, file a bug please.

Comment #3

kbahey commented 23 June 2007 at 04:29

In 4.7, Drupal sucked all aliases into memory on every request.

Minor point: 4.6 was the last release we did this. 4.7 was when we started doing what 5.x does now. In case you or someone else want to look at the code, ..etc.

Temporal locality by nid is not enough.

For example, look at pages like the sitemap on baheyeldin.com. You will see that these are generated by sitemenu, which lists the most recent X nodes in each term.

Also, now with views and panels we have pages that list nodes by taxonomy, by date, by whatever, and combinations thereof. We also have most recent comments, which may be for nodes that were posted a long time ago.

The problem is not the number of aliases per se, as much as it is how we create aliases. Aliases are created by path and paths do not correspond 1:1 to objects (node, term, user, comment), but rather to a string that is arbitrary (think of modules that generate their own paths, and possible aliases for those via the custom alias function).

Hence the problem is that we are trying to move that unstructured data and force it into a structured way, but it does not lend itself to this. I don't know of a way to force structure without limiting aliases to a few components (Drupal objects), (e.g. node/xxx with be "node-title", and node/xxx/edit will be derived from that : "node-title-edit", ...etc.) and preventing them from random paths. This will allow 1:1 aliases, and hence we can move the alias to the node table. But this will also break many many sites, and limit functionality that we have today.

Short of using memcached and Robert's module, and having an associative array of all aliases in memory all the time (not built per request, but every cache flush), I don't think there will be a solution that will fit all cases, due to the unstructured.

This only applies to sites on dedicated hosts and probably mutihosts yet. Users on RASH (Random Ass Shared Hosting (tm) Boris Mann) will not benefit from this.

The dilemma continues ...

Comment #4

moshe weitzman commented 23 June 2007 at 05:01

@kbahey - you don't have to solve a problem completely in order to make a valuable patch. please resist the contrarian urge to highlight flaws in a proposal. this temporal locality will cut down a lot of queries for a lot of sites, and thus is valuable. if it doesn't speed up some taxonomy based listing somewhere, thats OK.

Comment #5

kbahey commented 23 June 2007 at 05:24

@moshe

Digressing, maybe, but not contrarian.

My comments were two fold:

If this patch was a module, I would not even comment, since it is optional. But given panels and such, things are not temporal by nid anymore. Blog front pages may benefit some. If there is a way to turn this off, then I am also happy. However if this lookup happens for every page load for sites/pages that do not benefit, then I have to point this out. Make it an option under performance ("Enable aliases cluster lookup [or some other non-tech wording]") for example.

The other part was pointing to the main problem which is the design of the alias by path, not by object. Yes, I digressed.

That was it.

Comment #6

bjaspan commented 23 June 2007 at 13:38

I agree that the main problem here is dealing with aliases that refer to string paths and not to objects. We do not want to remove that functionality, of course, as sometimes it is useful to have the complete flexibility. With this patch, I'm essentially proposing that we split aliases in half:

- Automatically created aliases to objects (node, comment, user, etc.) are tied to the object id and loaded by whatever means makes the most sense. For nodes, batching by nid makes sense. Even for taxonomy listings, the nodes are often/usually displayed in time order, so grabbing N nodes around each queried node will save time even if not all of them are used (* see below).

- Manually created aliases should just remain as strings. There will never be that many of these (humans just do not manually create that many things) so they can all be loaded into memory as with 4.6.

* Regarding how large N should be, we could keep some automatic statistics on cache hit rates. Start N at 100 and track, on average, how many of those N get used. Then slowly increase or decrease N dynamically based on whether the hit rate is quite large (make N bigger) or small (make N smaller). On a site that, say, always sorts its node alphabetically by title, the hit rate will usually be zero and N will trend towards 1.

Or something. :-)

Comment #7

catch

he/him

English

commented 23 June 2007 at 14:02

I like the idea of splitting aliases in half, this would seem to solve a lot of problems.

Also worth bearing in mind the active url alias patch http://drupal.org/node/147143

that'd mean something like:
active path for nodes, users, taxonomy terms - 1:1 stored in the node, user and term_data tables? - additional field for feed aliases since those are also generated automatically by pathauto?

+ path_revisions
+ string paths

Comment #8

kbahey commented 23 June 2007 at 17:20

Barry

Regarding N, another approach is to change the code in the patch to be like so:

 if (substr($path, 0, 5) == 'node/' && is_numeric(substr($path, 5))) {
   $magic = variable_get('foo', 100);
   // use $magic in place of the 100...

We can have the performance settings for this be None, 50, 100, 500, 1000.

As for splitting the aliases is a good idea worth exploring further.

So, are you thinking of moving the object aliases to the main table they are in too by having an alias column for each object table?

Does l() does not need to change at all? The path lookup function can check the string, if it is an object (node, taxonomy/term, user, ...etc) then use the numeric ID to get it from the respective object table.

Other aliases then gets passed to the existing logic of fetching from the url_alias table.

Comment #9

bjaspan commented 25 June 2007 at 15:39

Re #7, storing primary alias in the primary object table (node, user ,etc): I've been thinking the same thing. Every time we display a node on a page, we have to know its title, so we already queried the node table for it, and we should get the alias at the same time. This is better than any other loading or caching strategy.

However, I don't especially love the code pattern

$node = node_load($nid);
$link = l($node->title, 'node/'.$nid, array('alias' => $node->alias));

It's too long and redundant. I'd much prefer a per-object-type function like l_node($node, $attrs = array()). $attrs is just like for l(), and the text to display is $attrs['title'], defaulting to $node->title. This would give us:

$node = node_load($nid);
$link = l_node($node);

or even

$partial_node = SELECT * FROM node WHERE ...;
$link = l_node($partial_node); // all we need is ->nid and ->title

Much easier on the hands and eyes.

Re #8: A variable for the number of aliases to retrieve in a batch. Well, with the primary alias in the main object table, we no longer need to load a batch of aliases, so effectively the original purpose of this issue is moot. However, see my next post.

Comment #10

bjaspan commented 25 June 2007 at 15:45

Title:

Exploit temporal locality for URL aliases

» Exploit temporal locality for node loads

I retract my "However, see my next post" statement at the end of the previous comment for now.

Comment #11

cburschka

they

commented 29 June 2007 at 15:49

A suggestion that may be really ridiculous but sounded good when I thought of it: {cache_link}? Constituting an n:n relationship on the system path column in {url_alias}?

This way, Drupal could load those aliases into memory that are known to exist on the current page. The beauty of this is that it would be completely independent of nodes. The system path applies globally to the whole page, so it would apply to any aliased link on any page.

Even cache misses could be fixed immediately: Whenever l() needs an alias it hasn't already loaded, it could insert the new relation into the link cache. Eventually, the cache would need to be cleaned up as link relationships disappear, but still.

Another cache table would of course cause additional database overhead. But it would reduce the number of queries to (ideally) two, not counting the cache misses, while loading as few aliases into memory as possible.

Comment #12

catch

he/him

English

commented 8 September 2007 at 00:46

Version:

6.x-dev

» 7.x-dev

Comment #13

catch

he/him

English

commented 10 September 2009 at 05:49

Status:

Needs work

» Closed (won't fix)

Since we have both whitelist/blacklist and the per-page cache in core now, marking this as a won't fix.

Exploit temporal locality for node loads

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

News items

Our community

Documentation

Drupal code base

Governance of community