Snippet from function drupal_lookup_path

  // Use $count to avoid looking up paths in subsequent calls if there simply are no aliases
  if (!isset($count)) {
    $count = db_result(db_query('SELECT COUNT(pid) FROM {url_alias}'));
  }

  if ($action == 'wipe') {
    $map = array();
    $no_src = array();
  }
  elseif ($count > 0 && $path != '') {

Why do a count of url_alias instead of using module_exists('path')?

I do not see why the code snippet could not be:

  if ($action == 'wipe') {
    $map = array();
    $no_src = array();
  }
  elseif (module_exists('path') && $path != '') {

Am I missing something here?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

kbahey’s picture

It is not really that expensive (at least on MySQL with MyISAM). On a site with 96,000 aliases, it takes 20 to 40 milliseconds to execute this query.

But the idea here is to ask : "are there aliases". Not just : "is path module installed".

Anonymous’s picture

And with every new alias the count query gets longer and longer and longer. I removed the query (which is run on every new page) which was taking nearly 30ms to run and improved my site performance. You should only be able to have an alias if and only if the path module is installed, correct? And while I'm counting those rows someone else waits for the DB. IMO we should not really care if there are zero aliases and just query url_alias table regardless. Counting rows is expensive; maybe not on its own single occurrence, but it adds up which makes it expensive.

ChrisKennedy’s picture

Status: Active » Closed (duplicate)

See plenty of discussion at http://drupal.org/node/106559

Anonymous’s picture

Status: Closed (duplicate) » Active

Not a duplicate since the count still exists.

kbahey’s picture

Earnie

Still, the path module can be enabled, but no aliases in the database.

It is best to create a proper patch, with benchmarks of before and after. This way, it is more concrete.

Anonymous’s picture

Still, the path module can be enabled, but no aliases in the database.

But the user of the path module should not care if there are currently zero in the database vs 100K within this function. It is something which isn't useful and takes more time to do the count than to just do the select request to begin with.

It is best to create a proper patch, with benchmarks of before and after. This way, it is more concrete.

I plan to do just that. I created a support request for it first to get comment. I will change to a bug report when I have a patch to apply.

kbahey’s picture

Another approach is this:

- Cron would do the select count(*) every so often, and store whether there are aliases in a variable_set().
- The code that now checks count just checks that variable.

Anonymous’s picture

I have thunked a solution of

  if (!isset($have_rows)) {
    $have_rows = db_result(db_query('SELECT pid FROM {url_alias} LIMIT 1'));
  }

  if ($action == 'wipe') {
    $map = array();
    $no_src = array();
  }
  elseif ($have_rows && $path != '') {

How should the LIMIT 1 be specified since it is MySql specific? (Hmm, guess it is ANSI specified. I thought Oracle didn't support it though.)

Anonymous’s picture

Category: support » bug
Status: Active » Needs review
FileSize
1.25 KB

Here's the HEAD patch.

Anonymous’s picture

And the same patch rolled for 6.x and 5.x.

FiReaNGeL’s picture

IMHO a SELECT ... LIMIT 1 answers the question "is there an alias in the db" way more efficiently than counting every alias out there. It may be "only 30-40 ms", but its still an improvement and as earnie said, the more alias you got, the worse the problem. +1 from me.

kbahey’s picture

Any benchmarks for the new patch?

FiReaNGeL’s picture

It would be interesting to benchmark it in both MySQL MyISAM, INNODB and Postgre with lots of nodes if possible. MyISAM has optimization for COUNT(*), so there might not be a terrible difference there.

Anonymous’s picture

I went from 30+-ms to between 0.29ms and 0.45ms on InnoDB.

FiReaNGeL’s picture

Thanks Earnie, that confirms what I've been thinking; this probably got overlooked because COUNT(*) is extremely fast in MyISAM tables. For everyone else (InnoDB and Postgre), this way of doing things is very bad performance wise (when the table is big).

Anonymous’s picture

To Count Or Not To Count

I did the patch before I found this.

kbahey’s picture

Not so fast ...

Turn the MySQL query cache off (or restart MySQL between tests) so that it does not skew the numbers.

kbahey’s picture

Status: Needs review » Needs work

I did some benchmarking of MyISAM and InnoDB on a table with 10,438 nodes.

The results DO NOT SHOW that LIMIT is faster than COUNT.

Here is the script used to test:

require_once('./includes/bootstrap.inc');
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

timer_start('select1');
db_query("SELECT COUNT(*) FROM {node}");
$total = timer_stop('select1');
print "MyISAM COUNT(*): ". $total['time'] ." ms\n";

timer_start('select2');
db_query("SELECT nid FROM {node} LIMIT 1'");
$total = timer_stop('select2');
print "MyISAM LIMIT 1: ". $total['time'] ." ms\n";

timer_start('select3');
db_query("SELECT COUNT(*) FROM {node_inno}'");
$total = timer_stop('select3');
print "InnoDB COUNT(*): ". $total['time'] ." ms\n";

timer_start('select4');
db_query("SELECT nid FROM {node_inno} LIMIT 1'");
$total = timer_stop('select4');
print "InnoDB LIMIT 1: ". $total['time'] ." ms\n";

Here are the results.

$ php ~/sites/benchmark/benchdb.php
MyISAM COUNT(*): 0.14 ms
MyISAM LIMIT 1: 0.44 ms
InnoDB COUNT(*): 0.33 ms
InnoDB LIMIT 1: 0.33 ms

Since MyISAM is in wide use, your patch will cause slowdowns on many sites.

FiReaNGeL’s picture

kbahey: your script have an extra ' at the end in the 3 queries except the original COUNT(*) on MyISAM, which would cause the query to fail entirely. Are you sure the queries ran properly?

kbahey’s picture

[edited -- kbahey]

Fixed.

Here are the results after the fix.

$ php ~/sites/benchmark/benchdb.php
MyISAM COUNT(*): 0.13 ms
MyISAM LIMIT 1: 0.22 ms
InnoDB COUNT(*): 6.7 ms
InnoDB LIMIT 1: 0.39 ms

For PostgreSQL with 10,000 nodes in the table, it says:

PostgreSQL COUNT(*): 5.16 ms
PostgreSQL LIMIT 1: 0.51 ms
Anonymous’s picture

Status: Needs work » Needs review

Seems ready to be committed to me. It serves three database engines well.

catch’s picture

This still applies cleanly. But if the LIMIT is slower on MyISAM, why not the variable_set in cron proposed by kbahey in #7? This could also be checked/set in path_set_alias() for the very rare case that a site has 0 aliases then gets one in between cron runs. Checking the variable has to be quicker than this query on every page.

Anonymous’s picture

For the obvious reason that if someone turns on path aliases they expect it to be available at the time they submit the page and not after waiting for the length of time that the cron timer expires to execute it. And as I've already said, the count isn't important, the fact that we have aliases to serve is.

The whole idea of protecting against having rows in path_alias is debatable since we wouldn't have path turned on if we weren't going to use it. I'm one who likes to eliminate such frivolous things and just let the code handle it just like it would for no alias found. It makes the code smaller and therefore faster, uses less memory and easier to maintain. It's a module and if you're not using it then disable it to make the system faster. I submitted this patch because it is the least intrusive to the existing code.

catch’s picture

earnie:

This could also be checked/set in path_set_alias() for the very rare case that a site has 0 aliases then gets one in between cron runs.

Or for that matter, could be set on saving and deleting path aliases plus enabling path module - and left both out of both page requests and cron entirely no?

Anonymous’s picture

Catch, why should http://api.drupal.org/api/function/drupal_lookup_path/6 care about count? How does having a count help for this function? I understand that if the path alias module isn't used that perhaps the function should just return false; but if it is used how is count useful? What you propose is adding extra overhead for no gain.

catch’s picture

earnie: I think we're talking at cross-purposes.

What I'm saying is keep the altered query from #10, but move it to a variable_get/set in path_set_alias() when paths are added and deleted rather than read. Then all burden goes to writes and it's just if ($has_path) in drupal_lookup_path() itself.

This is the same as kbahey suggested, but without the lag of relying on cron, and any overhead is only on writes. I don't see how this adds extra overhead to drupal_lookup_path() - it even simplifies it and it should gain slightly more than the original patch as well no? Or am I simply wrong here?

Anonymous’s picture

I understood

What I'm saying is keep the altered query from #10, but move it to a
variable_get/set in path_set_alias() when paths are added and deleted
rather than read. Then all burden goes to writes and it's just if
($has_path) in drupal_lookup_path() itself.

you were saying this. My question is why do we care about the count? Why modify the updaters to keep track of a count? Where is it important to have?

catch’s picture

Because a small performance hit on some updates update and one additional variable is a tiny price to pay for removing that query altogether on every page load. Since neither LIMIT 1; and SELECT COUNT() are optimal cross-databse, it seems like a win-win to me to just check the variable instead. You don't need to store the count, just a TRUE/FALSE as to whether there's any rows in the path table and update only if it changes.

Susurrus’s picture

@catch: So how do you determine when you've removed the last path to set that variable back to FALSE? Am I missing somewhere where there's a REMOVE * and you'd therefore know that you were for sure back to no rows?

chx’s picture

The world mostly runs MyISAM (edit: might be that every table engine) which checks against the table definiton (edit: the frm file) and not against the table data when running a count of all rows.

chx’s picture

So. it's not the FRM file but the MyISAM table header. Sorry. Anyways -- it's extremely fast , it's pre stored always. http://www.browardphp.com/mysql_manual_en/manual_Storage_engines.html

catch’s picture

Susurrus: well you could run the query just after path deletion to ensure you catch if there's zero rows. Or if using COUNT() and $count -1 = 0 just before a delete, then you'd know that way as well.

Susurrus’s picture

But, isn't the point to remove queries? Why would we want to run COUNT() as a separate query?

As far as this query, it should be very quick since pid is a primary key on the url_alias. In theory, the number of rows in this table is stored directly by the database (I would imagine this would be the case for all databases) and the query would run very quickly. Has anyone actually benchmarked the cost of this query per page load?

Anonymous’s picture

Status: Needs review » Needs work

Since the reason for needing to have a count hasn't been answered to any satisfaction, I will submit a different patch soon that just checks for the path module existing. If I have path module enabled but no aliases created yet is minimal and I wouldn't care that the lookup for a non existent alias occurs with zero rows in the table.

@Susurrus: I submitted the original patch because the devel module identified the count as slow. This query is executed on every page so eliminating long unneeded queries is important. Moving the query elsewhere isn't feasible either. I'm creating aliases for every tag I create, not to mention every page title. See the statistics at #20 for differences between myisam, innodb and postgres.

@chx: While myisam is good for small sites IMO, I am processing 100K nodes (soon I will be processing more than 500K nodes) in batch and I need transactions. I chose innodb because I deemed that since it was still mysql I would have fewer problems since I really am not familiar with postgres.

catch’s picture

earnie: Oh, now I see! Checking for path module rather than in the table seems completely fine to me, sorry! I was still working on the basis of #10.

Anonymous’s picture

Status: Needs work » Needs review
FileSize
1.28 KB

As promised but this patch only works if http://drupal.org/node/222109 is also applied.

Dries’s picture

I think it is OK to remove that query -- not everyone uses MyISAM. However, I don't like the static $has_path that was introduced. module_exists() should be fast -- if it is not, it should be made faster. Isn't module_exist() caching this already?

Anonymous’s picture

That is what I almost did. The module_exists function sets a variable calling module_list>module_list and then calls array_key_exists. It may be slightly faster to cash the result of module_exists. I've attached a modified patch without the cache.

Dries’s picture

Status: Needs review » Fixed

OK, I've committed this to CVS HEAD. Thanks.

Anonymous’s picture

Dries: As I stated at http://drupal.org/comment/reply/196862#comment-731300 you also need to commit http://drupal.org/node/222109 or it doesn't work because the module_list cache isn't filled.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

David_Rothstein’s picture

This patch seems to have broken URL aliases entirely; see #237336: Url Aliases no longer work. (I tried applying the suggested companion patch at http://drupal.org/node/222109 which @earnie recommends, but that didn't seem to help.)

I don't think the problem is the "fault" of this patch, but seems like it might be a deeper problem that this patch happened to expose... so it's probably best to discuss it over at the new issue.

catch’s picture

IMO this patch is at fault and I think it needs to be rolled back.

I'm not going to mess around with issue statuses, so rollback patch is here: http://drupal.org/node/222109#comment-809566

catch’s picture

Title: drupal_lookup_path - expensive query » path aliases broken (was: drupal_lookup_path - expensive query)
Priority: Normal » Critical
Status: Closed (fixed) » Reviewed & tested by the community
FileSize
1.23 KB

I'm pretty certain it was indeed the fault of this patch, so I'm reopening this issue and marking http://drupal.org/node/222109 as duplicate.

The short version is path module doesn't exist during DRUPAL_BOOTSTRAP_PATH because it doesn't implement hook_boot or hook_exit, so checking for it always returns FALSE.

This is a simple rollback so I'm marking RTBC - the patch is just a reversal of #38 rolled from root for convenience.

Once that's done, we should keep this issue open at CNW and try to prevent the query another way.

Anonymous’s picture

Simply rolling back a fix that showed issues elsewhere isn't fixing the real problem. Please review the patches by David Rothstein at #222109: module_list() should be rewritten as 2 functions: module_list() and module_list_bootstrap() comment #10.

David_Rothstein’s picture

If the goal is to get URL aliases working again quickly, then I agree that rolling back this patch is probably the best move.

However, as @earnie said, we have reopened http://drupal.org/node/222109, since there is indeed a more fundamental problem discussed there that is related to this one... and fixing it may allow the current patch here to work eventually (it seems that it turned out to be more complicated than I originally thought, though).

David_Rothstein’s picture

Hm, even if we can get the patch in #38 to work, is module_exists('path') really the right thing to check?? What if you create some aliases but then later disable the path module... won't that lead to a problem with broken links?

Currently, Drupal checks URL aliases in the database REGARDLESS of whether the path module is enabled... the path module itself basically just controls the administrative interface. I think in many ways that makes more sense.

Freso’s picture

Adding to my issue queue (catch thought I should have a look at it, so... ;)).

Anonymous’s picture

Hm, even if we can get the patch in #38 to work, is module_exists('path') really the right thing to check?? What if you create some aliases but then later disable the path module... won't that lead to a problem with broken links?

My original patch at #9 took care of this. However, if someone (me for instance) disables the path module don't you think they might just think that path aliases are no longer valid? See Dries comment at #37, its obvious from the remark that he was thinking that module_exists('path') should work for this.

David_Rothstein’s picture

Well, it's not wrong, I guess, just drastic. You click on that little checkbox and suddenly a bunch of links all over the Internet are broken ;)

If this patch goes in then maybe part of the solution is to rewrite some of the help text for path.module. Right now, the help text basically just talks about the admin interface for managing aliases (as it should, since in Drupal 6 that's all it controls, not the aliases themselves). For example, the description in the module list is currently Allows users to rename URLs, and the help text has things like:

The path module enables appropriately permissioned users to specify an optional alias in all node input and editing forms, and provides an interface to view and edit all URL aliases.

I guess this is somewhat related to @catch's comment here, that path.inc and path.module are perhaps more tangled up than they should be.

Dries’s picture

I agree with David -- let's roll back and fix the help text as a stop gap solution, then spend the necessary time to work on a proper solution. Care to re-roll with the documentation suggestions?

BioALIEN’s picture

Status: Reviewed & tested by the community » Needs work

As per Dries's request, setting to CNW.

Anonymous’s picture

@Dries: Please don't forget to look at the synergies around #222109: module_list() should be rewritten as 2 functions: module_list() and module_list_bootstrap() to help module_list. Chx has a patch being reviewed now. This simple patch has brought to light some inconsistencies with the use of module_list that will make D-7 even better. Let's not roll back just yet but wait for the outcome of #222109: module_list() should be rewritten as 2 functions: module_list() and module_list_bootstrap() before a decision is made.

catch’s picture

Status: Needs work » Needs review

fwiw I'm not sure the help text needs changing on this, or if it does I'm not sure in what way.

With the rollback, the path module does exactly what it says it does - provides the interface for adding and deleting paths. So the current help text would be correct. It's only with the change from the query to module_exists() as it currently is in HEAD that the path module influences whether paths themselves are activated or not.

I'm fine with leaving things as they are until module_list is reworked (assuming that gets in quickly), in which case we'd just need a patch to update the help text once the other issue is fixed. If we want to roll this back and revisit it again from scratch, then the current patch should do that fine, so marking back to CNR.

catch’s picture

Assigned: » catch
Status: Needs review » Reviewed & tested by the community

http://drupal.org/node/196862#comment-809836 still applies, still fixes the issue, makes all path tests pass.

http://drupal.org/node/259412 is the current issue addressing with the root problem, but IMO this shouldn't be postponed for a complete module system revamp.

I'm marking this back to RTBC again.

chx’s picture

Yeah the module revamp will fix it but until then why keep testing back?

Dries’s picture

Status: Reviewed & tested by the community » Fixed

Committed to CVS HEAD. Thanks.

catch’s picture

Title: path aliases broken (was: drupal_lookup_path - expensive query) » drupal_lookup_path - expensive query
Assigned: catch » Unassigned
Category: bug » task
Priority: Critical » Normal
Status: Fixed » Postponed

Changing status so we can revisit this when the module revamp is in.

chx’s picture

Assigned: Unassigned » chx
Status: Postponed » Needs review
FileSize
1.69 KB

It's easy, really.

chx’s picture

FileSize
1.69 KB

With less tpyos ;)

Crell’s picture

If I read this correctly, it makes the path system not function unless the path module itself is installed. Who's to say the core path module is the only possible way to configure paths? What if I want pathauto (or similar) running, but not have the core path module's UI? What if I want to provide my own API instead? This is like making Views not function unless Views UI is enabled, but keeping Views UI disableable. -1.

chx’s picture

FileSize
2 KB

Found one more count and re-read the whole issue. If you are using aliases without the path module then set the variable too, it's easy.

David_Rothstein’s picture

Status: Needs review » Needs work
FileSize
706 bytes

Hm, at a minimum wouldn't this need to be accompanied by a change to the path.module's help text which explained that disabling the module breaks all your existing URL aliases?... However, I think @Crell is exactly right. Ideally, path.inc should be an API which works on its own, with path.module being one possible UI for accessing it (just like the separation between menu.inc and menu.module).

However, what if instead of making the URL aliases disabled by default, you made them enabled by default, as in the attached patch? There could then be a radio button for disabling path aliases altogether at admin/settings/performance (or even if not, the variable would still be available to set in settings.php for sites that cared about it).

catch’s picture

Latest patch in #73660: Allow path module to be uninstalled uses module_exists(). Setting the variable works too though. Not sure which I prefer.

This one needs a hook_update_N() to set the variable on existing sites regardless, so CNW for now.

David_Rothstein’s picture

Never mind my patch... I wasn't thinking straight about this and didn't realize the point was to get rid of the count query in all cases :)

However, I think what I was saying in the comment still makes sense. If this is a separate variable, it makes sense for path.module to enable it in path_enable(), but not to force it to be disabled in path_disable().

Damien Tournoud’s picture

What is wrong with just:

db_result(db_query_range('SELECT pid FROM {url_alias}'));

The current way of doing this (ie. not relying of the path module being enabled, which makes little sense), is the good one imho.

chx’s picture

Status: Needs work » Needs review
FileSize
2.41 KB

Damien, that's a query, though fine enough that i used it in the update function. I want to avoid a separate query if at all possible. I have removed the path_disable function. Once again, if you provide a path-equivalent module then add a variable_set to your _enable.

Damien Tournoud’s picture

Damien, that's a query, though fine enough that i used it in the update function. I want to avoid a separate query if at all possible. I have removed the path_disable function. Once again, if you provide a path-equivalent module then add a variable_set to your _enable.

Trying to remove one straight-forward query has very little value by itself.

If you really want to remove if in the most probable use case, why not "simply" issue a check query if the first lookup fails?

Status: Needs review » Needs work

The last submitted patch failed testing.

Crell’s picture

Is there an API call for setting a path? Why not just magically regenerate the "there are aliases" variable after each set/delete? If there are some, presumably we want to use the alias look up system. If not, disable it entirely. And if there's no API function and it's just a direct query, FAIL! Add it. :-)

David_Rothstein’s picture

A better API seems like it would be a big win in its own right, and doing the variable_set/variable_del in there is more or less what @catch proposed way up above, I think.... but it might have to be a pretty comprehensive API in order to work for this. For example, a module like Pathauto currently runs queries like DELETE FROM {url_alias} WHERE src LIKE '%s%% in order to support bulk deletion by alias type, and if that kind of query can't be done via the API, it seems like maybe the variable could get out of whack?

Maybe the simplest approach would be to provide a variable, but not bother trying to set it at all. Just leave it as something that people can optionally set in settings.php if they care about ultra-high performance and really want to avoid this single DB query on every page load. For example, look at the approach in http://api.drupal.org/api/function/drupal_is_denied/7 - something like that...

moshe weitzman’s picture

We need a universal solution to the COUNT(*) problem. We just made InnoDb the default engine, so we can ignore it no longer. If you think it is a minor issue, grep the source or the query log and you find several of these queries on every page. And sometimes on big tables like url_alias and node_access.

I agree with an earlier proposal that modules should calculate and store row counts during cron. They can do that using a new core table or with separate variables.

Anonymous’s picture

Based on http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html

To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does. SHOW TABLE STATUS also can be used if an approximate row count is sufficient.

Anonymous’s picture

Title: drupal_lookup_path - expensive query » COUNT(*) is an expensive query in InnoDB.

Setting a new title.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
13.69 KB

The attached patch goes part of the way: it removes "useless" COUNT(*) queries, ie. queries where we only want to know if the count is non zero. It's probably a good start to commit that to D7, and then to backport to D6.

moshe weitzman’s picture

@Damien - see #18 where kbahey bechmarked this - not promising.

Status: Needs review » Needs work

The last submitted patch failed testing.

Anonymous’s picture

Re: #76: Yes, which is why MySQL suggests http://dev.mysql.com/doc/refman/5.0/en/show-table-status.html for those tables we just want to know data exists. This can be used for MyISAM as well, it isn't specific to InnoDB. We probably need some API db_table_has_rows($table) or some such that returns boolean. Then other DB vendor can do the right thing.

catch’s picture

Looks like we've got a choice between a SHOW TABLE STATUS and variable_get('path_table_has_rows', FALSE); then. Anyone able to run benchmarks on SHOW TABLE STATUS?

sun’s picture

Status: Needs work » Closed (duplicate)

Marking as duplicate of #464164: Move URL alias functionality into a Path API module (still separate from the Path UI). You can follow up on that issue instead to track its status. If any information from this issue is missing in the other issue, please make sure you provide it over there.

However, thanks for taking the time to report this issue.

moshe weitzman’s picture

Status: Closed (duplicate) » Needs work

This issue has real discussion and real benchmarks. That issue no proper writeup, no followups, and is newer. Removing dupe status.

moshe weitzman’s picture

Priority: Normal » Critical

Priority = Critical since we now ship with InnoDB.

sun’s picture

#464164: Move URL alias functionality into a Path API module (still separate from the Path UI) removes path.inc and makes URL aliases optional. Proper fix for the real, underlying cause.

This patch touches the surface only.

JimBroad’s picture

Subscribing...

@sun: I'd say this issus has moved well beyond the scope of the original patch/description.

Agreeing with catch @ #79, need some tests on 'SHOW TABLE STATUS'.

and with ernie @ #78, I think for developers and and maintenance tasks the db_table_has_rows() or db_table_empty() solution would be appreciated, but alot of time we actually are counting so that won't fix everything.

At first I was thinking of this new 'count' table as a pain similar to the 'sequence' table and way it used to be with nids. Still thinking about consequences of direction taken here now that InnoDB will be default engine.

Damien Tournoud’s picture

I stand by my patch in #75.

I used the following script on a 350,000 rows table (a copy of drupal.org node table):

<?php
require_once('./includes/bootstrap.inc');
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

db_query("SET SESSION query_cache_type = OFF");

$count = 100;
print "<pre>";

timer_start('select1');
for ($i = 0; $i < $count; $i++) {
  db_query("SELECT COUNT(*) FROM {node}");
}
$total = timer_stop('select1');
print "MyISAM COUNT(*): ". $total['time'] ." ms\n";

timer_start('select2');
for ($i = 0; $i < $count; $i++) {
  db_query("SELECT nid FROM {node} LIMIT 1");
}
$total = timer_stop('select2');
print "MyISAM LIMIT 1: ". $total['time'] ." ms\n";

timer_start('select3');
for ($i = 0; $i < $count; $i++) {
  db_query("SELECT COUNT(*) FROM {node_inno}");
}
$total = timer_stop('select3');
print "InnoDB COUNT(*): ". $total['time'] ." ms\n";

timer_start('select4');
for ($i = 0; $i < $count; $i++) {
  db_query("SELECT nid FROM {node_inno} LIMIT 1");
}
$total = timer_stop('select4');
print "InnoDB LIMIT 1: ". $total['time'] ." ms\n";

and I'm getting:

MyISAM COUNT(*): 10.88 ms
MyISAM LIMIT 1: 12.08 ms
InnoDB COUNT(*): 24778.75 ms
InnoDB LIMIT 1: 12.17 ms

No comment.

Damien Tournoud’s picture

Damien Tournoud’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch failed testing.

Damien Tournoud’s picture

Status: Needs work » Needs review
FileSize
13.13 KB

Thanks you, testing bot!

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

Benchmarks in #85 are compelling. Bot is happy. RTBC.

Damien Tournoud’s picture

The benchmark in #85 was not conclusive for comparing COUNT(*) and LIMIT 1 on MyISAM. Here is a typical result with $count = 100000:

MyISAM COUNT(*): 10726.87 ms
MyISAM LIMIT 1: 10114.81 ms
InnoDB LIMIT 1: 10177.2 ms

It seems that COUNT(*) is consistently a tiny bit slower (in a 1-5% range) then LIMIT 1 on MyISAM, while LIMIT 1 are equally costly on MyISAM and InnoDB.

webchick’s picture

Status: Reviewed & tested by the community » Needs review

I spent a good amount of time reviewing this patch, reading the linked resources, etc. It seems like there's a good bit of confusion and cross-talk which was a bit confusing, but here's where I think we stand:

There are sometimes legitimate reasons to use COUNT(*) to provide a count of the rows in a table. These queries are prohibitively slow on InnoDB (and really, anything but MyISAM, which is specifically optimized for this case). In this case you need a summary table, like we do with node_comment_statistics.comment_count. Or, alternatively, a variable in the variables table as chx's attempts at this patch did.

However, in the *vast* majority of cases, all we are using COUNT(*) for is "Are there rows in this table?" There is no point in running a COUNT query on a table that's getting bigger... and bigger.. and biiigggggeerrrrr.... if that's all you want to know. Damien's patch addresses these queries by changing them to just a simple 'SELECT 1 FROM ... LIMIT 1'. This makes a lot more sense, since the query is expressing what you actually mean. I'm a bit concerned about the verbosity and complexity of these queries (db_query_range, along with a cast to bool), and discussed with Damien whether a wrapper function might make sense, but he pointed out that there's really nothing these queries have in common other than that, so it would just be a needless layer of indrection, really.

Others in this issue have attempted to remove the path.inc query altogether by various means (just checking for whether path module is enabled, which is a no-go because then URL aliases would break if the optional Path module were disabled, updating a variable on cron which can be checked at read time, ec.) but none were very optimal, and none apply across the board to all of the COUNT(*) queries in core. I think that Damien's approach is best of the presented options.

However, what's a little confusing to me is that past benchmarks have showed SELECT 1 as slower than SELECT COUNT(*) in MyISAM, up until #91 by Damien. I know InnoDB is now our default, but we also deliberately did not put in an upgrade path for this change, so all of the hundreds of thousands of sites upgrading from 6.x and below will be stuck with MyISAM performance on these queries.

Damien, could you share why you got different results last time than anyone else? And could someone do a more holistic benchmark with ab now that this change is affecting more than just path.inc?

webchick’s picture

Status: Needs review » Needs work

Sorry, I meant...

chx’s picture

Status: Needs work » Needs review

This does not need a benchmark. SELECT COUNT(*) reads from the table header and does not read from the table so of course it's faster.

webchick’s picture

Status: Needs review » Needs work
Issue tags: +Needs documentation

Spent some time talking benchmarks with Damien in IRC. The values presented in #18, #20, and #85 are highly variable. For example, here were four more test runs @ 100 queries:

MyISAM COUNT(*): 13.32 ms MyISAM LIMIT 1: 12.1 ms InnoDB LIMIT 1: 11.82 ms 
MyISAM COUNT(*): 9.98 ms MyISAM LIMIT 1: 12.12 ms InnoDB LIMIT 1: 12.04 ms 
MyISAM COUNT(*): 8.97 ms MyISAM LIMIT 1: 8.48 ms InnoDB LIMIT 1: 12.03 ms 
MyISAM COUNT(*): 9.95 ms MyISAM LIMIT 1: 17.13 ms InnoDB LIMIT 1: 12.27 ms 

Sometimes COUNT(*) is faster on MyISAM, sometimes it isn't. There's a pretty wide margin of error.

In #91, Damien cranked the number of iterations to 100K instead of 100 to try and make these margins a bit more pronounced. Over 100K iterations, LIMIT 1 comes out on top.

The end result being that MyISAM users shouldn't see a dramatic difference, and on the whole these spots in core should be marginally faster.

Therefore, committed to HEAD. We need to document this practice in the upgrade guide, at least, and it would be great to document it in the Programming best practices guide as well.

pwolanin’s picture

Aside from documentation, this also needs to be backported to D6 it would seem.

webchick’s picture

Bump.

catch’s picture

Status: Needs work » Patch (to be ported)

Documented in upgrade guide, moving back to D6.

catch’s picture

Version: 7.x-dev » 6.x-dev

Really moving back to 6 this time.

hass’s picture

I believe it could be a good idea for module maintainers to change their count querys in D6 times, too. I'm only a bit confused why I should write SELECT COUNT(1) FROM {mytable} as written in http://drupal.org/node/224333#select_count. I like SELECT COUNT(primary_key_field) FROM {mytable} more than using anonymous numbers... nevertheless - it doesn't change so much in the result of the query, but I find it easier to read.

Is a named column slower?

tuffnatty’s picture

on pgsql, named column is slower.

tuffnatty’s picture

So, why this COUNT is still there in D6?

Anonymous’s picture

So, why this COUNT is still there in D6?

Because InnoDB isn't the default like it will be for D7.

on pgsql, named column is slower.

Why should we code differently for a broken DB server. Using named columns is beneficial for self documenting code.

I like SELECT COUNT(primary_key_field) FROM {mytable} more than using anonymous numbers

Ditto. It documents the code better but otherwise is no different. It should be just as fast (or nearly as fast) regardless of the column. It is a count of rows based on the WHERE clause if one is given. Using * you select all columns so, if you are just interested in a count of rows, using * is a bad idea. Using count(1) makes an assumption that column 1 is the primary key but that column might not be the primary key.

tuffnatty’s picture

on pgsql, named column is slower.

Why should we code differently for a broken DB server. Using named columns is beneficial for self documenting code.

I like SELECT COUNT(primary_key_field) FROM {mytable} more than using anonymous numbers

Ditto. It documents the code better but otherwise is no different. It should be just as fast (or nearly as fast) regardless of the column. It is a count of rows based on the WHERE clause if one is given. Using * you select all columns so, if you are just interested in a count of rows, using * is a bad idea. Using count(1) makes an assumption that column 1 is the primary key but that column might not be the primary key.

COUNT(column) is a count of non-NULL values in that column. COUNT(1) is the count of rows where the constant 1 evaluates to non-NULL (that is, all rows), 1 is not column 1, but a constant.
The real thing we want to know is not the count of these rows, but whether they exist at all. It is best achieved by
SELECT EXISTS(SELECT 1 FROM {url_alias});
It's also 2 orders faster than the fastest of the cited solutions, at least on PostgreSQL. Please check it under MySQL also.

Damien Tournoud’s picture

To clarify, what we are using is:

SELECT 1 FROM {url_alias} LIMIT 1

We are *not* counting.

tuffnatty’s picture

To clarify, what we are using is:
SELECT 1 FROM {url_alias} LIMIT 1
We are *not* counting.

By "we" you must mean HEAD or something else, as D6, to which this bug is dedicated, uses COUNT...
SELECT 1...LIMIT 1 is OK from the performance point of view, it is as fast as EXISTS. Semantically, EXISTS is better.

mattyoung’s picture

.

Anonymous’s picture

COUNT(column) is a count of non-NULL values in that column.

COUNT(primary_key_field) FROM {mytable}

Primary key fields cannot be NULL.

Dave Reid’s picture

Marked #260400: path.inc query should run faster on InnoDB as a duplicate of this issue.

arhak’s picture

Issue tags: +Ancient

tag

Damien Tournoud’s picture

Issue tags: -Ancient

Patches are welcome.

arhak’s picture

@#111 I disapprove your attitude http://drupal.org/node/7881#comment-2983710

Damien Tournoud’s picture

What about rolling a patch here, instead of discussing my "attitude"?

What needs to be done is:

  • Grep Drupal 6 core for COUNT(
  • Identify the cases where we use COUNT() only to know if there are rows matching a set of conditions
  • Replace those db_query('SELECT COUNT(*) FROM table WHERE xxx') by db_query_range('SELECT 1 FROM table', 0, 1)
valthebald’s picture

Status: Patch (to be ported) » Needs review
Issue tags: +Performance, +db queries
FileSize
12.51 KB

#131: I've done basically that. Attached is replaced version of occurences that for sure don't need SELECT COUNT(*)
By the way, I confirm that the COUNT(*) is slow on Postgres (8.3/8.4/9.0) as well
Anyone can review please?

valthebald’s picture

Added LIMIT 1 to url_alias query, other cases remain intact

catch’s picture

Priority: Critical » Major
Status: Needs review » Needs work

This should use db_query_range(), instead of LIMIT.

Downgrading from critical since it's not a site breaking issue (other queries in Drupal 6 are much worse than COUNT(*)) but this is still 'major' for D6 IMO.

valthebald’s picture

Status: Needs work » Needs review
FileSize
12.56 KB

Here you go

Anonymous’s picture

Status: Needs review » Needs work

All of the queries need to use the db_query_range() even if the query would naturally only return one row. The use of db_query_range for all of these should be used for consistency in coding.

valthebald’s picture

Status: Needs work » Needs review
FileSize
12.67 KB

Another attempt

valthebald’s picture

Issue tags: +Ancient

Marking this Ancient - any chance we can get this fixed in 4 years? :)

Damien Tournoud’s picture

#119 looks good to me on visual review, but I haven't tested it.

Crell’s picture

Status: Needs review » Reviewed & tested by the community

I am also OK with #119, and the testbot likes it. I don't know off hand if there are any other COUNT(*) queries it doesn't get, but frankly if so we can make those separate patches.

crea’s picture

Subscribing

BenVercammen’s picture

I tried the patch with Drupal 6.22 and got errors as some count queries seemed to fail. Apparently the signature for db_query_range() isn't used correctly.

I tested this with the following lines:

$existing_type = 'panel';
// Orginal:
$is_existing = db_result(db_query("SELECT COUNT(*) FROM {node_type} WHERE type = '%s'", $existing_type));
echo (var_dump($is_existing));

// Patch:
$is_existing = db_result(db_query_range("SELECT 1 FROM {node_type} WHERE type = '%s'", 0, 1, $existing_type));
echo (var_dump($is_existing));

// Patched patch:
$is_existing = db_result(db_query_range("SELECT 1 FROM {node_type} WHERE type = '%s'", $existing_type, 0, 1));
echo (var_dump($is_existing));

Apparently all patched lines use the wrong signature, adding the query parameters after the $from and $count parameters, instead of before them...

The db_query_range() function expects the $count and $from arguments in the end (hence the usage of the array_pop() function)...

function db_query_range($query) {
  $args = func_get_args();
  $count = array_pop($args);
  $from = array_pop($args);
  array_shift($args);
  ...
}

Is this really something that's been overlooked, or am I missing something? I just checked the current 6.x-dev build, but the db_query_range() function hasn't changed...

valthebald’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
12.67 KB

This is really something overlooked. I was under (wrong) impression that count and limit params are taken before dynamic params of db_query_range(). Patch at #119 works only for queries without dynamic parameters. Corrected patch attached.

Status: Needs review » Needs work

The last submitted patch, 196862-expensive-count-query-range-all-125.patch, failed testing.

valthebald’s picture

Status: Needs work » Needs review
FileSize
13.5 KB

Reroll with patch against latest 6.x-dev

rjbrown99’s picture

FWIW, this patch could also benefit Pressflow. All but one of the count queries from patch #127 would be applicable.

The one that is not applicable to Pressflow is the path.inc modification (2nd one down if you read the patch file) because they use a different drupal_lookup_path() function.

mikeytown2’s picture

#127 doesn't apply cleanly to pressflow; I've attached a patch just for pressflow.

gielfeldt’s picture

Looking at the SELECT COUNT() for paths only, isn't this a perfectly valid question:

* How many sites/installation actually have 0 url aliases?

And the follow-up question:

* Is it really worth the effort to optimize for this case?

Anonymous’s picture

Bump. Is this getting committed for D6?

Status: Needs review » Needs work

The last submitted patch, pressflow-count-query-196862-129-D6.patch, failed testing.

valthebald’s picture

Status: Needs work » Needs review

Someone (but not me) has to RTBC that :)

rodricels’s picture

#129 Pressflow removes the LOWER() in the function _user_edit_validate(), why introduce it again?

mikeytown2’s picture

Thanks for catching that!

chx’s picture

Assigned: chx » Unassigned
Issue summary: View changes

Status: Needs review » Closed (outdated)

Automatically closed because Drupal 6 is no longer supported. If the issue verifiably applies to later versions, please reopen with details and update the version.