COUNT(*) is an expensive query in InnoDB.

earnie - December 1, 2007 - 15:45
Project:Drupal
Version:6.x-dev
Component:base system
Category:task
Priority:critical
Assigned:chx
Status:patch (to be ported)
Issue tags:Needs Documentation
Description

Snippet from function drupal_lookup_path

  // Use $count to avoid looking up paths in subsequent calls if there simply are no aliases
  if (!isset($count)) {
    $count = db_result(db_query('SELECT COUNT(pid) FROM {url_alias}'));
  }

  if ($action == 'wipe') {
    $map = array();
    $no_src = array();
  }
  elseif ($count > 0 && $path != '') {

Why do a count of url_alias instead of using module_exists('path')?

I do not see why the code snippet could not be:

  if ($action == 'wipe') {
    $map = array();
    $no_src = array();
  }
  elseif (module_exists('path') && $path != '') {

Am I missing something here?

#1

kbahey - December 1, 2007 - 21:05

It is not really that expensive (at least on MySQL with MyISAM). On a site with 96,000 aliases, it takes 20 to 40 milliseconds to execute this query.

But the idea here is to ask : "are there aliases". Not just : "is path module installed".

#2

earnie - December 2, 2007 - 05:10

And with every new alias the count query gets longer and longer and longer. I removed the query (which is run on every new page) which was taking nearly 30ms to run and improved my site performance. You should only be able to have an alias if and only if the path module is installed, correct? And while I'm counting those rows someone else waits for the DB. IMO we should not really care if there are zero aliases and just query url_alias table regardless. Counting rows is expensive; maybe not on its own single occurrence, but it adds up which makes it expensive.

#3

ChrisKennedy - December 2, 2007 - 05:17
Status:active» duplicate

See plenty of discussion at http://drupal.org/node/106559

#4

earnie - December 2, 2007 - 09:08
Status:duplicate» active

Not a duplicate since the count still exists.

#5

kbahey - December 3, 2007 - 01:35

Earnie

Still, the path module can be enabled, but no aliases in the database.

It is best to create a proper patch, with benchmarks of before and after. This way, it is more concrete.

#6

earnie - December 3, 2007 - 13:27

Still, the path module can be enabled, but no aliases in the database.

But the user of the path module should not care if there are currently zero in the database vs 100K within this function. It is something which isn't useful and takes more time to do the count than to just do the select request to begin with.

It is best to create a proper patch, with benchmarks of before and after. This way, it is more concrete.

I plan to do just that. I created a support request for it first to get comment. I will change to a bug report when I have a patch to apply.

#7

kbahey - December 3, 2007 - 18:31

Another approach is this:

- Cron would do the select count(*) every so often, and store whether there are aliases in a variable_set().
- The code that now checks count just checks that variable.

#8

earnie - December 4, 2007 - 20:49

I have thunked a solution of

  if (!isset($have_rows)) {
    $have_rows = db_result(db_query('SELECT pid FROM {url_alias} LIMIT 1'));
  }

  if ($action == 'wipe') {
    $map = array();
    $no_src = array();
  }
  elseif ($have_rows && $path != '') {

How should the LIMIT 1 be specified since it is MySql specific? (Hmm, guess it is ANSI specified. I thought Oracle didn't support it though.)

#9

earnie - December 4, 2007 - 21:01
Category:support request» bug report
Status:active» needs review

Here's the HEAD patch.

AttachmentSize
drupal_lookup_path-HEAD.patch.txt 1.25 KB

#10

earnie - December 4, 2007 - 21:06

And the same patch rolled for 6.x and 5.x.

AttachmentSize
drupal_lookup_path-6.x.patch.txt 1.25 KB
drupal_lookup_path-5.x.patch.txt 1.07 KB

#11

FiReaNG3L - December 4, 2007 - 21:37

IMHO a SELECT ... LIMIT 1 answers the question "is there an alias in the db" way more efficiently than counting every alias out there. It may be "only 30-40 ms", but its still an improvement and as earnie said, the more alias you got, the worse the problem. +1 from me.

#12

kbahey - December 5, 2007 - 17:04

Any benchmarks for the new patch?

#13

FiReaNG3L - December 5, 2007 - 17:25

It would be interesting to benchmark it in both MySQL MyISAM, INNODB and Postgre with lots of nodes if possible. MyISAM has optimization for COUNT(*), so there might not be a terrible difference there.

#14

earnie - December 5, 2007 - 19:32

I went from 30+-ms to between 0.29ms and 0.45ms on InnoDB.

#15

FiReaNG3L - December 5, 2007 - 19:35

Thanks Earnie, that confirms what I've been thinking; this probably got overlooked because COUNT(*) is extremely fast in MyISAM tables. For everyone else (InnoDB and Postgre), this way of doing things is very bad performance wise (when the table is big).

#16

earnie - December 5, 2007 - 19:54

To Count Or Not To Count

I did the patch before I found this.

#17

kbahey - December 5, 2007 - 21:23

Not so fast ...

Turn the MySQL query cache off (or restart MySQL between tests) so that it does not skew the numbers.

#18

kbahey - December 5, 2007 - 23:55
Status:needs review» needs work

I did some benchmarking of MyISAM and InnoDB on a table with 10,438 nodes.

The results DO NOT SHOW that LIMIT is faster than COUNT.

Here is the script used to test:

<?php
require_once('./includes/bootstrap.inc');
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

timer_start('select1');
db_query("SELECT COUNT(*) FROM {node}");
$total = timer_stop('select1');
print
"MyISAM COUNT(*): ". $total['time'] ." ms\n";

timer_start('select2');
db_query("SELECT nid FROM {node} LIMIT 1'");
$total = timer_stop('select2');
print
"MyISAM LIMIT 1: ". $total['time'] ." ms\n";

timer_start('select3');
db_query("SELECT COUNT(*) FROM {node_inno}'");
$total = timer_stop('select3');
print
"InnoDB COUNT(*): ". $total['time'] ." ms\n";

timer_start('select4');
db_query("SELECT nid FROM {node_inno} LIMIT 1'");
$total = timer_stop('select4');
print
"InnoDB LIMIT 1: ". $total['time'] ." ms\n";
?>

Here are the results.

$ php ~/sites/benchmark/benchdb.php
MyISAM COUNT(*): 0.14 ms
MyISAM LIMIT 1: 0.44 ms
InnoDB COUNT(*): 0.33 ms
InnoDB LIMIT 1: 0.33 ms

Since MyISAM is in wide use, your patch will cause slowdowns on many sites.

#19

FiReaNG3L - December 6, 2007 - 02:27

kbahey: your script have an extra ' at the end in the 3 queries except the original COUNT(*) on MyISAM, which would cause the query to fail entirely. Are you sure the queries ran properly?

#20

kbahey - December 6, 2007 - 03:11

[edited -- kbahey]

Fixed.

Here are the results after the fix.

$ php ~/sites/benchmark/benchdb.php
MyISAM COUNT(*): 0.13 ms
MyISAM LIMIT 1: 0.22 ms
InnoDB COUNT(*): 6.7 ms
InnoDB LIMIT 1: 0.39 ms

For PostgreSQL with 10,000 nodes in the table, it says:

PostgreSQL COUNT(*): 5.16 ms
PostgreSQL LIMIT 1: 0.51 ms

#21

earnie - December 6, 2007 - 13:16
Status:needs work» needs review

Seems ready to be committed to me. It serves three database engines well.

#22

catch - February 14, 2008 - 11:14

This still applies cleanly. But if the LIMIT is slower on MyISAM, why not the variable_set in cron proposed by kbahey in #7? This could also be checked/set in path_set_alias() for the very rare case that a site has 0 aliases then gets one in between cron runs. Checking the variable has to be quicker than this query on every page.

#23

earnie - February 14, 2008 - 12:47

For the obvious reason that if someone turns on path aliases they expect it to be available at the time they submit the page and not after waiting for the length of time that the cron timer expires to execute it. And as I've already said, the count isn't important, the fact that we have aliases to serve is.

The whole idea of protecting against having rows in path_alias is debatable since we wouldn't have path turned on if we weren't going to use it. I'm one who likes to eliminate such frivolous things and just let the code handle it just like it would for no alias found. It makes the code smaller and therefore faster, uses less memory and easier to maintain. It's a module and if you're not using it then disable it to make the system faster. I submitted this patch because it is the least intrusive to the existing code.

#24

catch - February 14, 2008 - 14:30

earnie:

This could also be checked/set in path_set_alias() for the very rare case that a site has 0 aliases then gets one in between cron runs.

Or for that matter, could be set on saving and deleting path aliases plus enabling path module - and left both out of both page requests and cron entirely no?

#25

earnie - February 14, 2008 - 16:06

Catch, why should http://api.drupal.org/api/function/drupal_lookup_path/6 care about count? How does having a count help for this function? I understand that if the path alias module isn't used that perhaps the function should just return false; but if it is used how is count useful? What you propose is adding extra overhead for no gain.

#26

catch - February 14, 2008 - 16:24

earnie: I think we're talking at cross-purposes.

What I'm saying is keep the altered query from #10, but move it to a variable_get/set in path_set_alias() when paths are added and deleted rather than read. Then all burden goes to writes and it's just if ($has_path) in drupal_lookup_path() itself.

This is the same as kbahey suggested, but without the lag of relying on cron, and any overhead is only on writes. I don't see how this adds extra overhead to drupal_lookup_path() - it even simplifies it and it should gain slightly more than the original patch as well no? Or am I simply wrong here?

#27

earnie - February 14, 2008 - 18:41

I understood

What I'm saying is keep the altered query from #10, but move it to a
variable_get/set in path_set_alias() when paths are added and deleted
rather than read. Then all burden goes to writes and it's just if
($has_path) in drupal_lookup_path() itself.

you were saying this. My question is why do we care about the count? Why modify the updaters to keep track of a count? Where is it important to have?

#28

catch - February 15, 2008 - 01:35

Because a small performance hit on some updates update and one additional variable is a tiny price to pay for removing that query altogether on every page load. Since neither LIMIT 1; and SELECT COUNT() are optimal cross-databse, it seems like a win-win to me to just check the variable instead. You don't need to store the count, just a TRUE/FALSE as to whether there's any rows in the path table and update only if it changes.

#29

Susurrus - February 15, 2008 - 01:41

@catch: So how do you determine when you've removed the last path to set that variable back to FALSE? Am I missing somewhere where there's a REMOVE * and you'd therefore know that you were for sure back to no rows?

#30

chx - February 15, 2008 - 01:47

The world mostly runs MyISAM (edit: might be that every table engine) which checks against the table definiton (edit: the frm file) and not against the table data when running a count of all rows.

#31

chx - February 15, 2008 - 02:02

So. it's not the FRM file but the MyISAM table header. Sorry. Anyways -- it's extremely fast , it's pre stored always. http://www.browardphp.com/mysql_manual_en/manual_Storage_engines.html

#32

catch - February 15, 2008 - 02:02

Susurrus: well you could run the query just after path deletion to ensure you catch if there's zero rows. Or if using COUNT() and $count -1 = 0 just before a delete, then you'd know that way as well.

#33

Susurrus - February 15, 2008 - 02:24

But, isn't the point to remove queries? Why would we want to run COUNT() as a separate query?

As far as this query, it should be very quick since pid is a primary key on the url_alias. In theory, the number of rows in this table is stored directly by the database (I would imagine this would be the case for all databases) and the query would run very quickly. Has anyone actually benchmarked the cost of this query per page load?

#34

earnie - February 15, 2008 - 04:49
Status:needs review» needs work

Since the reason for needing to have a count hasn't been answered to any satisfaction, I will submit a different patch soon that just checks for the path module existing. If I have path module enabled but no aliases created yet is minimal and I wouldn't care that the lookup for a non existent alias occurs with zero rows in the table.

@Susurrus: I submitted the original patch because the devel module identified the count as slow. This query is executed on every page so eliminating long unneeded queries is important. Moving the query elsewhere isn't feasible either. I'm creating aliases for every tag I create, not to mention every page title. See the statistics at #20 for differences between myisam, innodb and postgres.

@chx: While myisam is good for small sites IMO, I am processing 100K nodes (soon I will be processing more than 500K nodes) in batch and I need transactions. I chose innodb because I deemed that since it was still mysql I would have fewer problems since I really am not familiar with postgres.

#35

catch - February 15, 2008 - 10:00

earnie: Oh, now I see! Checking for path module rather than in the table seems completely fine to me, sorry! I was still working on the basis of #10.

#36

earnie - February 15, 2008 - 13:38
Status:needs work» needs review

As promised but this patch only works if http://drupal.org/node/222109 is also applied.

AttachmentSize
drupal_lookup_path-HEAD.patch.txt 1.28 KB

#37

Dries - February 17, 2008 - 20:38

I think it is OK to remove that query -- not everyone uses MyISAM. However, I don't like the static $has_path that was introduced. module_exists() should be fast -- if it is not, it should be made faster. Isn't module_exist() caching this already?

#38

earnie - February 18, 2008 - 15:18

That is what I almost did. The module_exists function sets a variable calling module_list>module_list and then calls array_key_exists. It may be slightly faster to cash the result of module_exists. I've attached a modified patch without the cache.

AttachmentSize
drupal_lookup_path-HEAD.patch.txt 1.15 KB

#39

Dries - February 18, 2008 - 16:49
Status:needs review» fixed

OK, I've committed this to CVS HEAD. Thanks.

#40

earnie - February 18, 2008 - 19:30

Dries: As I stated at http://drupal.org/comment/reply/196862#comment-731300 you also need to commit http://drupal.org/node/222109 or it doesn't work because the module_list cache isn't filled.

#41

Anonymous (not verified) - March 3, 2008 - 19:32
Status:fixed» closed

Automatically closed -- issue fixed for two weeks with no activity.

#42

David_Rothstein - April 15, 2008 - 02:11

This patch seems to have broken URL aliases entirely; see #237336: Url Aliases no longer work. (I tried applying the suggested companion patch at http://drupal.org/node/222109 which @earnie recommends, but that didn't seem to help.)

I don't think the problem is the "fault" of this patch, but seems like it might be a deeper problem that this patch happened to expose... so it's probably best to discuss it over at the new issue.

#43

catch - April 16, 2008 - 10:06

IMO this patch is at fault and I think it needs to be rolled back.

I'm not going to mess around with issue statuses, so rollback patch is here: http://drupal.org/node/222109#comment-809566

#44

catch - April 16, 2008 - 14:19
Title:drupal_lookup_path - expensive query» path aliases broken (was: drupal_lookup_path - expensive query)
Priority:normal» critical
Status:closed» reviewed & tested by the community

I'm pretty certain it was indeed the fault of this patch, so I'm reopening this issue and marking http://drupal.org/node/222109 as duplicate.

The short version is path module doesn't exist during DRUPAL_BOOTSTRAP_PATH because it doesn't implement hook_boot or hook_exit, so checking for it always returns FALSE.

This is a simple rollback so I'm marking RTBC - the patch is just a reversal of #38 rolled from root for convenience.

Once that's done, we should keep this issue open at CNW and try to prevent the query another way.

AttachmentSize
path-rollback.patch 1.23 KB
Testbed results
path-rollback.patchfailedFailed: Failed to apply patch. Detailed results

#45

earnie - April 16, 2008 - 16:23

Simply rolling back a fix that showed issues elsewhere isn't fixing the real problem. Please review the patches by David Rothstein at #222109: module_list() is broken comment #10.

#46

David_Rothstein - April 16, 2008 - 17:56

If the goal is to get URL aliases working again quickly, then I agree that rolling back this patch is probably the best move.

However, as @earnie said, we have reopened http://drupal.org/node/222109, since there is indeed a more fundamental problem discussed there that is related to this one... and fixing it may allow the current patch here to work eventually (it seems that it turned out to be more complicated than I originally thought, though).

#47

David_Rothstein - April 16, 2008 - 18:01

Hm, even if we can get the patch in #38 to work, is module_exists('path') really the right thing to check?? What if you create some aliases but then later disable the path module... won't that lead to a problem with broken links?

Currently, Drupal checks URL aliases in the database REGARDLESS of whether the path module is enabled... the path module itself basically just controls the administrative interface. I think in many ways that makes more sense.

#48

Freso - April 16, 2008 - 21:36

Adding to my issue queue (catch thought I should have a look at it, so... ;)).

#49

earnie - April 16, 2008 - 22:46

Hm, even if we can get the patch in #38 to work, is module_exists('path') really the right thing to check?? What if you create some aliases but then later disable the path module... won't that lead to a problem with broken links?

My original patch at #9 took care of this. However, if someone (me for instance) disables the path module don't you think they might just think that path aliases are no longer valid? See Dries comment at #37, its obvious from the remark that he was thinking that module_exists('path') should work for this.

#50

David_Rothstein - April 16, 2008 - 23:49

Well, it's not wrong, I guess, just drastic. You click on that little checkbox and suddenly a bunch of links all over the Internet are broken ;)

If this patch goes in then maybe part of the solution is to rewrite some of the help text for path.module. Right now, the help text basically just talks about the admin interface for managing aliases (as it should, since in Drupal 6 that's all it controls, not the aliases themselves). For example, the description in the module list is currently Allows users to rename URLs, and the help text has things like:

The path module enables appropriately permissioned users to specify an optional alias in all node input and editing forms, and provides an interface to view and edit all URL aliases.

I guess this is somewhat related to @catch's comment here, that path.inc and path.module are perhaps more tangled up than they should be.

#51

Dries - April 19, 2008 - 17:03

I agree with David -- let's roll back and fix the help text as a stop gap solution, then spend the necessary time to work on a proper solution. Care to re-roll with the documentation suggestions?

#52

BioALIEN - April 20, 2008 - 16:43
Status:reviewed & tested by the community» needs work

As per Dries's request, setting to CNW.

#53

earnie - April 21, 2008 - 13:41

@Dries: Please don't forget to look at the synergies around #222109: module_list() is broken to help module_list. Chx has a patch being reviewed now. This simple patch has brought to light some inconsistencies with the use of module_list that will make D-7 even better. Let's not roll back just yet but wait for the outcome of #222109: module_list() is broken before a decision is made.

#54

catch - April 21, 2008 - 13:50
Status:needs work» needs review

fwiw I'm not sure the help text needs changing on this, or if it does I'm not sure in what way.

With the rollback, the path module does exactly what it says it does - provides the interface for adding and deleting paths. So the current help text would be correct. It's only with the change from the query to module_exists() as it currently is in HEAD that the path module influences whether paths themselves are activated or not.

I'm fine with leaving things as they are until module_list is reworked (assuming that gets in quickly), in which case we'd just need a patch to update the help text once the other issue is fixed. If we want to roll this back and revisit it again from scratch, then the current patch should do that fine, so marking back to CNR.

#55

catch - June 4, 2008 - 10:36
Assigned to:earnie» catch
Status:needs review» reviewed & tested by the community

http://drupal.org/node/196862#comment-809836 still applies, still fixes the issue, makes all path tests pass.

http://drupal.org/node/259412 is the current issue addressing with the root problem, but IMO this shouldn't be postponed for a complete module system revamp.

I'm marking this back to RTBC again.

#56

chx - June 24, 2008 - 19:16

Yeah the module revamp will fix it but until then why keep testing back?

#57

Dries - June 24, 2008 - 22:12
Status:reviewed & tested by the community» fixed

Committed to CVS HEAD. Thanks.

#58

catch - June 24, 2008 - 22:21
Title:path aliases broken (was: drupal_lookup_path - expensive query)» drupal_lookup_path - expensive query
Category:bug report» task
Priority:critical» normal
Assigned to:catch» Anonymous
Status:fixed» postponed

Changing status so we can revisit this when the module revamp is in.

#59

chx - February 21, 2009 - 17:02
Assigned to:Anonymous» chx
Status:postponed» needs review

It's easy, really.

AttachmentSize
path_enable.patch 1.69 KB
Testbed results
path_enable.patchfailedFailed: Failed to install HEAD. Detailed results

#60

chx - February 21, 2009 - 17:03

With less tpyos ;)

AttachmentSize
path_enable.patch 1.69 KB
Testbed results
path_enable.patchfailedFailed: Failed to install HEAD. Detailed results

#61

Crell - February 21, 2009 - 17:07

If I read this correctly, it makes the path system not function unless the path module itself is installed. Who's to say the core path module is the only possible way to configure paths? What if I want pathauto (or similar) running, but not have the core path module's UI? What if I want to provide my own API instead? This is like making Views not function unless Views UI is enabled, but keeping Views UI disableable. -1.

#62

chx - February 21, 2009 - 17:11

Found one more count and re-read the whole issue. If you are using aliases without the path module then set the variable too, it's easy.

AttachmentSize
path_enable.patch 2 KB
Testbed results
path_enable.patchfailedFailed: Failed to apply patch. Detailed results

#63

David_Rothstein - February 21, 2009 - 18:03
Status:needs review» needs work

Hm, at a minimum wouldn't this need to be accompanied by a change to the path.module's help text which explained that disabling the module breaks all your existing URL aliases?... However, I think @Crell is exactly right. Ideally, path.inc should be an API which works on its own, with path.module being one possible UI for accessing it (just like the separation between menu.inc and menu.module).

However, what if instead of making the URL aliases disabled by default, you made them enabled by default, as in the attached patch? There could then be a radio button for disabling path aliases altogether at admin/settings/performance (or even if not, the variable would still be available to set in settings.php for sites that cared about it).

AttachmentSize
path_disable.patch 706 bytes
Testbed results
path_disable.patchre-testing

#64

catch - February 21, 2009 - 18:05

Latest patch in #73660: Allow path module to be uninstalled uses module_exists(). Setting the variable works too though. Not sure which I prefer.

This one needs a hook_update_N() to set the variable on existing sites regardless, so CNW for now.

#65

David_Rothstein - February 21, 2009 - 18:13

Never mind my patch... I wasn't thinking straight about this and didn't realize the point was to get rid of the count query in all cases :)

However, I think what I was saying in the comment still makes sense. If this is a separate variable, it makes sense for path.module to enable it in path_enable(), but not to force it to be disabled in path_disable().

#66

Damien Tournoud - February 21, 2009 - 18:21

What is wrong with just:

<?php
db_result
(db_query_range('SELECT pid FROM {url_alias}'));
?>

The current way of doing this (ie. not relying of the path module being enabled, which makes little sense), is the good one imho.

#67

chx - February 21, 2009 - 18:49
Status:needs work» needs review

Damien, that's a query, though fine enough that i used it in the update function. I want to avoid a separate query if at all possible. I have removed the path_disable function. Once again, if you provide a path-equivalent module then add a variable_set to your _enable.

AttachmentSize
path_enable.patch 2.41 KB
Testbed results
path_enable.patchfailedFailed: 9745 passes, 7 fails, 0 exceptions Detailed results

#68

Damien Tournoud - February 21, 2009 - 19:06

Damien, that's a query, though fine enough that i used it in the update function. I want to avoid a separate query if at all possible. I have removed the path_disable function. Once again, if you provide a path-equivalent module then add a variable_set to your _enable.

Trying to remove one straight-forward query has very little value by itself.

If you really want to remove if in the most probable use case, why not "simply" issue a check query if the first lookup fails?

#69

System Message - February 21, 2009 - 19:35
Status:needs review» needs work

The last submitted patch failed testing.

#70

Crell - February 21, 2009 - 23:03

Is there an API call for setting a path? Why not just magically regenerate the "there are aliases" variable after each set/delete? If there are some, presumably we want to use the alias look up system. If not, disable it entirely. And if there's no API function and it's just a direct query, FAIL! Add it. :-)

#71

David_Rothstein - February 26, 2009 - 06:46

A better API seems like it would be a big win in its own right, and doing the variable_set/variable_del in there is more or less what @catch proposed way up above, I think.... but it might have to be a pretty comprehensive API in order to work for this. For example, a module like Pathauto currently runs queries like DELETE FROM {url_alias} WHERE src LIKE '%s%% in order to support bulk deletion by alias type, and if that kind of query can't be done via the API, it seems like maybe the variable could get out of whack?

Maybe the simplest approach would be to provide a variable, but not bother trying to set it at all. Just leave it as something that people can optionally set in settings.php if they care about ultra-high performance and really want to avoid this single DB query on every page load. For example, look at the approach in http://api.drupal.org/api/function/drupal_is_denied/7 - something like that...

#72

moshe weitzman - April 26, 2009 - 15:02

We need a universal solution to the COUNT(*) problem. We just made InnoDb the default engine, so we can ignore it no longer. If you think it is a minor issue, grep the source or the query log and you find several of these queries on every page. And sometimes on big tables like url_alias and node_access.

I agree with an earlier proposal that modules should calculate and store row counts during cron. They can do that using a new core table or with separate variables.

#73

earnie - April 27, 2009 - 12:11

Based on http://dev.mysql.com/doc/refman/5.0/en/innodb-restrictions.html

To get a fast count, you have to use a counter table you create yourself and let your application update it according to the inserts and deletes it does. SHOW TABLE STATUS also can be used if an approximate row count is sufficient.

#74

earnie - April 27, 2009 - 13:40
Title:drupal_lookup_path - expensive query» COUNT(*) is an expensive query in InnoDB.

Setting a new title.

#75

Damien Tournoud - April 27, 2009 - 14:14
Status:needs work» needs review

The attached patch goes part of the way: it removes "useless" COUNT(*) queries, ie. queries where we only want to know if the count is non zero. It's probably a good start to commit that to D7, and then to backport to D6.

AttachmentSize
196862-remove-useless-count-queries-we-do-not-like-you-please-go-away.patch 13.69 KB
Testbed results
196862-remove-useless-count-queries-we-do-not-like-you-please-go-away.patchfailedFailed: 10903 passes, 32 fails, 9 exceptions Detailed results

#76

moshe weitzman - April 27, 2009 - 14:37

@Damien - see #18 where kbahey bechmarked this - not promising.

#77

System Message - April 27, 2009 - 15:16
Status:needs review» needs work

The last submitted patch failed testing.

#78

earnie - April 27, 2009 - 15:46

Re: #76: Yes, which is why MySQL suggests http://dev.mysql.com/doc/refman/5.0/en/show-table-status.html for those tables we just want to know data exists. This can be used for MyISAM as well, it isn't specific to InnoDB. We probably need some API db_table_has_rows($table) or some such that returns boolean. Then other DB vendor can do the right thing.

#79

catch - April 27, 2009 - 15:57

Looks like we've got a choice between a SHOW TABLE STATUS and variable_get('path_table_has_rows', FALSE); then. Anyone able to run benchmarks on SHOW TABLE STATUS?

#80

sun - May 16, 2009 - 00:36
Status:needs work» duplicate

Marking as duplicate of #464164: path.inc your time has come. You can follow up on that issue instead to track its status. If any information from this issue is missing in the other issue, please make sure you provide it over there.

However, thanks for taking the time to report this issue.

#81

moshe weitzman - May 16, 2009 - 01:40
Status:duplicate» needs work

This issue has real discussion and real benchmarks. That issue no proper writeup, no followups, and is newer. Removing dupe status.

#82

moshe weitzman - May 16, 2009 - 01:42
Priority:normal» critical

Priority = Critical since we now ship with InnoDB.

#83

sun - May 16, 2009 - 02:17

#464164: path.inc your time has come removes path.inc and makes URL aliases optional. Proper fix for the real, underlying cause.

This patch touches the surface only.

#84

JimBroad - May 16, 2009 - 09:04

Subscribing...

@sun: I'd say this issus has moved well beyond the scope of the original patch/description.

Agreeing with catch @ #79, need some tests on 'SHOW TABLE STATUS'.

and with ernie @ #78, I think for developers and and maintenance tasks the db_table_has_rows() or db_table_empty() solution would be appreciated, but alot of time we actually are counting so that won't fix everything.

At first I was thinking of this new 'count' table as a pain similar to the 'sequence' table and way it used to be with nids. Still thinking about consequences of direction taken here now that InnoDB will be default engine.

#85

Damien Tournoud - May 16, 2009 - 09:58

I stand by my patch in #75.

I used the following script on a 350,000 rows table (a copy of drupal.org node table):

<?php
<?php
require_once('./includes/bootstrap.inc');
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);

db_query("SET SESSION query_cache_type = OFF");

$count = 100;
print
"<pre>";

timer_start('select1');
for (
$i = 0; $i < $count; $i++) {
 
db_query("SELECT COUNT(*) FROM {node}");
}
$total = timer_stop('select1');
print
"MyISAM COUNT(*): ". $total['time'] ." ms\n";

timer_start('select2');
for (
$i = 0; $i < $count; $i++) {
 
db_query("SELECT nid FROM {node} LIMIT 1");
}
$total = timer_stop('select2');
print
"MyISAM LIMIT 1: ". $total['time'] ." ms\n";

timer_start('select3');
for (
$i = 0; $i < $count; $i++) {
 
db_query("SELECT COUNT(*) FROM {node_inno}");
}
$total = timer_stop('select3');
print
"InnoDB COUNT(*): ". $total['time'] ." ms\n";

timer_start('select4');
for (
$i = 0; $i < $count; $i++) {
 
db_query("SELECT nid FROM {node_inno} LIMIT 1");
}
$total = timer_stop('select4');
print
"InnoDB LIMIT 1: ". $total['time'] ." ms\n";
?>

and I'm getting:

MyISAM COUNT(*): 10.88 ms
MyISAM LIMIT 1: 12.08 ms
InnoDB COUNT(*): 24778.75 ms
InnoDB LIMIT 1: 12.17 ms

No comment.

#86

Damien Tournoud - May 16, 2009 - 10:19

Let's try again.

AttachmentSize
196862-remove-useless-count-queries-we-do-not-like-you-please-go-away.patch 13.15 KB
Testbed results
196862-remove-useless-count-queries-we-do-not-like-you-please-go-away.patchfailedFailed: 11254 passes, 1 fail, 6 exceptions Detailed results

#87

Damien Tournoud - May 16, 2009 - 10:19
Status:needs work» needs review

#88

System Message - May 16, 2009 - 10:31
Status:needs review» needs work

The last submitted patch failed testing.

#89

Damien Tournoud - May 16, 2009 - 10:39
Status:needs work» needs review

Thanks you, testing bot!

AttachmentSize
196862-remove-useless-count-queries-we-do-not-like-you-please-go-away.patch 13.13 KB
Testbed results
196862-remove-useless-count-queries-we-do-not-like-you-please-go-away.patchpassedPassed: 11255 passes, 0 fails, 0 exceptions Detailed results

#90

moshe weitzman - May 16, 2009 - 11:15
Status:needs review» reviewed & tested by the community

Benchmarks in #85 are compelling. Bot is happy. RTBC.

#91

Damien Tournoud - May 16, 2009 - 11:52

The benchmark in #85 was not conclusive for comparing COUNT(*) and LIMIT 1 on MyISAM. Here is a typical result with $count = 100000:

MyISAM COUNT(*): 10726.87 ms
MyISAM LIMIT 1: 10114.81 ms
InnoDB LIMIT 1: 10177.2 ms

It seems that COUNT(*) is consistently a tiny bit slower (in a 1-5% range) then LIMIT 1 on MyISAM, while LIMIT 1 are equally costly on MyISAM and InnoDB.

#92

webchick - May 16, 2009 - 14:47
Status:reviewed & tested by the community» needs review

I spent a good amount of time reviewing this patch, reading the linked resources, etc. It seems like there's a good bit of confusion and cross-talk which was a bit confusing, but here's where I think we stand:

There are sometimes legitimate reasons to use COUNT(*) to provide a count of the rows in a table. These queries are prohibitively slow on InnoDB (and really, anything but MyISAM, which is specifically optimized for this case). In this case you need a summary table, like we do with node_comment_statistics.comment_count. Or, alternatively, a variable in the variables table as chx's attempts at this patch did.

However, in the *vast* majority of cases, all we are using COUNT(*) for is "Are there rows in this table?" There is no point in running a COUNT query on a table that's getting bigger... and bigger.. and biiigggggeerrrrr.... if that's all you want to know. Damien's patch addresses these queries by changing them to just a simple 'SELECT 1 FROM ... LIMIT 1'. This makes a lot more sense, since the query is expressing what you actually mean. I'm a bit concerned about the verbosity and complexity of these queries (db_query_range, along with a cast to bool), and discussed with Damien whether a wrapper function might make sense, but he pointed out that there's really nothing these queries have in common other than that, so it would just be a needless layer of indrection, really.

Others in this issue have attempted to remove the path.inc query altogether by various means (just checking for whether path module is enabled, which is a no-go because then URL aliases would break if the optional Path module were disabled, updating a variable on cron which can be checked at read time, ec.) but none were very optimal, and none apply across the board to all of the COUNT(*) queries in core. I think that Damien's approach is best of the presented options.

However, what's a little confusing to me is that past benchmarks have showed SELECT 1 as slower than SELECT COUNT(*) in MyISAM, up until #91 by Damien. I know InnoDB is now our default, but we also deliberately did not put in an upgrade path for this change, so all of the hundreds of thousands of sites upgrading from 6.x and below will be stuck with MyISAM performance on these queries.

Damien, could you share why you got different results last time than anyone else? And could someone do a more holistic benchmark with ab now that this change is affecting more than just path.inc?

#93

webchick - May 16, 2009 - 14:50
Status:needs review» needs work

Sorry, I meant...

#94

chx - May 16, 2009 - 15:16
Status:needs work» needs review

This does not need a benchmark. SELECT COUNT(*) reads from the table header and does not read from the table so of course it's faster.

#95

webchick - May 16, 2009 - 15:23
Status:needs review» needs work
Issue tags:-Needs benchmarks+Needs Documentation

Spent some time talking benchmarks with Damien in IRC. The values presented in #18, #20, and #85 are highly variable. For example, here were four more test runs @ 100 queries:

MyISAM COUNT(*): 13.32 ms MyISAM LIMIT 1: 12.1 ms InnoDB LIMIT 1: 11.82 ms
MyISAM COUNT(*): 9.98 ms MyISAM LIMIT 1: 12.12 ms InnoDB LIMIT 1: 12.04 ms
MyISAM COUNT(*): 8.97 ms MyISAM LIMIT 1: 8.48 ms InnoDB LIMIT 1: 12.03 ms
MyISAM COUNT(*): 9.95 ms MyISAM LIMIT 1: 17.13 ms InnoDB LIMIT 1: 12.27 ms

Sometimes COUNT(*) is faster on MyISAM, sometimes it isn't. There's a pretty wide margin of error.

In #91, Damien cranked the number of iterations to 100K instead of 100 to try and make these margins a bit more pronounced. Over 100K iterations, LIMIT 1 comes out on top.

The end result being that MyISAM users shouldn't see a dramatic difference, and on the whole these spots in core should be marginally faster.

Therefore, committed to HEAD. We need to document this practice in the upgrade guide, at least, and it would be great to document it in the Programming best practices guide as well.

#96

pwolanin - May 17, 2009 - 15:31

Aside from documentation, this also needs to be backported to D6 it would seem.

#97

webchick - July 11, 2009 - 01:50

Bump.

#98

catch - August 6, 2009 - 00:06
Status:needs work» patch (to be ported)

Documented in upgrade guide, moving back to D6.

#99

catch - August 6, 2009 - 20:17
Version:7.x-dev» 6.x-dev

Really moving back to 6 this time.

#100

hass - August 17, 2009 - 16:23

I believe it could be a good idea for module maintainers to change their count querys in D6 times, too. I'm only a bit confused why I should write SELECT COUNT(1) FROM {mytable} as written in http://drupal.org/node/224333#select_count. I like SELECT COUNT(primary_key_field) FROM {mytable} more than using anonymous numbers... nevertheless - it doesn't change so much in the result of the query, but I find it easier to read.

Is a named column slower?

#101

tuffnatty - August 31, 2009 - 13:07

on pgsql, named column is slower.

#102

tuffnatty - August 31, 2009 - 13:11

So, why this COUNT is still there in D6?

#103

earnie - September 1, 2009 - 20:16

So, why this COUNT is still there in D6?

Because InnoDB isn't the default like it will be for D7.

on pgsql, named column is slower.

Why should we code differently for a broken DB server. Using named columns is beneficial for self documenting code.

I like SELECT COUNT(primary_key_field) FROM {mytable} more than using anonymous numbers

Ditto. It documents the code better but otherwise is no different. It should be just as fast (or nearly as fast) regardless of the column. It is a count of rows based on the WHERE clause if one is given. Using * you select all columns so, if you are just interested in a count of rows, using * is a bad idea. Using count(1) makes an assumption that column 1 is the primary key but that column might not be the primary key.

#104

tuffnatty - September 22, 2009 - 06:33
on pgsql, named column is slower.

Why should we code differently for a broken DB server. Using named columns is beneficial for self documenting code.

I like SELECT COUNT(primary_key_field) FROM {mytable} more than using anonymous numbers

Ditto. It documents the code better but otherwise is no different. It should be just as fast (or nearly as fast) regardless of the column. It is a count of rows based on the WHERE clause if one is given. Using * you select all columns so, if you are just interested in a count of rows, using * is a bad idea. Using count(1) makes an assumption that column 1 is the primary key but that column might not be the primary key.

COUNT(column) is a count of non-NULL values in that column. COUNT(1) is the count of rows where the constant 1 evaluates to non-NULL (that is, all rows), 1 is not column 1, but a constant.
The real thing we want to know is not the count of these rows, but whether they exist at all. It is best achieved by
SELECT EXISTS(SELECT 1 FROM {url_alias});
It's also 2 orders faster than the fastest of the cited solutions, at least on PostgreSQL. Please check it under MySQL also.

#105

Damien Tournoud - September 22, 2009 - 06:37

To clarify, what we are using is:

SELECT 1 FROM {url_alias} LIMIT 1

We are *not* counting.

#106

tuffnatty - September 22, 2009 - 06:46

To clarify, what we are using is:
SELECT 1 FROM {url_alias} LIMIT 1
We are *not* counting.

By "we" you must mean HEAD or something else, as D6, to which this bug is dedicated, uses COUNT...
SELECT 1...LIMIT 1 is OK from the performance point of view, it is as fast as EXISTS. Semantically, EXISTS is better.

#107

mattyoung - September 22, 2009 - 07:02

.

#108

earnie - September 22, 2009 - 12:06

COUNT(column) is a count of non-NULL values in that column.

COUNT(primary_key_field) FROM {mytable}

Primary key fields cannot be NULL.

 
 

Drupal is a registered trademark of Dries Buytaert.