do_search refactoring: performance, legibility, move validation to validation/submit handlers of form.

robertDouglass - May 15, 2008 - 20:24
Project:Drupal
Version:7.x-dev
Component:search.module
Category:feature request
Priority:normal
Assigned:Unassigned
Status:needs work
Issue tags:Performance
Description

Started at the MINNESOTA SEARCH SPRINT

The principle features are:

- Easier to read do_search.
- do_search count query is more efficient because it only contains the part of the query relevant to counting.
- drupal_set_message and form validation has been moved out of do_search and into search_form_submit (where it *almost* belongs).
- The existing tests are sufficient for coverage of this code.

This patch is one step in a series that will loosen up the rigid structure of do_search. The code in this patch is closer to being a query builder, and eventually I want to move there. Getting the drupal_set_message out of do_search lets us call it programmatically without unforseen UI artifacts. The performance boost in the count query is non-trivial, reducing the time to run the query by 40-60%.

AttachmentSize
do_search_refactor.patch5.47 KB

#1

robertDouglass - May 15, 2008 - 20:45

To make the performance aspect clear, here is the count query before and after this patch:

# Before

SELECT COUNT(*) FROM (
  SELECT i.type, i.sid,
         (5 * COALESCE((2.0 - 2.0 / (1.0 + node_comment_statistics.comment_count * 0.00271002710027)), 0) +
          5 * COALESCE(((34.5707888161 * SUM(i.score * t.count))), 0) +
          5 * COALESCE((n.sticky), 0) +
          5 * COALESCE((n.promote), 0) +
          5 * COALESCE(((POW(2, GREATEST(n.created, n.changed) - 1072477462) * 6.43e-8)), 0)) AS score
    FROM search_index i
    INNER JOIN search_total t ON i.word = t.word
    INNER JOIN node n ON n.nid = i.sid
    LEFT JOIN node_comment_statistics node_comment_statistics ON node_comment_statistics.nid = i.sid
    WHERE n.status = 1 AND (i.word = 'dries') AND i.type = 'node'
    GROUP BY i.type, i.sid HAVING COUNT(*) >= 1) n1

# After

SELECT COUNT(*) FROM (
  SELECT 1 FROM search_index i
  INNER JOIN node n ON n.nid = i.sid
  WHERE n.status = 1 AND (i.word = 'dries') AND i.type = 'node'
  GROUP BY i.type, i.sid HAVING COUNT(*) >= 1) n1 

#2

robertDouglass - May 15, 2008 - 21:21
Status:needs review» needs work

Oops, looks like I lost part of the patch. Need to try and recover and merge.

#3

robertDouglass - May 16, 2008 - 09:58

This patch includes the changes to the search box form and the search form submit handler which are important for keeping do_search out of the input validation business.

AttachmentSize
do_search_refactor.patch 7.12 KB

#4

robertDouglass - May 16, 2008 - 09:58
Status:needs work» needs review

#5

robertDouglass - May 16, 2008 - 10:49

Still managed to miss a bit. This version has it all.

AttachmentSize
do_search_refactor.patch 7.56 KB

#6

douggreen - May 16, 2008 - 14:56

Tracking...

I started to do some performance testing on this with the 100k node database, but I had problems with devel module not displaying the query results. I debugged this to the new registry, and devel_boot not getting registered or called.

#7

douggreen - May 17, 2008 - 02:06

I get the following warning:

notice: Undefined variable: keyword_arguments1 in ../drupal-7/modules/search/search.module on line 925

When I changed this from "keyword_arguments1" to "keyword_arguments" no results were returned.

I'm also getting warnings when submitting from the search block instead of the search form.

# notice: Undefined index: basic in ../drupal-7/modules/search/search.pages.inc on line 114.
# notice: Undefined index: keys in .../drupal-7/modules/search/search.pages.inc on line 114.
# warning: array_shift() [function.array-shift]: The argument should be an array in .../drupal-7/includes/form.inc on line 1308.

(btw, There's definitely something wrong with the new registry, but that's another issue. I was able to finagle it to register the devel functions to run these tests.)

Performance Tests

I tested search for "commoveo" on the 100k node database with mysql query cache enabled. All of the search rankings were set to 0, which is the same as keyword relevancy set to 5. I suspect that the performance gains would be greater had I used more search rankings.

The pre-patch count query is below. Because I have the mysql query cache enabled, this query is instantaneous when the page is refreshed, and on subsequent pages of the pager query.

pager_query 1243.67 SELECT COUNT(*) FROM (SELECT i.type, i.sid, (4840.64572508 * SUM(i.score * t.count)) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word INNER JOIN node n ON n.nid = i.sid WHERE n.status = 1 AND (i.word = 'commoveo') AND i.type = 'node' GROUP BY i.type, i.sid HAVING COUNT(*) >= 1) n1

I then RESET QUERY CACHE and applied the patch. The post-patch count queries are:

do_search 874.84 SELECT COUNT(*) FROM (SELECT 1 FROM search_index i INNER JOIN node n ON n.nid = i.sid WHERE n.status = 1 AND (i.word = 'commoveo') AND i.type = 'node' GROUP BY i.type, i.sid HAVING COUNT(*) >= 1) n1
pager_query 0.53 SELECT 63733

I ran each test (pre-patch and post-patch) three times, resetting the query cache each time, with similar results. So, I can confirm Robert's results that this patch results in a faster count query!

Results Testing

I confirmed that the page counts are the same pre-patch and post-patch.

#8

douggreen - May 17, 2008 - 03:01
Status:needs review» needs work

I fixed the first error, relating to "keyword_arguments1". I continued the naming of the search_parse_query variables that Robert started, pushing the renaming into that function, and making it return a keyed array. I also modified the creation of the arguments arrays, using array_merge instead of looping through them. I ran simpletest, and the new patched code (as well as the old one) passes all tests.

AttachmentSize
258998.patch 10.41 KB

#9

Susurrus - July 6, 2008 - 01:56

subscribe

#10

moshe weitzman - July 9, 2008 - 11:42

Great to see this moving along. serscribe

#11

robertDouglass - July 9, 2008 - 12:07
Status:needs work» needs review

I've tested this again and it still applies, still improves the queries and legibility of the search code. Please review #8.

#12

robertDouglass - July 9, 2008 - 12:10

I'm also getting warnings when submitting from the search block instead of the search form.

# notice: Undefined index: basic in ../drupal-7/modules/search/search.pages.inc on line 114.
# notice: Undefined index: keys in .../drupal-7/modules/search/search.pages.inc on line 114.
# warning: array_shift() [function.array-shift]: The argument should be an array in .../drupal-7/includes/form.inc on line 1308.

These warnings are dealt with here: http://drupal.org/node/280319

They're unrelated to this patch.

#13

Dries - July 10, 2008 - 11:03
Status:needs review» needs work

This patch does not apply.

My only comment is that in the output, we use the term 'positive keyword'. Why don't we simply write 'keyword'? :)

#14

robertDouglass - July 10, 2008 - 11:26
Status:needs work» needs review

Rerolled to reflect http://drupal.org/cvs?commit=126469

Changed "positive keyword" to "keyword".

Ran tests, did some taste testing manually. Looks ok.

AttachmentSize
search.patch 10.24 KB

#15

robertDouglass - July 10, 2008 - 11:32

The word "positive" stayed in with that last one.

AttachmentSize
search.patch 10.27 KB

#16

catch - July 10, 2008 - 11:39

edit: wrong tab.

#17

Susurrus - July 10, 2008 - 18:51
Status:needs review» needs work

The advanced search doesn't seem to work for me. I created a single article title "Blah" with no content. Ran cron. Searched for blah using the basic search and it found the article. Ran it again restricting the search to node types of article and it didn't find it.

This is the query it created "SELECT COUNT(*) FROM (SELECT 1 FROM search_index i INNER JOIN node n ON n.nid = i.sid WHERE n.status = 1 AND (n.type = 'blah' OR n.type = 'node') AND (i.word = '1') AND i.type = '' GROUP BY i.type, i.sid HAVING COUNT(*) >= 0) n1".

Also, has someone tested this on postgre?

#18

robertDouglass - July 11, 2008 - 20:46

I've written tests that check whether the advanced search works. In particular I test the type option. It would be good to add tests to test vocabulary options, language options, etc., but at least I solved the problem of how to test the advanced search form (was a non-trivial effort). I also adjusted do_search so that it is now passing all of the tests. I don't have full confidence in its correctness, though, and would love for others to do some more testing, and perhaps follow the model I set up in the test to test other parts of the advanced search form.

AttachmentSize
search.patch 12.4 KB

#19

Susurrus - July 11, 2008 - 22:36
Status:needs work» needs review

Needs review now?

#20

robertDouglass - July 12, 2008 - 00:33

Yes, please review what is there. Eventually someone needs to add the other tests, but your reviews are very helpful. Thank you.

#21

Susurrus - July 12, 2008 - 02:24
Status:needs review» needs work

Just looking over the code, the last assertion is wrong. Looks like a copy paste error on the description of the last assertText.

I think it'd actually be great to have unit tests for search_parse_query and do_search as well as having a webTest like the tests in this patch currently.

#22

Damien Tournoud - August 12, 2008 - 17:30

Could we also drop the default "WHERE 1 AND" that get inserted when do_search is called without a $where1 parameter, to respond to Dries' concern in #293504: Argument of AND must be type boolean, not type integer?

#23

douggreen - August 15, 2008 - 17:55
Status:needs work» needs review

The attached patch fixes the language of the final article search test.

It also removed the additional WHERE 1 problem identified in [293504], although I kinda wish that this was offered on that issue rather than this one, because it is kinda a separate issue.

AttachmentSize
258998.patch 12.61 KB

#24

robertDouglass - September 1, 2008 - 08:31
Status:needs review» needs work

Re-roll. Tests are broken.

AttachmentSize
do_search.patch 13.74 KB

#25

robertDouglass - September 1, 2008 - 11:02

Some tests still failing.

AttachmentSize
do_search.patch 12.75 KB

#26

robertDouglass - September 1, 2008 - 15:42

Tests pass.

AttachmentSize
do_search.patch 12.77 KB

#27

robertDouglass - September 1, 2008 - 15:52

And now for the cool part. It's not in the patch yet. The count query (and I think the search query) can be simplified (for the search query "dolore -sit") in the cases of phrase queries from this:

SELECT COUNT(*) FROM (
  SELECT count(*) FROM
    search_dataset d
    INNER JOIN node n ON n.nid = d.sid
    WHERE n.status = 1 AND
    d.type = 'node' AND
    (d.data LIKE '% dolore %' AND
    d.data NOT LIKE '% sit %')
    GROUP BY d.type, d.sid
    HAVING COUNT(*) >= 1
  ) n1;

to this:

SELECT count(*)
  FROM search_dataset d
  INNER JOIN node n ON n.nid = d.sid
  WHERE n.status = 1 AND
    d.type = 'node' AND
    (d.data LIKE '% dolore %' AND
     d.data NOT LIKE '% sit %');

So leaving marked as "needs work" because there is still so much that can be done here.

#28

robertDouglass - September 1, 2008 - 15:55

Furthermore, in some cases, the HAVING clause is not needed to get the right result. This needs more investigation:

mysql> SELECT COUNT(*) FROM (SELECT 1 FROM search_index i INNER JOIN node n ON n.nid = i.sid WHERE n.status = 1 AND (i.word = 'dolore') AND i.type = 'node' GROUP BY i.type, i.sid HAVING COUNT(*) >= 1) n1;
+----------+
| COUNT(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)

mysql> SELECT COUNT(*) FROM (SELECT 1 FROM search_index i INNER JOIN node n ON n.nid = i.sid WHERE n.status = 1 AND (i.word = 'dolore') AND i.type = 'node' GROUP BY i.type, i.sid) n1;
+----------+
| COUNT(*) |
+----------+
|        3 |
+----------+
1 row in set (0.00 sec)

#29

robertDouglass - September 1, 2008 - 18:19

I'm testing right now to find out if the query variant without the search_index table is a good idea. It probably causes full table scans which is what Steven wanted to avoid, thus the original structure of the query. Will report.

#30

robertDouglass - September 2, 2008 - 09:31

Benchmarks answered the question in #29: the shorter version of the query is faster, and markedly so. I tested three different queries of differing complexity, and just used PHP to execute the database query a number of times. In all cases the shorter query was faster and produced identical results. Here is an example of two of the queries I used, and the results.

SELECT COUNT(*) FROM (
  SELECT 1 FROM search_index i
    INNER JOIN node n ON n.nid = i.sid
    INNER JOIN search_dataset d ON i.sid = d.sid AND i.type = d.type
  WHERE n.status = 1 AND
    (i.word = 'genitus' OR i.word = 'ratis' OR i.word = 'abigo' OR i.word = 'natu' OR i.word = 'ideo') AND
    i.type = 'node' AND
    (d.data LIKE '% os genitus ratis abigo natu ideo %')
  GROUP BY i.type, i.sid
  HAVING COUNT(*) >= 5) n1;   

SELECT COUNT(*) FROM search_dataset d
  INNER JOIN node n ON n.nid = d.sid 
  WHERE n.status = 1 AND
    d.type = 'node' AND
    (d.data LIKE '% os genitus ratis abigo natu ideo %');
Test # Big query Small query
1 4037.23 2095.99
2 56724.5 2894.56
3 7473.2 1090.48

#31

robertDouglass - September 2, 2008 - 09:36

Here's the patch that uses the new query structure for count queries. I haven't gotten to using it for search queries (and haven't seen whether it is possible). However, a certain type of tests fail.

The tests that fail seem to fail for reasons that are only documented in the code comments of the unit tests:

<?php
 
/**
   * Run predefine queries looking for indexed terms.
   */
 
function _testQueries() {
   
/*
      Note: OR queries that include short words in OR groups are only accepted
      if the ORed terms are ANDed with at least one long word in the rest of the query.

      e.g. enim dolore OR ut = enim (dolore OR ut) = (enim dolor) OR (enim ut) -> good
      e.g. dolore OR ut = (dolore) OR (ut) -> bad

      This is a design limitation to avoid full table scans.
    */
?>

Even though the tests fail, searching for the same terms in a normal website context works. So I don't know what's going on and have to investigate more.

AttachmentSize
do_search_tests_fail.patch 13.3 KB

#32

robertDouglass - September 14, 2008 - 10:03

I've moved the test to a separate issue in the hopes that making a smaller patch will help people review. The test is now here: http://drupal.org/node/308233

AttachmentSize
do_search_no_test.patch 10.4 KB

#33

robertDouglass - September 14, 2008 - 14:00

Tests pass in this. But don't rejoice! It's only because of an ugly hack that proves the concept and also proves that the signature of do_search has to be refactored. Basically it comes down to this: passing in snippets of SQL to inject into queries is too fragile. Since the technique of this patch is to remove the search_index table from the query in some cases, the node module can no longer assume that the search_index i part of the query will be there. So we need to move to a more meta-data driven query builder (like the views query builder!!!)

AttachmentSize
do_search.patch 10.8 KB

#34

moshe weitzman - September 14, 2008 - 17:02

Er, you mean "like the core query builder", now. the code in this patch begs to use db_select. From there, just add your conditions and joins and so on.

#35

lilou - May 17, 2009 - 20:58

Add tag.

#36

Berdir - June 4, 2009 - 20:32

#394182: DBTNG search.module is converting do_search() to a (sort of) Query Extenders, please review! ;)

I did not actually *change* the resulting sql, if there is anything that can be improved in my patch, please comment on it..

 
 

Drupal is a registered trademark of Dries Buytaert.