| Project: | Drupal core |
| Version: | 7.x-dev |
| Component: | search.module |
| Category: | feature request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed (won't fix) |
| Issue tags: | Performance |
Issue Summary
Started at the MINNESOTA SEARCH SPRINT
The principle features are:
- Easier to read do_search.
- do_search count query is more efficient because it only contains the part of the query relevant to counting.
- drupal_set_message and form validation has been moved out of do_search and into search_form_submit (where it *almost* belongs).
- The existing tests are sufficient for coverage of this code.
This patch is one step in a series that will loosen up the rigid structure of do_search. The code in this patch is closer to being a query builder, and eventually I want to move there. Getting the drupal_set_message out of do_search lets us call it programmatically without unforseen UI artifacts. The performance boost in the count query is non-trivial, reducing the time to run the query by 40-60%.
| Attachment | Size | Status | Test result | Operations |
|---|---|---|---|---|
| do_search_refactor.patch | 5.47 KB | Ignored: Check issue status. | None | None |
Comments
#1
To make the performance aspect clear, here is the count query before and after this patch:
# Before
SELECT COUNT(*) FROM (
SELECT i.type, i.sid,
(5 * COALESCE((2.0 - 2.0 / (1.0 + node_comment_statistics.comment_count * 0.00271002710027)), 0) +
5 * COALESCE(((34.5707888161 * SUM(i.score * t.count))), 0) +
5 * COALESCE((n.sticky), 0) +
5 * COALESCE((n.promote), 0) +
5 * COALESCE(((POW(2, GREATEST(n.created, n.changed) - 1072477462) * 6.43e-8)), 0)) AS score
FROM search_index i
INNER JOIN search_total t ON i.word = t.word
INNER JOIN node n ON n.nid = i.sid
LEFT JOIN node_comment_statistics node_comment_statistics ON node_comment_statistics.nid = i.sid
WHERE n.status = 1 AND (i.word = 'dries') AND i.type = 'node'
GROUP BY i.type, i.sid HAVING COUNT(*) >= 1) n1
# After
SELECT COUNT(*) FROM (
SELECT 1 FROM search_index i
INNER JOIN node n ON n.nid = i.sid
WHERE n.status = 1 AND (i.word = 'dries') AND i.type = 'node'
GROUP BY i.type, i.sid HAVING COUNT(*) >= 1) n1
#2
Oops, looks like I lost part of the patch. Need to try and recover and merge.
#3
This patch includes the changes to the search box form and the search form submit handler which are important for keeping do_search out of the input validation business.
#4
#5
Still managed to miss a bit. This version has it all.
#6
Tracking...
I started to do some performance testing on this with the 100k node database, but I had problems with devel module not displaying the query results. I debugged this to the new registry, and devel_boot not getting registered or called.
#7
I get the following warning:
When I changed this from "keyword_arguments1" to "keyword_arguments" no results were returned.
I'm also getting warnings when submitting from the search block instead of the search form.
(btw, There's definitely something wrong with the new registry, but that's another issue. I was able to finagle it to register the devel functions to run these tests.)
Performance Tests
I tested search for "commoveo" on the 100k node database with mysql query cache enabled. All of the search rankings were set to 0, which is the same as keyword relevancy set to 5. I suspect that the performance gains would be greater had I used more search rankings.
The pre-patch count query is below. Because I have the mysql query cache enabled, this query is instantaneous when the page is refreshed, and on subsequent pages of the pager query.
SELECT COUNT(*) FROM (SELECT i.type, i.sid, (4840.64572508 * SUM(i.score * t.count)) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word INNER JOIN node n ON n.nid = i.sid WHERE n.status = 1 AND (i.word = 'commoveo') AND i.type = 'node' GROUP BY i.type, i.sid HAVING COUNT(*) >= 1) n1I then
RESET QUERY CACHEand applied the patch. The post-patch count queries are:SELECT COUNT(*) FROM (SELECT 1 FROM search_index i INNER JOIN node n ON n.nid = i.sid WHERE n.status = 1 AND (i.word = 'commoveo') AND i.type = 'node' GROUP BY i.type, i.sid HAVING COUNT(*) >= 1) n1SELECT 63733I ran each test (pre-patch and post-patch) three times, resetting the query cache each time, with similar results. So, I can confirm Robert's results that this patch results in a faster count query!
Results Testing
I confirmed that the page counts are the same pre-patch and post-patch.
#8
I fixed the first error, relating to "keyword_arguments1". I continued the naming of the search_parse_query variables that Robert started, pushing the renaming into that function, and making it return a keyed array. I also modified the creation of the arguments arrays, using array_merge instead of looping through them. I ran simpletest, and the new patched code (as well as the old one) passes all tests.
#9
subscribe
#10
Great to see this moving along. serscribe
#11
I've tested this again and it still applies, still improves the queries and legibility of the search code. Please review #8.
#12
These warnings are dealt with here: http://drupal.org/node/280319
They're unrelated to this patch.
#13
This patch does not apply.
My only comment is that in the output, we use the term 'positive keyword'. Why don't we simply write 'keyword'? :)
#14
Rerolled to reflect http://drupal.org/cvs?commit=126469
Changed "positive keyword" to "keyword".
Ran tests, did some taste testing manually. Looks ok.
#15
The word "positive" stayed in with that last one.
#16
edit: wrong tab.
#17
The advanced search doesn't seem to work for me. I created a single article title "Blah" with no content. Ran cron. Searched for blah using the basic search and it found the article. Ran it again restricting the search to node types of article and it didn't find it.
This is the query it created "SELECT COUNT(*) FROM (SELECT 1 FROM search_index i INNER JOIN node n ON n.nid = i.sid WHERE n.status = 1 AND (n.type = 'blah' OR n.type = 'node') AND (i.word = '1') AND i.type = '' GROUP BY i.type, i.sid HAVING COUNT(*) >= 0) n1".
Also, has someone tested this on postgre?
#18
I've written tests that check whether the advanced search works. In particular I test the type option. It would be good to add tests to test vocabulary options, language options, etc., but at least I solved the problem of how to test the advanced search form (was a non-trivial effort). I also adjusted do_search so that it is now passing all of the tests. I don't have full confidence in its correctness, though, and would love for others to do some more testing, and perhaps follow the model I set up in the test to test other parts of the advanced search form.
#19
Needs review now?
#20
Yes, please review what is there. Eventually someone needs to add the other tests, but your reviews are very helpful. Thank you.
#21
Just looking over the code, the last assertion is wrong. Looks like a copy paste error on the description of the last
assertText.I think it'd actually be great to have unit tests for search_parse_query and do_search as well as having a webTest like the tests in this patch currently.
#22
Could we also drop the default "WHERE 1 AND" that get inserted when do_search is called without a $where1 parameter, to respond to Dries' concern in #293504: Argument of AND must be type boolean, not type integer?
#23
The attached patch fixes the language of the final article search test.
It also removed the additional WHERE 1 problem identified in [293504], although I kinda wish that this was offered on that issue rather than this one, because it is kinda a separate issue.
#24
Re-roll. Tests are broken.
#25
Some tests still failing.
#26
Tests pass.
#27
And now for the cool part. It's not in the patch yet. The count query (and I think the search query) can be simplified (for the search query "dolore -sit") in the cases of phrase queries from this:
SELECT COUNT(*) FROM (SELECT count(*) FROM
search_dataset d
INNER JOIN node n ON n.nid = d.sid
WHERE n.status = 1 AND
d.type = 'node' AND
(d.data LIKE '% dolore %' AND
d.data NOT LIKE '% sit %')
GROUP BY d.type, d.sid
HAVING COUNT(*) >= 1
) n1;
to this:
SELECT count(*)FROM search_dataset d
INNER JOIN node n ON n.nid = d.sid
WHERE n.status = 1 AND
d.type = 'node' AND
(d.data LIKE '% dolore %' AND
d.data NOT LIKE '% sit %');
So leaving marked as "needs work" because there is still so much that can be done here.
#28
Furthermore, in some cases, the HAVING clause is not needed to get the right result. This needs more investigation:
mysql> SELECT COUNT(*) FROM (SELECT 1 FROM search_index i INNER JOIN node n ON n.nid = i.sid WHERE n.status = 1 AND (i.word = 'dolore') AND i.type = 'node' GROUP BY i.type, i.sid HAVING COUNT(*) >= 1) n1;
+----------+
| COUNT(*) |
+----------+
| 3 |
+----------+
1 row in set (0.00 sec)
mysql> SELECT COUNT(*) FROM (SELECT 1 FROM search_index i INNER JOIN node n ON n.nid = i.sid WHERE n.status = 1 AND (i.word = 'dolore') AND i.type = 'node' GROUP BY i.type, i.sid) n1;
+----------+
| COUNT(*) |
+----------+
| 3 |
+----------+
1 row in set (0.00 sec)
#29
I'm testing right now to find out if the query variant without the search_index table is a good idea. It probably causes full table scans which is what Steven wanted to avoid, thus the original structure of the query. Will report.
#30
Benchmarks answered the question in #29: the shorter version of the query is faster, and markedly so. I tested three different queries of differing complexity, and just used PHP to execute the database query a number of times. In all cases the shorter query was faster and produced identical results. Here is an example of two of the queries I used, and the results.
#31
Here's the patch that uses the new query structure for count queries. I haven't gotten to using it for search queries (and haven't seen whether it is possible). However, a certain type of tests fail.
The tests that fail seem to fail for reasons that are only documented in the code comments of the unit tests:
<?php
/**
* Run predefine queries looking for indexed terms.
*/
function _testQueries() {
/*
Note: OR queries that include short words in OR groups are only accepted
if the ORed terms are ANDed with at least one long word in the rest of the query.
e.g. enim dolore OR ut = enim (dolore OR ut) = (enim dolor) OR (enim ut) -> good
e.g. dolore OR ut = (dolore) OR (ut) -> bad
This is a design limitation to avoid full table scans.
*/
?>
Even though the tests fail, searching for the same terms in a normal website context works. So I don't know what's going on and have to investigate more.
#32
I've moved the test to a separate issue in the hopes that making a smaller patch will help people review. The test is now here: http://drupal.org/node/308233
#33
Tests pass in this. But don't rejoice! It's only because of an ugly hack that proves the concept and also proves that the signature of do_search has to be refactored. Basically it comes down to this: passing in snippets of SQL to inject into queries is too fragile. Since the technique of this patch is to remove the search_index table from the query in some cases, the node module can no longer assume that the search_index i part of the query will be there. So we need to move to a more meta-data driven query builder (like the views query builder!!!)
#34
Er, you mean "like the core query builder", now. the code in this patch begs to use db_select. From there, just add your conditions and joins and so on.
#35
Add tag.
#36
#394182: DBTNG search.module is converting do_search() to a (sort of) Query Extenders, please review! ;)
I did not actually *change* the resulting sql, if there is anything that can be improved in my patch, please comment on it..
#37
I'm assuming this particular refactoring of the search module is dead now, since the DBTNG code totally redid it. The patches won't do anything... I'm setting to postponed for now -- are there things that you feel still should be fixed in the current D7 code, or for D8?
#38
#39
Feel free to reopen this issue or file a new one if there are still performance things that you think need fixing with the current search architecture.