more optimal search totals calculation
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | search.module |
| Category: | bug report |
| Priority: | critical |
| Assigned: | douggreen |
| Status: | needs work |
| Issue tags: | Performance |
Not LIVE FROM THE MINNESOTA SEARCH SPRINT, but found on the plane ride home...
After cron completes indexing a chunk of nodes, search_update_totals() updates {search_total} with the new totals for each modified word. It performs some of the calculations in php and thus makes 3 SQL statements for each word. Since you can update 100's of nodes with 1000's of words, that's a lot of SQL to perform at the end of each cron run.
The attached patch greatly simplifies this by updating 500 words at a time. I arbitrarily limited it 500 words, not knowing what was a good limit to the SQL packet size.
The one concern I have about this approach is databases other than mysql. I'm relying on the non-standard function LOG10(). I believe that we use GREATEST() somewhere else, so I don't think that this is an issue. So, we need to check if pgsql supports LOG10().
And we should discuss how many words at a time are reasonable to update.
| Attachment | Size | Status | Test result | Operations |
|---|---|---|---|---|
| search_totals.patch | 2.39 KB | Idle | Failed: Failed to apply patch. | View details | Re-test |

#1
This is a clever idea. Two remarks:
LOG(B,X)form exists in both MySQL and PostgreSQL. You can useLOG(10, X)as an alternative toLOG10#2
#3
fwiw there's contrib support for LOG/LOG10 in sqlite - http://www.sqlite.org/contrib
#4
I've changed
LOG10(X)toLOG(10, X). I'm leaving the 500 word chunk in for now. I'd like to see some testing with this before we decrease it. If people find problems on certain platforms, we can easily change this. But if nobody finds problems, the bigger the chunk size the better (I would think).#5
I modified mine to do 200 and 500. I have it set to 200.
#6
I don't know about the optimal word limit, but other than that the patch looks good to me. I've also compared results before and after, and counts in search_total were the same.
I would not delay the patch because of the uncertain word limit, which could easily be changed later. This is already a great performance improvement.
#7
Given David's comments, I'm marking my own patch as RTBC...
#8
When I run the search tests after applying this patch I get:
Unexpected PHP error [You can't specify target table 'simpletest128363search_total' for update in FROM clause query: DELETE FROM simpletest128363search_total WHERE word IN (SELECT t.word AS realword FROM simpletest128363search_total t LEFT JOIN simpletest128363search_index i ON t.word = i.word WHERE i.word IS NULL)] severity [E_USER_WARNING] in [/Users/dries/Sites/drupal-cvs/includes/database.mysqli.inc line 130]Can somebody reproduce this?
#9
It think the error is a quirk in MySql and you need to clarify the ``word'' column after the first WHERE phrase. So I'm thinking the SQL should read something like:
DELETE FROM simpletest128363search_total st WHERE st.word IN (SELECT t.word AS realword FROM simpletest128363search_total t LEFT JOIN simpletest128363search_index i ON t.word = i.word WHERE i.word IS NULL)I'm not in development mode until June so i can't test this yet but thought it deserved mentioning anyway.
#11
I tried ernie's suggestion, and it didn't solve the problem. I found a solution here which suggests a subquery in a subquery to force a temporary table:
DELETE FROM {search_total} WHERE word IN (SELECT word FROM (SELECT t.word AS realword FROM {search_total} t LEFT JOIN {search_index} i ON t.word = i.word WHERE i.word IS NULL) t1)This solution passes the simpletests.
#12
Some code-style and commenting issues with this.
1.) -1 on the magic number, especially in the "$dirty500" variable - couldn't this be made a constant?
2.) SQL queries have WHERE clauses. I've never heard of "an SQL clause". Also, "this prevents" does not specify the alternative case it is preventing - would the big clause this avoids occur when updating them one by one (like now), or all at once?
3.) Please consider running this through the code-cleaner - I see trailing white-space.
4.) "We use the two embedded subqueries to avoid the database 'update in FROM clause' error." This comment doesn't clarify that it is a MySQL quirk, why this fixes it (forcing a temp table) or where to find more information on it. Specifically, on mysql.com.Workarounds should be documented much more clearly.
#13
This is a performance issue, and marking "critical" for that reason.
#14
Marked #312390: Performance: search_update_totals() slow with large number of nodes as duplicate. Here's Wesley's / BrianV's last patch from that issue for reference.
#15