Performance Issue during Indexing - more optimal search totals calculation

douggreen - May 13, 2008 - 13:22
Project:Drupal
Version:7.x-dev
Component:search.module
Category:bug report
Priority:normal
Assigned:douggreen
Status:needs work
Description

Not LIVE FROM THE MINNESOTA SEARCH SPRINT, but found on the plane ride home...

After cron completes indexing a chunk of nodes, search_update_totals() updates {search_total} with the new totals for each modified word. It performs some of the calculations in php and thus makes 3 SQL statements for each word. Since you can update 100's of nodes with 1000's of words, that's a lot of SQL to perform at the end of each cron run.

The attached patch greatly simplifies this by updating 500 words at a time. I arbitrarily limited it 500 words, not knowing what was a good limit to the SQL packet size.

The one concern I have about this approach is databases other than mysql. I'm relying on the non-standard function LOG10(). I believe that we use GREATEST() somewhere else, so I don't think that this is an issue. So, we need to check if pgsql supports LOG10().

And we should discuss how many words at a time are reasonable to update.

AttachmentSize
search_totals.patch2.39 KB
Testbed results
search_totals.patchfailedFailed: Failed to apply patch. Detailed results

#1

Damien Tournoud - May 13, 2008 - 13:46

This is a clever idea. Two remarks:

  • the LOG(B,X) form exists in both MySQL and PostgreSQL. You can use LOG(10, X) as an alternative to LOG10
  • I'm not sure there is much to gain in increasing the batch number more. This needs testing, but 50 or 100 seems more than enough.

#2

douggreen - May 13, 2008 - 14:01
Title:Performance Issue during Indexing - thousands of queries becomes 2» Performance Issue during Indexing - more optimal search totals calculation

#3

catch - May 13, 2008 - 14:11

fwiw there's contrib support for LOG/LOG10 in sqlite - http://www.sqlite.org/contrib

#4

douggreen - May 13, 2008 - 15:24

I've changed LOG10(X) to LOG(10, X). I'm leaving the 500 word chunk in for now. I'd like to see some testing with this before we decrease it. If people find problems on certain platforms, we can easily change this. But if nobody finds problems, the bigger the chunk size the better (I would think).

AttachmentSize
257916.patch 2.4 KB
Testbed results
257916.patchfailedFailed: Failed to apply patch. Detailed results

#5

earnie - May 15, 2008 - 15:12

I'm not sure there is much to gain in increasing the batch number more. This needs testing, but 50 or 100 seems more than enough.

I modified mine to do 200 and 500. I have it set to 200.

#6

David Lesieur - May 17, 2008 - 14:47

I don't know about the optimal word limit, but other than that the patch looks good to me. I've also compared results before and after, and counts in search_total were the same.

I would not delay the patch because of the uncertain word limit, which could easily be changed later. This is already a great performance improvement.

#7

douggreen - May 20, 2008 - 12:41
Status:needs review» reviewed & tested by the community

Given David's comments, I'm marking my own patch as RTBC...

#8

Dries - May 20, 2008 - 20:24
Status:reviewed & tested by the community» needs work

When I run the search tests after applying this patch I get:

Unexpected PHP error [You can't specify target table 'simpletest128363search_total' for update in FROM clause query: DELETE FROM simpletest128363search_total WHERE word IN (SELECT t.word AS realword FROM simpletest128363search_total t LEFT JOIN simpletest128363search_index i ON t.word = i.word WHERE i.word IS NULL)] severity [E_USER_WARNING] in [/Users/dries/Sites/drupal-cvs/includes/database.mysqli.inc line 130]

Can somebody reproduce this?

#9

earnie - May 26, 2008 - 17:43

It think the error is a quirk in MySql and you need to clarify the ``word'' column after the first WHERE phrase. So I'm thinking the SQL should read something like:

DELETE FROM simpletest128363search_total st WHERE st.word IN (SELECT t.word AS realword FROM simpletest128363search_total t LEFT JOIN simpletest128363search_index i ON t.word = i.word WHERE i.word IS NULL)

I'm not in development mode until June so i can't test this yet but thought it deserved mentioning anyway.

#11

douggreen - August 28, 2008 - 07:52
Status:needs work» needs review

I tried ernie's suggestion, and it didn't solve the problem. I found a solution here which suggests a subquery in a subquery to force a temporary table:

DELETE FROM {search_total} WHERE word IN (SELECT word FROM (SELECT t.word AS realword FROM {search_total} t LEFT JOIN {search_index} i ON t.word = i.word WHERE i.word IS NULL) t1)

This solution passes the simpletests.

AttachmentSize
257916-1.patch 2.43 KB
Testbed results
257916-1.patchpassedPassed: 7187 passes, 0 fails, 0 exceptions Detailed results

#12

Arancaytar - November 16, 2008 - 20:43
Status:needs review» needs work

Some code-style and commenting issues with this.

1.) -1 on the magic number, especially in the "$dirty500" variable - couldn't this be made a constant?

2.) SQL queries have WHERE clauses. I've never heard of "an SQL clause". Also, "this prevents" does not specify the alternative case it is preventing - would the big clause this avoids occur when updating them one by one (like now), or all at once?

3.) Please consider running this through the code-cleaner - I see trailing white-space.

4.) "We use the two embedded subqueries to avoid the database 'update in FROM clause' error." This comment doesn't clarify that it is a MySQL quirk, why this fixes it (forcing a temp table) or where to find more information on it. Specifically, on mysql.com.Workarounds should be documented much more clearly.

 
 

Drupal is a registered trademark of Dries Buytaert.