Performance: search_update_totals() slow with large number of nodes

Wesley Tanaka - September 23, 2008 - 18:09
Project:Drupal
Version:7.x-dev
Component:search.module
Category:bug report
Priority:critical
Assigned:Unassigned
Status:duplicate
Issue tags:Performance
Description

I've been looking into using Drupal with millions of nodes. One problem occurs when the search module is enabled, and after {search_total} has grown to a reasonably large size.

search_update_totals() runs this query:
SELECT t.word AS realword, i.word
FROM search_total t
LEFT JOIN search_index i ON t.word = i.word
WHERE i.word IS NULL
which does a full table scan on {search_total}

Given that this bug exists: http://drupal.org/node/312385
and thus the {search_total} table isn't 100% guaranteed to be up to date anyway, it seems like it might be useful to handle deletions from {search_total} incrementally with the search_dirty() function the same way that inserts and updates are handled.

I'd like to hear what someone more familiar with the search module thinks of that idea.

#1

Wesley Tanaka - October 6, 2008 - 04:13
Status:active» needs review

Untested patch which conceptually implements the proposed change
(update: deleted broken patch)

#2

earnie - September 24, 2008 - 12:28
Version:6.4» 7.x-dev
Status:needs review» needs work

Make the patch against D7.

I turned the Drupal search module off on my site because it was consuming too much of my resources.

#3

Damien Tournoud - September 24, 2008 - 13:29
Status:needs work» won't fix

@Wesley: the search module is not designed to scale to millions of nodes. For high-performance solutions, please look at integration between Drupal and Solr or Xapian.

#4

earnie - September 24, 2008 - 13:20
Status:won't fix» needs work

Just because it isn't designed to be scalable to millions doesn't mean we shouldn't fix it. This is a huge issue for sites creating large amounts of nodes. We offer a search module in core, it should work with whatever data we throw at it.

#5

catch - September 24, 2008 - 13:34

While the patch is already at needs work, it'll also need to be updated for the new database layer. If this results in a small improvement or negligible effect for sites with smaller numbers of nodes, then there's no harm done fixing it.

#6

Damien Tournoud - September 24, 2008 - 13:35

@earnie: the patch is clearly not a bad idea, but no fulltext search implemented in a SQL database can scale correctly. And of course, there is no such thing as a system that works "whatever data we throw at it". Every system has its limitations, and every system is tuned to a specific scale.

#7

earnie - September 24, 2008 - 14:30

@Damien: I agree with what you say of course. But when someone is trying to tune out of the box search module we should at least try it out. The other thing would be a test for the search.module so that we can measure the affects of the patch.

#8

Wesley Tanaka - September 25, 2008 - 14:28

I'm personally interested to hear from someone familiar with the 6.x search.module whether or not the general idea of keeping track of which rows to delete from {search_total} is a reasonable one.

Where might be a good place to ask that question/get feedback on that issue if not in this bug report?

#9

robertDouglass - September 25, 2008 - 19:02

Sub.

#10

Wesley Tanaka - September 28, 2008 - 14:25
Version:7.x-dev» 6.4
Status:needs work» needs review

Fixed a syntax error in the previously attached patch.
Regenerated it without the -w flag to correct indentation.

Smoke tested the patch against Drupal 6.4 by:

1. Editing a node to contain a nonsense word that didn't previously exist
2. Running cron
3. Confirming that the word was contained in {search_total} by running SELECT * FROM d_search_total WHERE word='nonsenseword';
4. Editing the same node to delete the nonsense word
5. Running cron
6. Confirming that the word no longer appeared in {search_total} by running SELECT * FROM d_search_total WHERE word='nonsenseword';

which worked.

Caveat: I'm pretty sure the patch exacerbates #312385: {search_index} and {search_total} can get out of sync somewhat.

AttachmentSize
312390-6.4.patch 2 KB
Testbed results
312390-6.4.patchfailedFailed: Failed to apply patch. Detailed results

#11

Babalu - October 5, 2008 - 10:34

thx

#12

Wesley Tanaka - October 9, 2008 - 10:24
Version:6.4» 6.5

Patch applies cleanly without modification to Drupal 6.5

#13

veelo - October 16, 2008 - 10:32

A real world case: our node table has 547 rows. search_index and search_total have 36089 and 12905 rows respectively. Up until I applied this patch, the above query took well over 500 seconds. There have been occasions where SHOW PROCESSLIST showed a host of these queries which obviously were all battling for CPU cycles, bringing our production server to a crawl. Restarting mysql was the only way out. I imagine that this situation occurs if the duration to complete one such query exceeds the cron interval.

With this patch that situation is completely gone. Thank you Wesley, you saved us.

Bastiaan Veelo,
running Drupal 6.5.

#14

earnie - October 16, 2008 - 11:49
Priority:normal» critical
Status:needs review» reviewed & tested by the community

With @veelo's comment in #13 I'm making this critical.

@Wesley can you port forward your patch to 7.x-dev please?

#15

catch - October 16, 2008 - 12:47
Version:6.5» 7.x-dev
Status:reviewed & tested by the community» needs work

Yes, this needs to go into 7.x first.

#16

Wesley Tanaka - October 22, 2008 - 11:53

@earnie: My apologies, I don't really know anything about 7.x-dev. =P

#17

earnie - October 22, 2008 - 12:10

@wesley: See http://drupal.org/handbook/cvs and http://drupal.org/patch. 7.x-dev is the cvs HEAD version.

#18

robertDouglass - February 26, 2009 - 13:06

As I too have been looking at sites with millions of nodes I've got renewed interest in this. Thanks for the effort so far, Wesley.

#19

geerlingguy - July 8, 2009 - 20:59

Does the patch in #10 apply for Drupal 6.x still? I'd like to try it out on stlouisreview.com to see if it lets cron actually run instead of hanging all the time on search indexing (even if it's only indexing one or two nodes, cron times out after 2 minutes).

[Edit: Patch applies cleanly to search.module in Drupal 6.13. Reduced average cron run time for search cron functions from 8 seconds to 1 second, according to the Supercron module.]

Major kudos - and hope we can get Drupal 7 search functions to work so well!

#20

Wesley Tanaka - May 15, 2009 - 12:39

geerlingguy: Yes, it still applies to Drupal 6.12

#21

As If - July 9, 2009 - 00:45

bravo. sub.

#22

joelstein - July 9, 2009 - 15:53

subscribing.

#23

brianV - July 29, 2009 - 05:03
Status:needs work» needs review

Here is the patch from #10 ported to D7 / DBTNG.

Patch #10 should be reviewed for D6.

AttachmentSize
312930-speedup-search-update-totals.patch 2.3 KB
Testbed results
312930-speedup-search-update-totals.patchfailedFailed: Failed to apply patch. Detailed results

#24

brianV - July 29, 2009 - 18:04
Title:search_update_totals() slow with large number of nodes» Performance: search_update_totals() slow with large number of nodes

Just tagging and editing the title for better organization.

#25

System Message - August 1, 2009 - 07:00
Status:needs review» needs work

The last submitted patch failed testing.

#26

catch - August 1, 2009 - 11:33
Status:needs work» needs review

#27

System Message - August 30, 2009 - 00:25
Status:needs review» needs work

The last submitted patch failed testing.

#28

catch - September 5, 2009 - 03:25
Status:needs work» duplicate

This is a duplicate of #257916: more optimal search totals calculation.

 
 

Drupal is a registered trademark of Dries Buytaert.