I've been looking into using Drupal with millions of nodes. One problem occurs when the search module is enabled, and after {search_total} has grown to a reasonably large size.
search_update_totals() runs this query:
SELECT t.word AS realword, i.word
FROM search_total t
LEFT JOIN search_index i ON t.word = i.word
WHERE i.word IS NULL
which does a full table scan on {search_total}
Given that this bug exists: http://drupal.org/node/312385
and thus the {search_total} table isn't 100% guaranteed to be up to date anyway, it seems like it might be useful to handle deletions from {search_total} incrementally with the search_dirty() function the same way that inserts and updates are handled.
I'd like to hear what someone more familiar with the search module thinks of that idea.
Comment | File | Size | Author |
---|---|---|---|
#32 | search-performance.patch | 2.08 KB | gnindl |
#29 | 312390-6.4-2.patch | 1.76 KB | zyxware |
#23 | 312930-speedup-search-update-totals.patch | 2.3 KB | brianV |
#10 | 312390-6.4.patch | 2 KB | Wesley Tanaka |
Comments
Comment #1
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedUntested patch which conceptually implements the proposed change
(update: deleted broken patch)
Comment #2
Anonymous (not verified) CreditAttribution: Anonymous commentedMake the patch against D7.
I turned the Drupal search module off on my site because it was consuming too much of my resources.
Comment #3
Damien Tournoud CreditAttribution: Damien Tournoud commented@Wesley: the search module is not designed to scale to millions of nodes. For high-performance solutions, please look at integration between Drupal and Solr or Xapian.
Comment #4
Anonymous (not verified) CreditAttribution: Anonymous commentedJust because it isn't designed to be scalable to millions doesn't mean we shouldn't fix it. This is a huge issue for sites creating large amounts of nodes. We offer a search module in core, it should work with whatever data we throw at it.
Comment #5
catchWhile the patch is already at needs work, it'll also need to be updated for the new database layer. If this results in a small improvement or negligible effect for sites with smaller numbers of nodes, then there's no harm done fixing it.
Comment #6
Damien Tournoud CreditAttribution: Damien Tournoud commented@earnie: the patch is clearly not a bad idea, but no fulltext search implemented in a SQL database can scale correctly. And of course, there is no such thing as a system that works "whatever data we throw at it". Every system has its limitations, and every system is tuned to a specific scale.
Comment #7
Anonymous (not verified) CreditAttribution: Anonymous commented@Damien: I agree with what you say of course. But when someone is trying to tune out of the box search module we should at least try it out. The other thing would be a test for the search.module so that we can measure the affects of the patch.
Comment #8
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedI'm personally interested to hear from someone familiar with the 6.x search.module whether or not the general idea of keeping track of which rows to delete from {search_total} is a reasonable one.
Where might be a good place to ask that question/get feedback on that issue if not in this bug report?
Comment #9
robertDouglass CreditAttribution: robertDouglass commentedSub.
Comment #10
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedFixed a syntax error in the previously attached patch.
Regenerated it without the -w flag to correct indentation.
Smoke tested the patch against Drupal 6.4 by:
1. Editing a node to contain a nonsense word that didn't previously exist
2. Running cron
3. Confirming that the word was contained in {search_total} by running SELECT * FROM d_search_total WHERE word='nonsenseword';
4. Editing the same node to delete the nonsense word
5. Running cron
6. Confirming that the word no longer appeared in {search_total} by running SELECT * FROM d_search_total WHERE word='nonsenseword';
which worked.
Caveat: I'm pretty sure the patch exacerbates #312385: {search_index} and {search_total} can get out of sync somewhat.
Comment #11
Babalu CreditAttribution: Babalu commentedthx
Comment #12
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedPatch applies cleanly without modification to Drupal 6.5
Comment #13
veelo CreditAttribution: veelo commentedA real world case: our node table has 547 rows. search_index and search_total have 36089 and 12905 rows respectively. Up until I applied this patch, the above query took well over 500 seconds. There have been occasions where SHOW PROCESSLIST showed a host of these queries which obviously were all battling for CPU cycles, bringing our production server to a crawl. Restarting mysql was the only way out. I imagine that this situation occurs if the duration to complete one such query exceeds the cron interval.
With this patch that situation is completely gone. Thank you Wesley, you saved us.
Bastiaan Veelo,
running Drupal 6.5.
Comment #14
Anonymous (not verified) CreditAttribution: Anonymous commentedWith @veelo's comment in #13 I'm making this critical.
@Wesley can you port forward your patch to 7.x-dev please?
Comment #15
catchYes, this needs to go into 7.x first.
Comment #16
Wesley Tanaka CreditAttribution: Wesley Tanaka commented@earnie: My apologies, I don't really know anything about 7.x-dev. =P
Comment #17
Anonymous (not verified) CreditAttribution: Anonymous commented@wesley: See http://drupal.org/handbook/cvs and http://drupal.org/patch. 7.x-dev is the cvs HEAD version.
Comment #18
robertDouglass CreditAttribution: robertDouglass commentedAs I too have been looking at sites with millions of nodes I've got renewed interest in this. Thanks for the effort so far, Wesley.
Comment #19
geerlingguy CreditAttribution: geerlingguy commentedDoes the patch in #10 apply for Drupal 6.x still? I'd like to try it out on stlouisreview.com to see if it lets cron actually run instead of hanging all the time on search indexing (even if it's only indexing one or two nodes, cron times out after 2 minutes).
[Edit: Patch applies cleanly to search.module in Drupal 6.13. Reduced average cron run time for search cron functions from 8 seconds to 1 second, according to the Supercron module.]
Major kudos - and hope we can get Drupal 7 search functions to work so well!
Comment #20
Wesley Tanaka CreditAttribution: Wesley Tanaka commentedgeerlingguy: Yes, it still applies to Drupal 6.12
Comment #21
As If CreditAttribution: As If commentedbravo. sub.
Comment #22
joelstein CreditAttribution: joelstein commentedsubscribing.
Comment #23
brianV CreditAttribution: brianV commentedHere is the patch from #10 ported to D7 / DBTNG.
Patch #10 should be reviewed for D6.
Comment #24
brianV CreditAttribution: brianV commentedJust tagging and editing the title for better organization.
Comment #26
catchComment #28
catchThis is a duplicate of #257916: Improve performance of search totals calculation.
Comment #29
zyxware CreditAttribution: zyxware commentedHere is the patch in #10, updated for Pressflow 6.20.97.
Comment #30
Encarte CreditAttribution: Encarte commentedComment #31
Niklas Fiekas CreditAttribution: Niklas Fiekas commentedSubscribe.
Comment #32
gnindl CreditAttribution: gnindl commentedPorted patch from comment #10 to Drupal core version 6.17. Saved my day, thx guys!
Comment #34
binuraman CreditAttribution: binuraman commented#32: search-performance.patch queued for re-testing.
Comment #36
bleen CreditAttribution: bleen commentedThis needs to be fixed in D8 first
Comment #37
bleen CreditAttribution: bleen commentedoops .. just noticed catch closed this higher up the chain
see #28