I've been looking into using Drupal with millions of nodes. One problem occurs when the search module is enabled, and after {search_total} has grown to a reasonably large size.

search_update_totals() runs this query:
SELECT t.word AS realword, i.word
FROM search_total t
LEFT JOIN search_index i ON t.word = i.word
WHERE i.word IS NULL
which does a full table scan on {search_total}

Given that this bug exists: http://drupal.org/node/312385
and thus the {search_total} table isn't 100% guaranteed to be up to date anyway, it seems like it might be useful to handle deletions from {search_total} incrementally with the search_dirty() function the same way that inserts and updates are handled.

I'd like to hear what someone more familiar with the search module thinks of that idea.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Wesley Tanaka’s picture

Untested patch which conceptually implements the proposed change
(update: deleted broken patch)

Anonymous’s picture

Version: 6.4 » 7.x-dev
Status: Active » Needs work

Make the patch against D7.

I turned the Drupal search module off on my site because it was consuming too much of my resources.

Damien Tournoud’s picture

Status: Needs work » Closed (won't fix)

@Wesley: the search module is not designed to scale to millions of nodes. For high-performance solutions, please look at integration between Drupal and Solr or Xapian.

Anonymous’s picture

Just because it isn't designed to be scalable to millions doesn't mean we shouldn't fix it. This is a huge issue for sites creating large amounts of nodes. We offer a search module in core, it should work with whatever data we throw at it.

catch’s picture

Status: Closed (won't fix) » Needs work

While the patch is already at needs work, it'll also need to be updated for the new database layer. If this results in a small improvement or negligible effect for sites with smaller numbers of nodes, then there's no harm done fixing it.

Damien Tournoud’s picture

@earnie: the patch is clearly not a bad idea, but no fulltext search implemented in a SQL database can scale correctly. And of course, there is no such thing as a system that works "whatever data we throw at it". Every system has its limitations, and every system is tuned to a specific scale.

Anonymous’s picture

@Damien: I agree with what you say of course. But when someone is trying to tune out of the box search module we should at least try it out. The other thing would be a test for the search.module so that we can measure the affects of the patch.

Wesley Tanaka’s picture

I'm personally interested to hear from someone familiar with the 6.x search.module whether or not the general idea of keeping track of which rows to delete from {search_total} is a reasonable one.

Where might be a good place to ask that question/get feedback on that issue if not in this bug report?

robertDouglass’s picture

Sub.

Wesley Tanaka’s picture

Version: 7.x-dev » 6.4
Status: Needs work » Needs review
FileSize
2 KB

Fixed a syntax error in the previously attached patch.
Regenerated it without the -w flag to correct indentation.

Smoke tested the patch against Drupal 6.4 by:

1. Editing a node to contain a nonsense word that didn't previously exist
2. Running cron
3. Confirming that the word was contained in {search_total} by running SELECT * FROM d_search_total WHERE word='nonsenseword';
4. Editing the same node to delete the nonsense word
5. Running cron
6. Confirming that the word no longer appeared in {search_total} by running SELECT * FROM d_search_total WHERE word='nonsenseword';

which worked.

Caveat: I'm pretty sure the patch exacerbates #312385: {search_index} and {search_total} can get out of sync somewhat.

Babalu’s picture

thx

Wesley Tanaka’s picture

Version: 6.4 » 6.5

Patch applies cleanly without modification to Drupal 6.5

veelo’s picture

A real world case: our node table has 547 rows. search_index and search_total have 36089 and 12905 rows respectively. Up until I applied this patch, the above query took well over 500 seconds. There have been occasions where SHOW PROCESSLIST showed a host of these queries which obviously were all battling for CPU cycles, bringing our production server to a crawl. Restarting mysql was the only way out. I imagine that this situation occurs if the duration to complete one such query exceeds the cron interval.

With this patch that situation is completely gone. Thank you Wesley, you saved us.

Bastiaan Veelo,
running Drupal 6.5.

Anonymous’s picture

Priority: Normal » Critical
Status: Needs review » Reviewed & tested by the community

With @veelo's comment in #13 I'm making this critical.

@Wesley can you port forward your patch to 7.x-dev please?

catch’s picture

Version: 6.5 » 7.x-dev
Status: Reviewed & tested by the community » Needs work

Yes, this needs to go into 7.x first.

Wesley Tanaka’s picture

@earnie: My apologies, I don't really know anything about 7.x-dev. =P

Anonymous’s picture

@wesley: See http://drupal.org/handbook/cvs and http://drupal.org/patch. 7.x-dev is the cvs HEAD version.

robertDouglass’s picture

As I too have been looking at sites with millions of nodes I've got renewed interest in this. Thanks for the effort so far, Wesley.

geerlingguy’s picture

Does the patch in #10 apply for Drupal 6.x still? I'd like to try it out on stlouisreview.com to see if it lets cron actually run instead of hanging all the time on search indexing (even if it's only indexing one or two nodes, cron times out after 2 minutes).

[Edit: Patch applies cleanly to search.module in Drupal 6.13. Reduced average cron run time for search cron functions from 8 seconds to 1 second, according to the Supercron module.]

Major kudos - and hope we can get Drupal 7 search functions to work so well!

Wesley Tanaka’s picture

geerlingguy: Yes, it still applies to Drupal 6.12

As If’s picture

bravo. sub.

joelstein’s picture

subscribing.

brianV’s picture

Status: Needs work » Needs review
FileSize
2.3 KB

Here is the patch from #10 ported to D7 / DBTNG.

Patch #10 should be reviewed for D6.

brianV’s picture

Title: search_update_totals() slow with large number of nodes » Performance: search_update_totals() slow with large number of nodes
Issue tags: +Performance

Just tagging and editing the title for better organization.

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch failed testing.

catch’s picture

Status: Needs work » Closed (duplicate)
zyxware’s picture

FileSize
1.76 KB

Here is the patch in #10, updated for Pressflow 6.20.97.

Encarte’s picture

Niklas Fiekas’s picture

Subscribe.

gnindl’s picture

Version: 7.x-dev » 6.x-dev
Status: Closed (duplicate) » Needs review
FileSize
2.08 KB

Ported patch from comment #10 to Drupal core version 6.17. Saved my day, thx guys!

Status: Needs review » Needs work
Issue tags: -Performance

The last submitted patch, search-performance.patch, failed testing.

binuraman’s picture

Status: Needs work » Needs review

#32: search-performance.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +Performance

The last submitted patch, search-performance.patch, failed testing.

bleen’s picture

Version: 6.x-dev » 8.x-dev

This needs to be fixed in D8 first

bleen’s picture

Status: Needs work » Closed (duplicate)

oops .. just noticed catch closed this higher up the chain

see #28