Closed (fixed)
Project:
Multisite Search
Version:
6.x-2.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Reporter:
Created:
16 Oct 2008 at 11:47 UTC
Updated:
25 Oct 2010 at 17:00 UTC
The way I read the cron hook of this module, it seems that you delete the entire search index every cron run, select the entire search index from each of the multi sites, and insert them back into the multisite search. Is this true?
Have you tried this with many multi sites and reasonably large indexes? I can't imagine that this will scale (feel free to argue otherwise!)
Also, the search index is not supposed to have distinct words in it so I think it is quite possible that this line is a bug:
// ?????? how to proceed with this in a better way
// insert into search total -- need to have done without cron job
$res3 = db_query("SELECT DISTINCT (word) FROM multisite_drupal_search_index");
But I'll let you judge since I'm not wholly familiar with the way the module is supposed to work.
Comments
Comment #1
phdhiren commentedHow to handle the deleted node's content in search result. May be because of that reason whole thing is being deleted.
Comment #2
techrobo commentedI hope this module works on single database multi site concept, and aggregates the search tables of all other sites on the base site.
In that case if a node is deleted on one site then it is probably not updating the search tables of the multisite module and hence that could be the reason whole table is truncated & rebuilt from search tables of other sites.
A work around could be to delete the rows from the search tables of multisite module whenever a node is deleted from the source site. Probably then you may not want to truncate & re-build the search tables all over again.
Comment #3
grawat commentedI'm trying to use this module on a multisite installation with about 220 other sites (they are all new and have very little content in them) and when cron runs, this module causes cron to exceed the time limit and then abort. I don't know if the module is being maintained but in its current form it's not going to work for large sites.
Comment #4
robertdouglass commentedgrawat: you may be better off using ApacheSolr which also has a multisite search capability.
Comment #5
grawat commentedthanks.
Comment #6
jeff.cote commentedTo reduce the amount of time that is taken to rebuild the tables, a number of changes can be made. First, you can copy over only the published nodes. Second, you can create a query to copy over all entries in one site at a time, instead of having a php loop that individual copies over one entry in a site at a time.
Also, there is a snippet that removes the custom '404 page not found' from the search results.
file: multisite_search.module
function: multisite_search_cron
original lines are:
change lines to:
Comment #7
earthday47Interesting snippet... I'll look into it further and test.
I have to look closer at the code, but from my initial run-throughs, each site maintains its own search index, which is then aggregated upon running the search. This is of course, very inefficient, but the first question that came to my mind is, where should you run cron? On only one site? On any site?
However, you can share the 4 database tables among all the sites, which will prevent 200 indexes from appearing:
I don't know if it would solve the 200+ Multisite installation, but it's a start...
Comment #8
earthday47New version (6.x-2.0) has been committed!
I looked closely at jeff's code, and at the way core search.module works, and I don't think it's necessary to pull data from the node table. All the hook_cron() call is doing is copying the search_dataset table to multisite_search_dataset, and the published permissions, etc., are all handled by core Search.
I did take inspiration from #6 and remove the while() loops in favor of a INSERT INTO ... SELECT statement:
One query's better than 500!
Also, for comment #3, there is a new variable "TTL" that you can set on each site. It won't reindex on every cron run. Good practice might be to set the master site's TTL to 0, and then set the others to some high number, 10000. I haven't tested this extensively so we may want to revisit it.
What about a checkbox: "Re-index on cron"?