I've been looking into the solr internals a bit, particularly in regards to replication and load balancing.
It seems to be that commit shouldn't be called on every write to the index, but should only fire a few times a day max on high performance sites via the AutoCommit settings in solrconfig.xml.
Optimization is expensive, and the commit creates a snapshot if you are doing replication, which may not be desired...
Should this be a setting? I'm thinking it would be a boolean at the moment, so whether the app should fire the commit, or it should be done by the server at whatever interval the server is configured to do.
Comments
Comment #1
voidberg commentedIt's not only the snapshot generation but calling commit too often results in the "too many open file" error (unless you're using a compound index).
Comment #2
robertdouglass commentedYes. This must change. JacobSingh, please share your thoughts on the autocommit setting.
Comment #3
JacobSingh commentedBackground on autocommits and optimize
The following code in solrconfig.xml is where it happens:
what this means is that it will run a commit operation when there are more than 10000 docs waiting to be committed OR 1 second (1000ms) after a document has been added (suppose if no other autocommit is in play). I really don't understand why one would want a config like this by default. A more sane value would probably be:
Unfortunately, this si commented out by default (at least in 1.2x)
This means that when apachesolr sends over some documents in a batch, as long as they don't send any more docs for 50 sec, a commit will happen. This seems pretty good. Then, if there is a backlog for whatever reason, it will run eventually @ 10,000 docs (although since most drupal sites are much smaller, this may be too large for our purposes).
I think we *should* leave the commit() call upon deletion or node_Access updates, because that should take effect ASAP.
Optimization is a different beast. I'm not sure, but I believe we aren't even doing any optimization currently. Perhaps we should suggest people to set it up on their own.
The cheap way is to optimize after every commit using a postCommit handler like this:
But this can be heavy. It is probably preferred to optimize on cron once or twice a day. PostOptimize a commit is always called, and generally a snapshot is created for replication.
Practical changes Needed in the module
K, there it is, should we go for this?
Comment #4
drunken monkeyIn my opinion we should provide the altered schema.xml and the optimize call in cron ourselves (the latter maybe with an option for how often (or whether at all) it gets executed). Telling the user to specify one manually does not seem very user-friendly and it shouldn't be too much work.
Comment #5
JacobSingh commentedI'm all for usability too, but there are a couple concerns here:
1. The setting isn't in schema.xml, but solrconfig.xml... This isn't a big issue, but instead of providing (and having to maintain a solrconfig.xml), I think just telling them to uncomment a few lines is not too bad. After all, they are also configuring a java server :)
2. Running optimize via drupal cron. While I like the premise, I don't know if this will fly. optimize commands can take a REALLY long time, and I'd be worried about that when cron is supposed to run every 15min on most CMSs. Also, it could cause PHP timeouts if the user has default timeouts set (unless we unset it). It could be done, but I don't like the idea of combining them. Someone running a solr server (which is different from using the module) will have to be a bit of sysadmin to get the thing going anyway. I think if our instructions are good, they should be able to add a cronjob for optimization...
What do you think?
Comment #6
drunken monkeySorry, didn't know about solrconfig.xml - in your "practical changes" list you say it needs to be changed in schema.xml, and I already forgot the other mention at the beginning of your comment. In this case, this of course makes sense.
Regarding the cron optimize call: I wouldn't have voted for executing it on every drupal cron run, but what you say about PHP timing out, of course does apply. So yeah, you're right, let's do it that way!
Comment #7
john.money commentedAnother possible option is to introduce a dependency on the cronplus module. We've used it extensively in production environments to reliably fire certain cron tasks once per hour or once per day. Our Drupal cron hook is set at 15 minutes.
*edit: Could also be conditional dependency like
Comment #8
robertdouglass commentedIsn't there a way to tell Solr from cron to optimize asynchronously? Why does the PHP thread have to hang around?
Comment #9
JacobSingh commentedThat's true, we could use popen() or just system with a background indicator, but I think it is unneeded. Consider this:
1. Users who have drupal cron running can setup cron.
2. Users who are running a solr server have a dedicated box of VPS, so they are not using poormanscron
3. Users who are using a solrServer they do not maintain should not be sending index requests. The server should initiate these requests. It is possible for a client to initiate an optimize command from outside the box if the admin allows it, but not recommended.
4. optimize is not needed. Certainly it speeds things up, but it is only a 10% performance hit.
I'm okay going forward with it, but I think the larger issue is: should the indexing client be telling the server when to run operations needed for replication and for optimization, or should the server do it based on its needs.
From http://wiki.apache.org/solr/SolrPerformanceFactors:
So if a client is spamming commits, it will create a huge load on slave servers to keep updating, and it will also be a drag on disk space.
Optimize commands can be run more often, but is also a creator of CPU load.
What about this:
By default:
We provide a commit and an optimize command (perhaps backgrounded) after the entire cron operation has finished.
AND:
We create a new setting (radio) for whether you are handling this on the client side or the server side.
I see the need for usability, but from my discussions about this on the solr list, all the responses I've gotten are suggesting the later is a best practice.
Comment #10
robertdouglass commentedOk. I'm convinced. The argument that the client shouldn't be telling the server when to do maintenance is compelling. I'd be ok with taking optimize out altogether and documenting how to configure solrconfig.xml.
Comment #11
pwolanin commentedok inital go at solrconfig.xml and apachesolr_search change.
Comment #12
pwolanin commentedbetter patch - committing to 6.x
Comment #13
pwolanin commentedUpdating the README.
Comment #14
pwolanin commentedbetter version, committing to 6.x
Comment #15
pwolanin commentedI think we should add back the commit after we delete the index - bad UX otherwise.
Comment #16
pwolanin commentedcommitted the patch in #15 to 6.x
Comment #17
pwolanin commentedComment #18
jbruvoll commentedHi,
Has this patch made it into any official Drupal release yet, or possibly better: which release could we expect to see this patch in?
Thanks
Jan
Comment #19
pwolanin commentedWe will be likely to make a beta release within the coming week - as of now you'd need to check out 6.x from CVS.
Comment #20
pwolanin commented