I've been looking into the solr internals a bit, particularly in regards to replication and load balancing.

It seems to be that commit shouldn't be called on every write to the index, but should only fire a few times a day max on high performance sites via the AutoCommit settings in solrconfig.xml.

Optimization is expensive, and the commit creates a snapshot if you are doing replication, which may not be desired...

Should this be a setting? I'm thinking it would be a boolean at the moment, so whether the app should fire the commit, or it should be done by the server at whatever interval the server is configured to do.

Comments

voidberg’s picture

It's not only the snapshot generation but calling commit too often results in the "too many open file" error (unless you're using a compound index).

robertdouglass’s picture

Category: support » bug
Priority: Normal » Critical

Yes. This must change. JacobSingh, please share your thoughts on the autocommit setting.

JacobSingh’s picture

Background on autocommits and optimize

The following code in solrconfig.xml is where it happens:

<autoCommit>
      <maxDocs>10000</maxDocs>
      <maxTime>1000</maxTime>
 </autoCommit>

what this means is that it will run a commit operation when there are more than 10000 docs waiting to be committed OR 1 second (1000ms) after a document has been added (suppose if no other autocommit is in play). I really don't understand why one would want a config like this by default. A more sane value would probably be:

 <autoCommit>  
      <maxDocs>10000</maxDocs> <!-- maximum uncommited docs before autocommit triggered -->
      <maxTime>50000</maxTime> <!-- maximum time (in MS) after adding a doc before an autocommit is triggered -->
</autoCommit>

Unfortunately, this si commented out by default (at least in 1.2x)

This means that when apachesolr sends over some documents in a batch, as long as they don't send any more docs for 50 sec, a commit will happen. This seems pretty good. Then, if there is a backlog for whatever reason, it will run eventually @ 10,000 docs (although since most drupal sites are much smaller, this may be too large for our purposes).

I think we *should* leave the commit() call upon deletion or node_Access updates, because that should take effect ASAP.

Optimization is a different beast. I'm not sure, but I believe we aren't even doing any optimization currently. Perhaps we should suggest people to set it up on their own.

The cheap way is to optimize after every commit using a postCommit handler like this:

<listener event="postCommit" class="solr.RunExecutableListener">
      <str name="exe">optimize</str>
      <str name="dir">solr/bin</str>
      <bool name="wait">true</bool>
</listener>

But this can be heavy. It is probably preferred to optimize on cron once or twice a day. PostOptimize a commit is always called, and generally a snapshot is created for replication.

Practical changes Needed in the module

  • Remove the ->commit() calls from the cron update_index section
  • Change the documentation telling users to uncomment the autoCommit settings from their schema.xml
  • Add documentation telling users to add their own optimize calls in cron

K, there it is, should we go for this?

drunken monkey’s picture

In my opinion we should provide the altered schema.xml and the optimize call in cron ourselves (the latter maybe with an option for how often (or whether at all) it gets executed). Telling the user to specify one manually does not seem very user-friendly and it shouldn't be too much work.

JacobSingh’s picture

I'm all for usability too, but there are a couple concerns here:

1. The setting isn't in schema.xml, but solrconfig.xml... This isn't a big issue, but instead of providing (and having to maintain a solrconfig.xml), I think just telling them to uncomment a few lines is not too bad. After all, they are also configuring a java server :)

2. Running optimize via drupal cron. While I like the premise, I don't know if this will fly. optimize commands can take a REALLY long time, and I'd be worried about that when cron is supposed to run every 15min on most CMSs. Also, it could cause PHP timeouts if the user has default timeouts set (unless we unset it). It could be done, but I don't like the idea of combining them. Someone running a solr server (which is different from using the module) will have to be a bit of sysadmin to get the thing going anyway. I think if our instructions are good, they should be able to add a cronjob for optimization...

What do you think?

drunken monkey’s picture

Sorry, didn't know about solrconfig.xml - in your "practical changes" list you say it needs to be changed in schema.xml, and I already forgot the other mention at the beginning of your comment. In this case, this of course makes sense.

Regarding the cron optimize call: I wouldn't have voted for executing it on every drupal cron run, but what you say about PHP timing out, of course does apply. So yeah, you're right, let's do it that way!

john.money’s picture

Another possible option is to introduce a dependency on the cronplus module. We've used it extensively in production environments to reliably fire certain cron tasks once per hour or once per day. Our Drupal cron hook is set at 15 minutes.

*edit: Could also be conditional dependency like

/**
 * Implementation of hook_cron() via cronplus.
 */
function apachesolr_cronplus_daily($now, $last_cron, $last_this) {}

/**
 * Implementation of hook_cron() if no cronplus.
 */
if (!function_exists('cronplus_cron')) {
  function apachesolr_cron() {}
}
robertdouglass’s picture

Isn't there a way to tell Solr from cron to optimize asynchronously? Why does the PHP thread have to hang around?

JacobSingh’s picture

That's true, we could use popen() or just system with a background indicator, but I think it is unneeded. Consider this:

1. Users who have drupal cron running can setup cron.
2. Users who are running a solr server have a dedicated box of VPS, so they are not using poormanscron
3. Users who are using a solrServer they do not maintain should not be sending index requests. The server should initiate these requests. It is possible for a client to initiate an optimize command from outside the box if the admin allows it, but not recommended.
4. optimize is not needed. Certainly it speeds things up, but it is only a 10% performance hit.

I'm okay going forward with it, but I think the larger issue is: should the indexing client be telling the server when to run operations needed for replication and for optimization, or should the server do it based on its needs.

From http://wiki.apache.org/solr/SolrPerformanceFactors:

Every time a new index searcher is opened, some autowarming of the cache occurs before Solr hands queries over to that version of the collection. It is crucial to individual query latency that queries have warmed caches.

The three relevant parameters:

* The number/frequency of snapshots is completely up to the indexing client. Therefore, the number of versions of the collection is determined by the client's activity.
* The snappullers are cron'd. They could run every second, once a day, or anything in between. When they run, they will retrieve only the most recent collection that they do not have.
* Cache autowarming is configured for each cache in solrconfig.xml.

So if a client is spamming commits, it will create a huge load on slave servers to keep updating, and it will also be a drag on disk space.

Optimize commands can be run more often, but is also a creator of CPU load.

What about this:

By default:
We provide a commit and an optimize command (perhaps backgrounded) after the entire cron operation has finished.
AND:
We create a new setting (radio) for whether you are handling this on the client side or the server side.

I see the need for usability, but from my discussions about this on the solr list, all the responses I've gotten are suggesting the later is a best practice.

robertdouglass’s picture

Ok. I'm convinced. The argument that the client shouldn't be telling the server when to do maintenance is compelling. I'd be ok with taking optimize out altogether and documenting how to configure solrconfig.xml.

pwolanin’s picture

Version: 5.x-1.x-dev » 6.x-1.x-dev
Status: Active » Needs review
StatusFileSize
new2.8 KB

ok inital go at solrconfig.xml and apachesolr_search change.

pwolanin’s picture

StatusFileSize
new3.33 KB

better patch - committing to 6.x

pwolanin’s picture

StatusFileSize
new2.49 KB

Updating the README.

pwolanin’s picture

StatusFileSize
new2.98 KB

better version, committing to 6.x

pwolanin’s picture

StatusFileSize
new2.27 KB

I think we should add back the commit after we delete the index - bad UX otherwise.

pwolanin’s picture

committed the patch in #15 to 6.x

pwolanin’s picture

Status: Needs review » Fixed
jbruvoll’s picture

Hi,

Has this patch made it into any official Drupal release yet, or possibly better: which release could we expect to see this patch in?

Thanks
Jan

pwolanin’s picture

We will be likely to make a beta release within the coming week - as of now you'd need to check out 6.x from CVS.

pwolanin’s picture

Status: Fixed » Closed (fixed)