I'm not sure if this slow indexing is intended or I'm doing something incorrectly, but Apache Solr has been indexing my site at a very slow pace. After about 24 hours, it has only indexed about 1100 nodes out of a possible 12000+ nodes in total. Is there a way to speed things up?
Thanks in advance!
| Comment | File | Size | Author |
|---|---|---|---|
| #10 | drush.patch | 4.58 KB | erlendstromsvik |
| #9 | solr_scripts.zip | 8.65 KB | erlendstromsvik |
Comments
Comment #1
erlendstromsvik commentedFor ordinary sites using cron for indexing, just increase the number of nodes indexed per cron run under admin/settings/apachesolr and have cron run as often as you require.
We some large sites, where we have hacked the apache solr-module. Those sites index up to 432,000 nodes a day or 18,000 nodes per hour. If the first tip don't work, I could post the hacks here. Would require drush to work.
Comment #2
Quarantine commentedThanks for the tip quaoar! I'll try out the first tip first and see if it's sufficient for my needs or not. :)
Comment #3
Quarantine commentedOh yeah, I just discovered that my poormanscron module can only run the fastest at every 1 hour. Can't seem to find a setting that would make it run every 30 minutes or any other faster interval.
Comment #4
Daniel Ferreira Jorge commentedI have a similar problem. I have a website with many many nodes (50 million), but they are very small nodes (only titles, no body). The problem with my setup seems not to be the time solr takes to index, but the time the drupal database api takes to retrieve the nodes from mysql and send to solr. This is the actual bottleneck.
@quaoar, I already tried to increase the maximum number of nodes sent per cron to 1000-5000, but it has no significant effect. If you have a better way to gather the nodes from the database (for instance, bypassing the drupal api and working directly with mysql), can you post it here? My website is a shopping comparison engine and it gathers product feeds from merchants. When INSERTING and UPDATING products into drupal, I did something similar and I insert the nodes directly into the database, bypassing drupal completely. If I use the drupal api to insert the nodes, It takes about 5 days to insert/update 50 million products but, if I insert directly into mysql, it takes about 3 hours.
Thanks
Comment #5
geerlingguy commentedAlso, make sure you disable the core search indexer to speed things up mightily:
http://www.drupalcoder.com/story/742-performance-tip-disable-drupals-cor...
(in case you missed that... it really does bog things down).
Comment #6
reallyordinary commentedI'd be really interested to see the hacks - I'm just getting started with solr and have a site with 20 million nodes that I need to get indexed as quickly as possible.
Comment #7
pwolanin commentedIf you need to index via Drupal try using drush to invoke just the Solr cron hook, not the full Drupal cron.
For 20M nodes, you might want to look at DIH, though it will require a bunch of custom transformations.
Comment #8
geerlingguy commentedAnother way to do this is to use Elysia Cron to run the cron hook more times per hour (on one of my sites, I have search_cron and apachesolr_cron running every 5 minutes).
Comment #9
erlendstromsvik commentedWe have been working with Drupal and solr for some time now. We have previously used our own hacks in the apachesolr-modules but last month we worked with an expert on solr and we had need for faster indexing.
So with very little time to do it, we hastily wrote a python script to send information to our solr-server. I've included this script with this post.
One warning, this script is written to suit our selfish needs and is not for general use as is. I have tried to remove most of the code which is strictly for our own use, as to not make this more confusing than necessary. The good part is that we are indexing 14 million nodes per day, close to 600,000 nodes an hour, with very little server load and a small memory footprint.
To use this, you have to look at both index_nodes.py and Lib.py. There are some values there which should be set before using.
Hopefully this can be a starting point for anyone wanting to index other large Drupal sites. Just contact me if you need some help or just want to reprimand me for writing awful python code =/
If you have a concrete problem with a site I could always try to help.
Comment #10
erlendstromsvik commentedThis is our changes to the apachesolr-modules:
- apachesolr.admin.inc : added a select field to apachesolr_settings for number of iterations
- apachesolr.admin.inc : added a checkbox to apachesolr_settings to disable indexing with cron (this should be required for anything hooking into hook_cron)
- apachesolr.module : moved everything out of hook_cron to apachesolr_cron_scheduled and added a check for cron disabled
- apachesolr_search.module : added check for cron disabled in hook_cron
- apachesolr_search.module :
- apachesolr.drush.inc : added a constant APACHESOLR_DRUSH_LOCK
- apachesolr.drush.inc : added new command solr-update-index
I've included a patch for the apachesolr-6.x-2.0-beta3 branch. Didn't get to test it right now, since I'm not at the office. If someone has problems I'll test it with a test site after the weekend.
The clue with this patch, is to max out the memory limit and max execution time of apache/php. By running the drush command "solr-update-index" as often as possible from ordinary cron, and by tuning the index limit and iteration limit, we have been able to index around 500,000 nodes per day.
This is how our calling line from cron.d looks like:
Comment #11
geerlingguy commentedYou could simplify the cron line by typing */2 for "every two minutes":
Comment #12
erlendstromsvik commentedThanks for the tip! =)