We have a major installation of this module on one of our sites. And, unfortunately, it's causing us lots of issues. The apachesolr_index_entities_node table is enormous - like, 400 Gigs big. It's indexing about 25k+ pdfs, some of them multiple pages and quite large.
It's causing or related to the following issues:
- We've had to change mysql settings just to get it to be able to save the pdf data and to export for backup purposes or replication to dev enviroments.
- Import scripts for migration - bulk node additions, essentially - time out due to the required search and delete in the _insert and _update hooks.
- Randomly, various nodes are not being added to the solr index at creation.
- Editing nodes times out causing the dreaded "The website encountered an unexpected error. Please try again later." general Drupal error.
The body cache in the database seemed to me like a good idea at first - it seems like it would really reduce load on the server. However, when we put it into practice in an enterprise-level setting, it has been a big problem.
I am proposing removing that field entirely, and the hash as well as it is not useful without the body cache field.
Patch for this coming soon.
Comment | File | Size | Author |
---|---|---|---|
#31 | foo2.po | 54 bytes | pwolanin |
#31 | foo2.mkv | 54 bytes | pwolanin |
#25 | 1936662--solr_cache_bin_schema-24.patch | 4.67 KB | janusman |
#22 | 1936662-cache-acquia-agent-subscription-22.patch | 2.69 KB | janusman |
#21 | 1936662--solr_cache_bin_schema-21.patch | 3.35 KB | escuriola |
Comments
Comment #1
srjoshPatch attached.
Comment #2
Nick_vhThis kind of makes sense for large sites. I'm not sure if we want to remove it completely rahter than having a checkbox : get it from solr at all times?
Comment #3
posulliv CreditAttribution: posulliv commentedUpdated to keep caching related columns in the underlying database table. Added option to disable caching in the configuration menu.
Comment #4
pwolanin CreditAttribution: pwolanin commentedok, right so the issue is basically the caching of the extracted text:
An alternative here would be to save the extracted text to a series of text files (e.g. named for the file hash). That might be better than ever putting them in the DB?
Comment #5
posulliv CreditAttribution: posulliv commentedYep, pretty much. I have a project where I don't want to cache the extracted text in the database due to how large the caching table can get.
Does the caching of the extracted text help much? Is it really needed?
Comment #6
pwolanin CreditAttribution: pwolanin commentedCaching the text is especially useful when you need to do a round-trip to the Solr server for each doc to re-extract the text, so I wouldn't want to remove it.
Comment #7
posulliv CreditAttribution: posulliv commentedThat's fair enough.
The patch I created in #5 makes the database caching configurable with the default set to caching enabled. Does that seem like a reasonable approach?
Comment #8
Nick_vhPlease keep default value to true. We should not change the default behaviour.
Add a description why it would be useful to disable cache rather than saying what it does.
Comment #9
nasia123 CreditAttribution: nasia123 commentedwill this patch be committed ?
in the latest version there in no such field for selecting whether or not to use cache...
I think if this is going to be committed a more detailed description should be added, so that users understand the pros and cons of using each selection.
Comment #10
pwolanin CreditAttribution: pwolanin commentedI would prefer to have the new default option be to cache to the filesystem, so I don't think the patch is ready to be committed.
Comment #11
Kukulcan CreditAttribution: Kukulcan commentedI get an SQL error on that dbupdate function in line 101.
Is it safe to remove it? Will it only affect performance of future re-indexing?
Comment #12
amonteroIn addition to the disable caching option, perhaps offloading the cache column to a standard Drupal cache bin would bring a nice tradeoff.
Since we would be using cache_get and cache_set using the table PK as cid the site administrator could fine tune the cache bin TTL to a balance that better suits each case. Also moving the cache bin to another backend such a MongoDB would allow offload the DB from the body column space.
Would that be possible?
Comment #13
amonteroUsing Drupal standard cache functions would also fulfill pwolanin's request in #10 easily by way of the File Cache backend module
Comment #14
Nick_vhThat seems very reasonable, as long as you are able to choose how you want to keep the file extraction cache. Either on a filesystem or in a cache bin that is then regulated using the drupal cache configs. Anxiously awaiting a patch :)
Comment #15
escuriola CreditAttribution: escuriola commentedI'm working in the patch for #12.
Our platform has more than 1M of documents so it's really important this for us.
Comment #16
escuriola CreditAttribution: escuriola commentedFirst attempt for #12
Comment #17
escuriola CreditAttribution: escuriola commentedComment #18
escuriola CreditAttribution: escuriola commentedAdd flush all cache button
Comment #19
pwolanin CreditAttribution: pwolanin commentedstray code commented out:
Also does the other table schema need to change?
Comment #20
amonteroreturn $cached . ' cache';
The localExtracted text cache cleared.' as done in core. Also, it will not be necessarily local.Other than these, the patch is right and solves the problem. I think that moving the extracted body cache out of the 'apachesolr_index_entity_file' table will avoid all the deadlocks I've observed there. Managing the cache via Drupal's cache API will enable even moving it out of MySQL (to MongoDB, for instance).
Comment #21
escuriola CreditAttribution: escuriola commentedWell formed patch review with recent head update.
Comment #22
janusman CreditAttribution: janusman commentedCommitted a version of this to 7.x-1.x-dev. Patch attached.
Comment #23
janusman CreditAttribution: janusman commentedDerp. Commented on the wrong issue! Setting back to needs review (see patch from #21).
Comment #24
janusman CreditAttribution: janusman at Acquia commentedNew patch.
Patch in #21 had a few problems:
I also tried this out with https://www.drupal.org/project/filecache and it seems to work with this:
Comment #25
janusman CreditAttribution: janusman at Acquia commentedAaand of course I forgot the patch.
Comment #26
zdw CreditAttribution: zdw commentedTested the patch in #25 with the FileCache backend, and it appears to solve the issue.
Database size on a 35k document site went from >4GB to less than 200MB.
Comment #27
referup CreditAttribution: referup commentedOurs went from 8.5Gb for 1.3M docs to about 300mb. It's working flwalessly on our dev, test and prod systems.
This allows Drupal's best practice of not backing up cache data.
Comment #29
Nick_vhCommited. Thanks all!
[7.x-1.x c9834f5] Issue #1936662 by escuriola, janusman, posulliv, srjosh, pwolanin, amontero, Nick_vh: Database not scaling well
Author: janusman
4 files changed, 26 insertions(+), 17 deletions(-)
Comment #31
pwolanin CreditAttribution: pwolanin as a volunteer and at Acquia commentedComment #32
claudiu.cristeaThis is related. Also improving the performance #2017705: Performance! Add missed indexes to {apachesolr_index_entities_file} table.
Comment #33
MingsongThanks for the great job to #25 patch.
The size of our database is significantly reduced with the develop version of 1.4-3-dev.
I was aware that the cache_apachesolr_attachments_file_body table is always empty, even after I re-index all contents.
Is there anyone know why?
I just noticed that we are using Memcache module for local caching. I suppose that is why the custom cache table of cache_apachesolr_attachments_file_body is empty in our database.