We have a major installation of this module on one of our sites. And, unfortunately, it's causing us lots of issues. The apachesolr_index_entities_node table is enormous - like, 400 Gigs big. It's indexing about 25k+ pdfs, some of them multiple pages and quite large.

It's causing or related to the following issues:

  • We've had to change mysql settings just to get it to be able to save the pdf data and to export for backup purposes or replication to dev enviroments.
  • Import scripts for migration - bulk node additions, essentially - time out due to the required search and delete in the _insert and _update hooks.
  • Randomly, various nodes are not being added to the solr index at creation.
  • Editing nodes times out causing the dreaded "The website encountered an unexpected error. Please try again later." general Drupal error.

The body cache in the database seemed to me like a good idea at first - it seems like it would really reduce load on the server. However, when we put it into practice in an enterprise-level setting, it has been a big problem.

I am proposing removing that field entirely, and the hash as well as it is not useful without the body cache field.

Patch for this coming soon.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

srjosh’s picture

Status: Needs work » Needs review
FileSize
8.56 KB

Patch attached.

Nick_vh’s picture

This kind of makes sense for large sites. I'm not sure if we want to remove it completely rahter than having a checkbox : get it from solr at all times?

posulliv’s picture

Updated to keep caching related columns in the underlying database table. Added option to disable caching in the configuration menu.

pwolanin’s picture

Status: Needs review » Needs work

ok, right so the issue is basically the caching of the extracted text:

-  // Save the extracted, cleaned text to the DB.
-  db_update($indexer_table)->fields(array('hash' => $hash, 'body' => $text))->condition('entity_id', $file->fid)->execute();

An alternative here would be to save the extracted text to a series of text files (e.g. named for the file hash). That might be better than ever putting them in the DB?

posulliv’s picture

Yep, pretty much. I have a project where I don't want to cache the extracted text in the database due to how large the caching table can get.

Does the caching of the extracted text help much? Is it really needed?

pwolanin’s picture

Caching the text is especially useful when you need to do a round-trip to the Solr server for each doc to re-extract the text, so I wouldn't want to remove it.

posulliv’s picture

That's fair enough.

The patch I created in #5 makes the database caching configurable with the default set to caching enabled. Does that seem like a reasonable approach?

Nick_vh’s picture

+++ b/apachesolr_attachments.admin.incundefined
@@ -47,6 +47,12 @@ function apachesolr_attachments_settings($form, &$form_state, $env_id) {
+    '#default_value' => variable_get('apachesolr_attachments_use_cache', FALSE),

Please keep default value to true. We should not change the default behaviour.

+++ b/apachesolr_attachments.admin.incundefined
@@ -47,6 +47,12 @@ function apachesolr_attachments_settings($form, &$form_state, $env_id) {
+    '#description' => t('Whether or not to cache attachment text in the database. If disabled, requests always go to Solr.'),

Add a description why it would be useful to disable cache rather than saying what it does.

nasia123’s picture

will this patch be committed ?
in the latest version there in no such field for selecting whether or not to use cache...
I think if this is going to be committed a more detailed description should be added, so that users understand the pros and cons of using each selection.

pwolanin’s picture

I would prefer to have the new default option be to cache to the filesystem, so I don't think the patch is ready to be committed.

Kukulcan’s picture

I get an SQL error on that dbupdate function in line 101.

Is it safe to remove it? Will it only affect performance of future re-indexing?

amontero’s picture

Issue summary: View changes

In addition to the disable caching option, perhaps offloading the cache column to a standard Drupal cache bin would bring a nice tradeoff.
Since we would be using cache_get and cache_set using the table PK as cid the site administrator could fine tune the cache bin TTL to a balance that better suits each case. Also moving the cache bin to another backend such a MongoDB would allow offload the DB from the body column space.
Would that be possible?

amontero’s picture

Using Drupal standard cache functions would also fulfill pwolanin's request in #10 easily by way of the File Cache backend module

Nick_vh’s picture

That seems very reasonable, as long as you are able to choose how you want to keep the file extraction cache. Either on a filesystem or in a cache bin that is then regulated using the drupal cache configs. Anxiously awaiting a patch :)

escuriola’s picture

I'm working in the patch for #12.
Our platform has more than 1M of documents so it's really important this for us.

escuriola’s picture

First attempt for #12

escuriola’s picture

Status: Needs work » Needs review
escuriola’s picture

Add flush all cache button

pwolanin’s picture

Status: Needs review » Needs work

stray code commented out:

+  $cached = cache_get('entity_id:' . $file->fid . ':hash:' . $hash, 'cache_apachesolr_attachments_file_body');
+ // $cached = db_query("SELECT * FROM {{$indexer_table}} WHERE entity_id = :entity_id", array(':entity_id' => $file->fid))->fetchAssoc();
 
-  $cached = db_query("SELECT * FROM {{$indexer_table}} WHERE entity_id = :entity_id", array(':entity_id' => $file->fid))->fetchAssoc();

Also does the other table schema need to change?

amontero’s picture

  • +1 to @pwolanin's drop of obsolete body field.
  • +1 to @pwolanin's removal of stray code.
  • Cache table description: replace 'apachesolr_attachments' by full module name 'Apache Solr Attachments'.
  • Extra chars returned in return $cached . ' cache';
  • Change 'The local cache of extracted text has been deleted.' message to 'The local Extracted text cache cleared.' as done in core. Also, it will not be necessarily local.
  • Extra linefeeds before and after functions.

Other than these, the patch is right and solves the problem. I think that moving the extracted body cache out of the 'apachesolr_index_entity_file' table will avoid all the deadlocks I've observed there. Managing the cache via Drupal's cache API will enable even moving it out of MySQL (to MongoDB, for instance).

escuriola’s picture

Status: Needs work » Needs review
FileSize
3.35 KB

Well formed patch review with recent head update.

janusman’s picture

Status: Needs review » Fixed
FileSize
2.69 KB

Committed a version of this to 7.x-1.x-dev. Patch attached.

janusman’s picture

Status: Fixed » Needs review

Derp. Commented on the wrong issue! Setting back to needs review (see patch from #21).

janusman’s picture

New patch.

Patch in #21 had a few problems:

  • Did not store the calculated hash on the indexer table.
  • Left the 'body' field on the indexer table schema.
  • apachesolr_attachments_solr_reindex() was still attempting to delete any rows with a NULL 'body' field.
  • apachesolr_attachments_get_attachment_text() was returning the $cached_body object instead of just $cached_body->data.

I also tried this out with https://www.drupal.org/project/filecache and it seems to work with this:

$conf['cache_backends'][] = './sites/all/modules/contrib/filecache/filecache.inc';
$conf['cache_class_cache_apachesolr_attachments_file_body'] = 'DrupalFileCache';
janusman’s picture

Aaand of course I forgot the patch.

zdw’s picture

Tested the patch in #25 with the FileCache backend, and it appears to solve the issue.

Database size on a 35k document site went from >4GB to less than 200MB.

referup’s picture

Status: Needs review » Reviewed & tested by the community

Ours went from 8.5Gb for 1.3M docs to about 300mb. It's working flwalessly on our dev, test and prod systems.
This allows Drupal's best practice of not backing up cache data.

  • Nick_vh committed c9834f5 on 7.x-1.x authored by janusman
    Issue #1936662 by escuriola, janusman, posulliv, srjosh, pwolanin,...
Nick_vh’s picture

Status: Reviewed & tested by the community » Fixed

Commited. Thanks all!

[7.x-1.x c9834f5] Issue #1936662 by escuriola, janusman, posulliv, srjosh, pwolanin, amontero, Nick_vh: Database not scaling well
Author: janusman
4 files changed, 26 insertions(+), 17 deletions(-)

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.

pwolanin’s picture

FileSize
54 bytes
54 bytes
claudiu.cristea’s picture

Mingsong’s picture

Thanks for the great job to #25 patch.

The size of our database is significantly reduced with the develop version of 1.4-3-dev.

I was aware that the cache_apachesolr_attachments_file_body table is always empty, even after I re-index all contents.

Is there anyone know why?

I just noticed that we are using Memcache module for local caching. I suppose that is why the custom cache table of cache_apachesolr_attachments_file_body is empty in our database.