The implementations of hook_node_type, hook_user and hook_taxonomy in apachesolr.module completely knocks out our MySQL-server.
For example when someone commits a change to a term, this code can swamp the MySQL-server with very slow queries. We have seen queries like this running for 5+ hours on our test server.
Troublesome queries:

db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE uid = %d)", time(), $account->uid);
db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {term_node} WHERE tid = %d)", time(), $edit['tid']);
db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE type = '%s' OR type = '%s')", time(), $info->old_type, $info->type);

This server is a dual quad core xeon with 8GB memory and 15K rpm disks, with MySQL optimized with our standard settings.
The Drupal installation is of medium size - around 1.6 million nodes and 2.4 million term node connections.

We have changed these queries to this:

db_query("UPDATE {UPDATE apachesolr_search_node} a INNER JOIN {node} b ON a.nid = b.nid SET a.changed = %d WHERE b.uid = %d", array(time(), $account->uid));
db_query("UPDATE {UPDATE apachesolr_search_node} a INNER JOIN {term_node} b ON a.nid = b.nid SET a.changed = %d WHERE b.tid = %d", array(time(), $edit['tid']));
db_query("UPDATE {apachesolr_search_node} a INNER JOIN {node} b ON a.nid = b.nid SET a.changed = %d WHERE (b.type = '%s' OR b.type = '%s')", array(time(), $info->old_type, $info->type));

For us this is a critical flaw, because it renders the whole system unusable if the queries are not changed.

Comments

pwolanin’s picture

We're certainly open to optimizing them - but I'm wondering if that syntax is MySQL only?

Also, how many nodes do you have - I imagine it must be very large?

erlendstromsvik’s picture

What part of that syntax do you consider to be MySQL only?
I would guess that other database systems supports joins of type inner join. Especially since a lot of Drupal core uses joins by default.
But optimizing or not, it seems that the current queries simply don't work on any Drupal installation of a certain size for those that use MySQL.

This site has 1.6 mill nodes and 2.4 million term node connections.
We are already setting up some other sites for Apache Solr support, with 5+ million nodes. But it's a bit tiresome to always have to rewrite "stock modules".

Scott Reynolds’s picture

Ah would love to see a patch (drupal.org/patch) and would love to see the EXPLAIN for before and after.

The debate for and against subqueries/JOINS continues on!

pwolanin’s picture

it doesn't look like postgres supports this syntax

pwolanin’s picture

Looks like Postgres can use a similar but incompatible syntax like:

UPDATE {apachesolr_search_node} sn SET changed = %d FROM {node} n WHERE sn.nid=n.nid AND n.uid = %d

http://www.postgresql.org/docs/8.0/static/sql-update.html

erlendstromsvik’s picture

I'm not trying to start a join vs. subqueries discussion.
Just voicing my experience with the current code and MySQL. If 90% of all Drupal installations are on MySQL (anyone have any numbers?), having queries which are highly unoptimized for MySQL does not seem to be overly clever.

Scott Reynolds, is there a way to do an EXPLAIN on an update query with subquery?

pwolanin’s picture

Discussed this with David Strauss - MySQL might handle this better in MySQL 6, but likely to support this better in the module we'll need to switch on the DB type. I have no idea how that will work in D7.

Scott Reynolds’s picture

Scott Reynolds, is there a way to do an EXPLAIN on an update query with subquery?

Nope, apparently I hadn't had my coffee yet before posting that comment :-D

Looks like Postgres can use a similar but incompatible syntax like:

Incompatible how? Does it accomplish exactly what the original query was doing? I believe MySQL supports that syntax as well. http://dev.mysql.com/doc/refman/5.0/en/update.html

pwolanin’s picture

Mysql orders the table references differently and does not use the "FROM" keyword. I tried it before posting the above. MySQL can do something like:

UPDATE {apachesolr_search_node} sn, {node} n SET changed = %d WHERE sn.nid=n.nid AND n.uid = %d

compared to Postgres, where the table being updated is the only one before the SET:

UPDATE {apachesolr_search_node} sn SET changed = %d FROM {node} n WHERE sn.nid=n.nid AND n.uid = %d
erlendstromsvik’s picture

I've spent some time trying to find ways to make this work.
It seems that MySQL really has some problems with "WHERE IN (subquery)" because of it's query optimizing algorithms. Found some more technical documentation, where someone proved that doing a "WHERE IN (subquery (subquery))" would force MySQL to create a temporary result set for the inner most query. Didn't read too much into it, because it seems to me that would just be trying to solve the problem in a "wrong way" :)

But I am curious.
Does someone have a large data set to test these queries in PostgreSQL or any other database?
Is this the proper way to write such queries - query with subquery?

pwolanin’s picture

Looks like we need to switch on DB type if we want to support this - anyone want to roll a patch and test it?

pwolanin’s picture

David Straus mentions that MySQL 6 will handle this bettter, but that's a ways off: http://www.scribd.com/doc/2546837/New-Subquery-Optimizations-In-MySQL-6

pwolanin’s picture

Version: 6.x-1.0-rc2 » 6.x-1.x-dev

let's get this fixed before the next release

janusman’s picture

One possible improvement is changing OR to UNION.

From:

UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE type = '%s' OR type = '%s')

To:

UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE type = '%s' UNION SELECT nid FROM {node} WHERE type = '%s')

I think this is PgSQL compatible.

pwolanin’s picture

I think the problem is the IN query, not the OR

pwolanin’s picture

StatusFileSize
new3.84 KB

untested patch

pwolanin’s picture

Status: Active » Needs review
pwolanin’s picture

Status: Needs review » Needs work

Seems like that last query deletes from the {node} table too!!!

pwolanin’s picture

Status: Needs work » Needs review
StatusFileSize
new3.78 KB

ok, this form seems to behave as expected for the delete.

pwolanin’s picture

Version: 6.x-1.x-dev » 7.x-1.x-dev
Status: Needs review » Patch (to be ported)

Quick test to all 3 types of changes - looks ok. Committing to 5.x-2.x, 6.x-1.x, 6.x-2.x.

Needs to be ported to HEAD.

Anonymous’s picture

a couple of ideas spring to mind for the D7 port of this.

first, it seems weird to hold on to a real-time update that could be potentially server-bashing, when all we are doing is marking stuff for the real work to happen later. why not just record the type of change and the relevant data, and push all the work to the "do it later" code?

second, it would be great to see this sort or update be alterable by modules. in the degenerate case, the apachesolr module can continue to do the "just update everything in one go" approach. but, someone could write a "apachesolr_gearman" module, which could dispatch the updates as jobs, which would seem to fit the needs here much better.

totally untested code:

/**
 * Implements hook_user_update().
 */
function apachesolr_user_update(&$edit, $account, $category) {
  if (isset($edit['name']) && $account->name != $edit['name']) {
    $type = 'node_author_change';
    $data = array(
      $account->uid,
      REQUEST_TIME,
    );
    apachesolr_queue_updates($type, $data);
  }
}

/**
 * Queue search index updates.
 *
 * @param mixed $type
 * @param array $data
 * @return void
 */
function apachesolr_queue_updates($type, array $data) {
  static $types;
  if ($types === NULL) {
    $types = module_invoke_all('apachesolr_update_queue_info');
    drupal_alter('apachesolr_update_queue_types', $types);
  }

  if (isset($types[$type])) {
    call_user_func_array($types[$type]['callback'], $data);
  }
}

/**
 * Implements hook_apachesolr_update_queue_info().
 */ 
function apachesolr_apachesolr_update_queue_info() {
  return array(
    'node_author_change' => array(
      'callback' => 'apachesolr_update_user_nodes',
    ),
  );
}

/**
 * Mark nodes with the given $uid as needing re-indexing.
 *
 * @param int $uid
 * @param int $update_time
 */
function apachesolr_update_user_nodes($uid, $update_time) {
  $nid_query = db_select('node')->fields('node', array('nid'))->where("uid = :uid", array(':uid' => $uid));
  db_update('apachesolr_search_node')->condition('nid', $nid_query, 'IN')->fields(array('changed' => $update_time))->execute();
}
pwolanin’s picture

@justinrandell - I agree with your points, but in discussion with Crell it seems rather unfortunate that there is no way around sending such a poorly performing query to mysql in the context of DBTNG.

pwolanin’s picture

jpmckinney’s picture

Marking #924214: Calling drush solr-reindex errors with unknown table in MULTI DELETE duplicate. It mentions a bug for some versions of MySQL:

In Mysql version: 5.0.47 the query:

db_query("DELETE FROM {apachesolr_search_node} USING {apachesolr_search_node} asn INNER JOIN {node} n ON asn.nid = n.nid WHERE n.type = '%s'", $type);

Causes the error:

WD php: Unknown table 'apachesolr_search_node' in MULTI DELETE

The issue is that using the alias for apachesolr_search_node means that mysql can no longer recognise the table to delete.

pwolanin’s picture

Still waiting for the core fix.

nick_vh’s picture

A solution until the DBTNG has resolved this could be :

if(db_driver() === 'mysql') {

}
else {
  db_update('apachesolr_search_node')
        ->condition('nid', $nid, 'IN')
        ->fields(array('changed' => REQUEST_TIME))
        ->execute();
}

Still have to investigate how the exact query would be for mysql and for other types (needs a lot of testing)

nick_vh’s picture

Status: Patch (to be ported) » Needs work
nick_vh’s picture

StatusFileSize
new3.26 KB

A patch that already separates the db_driver. Still a todo would be to write another query for mysql or for another db type.

nick_vh’s picture

Also, it seems that we are doing the same query 3 times. We could move this to

apachesolr_mark_node($nid);

This function could possibly retrieve one or more nids?

pwolanin’s picture

In 7 mysql and mysql are not separate.

nick_vh’s picture

Very interesting post about this behavior
http://stackoverflow.com/questions/3417074/why-would-an-in-condition-be-...

i first tried to make it happen in the dbtng layer but I bumped into a problem where the UPDATE does not support joins
(Info about that here : http://api.drupal.org/api/drupal/includes--database--database.inc/functi...)

// UPDATE {apachesolr_search_node} asn
// INNER JOIN {node} n ON asn.nid = n.nid
// SET changed = %d
// WHERE (n.type = '%s' OR n.type = '%s')
$types = db_or();
$types
  ->condition('n.type', $info->type)
  ->condition('n.type', $info->old_type);
$update = db_update('apachesolr_search_node')
  ->join('node', 'n', 'apachesolr_search_node.nid = n.nid')
  ->fields(array('changed' => REQUEST_TIME))
  ->condition($ored);

I think the only solution will be the db_query for mysql only

switch (db_driver()) {
  case 'mysql' :
    $query = "UPDATE {apachesolr_search_node} asn
      INNER JOIN {node} n ON asn.nid = n.nid SET asn.changed = :changed
      WHERE (n.type = :type OR n.type = :old_type)";
    $result = db_query($query, array(':changed' => REQUEST_TIME,
      ':type' => $info->type,
      ':old_type' => $info->old_type,
    ));
    break;
  default :
    $nids = db_select('node')->fields('node', array('nid'))->where("type = :new OR type = :old", array(':new' => $info->type, ':old' => $info->old_type));
    $update = db_update('apachesolr_search_node')
      ->condition('nid', $nids, 'IN')
      ->fields(array('changed' => REQUEST_TIME))
      ->execute();
    }
nick_vh’s picture

StatusFileSize
new4.69 KB

And the accompanying patch

Note: I am aware of a whitespace issue in the patch, but let's look at the concept and resolution first

nick_vh’s picture

StatusFileSize
new6.29 KB

Updated some documentation so future updates become easier

pwolanin’s picture

Yes, from my prior discussion with Crell, it probably just has to be db_query()

pwolanin’s picture

need to also fix this function:

function apachesolr_rebuild_index_table($type = NULL) {
nick_vh’s picture

There are more of them :

function apachesolr_nodeapi_mass_update($nodes) {
...
if ($published_ids) {
  db_update('apachesolr_search_node')->fields(array('changed' => REQUEST_TIME, 'status' => 1))->condition('nid', array_keys($published_ids), 'IN')->execute();
}
if ($unpublished_ids) {
  db_update('apachesolr_search_node')->fields(array('changed' => REQUEST_TIME, 'status' => 0))->condition('nid', array_keys($unpublished_ids), 'IN')->execute();
}
function apachesolr_nodeapi_mass_delete($nodes) {
...
  db_delete('apachesolr_search_node')->condition('nid', $nids, 'IN')->execute();

function apachesolr_search_type_boost_form_submit($form, &$form_state) {
...
db_update('apachesolr_search_node')->fields(array('changed' => REQUEST_TIME))->condition('nid', $nids, 'IN')->execute();
nick_vh’s picture

StatusFileSize
new18.29 KB

Mixing up patches, ignore this

pwolanin’s picture

A simple IN query is fine, these don't need to change - it's only the combination with a subquery

nick_vh’s picture

StatusFileSize
new8.67 KB

Untested patch but it does take the delete into account
will report back if I have tested all this functionality

nick_vh’s picture

Status: Needs work » Needs review
nick_vh’s picture

StatusFileSize
new4.97 KB

Diffed on latest branch

nick_vh’s picture

I'm testing out all of the possibilities and somehow I get stuck/struck with the following :

  $query = "UPDATE {:table} asn
        INNER JOIN {taxonomy_index} ti ON asn.entity_id = ti.nid
        SET asn.changed = :changed
        WHERE ti.tid = :tid";
      $arguments = array(
        ':table' => $table,
        ':changed' => REQUEST_TIME,
        ':tid' => $term->tid,
      );
      $result = db_query($query, $arguments);

PDOException: SQLSTATE[42000]: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''apachesolr_index_entities_node' asn INNER JOIN taxonomy_index ti ON asn' at line 1: UPDATE {:table} asn INNER JOIN {taxonomy_index} ti ON asn.entity_id = ti.nid SET asn.changed = :changed WHERE ti.tid = :tid; Array ( [:table] => apachesolr_index_entities_node [:changed] => 1325850555 [:tid] => 73

Translated to a real query this becomes
UPDATE apachesolr_index_entities_node asn INNER JOIN taxonomy_index ti ON asn.entity_id = ti.nid SET asn.changed = 1325850555 WHERE ti.tid = 73

When this query is executed straight in mysql it does not give me any problems... Weird..

nick_vh’s picture

StatusFileSize
new5.15 KB

Progress so far after review

nick_vh’s picture

StatusFileSize
new5.01 KB

With help from mhrabovcin, the following patch works but it does have a certain direct SQL injection problem. (that is, if entity_info can be altered in the cache)

Should we do some extra checking on the table variable?

nick_vh’s picture

StatusFileSize
new5.15 KB

An escaped version. I feel more comfortable using mysql_escape_string instead of nothing

pwolanin’s picture

nick_vh’s picture

StatusFileSize
new5.14 KB

Adjusted

nick_vh’s picture

StatusFileSize
new5.38 KB

Needed some reroll

nick_vh’s picture

Status: Needs review » Fixed

Wooaa, a 2 year old issue can be closed and marked as fixed! Committed :-)

nick_vh’s picture

Updated it in a follow up that apachesolr_index_get_indexer_table needed to become apachesolr_get_indexer_table

nick_vh’s picture

Status: Fixed » Closed (fixed)
nick_vh’s picture

Version: 7.x-1.x-dev » 6.x-3.x-dev
StatusFileSize
new5.45 KB

Reroll for 6.x-3.x