Download & Extend

Hooks node_type, taxonomy and user knocks out our database server

Project:Apache Solr Search Integration
Version:6.x-3.x-dev
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:closed (fixed)

Issue Summary

The implementations of hook_node_type, hook_user and hook_taxonomy in apachesolr.module completely knocks out our MySQL-server.
For example when someone commits a change to a term, this code can swamp the MySQL-server with very slow queries. We have seen queries like this running for 5+ hours on our test server.
Troublesome queries:

db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE uid = %d)", time(), $account->uid);
db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {term_node} WHERE tid = %d)", time(), $edit['tid']);
db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE type = '%s' OR type = '%s')", time(), $info->old_type, $info->type);

This server is a dual quad core xeon with 8GB memory and 15K rpm disks, with MySQL optimized with our standard settings.
The Drupal installation is of medium size - around 1.6 million nodes and 2.4 million term node connections.

We have changed these queries to this:

db_query("UPDATE {UPDATE apachesolr_search_node} a INNER JOIN {node} b ON a.nid = b.nid SET a.changed = %d WHERE b.uid = %d", array(time(), $account->uid));
db_query("UPDATE {UPDATE apachesolr_search_node} a INNER JOIN {term_node} b ON a.nid = b.nid SET a.changed = %d WHERE b.tid = %d", array(time(), $edit['tid']));
db_query("UPDATE {apachesolr_search_node} a INNER JOIN {node} b ON a.nid = b.nid SET a.changed = %d WHERE (b.type = '%s' OR b.type = '%s')", array(time(), $info->old_type, $info->type));

For us this is a critical flaw, because it renders the whole system unusable if the queries are not changed.

Comments

#1

We're certainly open to optimizing them - but I'm wondering if that syntax is MySQL only?

Also, how many nodes do you have - I imagine it must be very large?

#2

What part of that syntax do you consider to be MySQL only?
I would guess that other database systems supports joins of type inner join. Especially since a lot of Drupal core uses joins by default.
But optimizing or not, it seems that the current queries simply don't work on any Drupal installation of a certain size for those that use MySQL.

This site has 1.6 mill nodes and 2.4 million term node connections.
We are already setting up some other sites for Apache Solr support, with 5+ million nodes. But it's a bit tiresome to always have to rewrite "stock modules".

#3

Ah would love to see a patch (drupal.org/patch) and would love to see the EXPLAIN for before and after.

The debate for and against subqueries/JOINS continues on!

#4

it doesn't look like postgres supports this syntax

#5

Looks like Postgres can use a similar but incompatible syntax like:

UPDATE {apachesolr_search_node} sn SET changed = %d FROM {node} n WHERE sn.nid=n.nid AND n.uid = %d

http://www.postgresql.org/docs/8.0/static/sql-update.html

#6

I'm not trying to start a join vs. subqueries discussion.
Just voicing my experience with the current code and MySQL. If 90% of all Drupal installations are on MySQL (anyone have any numbers?), having queries which are highly unoptimized for MySQL does not seem to be overly clever.

Scott Reynolds, is there a way to do an EXPLAIN on an update query with subquery?

#7

Discussed this with David Strauss - MySQL might handle this better in MySQL 6, but likely to support this better in the module we'll need to switch on the DB type. I have no idea how that will work in D7.

#8

Scott Reynolds, is there a way to do an EXPLAIN on an update query with subquery?

Nope, apparently I hadn't had my coffee yet before posting that comment :-D

Looks like Postgres can use a similar but incompatible syntax like:

Incompatible how? Does it accomplish exactly what the original query was doing? I believe MySQL supports that syntax as well. http://dev.mysql.com/doc/refman/5.0/en/update.html

#9

Mysql orders the table references differently and does not use the "FROM" keyword. I tried it before posting the above. MySQL can do something like:

UPDATE {apachesolr_search_node} sn, {node} n SET changed = %d WHERE sn.nid=n.nid AND n.uid = %d

compared to Postgres, where the table being updated is the only one before the SET:

UPDATE {apachesolr_search_node} sn SET changed = %d FROM {node} n WHERE sn.nid=n.nid AND n.uid = %d

#10

I've spent some time trying to find ways to make this work.
It seems that MySQL really has some problems with "WHERE IN (subquery)" because of it's query optimizing algorithms. Found some more technical documentation, where someone proved that doing a "WHERE IN (subquery (subquery))" would force MySQL to create a temporary result set for the inner most query. Didn't read too much into it, because it seems to me that would just be trying to solve the problem in a "wrong way" :)

But I am curious.
Does someone have a large data set to test these queries in PostgreSQL or any other database?
Is this the proper way to write such queries - query with subquery?

#11

Looks like we need to switch on DB type if we want to support this - anyone want to roll a patch and test it?

#12

David Straus mentions that MySQL 6 will handle this bettter, but that's a ways off: http://www.scribd.com/doc/2546837/New-Subquery-Optimizations-In-MySQL-6

#13

Version:6.x-1.0-rc2» 6.x-1.x-dev

let's get this fixed before the next release

#14

One possible improvement is changing OR to UNION.

From:

UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE type = '%s' OR type = '%s')

To:
UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE type = '%s' UNION SELECT nid FROM {node} WHERE type = '%s')

I think this is PgSQL compatible.

#15

I think the problem is the IN query, not the OR

#16

untested patch

AttachmentSizeStatusTest resultOperations
mysql-opt-592522-16.patch3.84 KBIgnored: Check issue status.NoneNone

#17

Status:active» needs review

#18

Status:needs review» needs work

Seems like that last query deletes from the {node} table too!!!

#19

Status:needs work» needs review

ok, this form seems to behave as expected for the delete.

AttachmentSizeStatusTest resultOperations
mysql-opt-592522-19.patch3.78 KBIgnored: Check issue status.NoneNone

#20

Version:6.x-1.x-dev» 7.x-1.x-dev
Status:needs review» patch (to be ported)

Quick test to all 3 types of changes - looks ok. Committing to 5.x-2.x, 6.x-1.x, 6.x-2.x.

Needs to be ported to HEAD.

#21

a couple of ideas spring to mind for the D7 port of this.

first, it seems weird to hold on to a real-time update that could be potentially server-bashing, when all we are doing is marking stuff for the real work to happen later. why not just record the type of change and the relevant data, and push all the work to the "do it later" code?

second, it would be great to see this sort or update be alterable by modules. in the degenerate case, the apachesolr module can continue to do the "just update everything in one go" approach. but, someone could write a "apachesolr_gearman" module, which could dispatch the updates as jobs, which would seem to fit the needs here much better.

totally untested code:

<?php
/**
* Implements hook_user_update().
*/
function apachesolr_user_update(&$edit, $account, $category) {
  if (isset(
$edit['name']) && $account->name != $edit['name']) {
   
$type = 'node_author_change';
   
$data = array(
     
$account->uid,
     
REQUEST_TIME,
    );
   
apachesolr_queue_updates($type, $data);
  }
}

/**
* Queue search index updates.
*
* @param mixed $type
* @param array $data
* @return void
*/
function apachesolr_queue_updates($type, array $data) {
  static
$types;
  if (
$types === NULL) {
   
$types = module_invoke_all('apachesolr_update_queue_info');
   
drupal_alter('apachesolr_update_queue_types', $types);
  }

  if (isset(
$types[$type])) {
   
call_user_func_array($types[$type]['callback'], $data);
  }
}

/**
* Implements hook_apachesolr_update_queue_info().
*/
function apachesolr_apachesolr_update_queue_info() {
  return array(
   
'node_author_change' => array(
     
'callback' => 'apachesolr_update_user_nodes',
    ),
  );
}

/**
* Mark nodes with the given $uid as needing re-indexing.
*
* @param int $uid
* @param int $update_time
*/
function apachesolr_update_user_nodes($uid, $update_time) {
 
$nid_query = db_select('node')->fields('node', array('nid'))->where("uid = :uid", array(':uid' => $uid));
 
db_update('apachesolr_search_node')->condition('nid', $nid_query, 'IN')->fields(array('changed' => $update_time))->execute();
}
?>

#22

@justinrandell - I agree with your points, but in discussion with Crell it seems rather unfortunate that there is no way around sending such a poorly performing query to mysql in the context of DBTNG.

#23

#24

Marking #924214: Calling drush solr-reindex errors with unknown table in MULTI DELETE duplicate. It mentions a bug for some versions of MySQL:

In Mysql version: 5.0.47 the query:

<?php
db_query
("DELETE FROM {apachesolr_search_node} USING {apachesolr_search_node} asn INNER JOIN {node} n ON asn.nid = n.nid WHERE n.type = '%s'", $type);
?>

Causes the error:

WD php: Unknown table 'apachesolr_search_node' in MULTI DELETE

The issue is that using the alias for apachesolr_search_node means that mysql can no longer recognise the table to delete.

#25

Still waiting for the core fix.

#26

A solution until the DBTNG has resolved this could be :

<?php
if(db_driver() === 'mysql') {

}
else {
 
db_update('apachesolr_search_node')
        ->
condition('nid', $nid, 'IN')
        ->
fields(array('changed' => REQUEST_TIME))
        ->
execute();
}
?>

Still have to investigate how the exact query would be for mysql and for other types (needs a lot of testing)

#27

Status:patch (to be ported)» needs work

#28

A patch that already separates the db_driver. Still a todo would be to write another query for mysql or for another db type.

AttachmentSizeStatusTest resultOperations
592522-28.patch3.26 KBIgnored: Check issue status.NoneNone

#29

Also, it seems that we are doing the same query 3 times. We could move this to

<?php
apachesolr_mark_node
($nid);
?>

This function could possibly retrieve one or more nids?

#30

In 7 mysql and mysql are not separate.

#31

Very interesting post about this behavior
http://stackoverflow.com/questions/3417074/why-would-an-in-condition-be-...

i first tried to make it happen in the dbtng layer but I bumped into a problem where the UPDATE does not support joins
(Info about that here : http://api.drupal.org/api/drupal/includes--database--database.inc/functi...)

<?php
// UPDATE {apachesolr_search_node} asn
// INNER JOIN {node} n ON asn.nid = n.nid
// SET changed = %d
// WHERE (n.type = '%s' OR n.type = '%s')
$types = db_or();
$types
 
->condition('n.type', $info->type)
  ->
condition('n.type', $info->old_type);
$update = db_update('apachesolr_search_node')
  ->
join('node', 'n', 'apachesolr_search_node.nid = n.nid')
  ->
fields(array('changed' => REQUEST_TIME))
  ->
condition($ored);
?>

I think the only solution will be the db_query for mysql only

<?php
switch (db_driver()) {
  case
'mysql' :
   
$query = "UPDATE {apachesolr_search_node} asn
      INNER JOIN {node} n ON asn.nid = n.nid SET asn.changed = :changed
      WHERE (n.type = :type OR n.type = :old_type)"
;
   
$result = db_query($query, array(':changed' => REQUEST_TIME,
     
':type' => $info->type,
     
':old_type' => $info->old_type,
    ));
    break;
  default :
   
$nids = db_select('node')->fields('node', array('nid'))->where("type = :new OR type = :old", array(':new' => $info->type, ':old' => $info->old_type));
   
$update = db_update('apachesolr_search_node')
      ->
condition('nid', $nids, 'IN')
      ->
fields(array('changed' => REQUEST_TIME))
      ->
execute();
    }
?>

#32

And the accompanying patch

Note: I am aware of a whitespace issue in the patch, but let's look at the concept and resolution first

AttachmentSizeStatusTest resultOperations
592522-32.patch4.69 KBIgnored: Check issue status.NoneNone

#33

Updated some documentation so future updates become easier

AttachmentSizeStatusTest resultOperations
592522-33.patch6.29 KBIgnored: Check issue status.NoneNone

#34

Yes, from my prior discussion with Crell, it probably just has to be db_query()

#35

need to also fix this function:

function apachesolr_rebuild_index_table($type = NULL) {

#36

There are more of them :

<?php
function apachesolr_nodeapi_mass_update($nodes) {
...
if (
$published_ids) {
 
db_update('apachesolr_search_node')->fields(array('changed' => REQUEST_TIME, 'status' => 1))->condition('nid', array_keys($published_ids), 'IN')->execute();
}
if (
$unpublished_ids) {
 
db_update('apachesolr_search_node')->fields(array('changed' => REQUEST_TIME, 'status' => 0))->condition('nid', array_keys($unpublished_ids), 'IN')->execute();
}
?>

<?php
function apachesolr_nodeapi_mass_delete($nodes) {
...
 
db_delete('apachesolr_search_node')->condition('nid', $nids, 'IN')->execute();
?>

<?php
function apachesolr_search_type_boost_form_submit($form, &$form_state) {
...
db_update('apachesolr_search_node')->fields(array('changed' => REQUEST_TIME))->condition('nid', $nids, 'IN')->execute();
?>

#37

Mixing up patches, ignore this

AttachmentSizeStatusTest resultOperations
592522-37.patch18.29 KBIgnored: Check issue status.NoneNone

#38

A simple IN query is fine, these don't need to change - it's only the combination with a subquery

#39

Untested patch but it does take the delete into account
will report back if I have tested all this functionality

AttachmentSizeStatusTest resultOperations
592522-38.patch8.67 KBIgnored: Check issue status.NoneNone

#40

Status:needs work» needs review

#41

Diffed on latest branch

AttachmentSizeStatusTest resultOperations
592522-41.patch4.97 KBIgnored: Check issue status.NoneNone

#42

I'm testing out all of the possibilities and somehow I get stuck/struck with the following :

<?php
  $query
= "UPDATE {:table} asn
        INNER JOIN {taxonomy_index} ti ON asn.entity_id = ti.nid
        SET asn.changed = :changed
        WHERE ti.tid = :tid"
;
     
$arguments = array(
       
':table' => $table,
       
':changed' => REQUEST_TIME,
       
':tid' => $term->tid,
      );
     
$result = db_query($query, $arguments);
?>

PDOException: SQLSTATE[42000]: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''apachesolr_index_entities_node' asn INNER JOIN taxonomy_index ti ON asn' at line 1: UPDATE {:table} asn INNER JOIN {taxonomy_index} ti ON asn.entity_id = ti.nid SET asn.changed = :changed WHERE ti.tid = :tid; Array ( [:table] => apachesolr_index_entities_node [:changed] => 1325850555 [:tid] => 73

Translated to a real query this becomes
UPDATE apachesolr_index_entities_node asn INNER JOIN taxonomy_index ti ON asn.entity_id = ti.nid SET asn.changed = 1325850555 WHERE ti.tid = 73

When this query is executed straight in mysql it does not give me any problems... Weird..

#43

Progress so far after review

AttachmentSizeStatusTest resultOperations
592522-43.patch5.15 KBIgnored: Check issue status.NoneNone

#44

With help from mhrabovcin, the following patch works but it does have a certain direct SQL injection problem. (that is, if entity_info can be altered in the cache)

Should we do some extra checking on the table variable?

AttachmentSizeStatusTest resultOperations
592522-44.patch5.01 KBIgnored: Check issue status.NoneNone

#45

An escaped version. I feel more comfortable using mysql_escape_string instead of nothing

AttachmentSizeStatusTest resultOperations
592522-45.patch5.15 KBIgnored: Check issue status.NoneNone

#46

#47

Adjusted

AttachmentSizeStatusTest resultOperations
592522-47.patch5.14 KBIgnored: Check issue status.NoneNone

#48

Needed some reroll

AttachmentSizeStatusTest resultOperations
592522-48.patch5.38 KBIgnored: Check issue status.NoneNone

#49

Status:needs review» fixed

Wooaa, a 2 year old issue can be closed and marked as fixed! Committed :-)

#50

Updated it in a follow up that apachesolr_index_get_indexer_table needed to become apachesolr_get_indexer_table

#51

Status:fixed» closed (fixed)

#52

Version:7.x-1.x-dev» 6.x-3.x-dev

Reroll for 6.x-3.x

AttachmentSizeStatusTest resultOperations
592522-52.patch5.45 KBIgnored: Check issue status.NoneNone
nobody click here