| Project: | Apache Solr Search Integration |
| Version: | 6.x-3.x-dev |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
Issue Summary
The implementations of hook_node_type, hook_user and hook_taxonomy in apachesolr.module completely knocks out our MySQL-server.
For example when someone commits a change to a term, this code can swamp the MySQL-server with very slow queries. We have seen queries like this running for 5+ hours on our test server.
Troublesome queries:
db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE uid = %d)", time(), $account->uid);
db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {term_node} WHERE tid = %d)", time(), $edit['tid']);
db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE type = '%s' OR type = '%s')", time(), $info->old_type, $info->type);This server is a dual quad core xeon with 8GB memory and 15K rpm disks, with MySQL optimized with our standard settings.
The Drupal installation is of medium size - around 1.6 million nodes and 2.4 million term node connections.
We have changed these queries to this:
db_query("UPDATE {UPDATE apachesolr_search_node} a INNER JOIN {node} b ON a.nid = b.nid SET a.changed = %d WHERE b.uid = %d", array(time(), $account->uid));
db_query("UPDATE {UPDATE apachesolr_search_node} a INNER JOIN {term_node} b ON a.nid = b.nid SET a.changed = %d WHERE b.tid = %d", array(time(), $edit['tid']));
db_query("UPDATE {apachesolr_search_node} a INNER JOIN {node} b ON a.nid = b.nid SET a.changed = %d WHERE (b.type = '%s' OR b.type = '%s')", array(time(), $info->old_type, $info->type));For us this is a critical flaw, because it renders the whole system unusable if the queries are not changed.
Comments
#1
We're certainly open to optimizing them - but I'm wondering if that syntax is MySQL only?
Also, how many nodes do you have - I imagine it must be very large?
#2
What part of that syntax do you consider to be MySQL only?
I would guess that other database systems supports joins of type inner join. Especially since a lot of Drupal core uses joins by default.
But optimizing or not, it seems that the current queries simply don't work on any Drupal installation of a certain size for those that use MySQL.
This site has 1.6 mill nodes and 2.4 million term node connections.
We are already setting up some other sites for Apache Solr support, with 5+ million nodes. But it's a bit tiresome to always have to rewrite "stock modules".
#3
Ah would love to see a patch (drupal.org/patch) and would love to see the EXPLAIN for before and after.
The debate for and against subqueries/JOINS continues on!
#4
it doesn't look like postgres supports this syntax
#5
Looks like Postgres can use a similar but incompatible syntax like:
UPDATE {apachesolr_search_node} sn SET changed = %d FROM {node} n WHERE sn.nid=n.nid AND n.uid = %dhttp://www.postgresql.org/docs/8.0/static/sql-update.html
#6
I'm not trying to start a join vs. subqueries discussion.
Just voicing my experience with the current code and MySQL. If 90% of all Drupal installations are on MySQL (anyone have any numbers?), having queries which are highly unoptimized for MySQL does not seem to be overly clever.
Scott Reynolds, is there a way to do an EXPLAIN on an update query with subquery?
#7
Discussed this with David Strauss - MySQL might handle this better in MySQL 6, but likely to support this better in the module we'll need to switch on the DB type. I have no idea how that will work in D7.
#8
Nope, apparently I hadn't had my coffee yet before posting that comment :-D
Incompatible how? Does it accomplish exactly what the original query was doing? I believe MySQL supports that syntax as well. http://dev.mysql.com/doc/refman/5.0/en/update.html
#9
Mysql orders the table references differently and does not use the "FROM" keyword. I tried it before posting the above. MySQL can do something like:
UPDATE {apachesolr_search_node} sn, {node} n SET changed = %d WHERE sn.nid=n.nid AND n.uid = %dcompared to Postgres, where the table being updated is the only one before the SET:
UPDATE {apachesolr_search_node} sn SET changed = %d FROM {node} n WHERE sn.nid=n.nid AND n.uid = %d#10
I've spent some time trying to find ways to make this work.
It seems that MySQL really has some problems with "WHERE IN (subquery)" because of it's query optimizing algorithms. Found some more technical documentation, where someone proved that doing a "WHERE IN (subquery (subquery))" would force MySQL to create a temporary result set for the inner most query. Didn't read too much into it, because it seems to me that would just be trying to solve the problem in a "wrong way" :)
But I am curious.
Does someone have a large data set to test these queries in PostgreSQL or any other database?
Is this the proper way to write such queries - query with subquery?
#11
Looks like we need to switch on DB type if we want to support this - anyone want to roll a patch and test it?
#12
David Straus mentions that MySQL 6 will handle this bettter, but that's a ways off: http://www.scribd.com/doc/2546837/New-Subquery-Optimizations-In-MySQL-6
#13
let's get this fixed before the next release
#14
One possible improvement is changing OR to UNION.
From:
UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE type = '%s' OR type = '%s')To:
UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE type = '%s' UNION SELECT nid FROM {node} WHERE type = '%s')I think this is PgSQL compatible.
#15
I think the problem is the IN query, not the OR
#16
untested patch
#17
#18
Seems like that last query deletes from the {node} table too!!!
#19
ok, this form seems to behave as expected for the delete.
#20
Quick test to all 3 types of changes - looks ok. Committing to 5.x-2.x, 6.x-1.x, 6.x-2.x.
Needs to be ported to HEAD.
#21
a couple of ideas spring to mind for the D7 port of this.
first, it seems weird to hold on to a real-time update that could be potentially server-bashing, when all we are doing is marking stuff for the real work to happen later. why not just record the type of change and the relevant data, and push all the work to the "do it later" code?
second, it would be great to see this sort or update be alterable by modules. in the degenerate case, the apachesolr module can continue to do the "just update everything in one go" approach. but, someone could write a "apachesolr_gearman" module, which could dispatch the updates as jobs, which would seem to fit the needs here much better.
totally untested code:
<?php
/**
* Implements hook_user_update().
*/
function apachesolr_user_update(&$edit, $account, $category) {
if (isset($edit['name']) && $account->name != $edit['name']) {
$type = 'node_author_change';
$data = array(
$account->uid,
REQUEST_TIME,
);
apachesolr_queue_updates($type, $data);
}
}
/**
* Queue search index updates.
*
* @param mixed $type
* @param array $data
* @return void
*/
function apachesolr_queue_updates($type, array $data) {
static $types;
if ($types === NULL) {
$types = module_invoke_all('apachesolr_update_queue_info');
drupal_alter('apachesolr_update_queue_types', $types);
}
if (isset($types[$type])) {
call_user_func_array($types[$type]['callback'], $data);
}
}
/**
* Implements hook_apachesolr_update_queue_info().
*/
function apachesolr_apachesolr_update_queue_info() {
return array(
'node_author_change' => array(
'callback' => 'apachesolr_update_user_nodes',
),
);
}
/**
* Mark nodes with the given $uid as needing re-indexing.
*
* @param int $uid
* @param int $update_time
*/
function apachesolr_update_user_nodes($uid, $update_time) {
$nid_query = db_select('node')->fields('node', array('nid'))->where("uid = :uid", array(':uid' => $uid));
db_update('apachesolr_search_node')->condition('nid', $nid_query, 'IN')->fields(array('changed' => $update_time))->execute();
}
?>
#22
@justinrandell - I agree with your points, but in discussion with Crell it seems rather unfortunate that there is no way around sending such a poorly performing query to mysql in the context of DBTNG.
#23
related core issue opened by Crell: #961604: dbtng drivers are not able to have their own condition classes
#24
Marking #924214: Calling drush solr-reindex errors with unknown table in MULTI DELETE duplicate. It mentions a bug for some versions of MySQL:
In Mysql version: 5.0.47 the query:
<?phpdb_query("DELETE FROM {apachesolr_search_node} USING {apachesolr_search_node} asn INNER JOIN {node} n ON asn.nid = n.nid WHERE n.type = '%s'", $type);
?>
Causes the error:
WD php: Unknown table 'apachesolr_search_node' in MULTI DELETEThe issue is that using the alias for apachesolr_search_node means that mysql can no longer recognise the table to delete.
#25
Still waiting for the core fix.
#26
A solution until the DBTNG has resolved this could be :
<?php
if(db_driver() === 'mysql') {
}
else {
db_update('apachesolr_search_node')
->condition('nid', $nid, 'IN')
->fields(array('changed' => REQUEST_TIME))
->execute();
}
?>
Still have to investigate how the exact query would be for mysql and for other types (needs a lot of testing)
#27
#28
A patch that already separates the db_driver. Still a todo would be to write another query for mysql or for another db type.
#29
Also, it seems that we are doing the same query 3 times. We could move this to
<?phpapachesolr_mark_node($nid);
?>
This function could possibly retrieve one or more nids?
#30
In 7 mysql and mysql are not separate.
#31
Very interesting post about this behavior
http://stackoverflow.com/questions/3417074/why-would-an-in-condition-be-...
i first tried to make it happen in the dbtng layer but I bumped into a problem where the UPDATE does not support joins
(Info about that here : http://api.drupal.org/api/drupal/includes--database--database.inc/functi...)
<?php// UPDATE {apachesolr_search_node} asn
// INNER JOIN {node} n ON asn.nid = n.nid
// SET changed = %d
// WHERE (n.type = '%s' OR n.type = '%s')
$types = db_or();
$types
->condition('n.type', $info->type)
->condition('n.type', $info->old_type);
$update = db_update('apachesolr_search_node')
->join('node', 'n', 'apachesolr_search_node.nid = n.nid')
->fields(array('changed' => REQUEST_TIME))
->condition($ored);
?>
I think the only solution will be the db_query for mysql only
<?phpswitch (db_driver()) {
case 'mysql' :
$query = "UPDATE {apachesolr_search_node} asn
INNER JOIN {node} n ON asn.nid = n.nid SET asn.changed = :changed
WHERE (n.type = :type OR n.type = :old_type)";
$result = db_query($query, array(':changed' => REQUEST_TIME,
':type' => $info->type,
':old_type' => $info->old_type,
));
break;
default :
$nids = db_select('node')->fields('node', array('nid'))->where("type = :new OR type = :old", array(':new' => $info->type, ':old' => $info->old_type));
$update = db_update('apachesolr_search_node')
->condition('nid', $nids, 'IN')
->fields(array('changed' => REQUEST_TIME))
->execute();
}
?>
#32
And the accompanying patch
Note: I am aware of a whitespace issue in the patch, but let's look at the concept and resolution first
#33
Updated some documentation so future updates become easier
#34
Yes, from my prior discussion with Crell, it probably just has to be db_query()
#35
need to also fix this function:
function apachesolr_rebuild_index_table($type = NULL) {#36
There are more of them :
<?phpfunction apachesolr_nodeapi_mass_update($nodes) {
...
if ($published_ids) {
db_update('apachesolr_search_node')->fields(array('changed' => REQUEST_TIME, 'status' => 1))->condition('nid', array_keys($published_ids), 'IN')->execute();
}
if ($unpublished_ids) {
db_update('apachesolr_search_node')->fields(array('changed' => REQUEST_TIME, 'status' => 0))->condition('nid', array_keys($unpublished_ids), 'IN')->execute();
}
?>
<?phpfunction apachesolr_nodeapi_mass_delete($nodes) {
...
db_delete('apachesolr_search_node')->condition('nid', $nids, 'IN')->execute();
?>
<?phpfunction apachesolr_search_type_boost_form_submit($form, &$form_state) {
...
db_update('apachesolr_search_node')->fields(array('changed' => REQUEST_TIME))->condition('nid', $nids, 'IN')->execute();
?>
#37
Mixing up patches, ignore this#38
A simple IN query is fine, these don't need to change - it's only the combination with a subquery
#39
Untested patch but it does take the delete into account
will report back if I have tested all this functionality
#40
#41
Diffed on latest branch
#42
I'm testing out all of the possibilities and somehow I get stuck/struck with the following :
<?php$query = "UPDATE {:table} asn
INNER JOIN {taxonomy_index} ti ON asn.entity_id = ti.nid
SET asn.changed = :changed
WHERE ti.tid = :tid";
$arguments = array(
':table' => $table,
':changed' => REQUEST_TIME,
':tid' => $term->tid,
);
$result = db_query($query, $arguments);
?>
PDOException: SQLSTATE[42000]: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''apachesolr_index_entities_node' asn INNER JOIN taxonomy_index ti ON asn' at line 1: UPDATE {:table} asn INNER JOIN {taxonomy_index} ti ON asn.entity_id = ti.nid SET asn.changed = :changed WHERE ti.tid = :tid; Array ( [:table] => apachesolr_index_entities_node [:changed] => 1325850555 [:tid] => 73Translated to a real query this becomes
UPDATE apachesolr_index_entities_node asn INNER JOIN taxonomy_index ti ON asn.entity_id = ti.nid SET asn.changed = 1325850555 WHERE ti.tid = 73When this query is executed straight in mysql it does not give me any problems... Weird..
#43
Progress so far after review
#44
With help from mhrabovcin, the following patch works but it does have a certain direct SQL injection problem. (that is, if entity_info can be altered in the cache)
Should we do some extra checking on the table variable?
#45
An escaped version. I feel more comfortable using mysql_escape_string instead of nothing
#46
probably want to use this
http://api.drupal.org/api/drupal/includes--database--database.inc/functi...
#47
Adjusted
#48
Needed some reroll
#49
Wooaa, a 2 year old issue can be closed and marked as fixed! Committed :-)
#50
Updated it in a follow up that apachesolr_index_get_indexer_table needed to become apachesolr_get_indexer_table
#51
#52
Reroll for 6.x-3.x