The implementations of hook_node_type, hook_user and hook_taxonomy in apachesolr.module completely knocks out our MySQL-server.
For example when someone commits a change to a term, this code can swamp the MySQL-server with very slow queries. We have seen queries like this running for 5+ hours on our test server.
Troublesome queries:
db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE uid = %d)", time(), $account->uid);
db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {term_node} WHERE tid = %d)", time(), $edit['tid']);
db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE type = '%s' OR type = '%s')", time(), $info->old_type, $info->type);
This server is a dual quad core xeon with 8GB memory and 15K rpm disks, with MySQL optimized with our standard settings.
The Drupal installation is of medium size - around 1.6 million nodes and 2.4 million term node connections.
We have changed these queries to this:
db_query("UPDATE {UPDATE apachesolr_search_node} a INNER JOIN {node} b ON a.nid = b.nid SET a.changed = %d WHERE b.uid = %d", array(time(), $account->uid));
db_query("UPDATE {UPDATE apachesolr_search_node} a INNER JOIN {term_node} b ON a.nid = b.nid SET a.changed = %d WHERE b.tid = %d", array(time(), $edit['tid']));
db_query("UPDATE {apachesolr_search_node} a INNER JOIN {node} b ON a.nid = b.nid SET a.changed = %d WHERE (b.type = '%s' OR b.type = '%s')", array(time(), $info->old_type, $info->type));
For us this is a critical flaw, because it renders the whole system unusable if the queries are not changed.
| Comment | File | Size | Author |
|---|---|---|---|
| #52 | 592522-52.patch | 5.45 KB | nick_vh |
| #48 | 592522-48.patch | 5.38 KB | nick_vh |
| #47 | 592522-47.patch | 5.14 KB | nick_vh |
| #45 | 592522-45.patch | 5.15 KB | nick_vh |
| #44 | 592522-44.patch | 5.01 KB | nick_vh |
Comments
Comment #1
pwolanin commentedWe're certainly open to optimizing them - but I'm wondering if that syntax is MySQL only?
Also, how many nodes do you have - I imagine it must be very large?
Comment #2
erlendstromsvik commentedWhat part of that syntax do you consider to be MySQL only?
I would guess that other database systems supports joins of type inner join. Especially since a lot of Drupal core uses joins by default.
But optimizing or not, it seems that the current queries simply don't work on any Drupal installation of a certain size for those that use MySQL.
This site has 1.6 mill nodes and 2.4 million term node connections.
We are already setting up some other sites for Apache Solr support, with 5+ million nodes. But it's a bit tiresome to always have to rewrite "stock modules".
Comment #3
Scott Reynolds commentedAh would love to see a patch (drupal.org/patch) and would love to see the EXPLAIN for before and after.
The debate for and against subqueries/JOINS continues on!
Comment #4
pwolanin commentedit doesn't look like postgres supports this syntax
Comment #5
pwolanin commentedLooks like Postgres can use a similar but incompatible syntax like:
http://www.postgresql.org/docs/8.0/static/sql-update.html
Comment #6
erlendstromsvik commentedI'm not trying to start a join vs. subqueries discussion.
Just voicing my experience with the current code and MySQL. If 90% of all Drupal installations are on MySQL (anyone have any numbers?), having queries which are highly unoptimized for MySQL does not seem to be overly clever.
Scott Reynolds, is there a way to do an EXPLAIN on an update query with subquery?
Comment #7
pwolanin commentedDiscussed this with David Strauss - MySQL might handle this better in MySQL 6, but likely to support this better in the module we'll need to switch on the DB type. I have no idea how that will work in D7.
Comment #8
Scott Reynolds commentedNope, apparently I hadn't had my coffee yet before posting that comment :-D
Incompatible how? Does it accomplish exactly what the original query was doing? I believe MySQL supports that syntax as well. http://dev.mysql.com/doc/refman/5.0/en/update.html
Comment #9
pwolanin commentedMysql orders the table references differently and does not use the "FROM" keyword. I tried it before posting the above. MySQL can do something like:
compared to Postgres, where the table being updated is the only one before the SET:
Comment #10
erlendstromsvik commentedI've spent some time trying to find ways to make this work.
It seems that MySQL really has some problems with "WHERE IN (subquery)" because of it's query optimizing algorithms. Found some more technical documentation, where someone proved that doing a "WHERE IN (subquery (subquery))" would force MySQL to create a temporary result set for the inner most query. Didn't read too much into it, because it seems to me that would just be trying to solve the problem in a "wrong way" :)
But I am curious.
Does someone have a large data set to test these queries in PostgreSQL or any other database?
Is this the proper way to write such queries - query with subquery?
Comment #11
pwolanin commentedLooks like we need to switch on DB type if we want to support this - anyone want to roll a patch and test it?
Comment #12
pwolanin commentedDavid Straus mentions that MySQL 6 will handle this bettter, but that's a ways off: http://www.scribd.com/doc/2546837/New-Subquery-Optimizations-In-MySQL-6
Comment #13
pwolanin commentedlet's get this fixed before the next release
Comment #14
janusman commentedOne possible improvement is changing OR to UNION.
From:
To:
I think this is PgSQL compatible.
Comment #15
pwolanin commentedI think the problem is the IN query, not the OR
Comment #16
pwolanin commenteduntested patch
Comment #17
pwolanin commentedComment #18
pwolanin commentedSeems like that last query deletes from the {node} table too!!!
Comment #19
pwolanin commentedok, this form seems to behave as expected for the delete.
Comment #20
pwolanin commentedQuick test to all 3 types of changes - looks ok. Committing to 5.x-2.x, 6.x-1.x, 6.x-2.x.
Needs to be ported to HEAD.
Comment #21
Anonymous (not verified) commenteda couple of ideas spring to mind for the D7 port of this.
first, it seems weird to hold on to a real-time update that could be potentially server-bashing, when all we are doing is marking stuff for the real work to happen later. why not just record the type of change and the relevant data, and push all the work to the "do it later" code?
second, it would be great to see this sort or update be alterable by modules. in the degenerate case, the apachesolr module can continue to do the "just update everything in one go" approach. but, someone could write a "apachesolr_gearman" module, which could dispatch the updates as jobs, which would seem to fit the needs here much better.
totally untested code:
Comment #22
pwolanin commented@justinrandell - I agree with your points, but in discussion with Crell it seems rather unfortunate that there is no way around sending such a poorly performing query to mysql in the context of DBTNG.
Comment #23
pwolanin commentedrelated core issue opened by Crell: #961604: dbtng drivers are not able to have their own condition classes
Comment #24
jpmckinney commentedMarking #924214: Calling drush solr-reindex errors with unknown table in MULTI DELETE duplicate. It mentions a bug for some versions of MySQL:
In Mysql version: 5.0.47 the query:
Causes the error:
The issue is that using the alias for apachesolr_search_node means that mysql can no longer recognise the table to delete.
Comment #25
pwolanin commentedStill waiting for the core fix.
Comment #26
nick_vhA solution until the DBTNG has resolved this could be :
Still have to investigate how the exact query would be for mysql and for other types (needs a lot of testing)
Comment #27
nick_vhComment #28
nick_vhA patch that already separates the db_driver. Still a todo would be to write another query for mysql or for another db type.
Comment #29
nick_vhAlso, it seems that we are doing the same query 3 times. We could move this to
This function could possibly retrieve one or more nids?
Comment #30
pwolanin commentedIn 7 mysql and mysql are not separate.
Comment #31
nick_vhVery interesting post about this behavior
http://stackoverflow.com/questions/3417074/why-would-an-in-condition-be-...
i first tried to make it happen in the dbtng layer but I bumped into a problem where the UPDATE does not support joins
(Info about that here : http://api.drupal.org/api/drupal/includes--database--database.inc/functi...)
I think the only solution will be the db_query for mysql only
Comment #32
nick_vhAnd the accompanying patch
Note: I am aware of a whitespace issue in the patch, but let's look at the concept and resolution first
Comment #33
nick_vhUpdated some documentation so future updates become easier
Comment #34
pwolanin commentedYes, from my prior discussion with Crell, it probably just has to be db_query()
Comment #35
pwolanin commentedneed to also fix this function:
Comment #36
nick_vhThere are more of them :
Comment #37
nick_vhMixing up patches, ignore thisComment #38
pwolanin commentedA simple IN query is fine, these don't need to change - it's only the combination with a subquery
Comment #39
nick_vhUntested patch but it does take the delete into account
will report back if I have tested all this functionality
Comment #40
nick_vhComment #41
nick_vhDiffed on latest branch
Comment #42
nick_vhI'm testing out all of the possibilities and somehow I get stuck/struck with the following :
PDOException: SQLSTATE[42000]: Syntax error or access violation: 1064 You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''apachesolr_index_entities_node' asn INNER JOIN taxonomy_index ti ON asn' at line 1: UPDATE {:table} asn INNER JOIN {taxonomy_index} ti ON asn.entity_id = ti.nid SET asn.changed = :changed WHERE ti.tid = :tid; Array ( [:table] => apachesolr_index_entities_node [:changed] => 1325850555 [:tid] => 73Translated to a real query this becomes
UPDATE apachesolr_index_entities_node asn INNER JOIN taxonomy_index ti ON asn.entity_id = ti.nid SET asn.changed = 1325850555 WHERE ti.tid = 73When this query is executed straight in mysql it does not give me any problems... Weird..
Comment #43
nick_vhProgress so far after review
Comment #44
nick_vhWith help from mhrabovcin, the following patch works but it does have a certain direct SQL injection problem. (that is, if entity_info can be altered in the cache)
Should we do some extra checking on the table variable?
Comment #45
nick_vhAn escaped version. I feel more comfortable using mysql_escape_string instead of nothing
Comment #46
pwolanin commentedprobably want to use this
http://api.drupal.org/api/drupal/includes--database--database.inc/functi...
Comment #47
nick_vhAdjusted
Comment #48
nick_vhNeeded some reroll
Comment #49
nick_vhWooaa, a 2 year old issue can be closed and marked as fixed! Committed :-)
Comment #50
nick_vhUpdated it in a follow up that apachesolr_index_get_indexer_table needed to become apachesolr_get_indexer_table
Comment #51
nick_vhComment #52
nick_vhReroll for 6.x-3.x