Each node takes longer to do. SQL Issue. Recommend disabling pathauto.module
mikeytown2 - September 8, 2009 - 22:35
| Project: | Calais |
| Version: | 6.x-3.2 |
| Component: | Code |
| Category: | support request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | won't fix |
Description
I'm doing a bulk process that will be running for quite some time. When I started the average time per node was around 15 seconds; now it's up to 20. I have about 400k rows in calais_term and 1.5M rows in calais_term_node. I'm 10% done.
Example Query from the process list...
SELECT t.tid, t . * , parent
FROM term_data t
INNER JOIN term_hierarchy h ON t.tid = h.tid
WHERE t.vidSeems like it's missing
WHERE t.vid = %d.
Also I'm getting a lot of table locks on the calais_term table. Can I convert this table to InnoDB?
Also I would recommend normalizing the data; create a GUID lookup table (?) and do searches on that result. It takes 2.48 seconds to run this command
SELECT * FROM calais_term WHERE vid = 70 AND guid = 'http://d.opencalais.com/pershash-1/87027943-556'Transform that to this somehow
SELECT * FROM calais_term WHERE vid = 70 AND guid = 55555Long story short, numbers are faster vs text. It also seems to be quite inefficient (multiple updates to the same row).
5808977 Query 2 Locked UPDATE calais_term SET name = 'Proposed', vid = 49, tdid = 337434, guid = 'http://d.opencalais.com/g...'
5808978 Query 2 Locked UPDATE calais_term SET name = 'Proposed', vid = 49, tdid = 337434, guid = 'http://d.opencalais.com/g...'
#1
2 ideas for calais_term table & GUID:
kill 'http://d.opencalais.com/pershash-1/' since it's varchar type (make it smaller).
create an index for GUID & name.
#2
First point about missing the %d is me being dumb... example query for that type
SELECT t.tid, t . * , parentFROM term_data t
INNER JOIN term_hierarchy h ON t.tid = h.tid
WHERE t.vid =70
ORDER BY weight, name
This still takes a very long time though, since this returns 200k rows...
#3
This is now my longest running query
SELECT t.tid, t . * , parentFROM term_data t
INNER JOIN term_hierarchy h ON t.tid = h.tid
WHERE t.vid
Shouldn't there be a limit on this???
http://api.drupal.org/api/function/db_query_range/6
#4
Issue has to do with
pathauto.module
File: /var/www/vhosts/openjurist.org/sites/all/modules/pathauto/pathauto.module (Line: 360)
Function: taxonomy_get_tree
Args: 70, 353707
File: (Line: )
Function: pathauto_taxonomy
Args: insert, term, Array
File: /var/www/vhosts/openjurist.org/includes/module.inc (Line: 471)
Function: call_user_func_array
Args: pathauto_taxonomy, Array
File: /var/www/vhosts/openjurist.org/modules/taxonomy/taxonomy.module (Line: 362)
Function: module_invoke_all
Args: taxonomy, insert, term, Array
File: /var/www/vhosts/openjurist.org/sites/all/modules/opencalais/calais.module (Line: 494)
Function: taxonomy_save_term
Args: Array
#5
#6
#7
Hey @mikeytown2.
Thanks for digging into it a bit. Lemme see if I can answer or at least address everything.
1) Yes, absolutely, I would probably recommend every table in your database be InnoDB, but I don't know your site's usage profile, the {calais_term} and {calais_term_*} tables definitely.
2) The GUID should be unique for each and every term in the calais_term table so I'm not sure exactly why we should break it out into a different table. At the very least it is not normalization, it's optimization.
3) Duplicate Updates, not sure where or why that is happening. Were you able to dig any deeper there? That would definitely qualify as a bug.
4) Another thing to look as is, are you also saving the RDF? If you do your {rdf_data_calais_node} table is likely enormous. I would recommend that, if you have a ton of content, don't save RDF until the RDF storage mechanism is more performant.
How many nodes are we talking about here?
#8
pathauto isue: #572604: If there is no taxonomy pattern, don't do any work
650K nodes
Not saving RDF
Once the table locks went away (switch to InnoDB) I couldn't find multiple queries. I would recommend hacking core and adding in a debug to file routine that saves each query. Thats how I figured out pathauto was eating up a lot of time. I put in in the _db_query function (see patch).
GUID isn't that slow after all; don't worry about it.
Try restoring a backup when its all InnoDB... #541728: InnoDB Engine & Slow Restores I'm very careful on what tables I switch to InnoDB.
#9
Subscribing....
#10