Each node takes longer to do. SQL Issue. Recommend disabling pathauto.module

mikeytown2 - September 8, 2009 - 22:35
Project:Calais
Version:6.x-3.2
Component:Code
Category:support request
Priority:normal
Assigned:Unassigned
Status:won't fix
Description

I'm doing a bulk process that will be running for quite some time. When I started the average time per node was around 15 seconds; now it's up to 20. I have about 400k rows in calais_term and 1.5M rows in calais_term_node. I'm 10% done.

Example Query from the process list...

SELECT t.tid, t . * , parent
FROM term_data t
INNER JOIN term_hierarchy h ON t.tid = h.tid
WHERE t.vid

Seems like it's missing WHERE t.vid = %d.

Also I'm getting a lot of table locks on the calais_term table. Can I convert this table to InnoDB?

Also I would recommend normalizing the data; create a GUID lookup table (?) and do searches on that result. It takes 2.48 seconds to run this command

SELECT * FROM calais_term WHERE vid = 70 AND guid = 'http://d.opencalais.com/pershash-1/87027943-556'

Transform that to this somehow
SELECT * FROM calais_term WHERE vid = 70 AND guid = 55555

Long story short, numbers are faster vs text. It also seems to be quite inefficient (multiple updates to the same row).
5808977  Query  2  Locked  UPDATE calais_term SET name = 'Proposed', vid = 49, tdid = 337434, guid = 'http://d.opencalais.com/g...'
5808978  Query  2  Locked  UPDATE calais_term SET name = 'Proposed', vid = 49, tdid = 337434, guid = 'http://d.opencalais.com/g...'

#1

mikeytown2 - September 9, 2009 - 04:38

2 ideas for calais_term table & GUID:
kill 'http://d.opencalais.com/pershash-1/' since it's varchar type (make it smaller).
create an index for GUID & name.

#2

mikeytown2 - September 8, 2009 - 22:47

First point about missing the %d is me being dumb... example query for that type

SELECT t.tid, t . * , parent
FROM term_data t
INNER JOIN term_hierarchy h ON t.tid = h.tid
WHERE t.vid =70
ORDER BY weight, name

This still takes a very long time though, since this returns 200k rows...

#3

mikeytown2 - September 9, 2009 - 09:27

This is now my longest running query

SELECT t.tid, t . * , parent
FROM term_data t
INNER JOIN term_hierarchy h ON t.tid = h.tid
WHERE t.vid

Shouldn't there be a limit on this???
http://api.drupal.org/api/function/db_query_range/6

#4

mikeytown2 - September 9, 2009 - 10:47

Issue has to do with
pathauto.module

File: /var/www/vhosts/openjurist.org/sites/all/modules/pathauto/pathauto.module (Line: 360)
Function: taxonomy_get_tree
Args: 70, 353707

File:  (Line: )
Function: pathauto_taxonomy
Args: insert, term, Array

File: /var/www/vhosts/openjurist.org/includes/module.inc (Line: 471)
Function: call_user_func_array
Args: pathauto_taxonomy, Array

File: /var/www/vhosts/openjurist.org/modules/taxonomy/taxonomy.module (Line: 362)
Function: module_invoke_all
Args: taxonomy, insert, term, Array

File: /var/www/vhosts/openjurist.org/sites/all/modules/opencalais/calais.module (Line: 494)
Function: taxonomy_save_term
Args: Array

#5

mikeytown2 - September 9, 2009 - 10:54
Title:Each node takes longer to do. SQL Issue» Each node takes longer to do. SQL Issue. Recommend disabling pathauto.module

#6

mikeytown2 - September 9, 2009 - 10:57
Status:active» needs review

#7

febbraro - September 9, 2009 - 17:38

Hey @mikeytown2.

Thanks for digging into it a bit. Lemme see if I can answer or at least address everything.

1) Yes, absolutely, I would probably recommend every table in your database be InnoDB, but I don't know your site's usage profile, the {calais_term} and {calais_term_*} tables definitely.

2) The GUID should be unique for each and every term in the calais_term table so I'm not sure exactly why we should break it out into a different table. At the very least it is not normalization, it's optimization.

3) Duplicate Updates, not sure where or why that is happening. Were you able to dig any deeper there? That would definitely qualify as a bug.

4) Another thing to look as is, are you also saving the RDF? If you do your {rdf_data_calais_node} table is likely enormous. I would recommend that, if you have a ton of content, don't save RDF until the RDF storage mechanism is more performant.

How many nodes are we talking about here?

#8

mikeytown2 - September 9, 2009 - 20:09

pathauto isue: #572604: If there is no taxonomy pattern, don't do any work
650K nodes
Not saving RDF
Once the table locks went away (switch to InnoDB) I couldn't find multiple queries. I would recommend hacking core and adding in a debug to file routine that saves each query. Thats how I figured out pathauto was eating up a lot of time. I put in in the _db_query function (see patch).
GUID isn't that slow after all; don't worry about it.
Try restoring a backup when its all InnoDB... #541728: InnoDB Engine & Slow Restores I'm very careful on what tables I switch to InnoDB.

AttachmentSize
db.patch 1.24 KB

#9

BenK - October 25, 2009 - 18:52

Subscribing....

#10

febbraro - November 3, 2009 - 02:11
Status:needs review» won't fix
 
 

Drupal is a registered trademark of Dries Buytaert.