http://drupal.org/node/114774#taxonomy-revisions

Thus, if you have some nodes with revisions, the count of the associated taxonomy terms will be inflated.

This patch does two things:
- Does a LEFT JOIN with the node table to include only the terms associated with the current revision of each node.
- Removes some unnecessary columns from the GROUP BY clause (unrelated to this issue).

Comments

cpugeniusmv’s picture

StatusFileSize
new1.14 KB

After reading some older issues (specifically #172054: In get weighted tags, why group by so many fields?), I see why the GROUP BY had those extra columns. Here's an updated patch that does not modify the behavior of the original GROUP BY clause.

Bèr Kessels’s picture

I would like to see a small benchmark, or at the very least a test on a larger database. I don't have any real-live Drupal6 sites yet.
Problem here is that the node table is slow and large, so joining against it should IMO be avoided at all costs.

Though, if that is the only solution to this problem, it will have to do.

Please, someone review?

cpugeniusmv’s picture

Here are the results from my (very) informal testing:

SELECT COUNT(*) FROM node
/* 5000 */

SELECT COUNT(*) FROM term_node
/* 75000 */

SELECT COUNT(*) FROM term_data
/* 500 */

/* WITH PATCH */
SELECT COUNT(*) AS count, td.tid, td.name, td.vid FROM node n LEFT JOIN term_node tn ON n.vid = tn.vid INNER JOIN term_data td ON tn.tid = td.tid WHERE td.vid IN (1,2,3) GROUP BY td.tid, td.name, td.vid ORDER BY count DESC

/* Query took 2.1832 sec */



/* ORIGINAL */
SELECT COUNT(*) AS count, d.tid, d.name, d.vid FROM term_data d INNER JOIN term_node n ON d.tid = n.tid WHERE d.vid IN (1,2,3) GROUP BY d.tid, d.name, d.vid ORDER BY count DESC

/* Query took 1.5072 sec */

This is a default install on MySQL5. Terms and nodes were generated with the Devel module. So there is a fairly significant impact on performance. Unfortunately, I can't think of any better solution.

Because this probably a fairly rare corner case, I suggest leaving it out of the module. If anyone needs more precise counts, they can always apply this patch themselves.

Bèr Kessels’s picture

Pity about hte performance. But yes, the penalty is really high.

However, though not being a SQL guru, I know some tricks:
If you left join always make the SELECT on the smallest table (one with the least records) and the JOIN on the largest.
e.g. SELECT * FROM node n LEFT JOIN node_image i ON n.nid = i.nid where i.size = ' large'
should perform a lot worse then
SELECT * FROM node_image i LEFT JOIN node n ON n.nid = i.nid where i.size = ' large'
where nodes has thousands and node_image contains a few (say ten) images.

The reasoning is that mysql
1 makes a temprorary table in memory containing all columns of both joined tables.
2 scans that for the where.

In the first case the temporary table has thousands of records, in the second, the table has only ten.

I must say that I have never tested this, but it makes sense.

Now, in our case that means the primary select should contain the term_data (returning at most 500 records) then a left join on term_node, then an inner join on node. That way, the part ( WHERE td.vid IN (1,2,3) GROUP BY td.tid, td.name, td.vid) can really minimise its possible results. We will not have to scan the entire node table, but only the ter_data.

This would result in something like (pseudo code!)
SELECT COUNT(*) AS count, td.tid, td.name, td.vid FROM term_data td LEFT JOIN term_node tn ON tn.tid = td.tid INNER JOIN node n ON n.vid = tn.vid WHERE td.vid IN (1,2,3) GROUP BY td.tid, td.name, td.vid ORDER BY count DESC

Could you, if you want and are interested, test this too?
And are you aware of the EXPLAIN function in mysql?
Running:
EXPLAIN SELECT COUNT(*) AS count, td.tid, td.name, td.vid FROM term_data td LEFT JOIN term_node tn ON tn.tid = td.tid INNER JOIN node n ON n.vid = tn.vid WHERE td.vid IN (1,2,3) GROUP BY td.tid, td.name, td.vid ORDER BY count DESC
will give you data such as " amount of rows scanned" and so forth. Often running explain shows you that joining one way around requires 10000 tables to be scanned while the reverse needs only 15. really usefull. http://dev.mysql.com/doc/refman/5.0/en/explain.html

yesct’s picture

over my head... did anyone try this, and/or have a patch to implement it this way?

cpugeniusmv’s picture

I'm still playing around with some other ideas to improve performance. I'll post back here if I find anything useful.

Bèr Kessels’s picture

If we join against the node table, we can also finally do a sql_rewrite, fixing all these gazillion i18n/permission and so on issues.
Could we not fix that in the same patch? IMO it does not impact the process there, right?

If this is possible, I will make a highlight of this patch and get people in here, to fix all these (over 40!) issues at once.

jstoller’s picture

IMO this is not a rare case at all. This is a major bug in the module which should be addressed. I'm working with a site right now that has many times more revisions than nodes. It's rather obvious that something's wrong when you click on one of the largest links in the cloud and there is only one node listed. And I agree with Bèr that that this patch has the potential to fix many other issues in one fell swoop.

jstoller’s picture

I should also point out that joining the node table adds the possibility of other useful features. For instance, in my case I need to restrict the cloud to counting terms used by one specific node type, even though the same vocabulary is used to categorize several different node types. The only way to accomplish this is to look in the node table.

neochief’s picture

StatusFileSize
new977 bytes

Prepared a new patch which:
- adds nodes join to the tail of the query (as Bèr suggested)
- provides db_placeholders() support
- throws away terms without nodes (HAVING COUNT(*) > 0)

Please, review.

svdoord’s picture

Just my couple of cents: I agree with jstoller that this is a very real and important bug. The tricky part is that you don't notice it unless you very carefully check the outcome of the tag cloud. Which, I think, most people don't do, assuming that it works as expected...

Bèr Kessels’s picture

this patch really needs some attention from testers. :)

svdoord’s picture

Ah, yes, forgot to mention that: patch works for me :-)

roderik’s picture

Title: SQL does not account for revision history of taxonomy terms maintained in Drupal 6 » SQL accounting for revision history and i18n
StatusFileSize
new1.33 KB

I took comment #7 to heart:
"we can also finally do a sql_rewrite, fixing all these gazillion i18n/permission and so on issues.
Could we not fix that in the same patch?"

(...also out of slight frustration that there was no i18n patch for D6 yet, in between the fifteen zillion issues talking about db_sql-rewrite.)
So now this is a patch for a few intermingled issues which should really be dealt with soon, to take some strain off the issue queue.

I rerolled the patch from #10 for the current 1.x-dev (accounting for the change from #393494: Adding title attribute with description to terms )
and:
- inserted the db_sql_rewrite() call as mentioned in #7;
- included the current language in the 'cache identifier'. (So that a block containing tags for nodes in language XX, is cached/shown when the active language is XX. If you don't do this, ONE block is cached, showing the tags for nodes in 'whatever language happened to be active when the cache was last cleared'.)
- incorporated #290863: Cache Retrieves Wrong Tags because it patches the same line.

I changed the issue title accordingly.

The 'include language in cache identifier' really should be added at the same time as the db_rewrite_sql call.
If this is too much changes for one patch, I'll open another issue for this. But in that case please please regard this issue 'reviewed and tested by the community' and commit the patch from #10 (but modified for the already-committed #393494, i.e. adding the description field).

Bèr Kessels’s picture

committed to D6 branch. Thanks!
Since I already committed #290863: Cache Retrieves Wrong Tags, I manually altered the module to work again.

Bèr Kessels’s picture

Status: Needs review » Fixed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.