The current central query in tagadelic_get_weighted_tags counts each revision of a node. The attached patch corrects this by joining on distinct (tid, nid) pairs in {term_node} rather than the full table. Tested to work against MySQL 5.0 and Postgres 8.4.

Comments

brad.bulger’s picture

StatusFileSize
new947 bytes

it should be joining to node, otherwise even if you reduce it to unique node IDs, you're getting a count of all the nodes that have ever linked to a term, now or in past revisions.

brad.bulger’s picture

also, should this be counting unpublished nodes?

petrovnn’s picture

Thank you HorsePunchKid, you patch realy great!

Bèr Kessels’s picture

I need a benchmark for this before consider commiting it.

An uncached timer in MySQL, of the query, on a large dataset, will do. e.g.
SELECT SQL_NO_CACHE tid, name, vid....

Before and after will give you an estimated loss in performance. I suspect this addition makes tagadelic a factor 4 to N*4 slower. Where N is the average amount of revisions for each node. I'd love to be proven wrong.

Bèr Kessels’s picture

Status: Needs review » Needs work
brad.bulger’s picture

well, there's slow, and there's wrong. it'd seem to me that the one is unfortunate while the other isn't acceptable. but if my numbers were right enough, so the only change i saw was the slower, i might have a different opinion :)

i can see that if all the nodes have a similar number of revisions, then even if the raw numbers are high the proportions are still OK. but this came up for me in the context of some forum-like content, where some topics were being frequently edited while others were quiet. the terms linked to the active nodes were showing up with cloud weights far above what they should have been.

i did a comparison of the original query and one that joins to node where status = 1, and the latter actually ran a bit faster for me. without restricting it by status, the two ran about the same. but it's not a very large data set, and about 25% of our nodes are not published. EXPLAIN says it's going after the node table first in either case, so results might be different on a bigger db. (this was in mysql 5.0)

Bèr Kessels’s picture

So your benchmark shows the difference is small? That could indeed be the case, seeing that the node table is the primary to-be-searched and there are no conditions on the joined table.

If my interpretation of your comment is correct, then this patch improves the behaviour and has a tiny, mostly academic performance penalty.

If so, I will gladly commit it.

brad.bulger’s picture

my benchmark showed a small difference but my database is also small. a bigger database might show different results.

i also still wonder, should it be counting unpublished nodes? or is that a feature request?

Bèr Kessels’s picture

Then I keep this as "needs work" we still need a benchmark on a large dataset.

Creating such a dataset is easy: on a clean install run "devel generator" module to create tags and create many nodes. I am not famliar with creating many updates/revisions with that module, though.

benlotter’s picture

Status: Needs work » Needs review

Sorry I don't know how to write a patch but I've modified the code I used to fix the following issue (Show a list (x) of taxonomy terms with node counts) and now my Tagadelic is working perfectly. If someone wants to make a patch out of this or approve it for a release that would be great thanks.

Basically I adapted one line in tagadelic.module and now it works. The line in question is line 228 which in case this is a different line number for you, here is the old line.

$result = db_query_range('SELECT COUNT(*) AS count, d.tid, d.name, d.vid FROM {term_data} d INNER JOIN {term_node} n ON d.tid = n.tid WHERE d.vid IN ('. substr(str_repeat('%d,', count($vids)), 0, -1) .') GROUP BY d.tid, d.name, d.vid ORDER BY count DESC', $vids, 0, $size);

And now here is the new line which simply joins to the {node} table (JOIN {node} node ON node.nid = n.nid AND node.vid = n.vid) and adds a filter for published nodes only (AND node.status = 1) which is another issue with the current tagadelic.

<?php
$result = db_query_range('SELECT COUNT(*) AS count, d.tid, d.name, d.vid FROM {term_data} d INNER JOIN {term_node} n ON d.tid = n.tid JOIN {node} node ON node.nid = n.nid AND node.vid = n.vid WHERE d.vid IN ('. substr(str_repeat('%d,', count($vids)), 0, -1) .') AND node.status = 1 GROUP BY d.tid, d.name, d.vid ORDER BY count DESC', $vids, 0, $size);
?>
    I can already here the objections, so I'll guess I'll just answer them up front.
  • Yes, I only have a few terms and nodes so I don't know how many is reasonable before advising you not to go this route.
  • Yes, I have caching enabled so I'm not going to run this query very often.
  • Yes, they are all INNER JOINs though I haven't checked the tables for indexes, but you could if things got too big.
  • Okay, that's all the objections I can think of other than I have no idea how the rest of the module works because I just searched for that specific line, found it right away and fixed it in 5 minutes including my testing time. In other words, use as is...an amateur fix.

Thanks.

brad.bulger’s picture

Status: Needs review » Needs work

that's exactly what the patch that is already here is doing, thanks.

benlotter’s picture

Status: Needs work » Needs review

Well, not exactly. In my situation, I schedule a lot of content ahead of time. The patch that was submitted does not seem to account for unpublished nodes or rather it does count them in the weight. My solution only counts published content AND only one revision.

brad.bulger’s picture

ah, yes, you're right, the original patch didn't include the status check, because it was a different question, and i wasn't sure if there was a reason behind it.

this is all moot now, i think, as the join to the node table is already in the 1.x-dev code and the last patch for #411852: Don't count unpublished posts adds the status check. so it looks like this is all set to be done in the next release anyway.

brad.bulger’s picture

StatusFileSize
new964 bytes

in the meantime, this is an updated patch of 1.2

as far as performance concerns, what i saw is that adding the check of node.status changes the query to need to apply the WHERE clause against the node table. without it, a join to node.vid can run directly against the vid index. so i changed the join to the node table to use node.nid and node.vid - it uses the primary index of node in that case, and i think that that's a bit better when you are going to end up accessing the whole table vs just the index.

since it's cached either way this probably doesn't matter too much.

Bèr Kessels’s picture

Status: Needs review » Needs work

As I have pointed out on all the issues: this needs benchmarks. The patch introduces not just a slow Query, but it has the potence to bring down sites if not optimized. As such, I will not consider the patch, unless accompanied with benchmarks before and after on both small and large datasets.

Bèr Kessels’s picture

Status: Needs work » Fixed

This patch fails against the latest version.

Also, this patch removes the td.description. We must fetch the description in the latest versions.

Also, please use db_placeholders() instead of the extremely ugly str-repeat-substr pattern.

A slightly modified version is committed.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.