Closed (fixed)
Project:
Tagadelic
Version:
6.x-1.2
Component:
Code
Priority:
Normal
Category:
Bug report
Assigned:
Unassigned
Reporter:
Created:
19 Nov 2010 at 15:40 UTC
Updated:
19 Apr 2011 at 09:51 UTC
Jump to comment: Most recent file
The current central query in tagadelic_get_weighted_tags counts each revision of a node. The attached patch corrects this by joining on distinct (tid, nid) pairs in {term_node} rather than the full table. Tested to work against MySQL 5.0 and Postgres 8.4.
| Comment | File | Size | Author |
|---|---|---|---|
| #14 | tagadelic.b976674-1.2.patch | 964 bytes | brad.bulger |
| #1 | tagadelic.b976674.patch | 947 bytes | brad.bulger |
| tagadelic-6.x-1.2_norevs_0.patch | 956 bytes | HorsePunchKid |
Comments
Comment #1
brad.bulger commentedit should be joining to node, otherwise even if you reduce it to unique node IDs, you're getting a count of all the nodes that have ever linked to a term, now or in past revisions.
Comment #2
brad.bulger commentedalso, should this be counting unpublished nodes?
Comment #3
petrovnn commentedThank you HorsePunchKid, you patch realy great!
Comment #4
Bèr Kessels commentedI need a benchmark for this before consider commiting it.
An uncached timer in MySQL, of the query, on a large dataset, will do. e.g.
SELECT SQL_NO_CACHE tid, name, vid....Before and after will give you an estimated loss in performance. I suspect this addition makes tagadelic a factor 4 to N*4 slower. Where N is the average amount of revisions for each node. I'd love to be proven wrong.
Comment #5
Bèr Kessels commentedComment #6
brad.bulger commentedwell, there's slow, and there's wrong. it'd seem to me that the one is unfortunate while the other isn't acceptable. but if my numbers were right enough, so the only change i saw was the slower, i might have a different opinion :)
i can see that if all the nodes have a similar number of revisions, then even if the raw numbers are high the proportions are still OK. but this came up for me in the context of some forum-like content, where some topics were being frequently edited while others were quiet. the terms linked to the active nodes were showing up with cloud weights far above what they should have been.
i did a comparison of the original query and one that joins to node where status = 1, and the latter actually ran a bit faster for me. without restricting it by status, the two ran about the same. but it's not a very large data set, and about 25% of our nodes are not published. EXPLAIN says it's going after the node table first in either case, so results might be different on a bigger db. (this was in mysql 5.0)
Comment #7
Bèr Kessels commentedSo your benchmark shows the difference is small? That could indeed be the case, seeing that the node table is the primary to-be-searched and there are no conditions on the joined table.
If my interpretation of your comment is correct, then this patch improves the behaviour and has a tiny, mostly academic performance penalty.
If so, I will gladly commit it.
Comment #8
brad.bulger commentedmy benchmark showed a small difference but my database is also small. a bigger database might show different results.
i also still wonder, should it be counting unpublished nodes? or is that a feature request?
Comment #9
Bèr Kessels commentedThen I keep this as "needs work" we still need a benchmark on a large dataset.
Creating such a dataset is easy: on a clean install run "devel generator" module to create tags and create many nodes. I am not famliar with creating many updates/revisions with that module, though.
Comment #10
benlotter commentedSorry I don't know how to write a patch but I've modified the code I used to fix the following issue (Show a list (x) of taxonomy terms with node counts) and now my Tagadelic is working perfectly. If someone wants to make a patch out of this or approve it for a release that would be great thanks.
Basically I adapted one line in tagadelic.module and now it works. The line in question is line 228 which in case this is a different line number for you, here is the old line.
And now here is the new line which simply joins to the {node} table (JOIN {node} node ON node.nid = n.nid AND node.vid = n.vid) and adds a filter for published nodes only (AND node.status = 1) which is another issue with the current tagadelic.
I can already here the objections, so I'll guess I'll just answer them up front.- Yes, I only have a few terms and nodes so I don't know how many is reasonable before advising you not to go this route.
- Yes, I have caching enabled so I'm not going to run this query very often.
- Yes, they are all INNER JOINs though I haven't checked the tables for indexes, but you could if things got too big.
- Okay, that's all the objections I can think of other than I have no idea how the rest of the module works because I just searched for that specific line, found it right away and fixed it in 5 minutes including my testing time. In other words, use as is...an amateur fix.
Thanks.
Comment #11
brad.bulger commentedthat's exactly what the patch that is already here is doing, thanks.
Comment #12
benlotter commentedWell, not exactly. In my situation, I schedule a lot of content ahead of time. The patch that was submitted does not seem to account for unpublished nodes or rather it does count them in the weight. My solution only counts published content AND only one revision.
Comment #13
brad.bulger commentedah, yes, you're right, the original patch didn't include the status check, because it was a different question, and i wasn't sure if there was a reason behind it.
this is all moot now, i think, as the join to the node table is already in the 1.x-dev code and the last patch for #411852: Don't count unpublished posts adds the status check. so it looks like this is all set to be done in the next release anyway.
Comment #14
brad.bulger commentedin the meantime, this is an updated patch of 1.2
as far as performance concerns, what i saw is that adding the check of node.status changes the query to need to apply the WHERE clause against the node table. without it, a join to node.vid can run directly against the vid index. so i changed the join to the node table to use node.nid and node.vid - it uses the primary index of node in that case, and i think that that's a bit better when you are going to end up accessing the whole table vs just the index.
since it's cached either way this probably doesn't matter too much.
Comment #15
Bèr Kessels commentedAs I have pointed out on all the issues: this needs benchmarks. The patch introduces not just a slow Query, but it has the potence to bring down sites if not optimized. As such, I will not consider the patch, unless accompanied with benchmarks before and after on both small and large datasets.
Comment #16
Bèr Kessels commentedThis patch fails against the latest version.
Also, this patch removes the td.description. We must fetch the description in the latest versions.
Also, please use db_placeholders() instead of the extremely ugly str-repeat-substr pattern.
A slightly modified version is committed.