Don't include multiple revisions of a node in weight count [#976674]

Comment	File	Size	Author
#14	tagadelic.b976674-1.2.patch	964 bytes	brad.bulger
#1	tagadelic.b976674.patch	947 bytes	brad.bulger
	tagadelic-6.x-1.2_norevs_0.patch	956 bytes	HorsePunchKid

Comment #1

brad.bulger commented 7 December 2010 at 23:25

Status	File	Size
new	tagadelic.b976674.patch	947 bytes

it should be joining to node, otherwise even if you reduce it to unique node IDs, you're getting a count of all the nodes that have ever linked to a term, now or in past revisions.

Log in or register to post comments

Comment #2

brad.bulger commented 8 December 2010 at 00:44

also, should this be counting unpublished nodes?

Log in or register to post comments

Comment #3

petrovnn commented 17 February 2011 at 12:24

Thank you HorsePunchKid, you patch realy great!

Log in or register to post comments

Comment #4

Bèr Kessels commented 18 February 2011 at 09:10

I need a benchmark for this before consider commiting it.

An uncached timer in MySQL, of the query, on a large dataset, will do. e.g.
SELECT SQL_NO_CACHE tid, name, vid....

Before and after will give you an estimated loss in performance. I suspect this addition makes tagadelic a factor 4 to N*4 slower. Where N is the average amount of revisions for each node. I'd love to be proven wrong.

Log in or register to post comments

Comment #5

Bèr Kessels commented 18 February 2011 at 09:10

Status:

Needs review

» Needs work

Log in or register to post comments

Comment #6

brad.bulger commented 18 February 2011 at 20:06

well, there's slow, and there's wrong. it'd seem to me that the one is unfortunate while the other isn't acceptable. but if my numbers were right enough, so the only change i saw was the slower, i might have a different opinion :)

i can see that if all the nodes have a similar number of revisions, then even if the raw numbers are high the proportions are still OK. but this came up for me in the context of some forum-like content, where some topics were being frequently edited while others were quiet. the terms linked to the active nodes were showing up with cloud weights far above what they should have been.

i did a comparison of the original query and one that joins to node where status = 1, and the latter actually ran a bit faster for me. without restricting it by status, the two ran about the same. but it's not a very large data set, and about 25% of our nodes are not published. EXPLAIN says it's going after the node table first in either case, so results might be different on a bigger db. (this was in mysql 5.0)

Log in or register to post comments

Comment #7

Bèr Kessels commented 21 February 2011 at 15:12

So your benchmark shows the difference is small? That could indeed be the case, seeing that the node table is the primary to-be-searched and there are no conditions on the joined table.

If my interpretation of your comment is correct, then this patch improves the behaviour and has a tiny, mostly academic performance penalty.

If so, I will gladly commit it.

Log in or register to post comments

Comment #8

brad.bulger commented 21 February 2011 at 17:07

my benchmark showed a small difference but my database is also small. a bigger database might show different results.

i also still wonder, should it be counting unpublished nodes? or is that a feature request?

Log in or register to post comments

Comment #9

Bèr Kessels commented 22 February 2011 at 15:01

Then I keep this as "needs work" we still need a benchmark on a large dataset.

Creating such a dataset is easy: on a clean install run "devel generator" module to create tags and create many nodes. I am not famliar with creating many updates/revisions with that module, though.

Log in or register to post comments

Comment #10

benlotter commented 2 March 2011 at 21:58

Status:

Needs work

» Needs review

Sorry I don't know how to write a patch but I've modified the code I used to fix the following issue (Show a list (x) of taxonomy terms with node counts) and now my Tagadelic is working perfectly. If someone wants to make a patch out of this or approve it for a release that would be great thanks.

Basically I adapted one line in tagadelic.module and now it works. The line in question is line 228 which in case this is a different line number for you, here is the old line.

$result = db_query_range('SELECT COUNT(*) AS count, d.tid, d.name, d.vid FROM {term_data} d INNER JOIN {term_node} n ON d.tid = n.tid WHERE d.vid IN ('. substr(str_repeat('%d,', count($vids)), 0, -1) .') GROUP BY d.tid, d.name, d.vid ORDER BY count DESC', $vids, 0, $size);

And now here is the new line which simply joins to the {node} table (JOIN {node} node ON node.nid = n.nid AND node.vid = n.vid) and adds a filter for published nodes only (AND node.status = 1) which is another issue with the current tagadelic.

<?php
$result = db_query_range('SELECT COUNT(*) AS count, d.tid, d.name, d.vid FROM {term_data} d INNER JOIN {term_node} n ON d.tid = n.tid JOIN {node} node ON node.nid = n.nid AND node.vid = n.vid WHERE d.vid IN ('. substr(str_repeat('%d,', count($vids)), 0, -1) .') AND node.status = 1 GROUP BY d.tid, d.name, d.vid ORDER BY count DESC', $vids, 0, $size);
?>

Yes, I only have a few terms and nodes so I don't know how many is reasonable before advising you not to go this route.
Yes, I have caching enabled so I'm not going to run this query very often.
Yes, they are all INNER JOINs though I haven't checked the tables for indexes, but you could if things got too big.
Okay, that's all the objections I can think of other than I have no idea how the rest of the module works because I just searched for that specific line, found it right away and fixed it in 5 minutes including my testing time. In other words, use as is...an amateur fix.

Thanks.

Log in or register to post comments

Comment #11

brad.bulger commented 2 March 2011 at 22:55

Status:

Needs review

» Needs work

that's exactly what the patch that is already here is doing, thanks.

Log in or register to post comments

Comment #12

benlotter commented 3 March 2011 at 16:11

Status:

Needs work

» Needs review

Well, not exactly. In my situation, I schedule a lot of content ahead of time. The patch that was submitted does not seem to account for unpublished nodes or rather it does count them in the weight. My solution only counts published content AND only one revision.

Log in or register to post comments

Comment #13

brad.bulger commented 3 March 2011 at 21:10

ah, yes, you're right, the original patch didn't include the status check, because it was a different question, and i wasn't sure if there was a reason behind it.

this is all moot now, i think, as the join to the node table is already in the 1.x-dev code and the last patch for #411852: Don't count unpublished posts adds the status check. so it looks like this is all set to be done in the next release anyway.

Log in or register to post comments

Comment #14

brad.bulger commented 3 March 2011 at 21:45

Status	File	Size
new	tagadelic.b976674-1.2.patch	964 bytes

in the meantime, this is an updated patch of 1.2

as far as performance concerns, what i saw is that adding the check of node.status changes the query to need to apply the WHERE clause against the node table. without it, a join to node.vid can run directly against the vid index. so i changed the join to the node table to use node.nid and node.vid - it uses the primary index of node in that case, and i think that that's a bit better when you are going to end up accessing the whole table vs just the index.

since it's cached either way this probably doesn't matter too much.

Log in or register to post comments

Comment #15

Bèr Kessels commented 5 March 2011 at 16:14

Status:

Needs review

» Needs work

As I have pointed out on all the issues: this needs benchmarks. The patch introduces not just a slow Query, but it has the potence to bring down sites if not optimized. As such, I will not consider the patch, unless accompanied with benchmarks before and after on both small and large datasets.

Log in or register to post comments

Comment #16

Bèr Kessels commented 5 April 2011 at 09:48

Status:

Needs work

» Fixed

This patch fails against the latest version.

Also, this patch removes the td.description. We must fetch the description in the latest versions.

Also, please use db_placeholders() instead of the extremely ugly str-repeat-substr pattern.

A slightly modified version is committed.

Log in or register to post comments

Comment #17

19 April 2011 at 09:51

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Log in or register to post comments

Don't include multiple revisions of a node in weight count

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

News items

Our community

Documentation

Drupal code base

Governance of community