Refinement through nodes only sees terms of first 10 nodes
george@dynapres.nl - February 18, 2007 - 13:29
| Project: | Refine by taxonomy |
| Version: | HEAD |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs review |
Description
Due to a bug (feature?) in taxonomy_select_nodes of taxonomy.module only the terms of the first variable_get('feed_default_items', 10) number of nodes is seen in refine_by_taxo_current_nids. When having more nodes a lot of terms are not seen in the refinement blocks!
Tested in Drupal 4.7.6. The Drupal 5.1 version of taxonomy.module behaves the same.

#1
This patch creates a new version of
taxonomy_select_nodes</codes> which returns all the nodes not just a selection. Based on the original of <code>taxonomy.moduleand modified a bit to suit the needs of refine_by_taxonomy (see comments in the code).#2
Removed some superfluous code and incorporated http://drupal.org/node/120265 which is needed for this patch to function.
#3
First, he DISTINCT from another (now committed) patch is in this patch too, please remove, it makes the patch no longer apply.
Second, I think your idea is a very bad one. Consider a site like Drupal, with hundreds of thousands nodes in certain tags. This module would make the CPU burn away within minutes.
The logic as it stands, returning all nodes without limits is not an option.
I have no idea how to solve this yet, but at the very least, we must work out some way to introduce your idea, without the severe performance penalty that comes with loading all the nodes.
#4
And another thing that concerns me, is that $wheres .= can grow immense. SQL can normally have only so much WHERE thingies before it starts trowing errors, we should at the very least avoid such errors, by limiting the amount somewhere.
#5
By using a bit more elaborate SQL query you can bypass most logic and directly select the related terms, like this:
select * from term_data td wheretd.vid = {current-vid} and exists(
select * from term_node tn inner join node n on tn.nid = n.nid where
n.status = 1 and n.moderate = 0 and tn.tid = td.tid and exists(
select * from term_node tn2 where tn.nid = tn2.nid and tn2.tid in ({current-tids})
)
);
This query selects all terms from a given vid which have at least one (published) node in another category than the current one(s). The only problem maybe that the query relies on the "exists" clause which is only present in MySQL 4.1 and up. The Drupal system requirements states: "Drupal will work on v3.23.17 and 4.0 but it is strongly suggested you use 4.1 or 5.0 for future compatibility with Drupal 6 which will drop support for older versions of MySQL" so this may not be too great a problem. Exists() is supported in PostgreSQL (at least in version 7.3 and up).
I'm not sure if this approach really is faster but it probably will.
#6
I mean: 'the query relies on subqueries' instead of 'the query relies on the "exists" clause'.
#7
I've spend quite some time to evaluate a solution using subqueries to see if it's any faster than the "pull in all the nodes" version. The bottom line is that the subquery version has quite a complicated SQL (and is MySQL version dependent) and isn't substantially faster.
I propose to make the maximum number of nodes to consider for the query configurable (say 10, 50, 100, 500, 1000, unlimited) and go for the "refine_by_taxo-find-more-terms-2.patch" solution. This way the site administrator can make a trade-off between completeness and speed.
If this is a viable solution I'll work out the patch.
#8
George, thank you for your comments in http://drupal.org/node/85726
As a workaround I made some code (a PHP snippet, not a patch of refine by taxo module) that's useful for me to show in a block related categories of another vocab (only AND).
Please take a look of this example (spanish block in the right listing locations) at http://www.inforo.com.ar/avisos/restaurants_bares_y_pubs/restaurantes . Has no limit in terms # and its very fast (only one select).
If this is useful for you, the snippet is in http://drupal.org/node/85726 post #18.
Regards,
Gustavo
#9
Gustavo, thank you for your contribution. Your snippet works but involves a costly join of term_node with itself. Also your query doesn't take in account the visibility of nodes. Take for example a category with one unpublished node. This category will show up in your query where it shouldn't be. To solve this you'd have to join in the node table (twice) to check for unpublished nodes. If you do this your snippet isn't (alas) any better than what refine_by_taxo does now.
#10
Thanks for the observation.
I'll add the status=0 in the where clause. I little busy now, but I'll test this addition tomorrow.
Gustavo
#11
I follow this motion. It makes sense to put it in the hands of the site owner to balance performance with functionality.
Thanks.
"I propose to make the maximum number of nodes to consider for the query configurable (say 10, 50, 100, 500, 1000, unlimited) and go for the "refine_by_taxo-find-more-terms-2.patch" solution. This way the site administrator can make a trade-off between completeness and speed."
#12
What is the current status on this? I tried the snippet and was fairly pleased until I discovered that although I could see the related terms from all nodes in a relevant sublist, once I tried to refine by one term in a specific free-tagging vocabulary, I was unable to further reduce my list by refining with yet another term from the same.
The refine_by_taxo module does what I want quite nicely (allowing me to pick term after term, combining the results so I get something like "cold" AND "wet" AND "windy", but with the serious limitation of only bringing me choices from the first results page of my list.
I understand that the proposed patch brings in what I need, but also reduces performance. What about the suggestion of allowing an admin some sort of load/performance balance setting? Also, knowing relatively very very little about SQL queries & subsequent load, would there be any way of seeking balance by putting the limit on the terms themselves instead of the nodes considered (for example, only bothering to check for terms that had been applied to >n nodes)?
#13
The bug is in Taxonomy module - just change the line in taxonomy.module to somthing like:
$result = db_query_range($sql, 0, variable_get('feed_default_items',255))
instead of 10
#14
Changing that line certainly works. I had a applied a patch before which added the $count variable which you could set to 0 for all. I had this working and on the last update to 5.2 I forgot about the change and updated the taxonomy module overriding the patch.
I cannot see to find this patch anywhere now. Do you know where this would be? I'm also curious why this hasn't made it into core yet, it is functionality that many people need...
#15
This is a patch which include george@dynapres.nl proposals (his patch and his idea of a settings for the query limit), without the patch on the DISTINCT (already commited and mentioned by Bèr) and with my own patch posted here (http://drupal.org/node/122570).
This is working for me so it couod be ready for commit, so give it a try.
#16
And this is the same patch I've just posted WITHOUT my own patch posted here (http://drupal.org/node/122570). So, if you have already applied this patch (http://drupal.org/node/122570), you would need to apply this second patch I'm posting now, instead of my first above. I hope am not confusing you.
#17
I won't have any proper time to test this patch. So feel free to test, review it. And please post back your expericences. That would help me a lot to get this committed!
#18
Wondering about progress on this.
I've just applied the patch in #15 to version 5.x-0.1, and the refine by blocks display only terms from the first 10 nodes that appear in the view.
Thanks!
#19
OK, so i just realized that the patch adds admin settings for the module where you can select the number of nodes taken in to account when populating refine by taxo blocks with links.
Have set it to 100 nodes.
Am testing with version version 5.x-0.1, and so far so good.
Thanks!