Active
Project:
Drupal core
Version:
main
Component:
entity system
Priority:
Normal
Category:
Plan
Assigned:
Unassigned
Reporter:
Created:
7 Dec 2008 at 19:05 UTC
Updated:
24 Mar 2025 at 17:08 UTC
Jump to comment: Most recent, Most recent file
Comments
Comment #1
driki_ commentedpatch file
Comment #2
driki_ commentedpatch file
Comment #3
driki_ commentedtxt version of the test.inc.php file for testing purposes.
Comment #4
catchSo some more background on this. Since the EOL taxonomy sprint, we've been discussing a lot about ways to denormalise taxonomy module's hierarchy storage so we don't need to do crazy recursive functions any more. There's the http://drupal.org/project/leftandright module by sdrycroft which implements nested sets, and we already have some materialised path in core with the menu system and comment's threading. Materialised path has storage issues with very deep taxonomies, not to mention multiple parents. Nested sets need to be rebuilt when something is inserted or moved within the tree.
However the nested sets keyed with rational numbers implementation allows for the light storage needs and fast retrieval of nested sets, without the terrible update woes - since there's always an available rational number you can use. This gives us the best of both approaches, and while the patch needs tidying up, it looks like not too much code weight either.
Suggested steps for this:
* Get the patch in line with core coding standards and using dbtng for query building.
* At the same time write tests for taxonomy_get_tree taxonomy_get_children/get_parents (we have some coverage, could use more).
* Convert taxonomy_get_tree() to use the new API, hopefully combined with multiple load so we only get the tree then load the objects into it.
* Looks at comments and other core hierarchies to see if they could benefit.
Comment #5
pwolanin commentedcatch - if this is really the right solution, then we should move menu links to it too?
Comment #6
catchpwolanin: if it does everything which that paper promises, and especially if we can distill it into an easy to use API, then it'd be great to have everything using the same system. Menu is still pretty fresh though, so that seems like a consistency patch after it goes in to fix the monsters though.
Comment #7
wim leersNow *this* is interesting stuff! First time that I'll actually *like* reading a mathematical paper :)
Thanks drico & catch!
Subscribing.
Comment #8
chx commentedpwolanin, wtf? When I wanted to add a very similar notation to the menu system, you ran screaming. See http://drupal.org/node/141866/revisions/159158/view .
The problem with this system (and any other similar) is that we need to understand what depth can we describe with floats as used by database systems supported by Drupal. That will be fun. The patch is full of whitespace errors as well. I would ask Mr. Tropashko (he answers emails :) ) as well what he thinks of Mr. Hazel's encoding.
Comment #9
driki_ commentedHi
I will be working tonight to make the patch more drupal 7 compliant ( query whitespace etc ).
About the float problem, I used bcdiv php function and bcscale to determine precision on the float value.
So for the float value field i'am thinking on putting it as a varchar(30) instead of a DOUBLE (10,9). That would clear the float depth problem maybe.
Comment #10
sdrycroft commentedCorrect me if I'm wrong, but a path of 8.32.15.15.4.1.3.43.4 (an example from the Catalogue of Life classification) would give rise to values for the denominator and numerator of ~170M and ~1.5G respectively. This means that we'd be forced to use the float data type, rather than int, resulting in slower queries. I've investigated using the Tropashko encoding before, and found it resulted in much slower search queries compared to both regular nested sets, and materialised path tree encoding.
Comment #11
driki_ commentedI will make a test with this value and give feedback
Comment #12
driki_ commentedHi
This is a new patch file, more compliant I hope. And also a new test file, using values given by sdrycroft .
I needed to add some large bcscale value, as suspected.
So for the test I ran, I inserted 2454 rows in 8 seconds, and moved 2105 rows in 3s inside the tree.
Comment #13
chx commentedYou do not need to build relatively simple SQL queries like this -- unless you want to make sure they are alterable which is not the case here IMO.
count($parents->parent)>0hardly needed just use$parents->parentis enough. Kindly please run coder module over your code it's a pain to read!Edit: you are using normal maths operators instead of bcmath at places? Won't that be a problem?
Comment #14
driki_ commentedHello,
Here is a coder compliant ( I hope ) patch.
I'm gonna make some optimization now, add some other functions, and correct bugs.
Any remarks appreciated.
Comment #15
catchThis is looking better - some more code style issues though:
All comments should start with a capital letter and end with a full stop.
Is it worth putting the database table creation into system.install at this point? If so it needs to use schema API - can copy the way cache does it.
Should be
pflv nv, dv etc. - all of these need to be unabbreviated for both variable names and column names.
We don't use C style comments, // is fine.
Did you do any benchmarking on tree selection yet, or just insert/update?
@chx, we need to use db_select() for everything here because $table is dynamic - that gets us pluggable/re-usable tree storage same as cache.inc
Comment #16
driki_ commented@catch : thank you very much for your reply, I'm gonna add some more comments, the right way :). And i'm gonna benchmark some tree retrieval also.
Comment #17
driki_ commentedJust for info :
the retrieval of the tree of a term took 0.03s with a tree of 3 parents and 1804 children
the retrieval of the tree of a term took 0.02s with a tree of 6 parents and 608 children
Comment #18
driki_ commentedok naming seems ok now,
time for comments then : database creation, functions renaming (?) and parameter change,optimization.
Comment #19
chx commentedI think I mentioned already that using the SELECT query builder for these relatively simple queries is unnecessary and bloat. Rule of thumb: only use the select builder if any code, be that yours or someone else's needs to change the query. A static SELECT... query can be ran just fine through db_query.
Comment #20
driki_ commented@chx, sorry but like said catch, i thought it was needed because of '$table' witch I think is useful for other module (than taxonomy) usage.
I'm ok to change it, but as I'm not really used to the database layer yet, can you check with catch, that it is really not usefuf ?
thanks
Comment #21
catchchx: we can't/shouldn't do
db_query('SELECT * FROM {' . $table . '}...'so have to use the query builder here as a matter of good practice, even if it's unnecessary otherwise.Comment #22
chx commentedOh I missed that! Fine then. Next question, why are we using db_and() when they are ANDed by default together anyways?
Comment #23
david straussI think this approach is over-complicated and unnecessary. The "crazy recursive functions" can be avoided by walking up the taxonomy tree when you save a node's terms. The term_node table would get an additional column like "steps." Terms directly applied to a node would have a value of zero in this column. As we recurse up the tree, the number of "steps" increases. We end up with a row mapping every term associated with the node, no matter how far up the tree, to the node.
We then index (tid, steps). It becomes a fast query to find items linked to a tid with any recursive depth limit.
The only tricky part is when you modify a term's parent(s). You have to clear the non-zero steps for the term and insert the updated parents.
Comment #24
catch@David, while that might help with taxonomy/term/%/$depth pages, it doesn't help at all for rendering the entire taxononomy tree, which is currently done for all non-autocomplete term selection (both node/add and term administration), and has a very wide range of uses cases and implementations in contrib.
Comment #25
david strauss@catch If the problem is having so many terms that it becomes slow to build a "select" element offering the whole tree and the question is "How can we build the tree faster?", we're asking the wrong question. If building the tree is slow, then we're putting way too many elements in the select element. We should find a different, better interface for selecting the terms.
(But I'm not saying this work won't see use elsewhere.)
Comment #26
catchDavid, yes for the form widgets I'm in favour of getting something along the lines of hierarchical select module or chx's menu chooser patch into core - ideally to replace taxonomy, book and menu select lists. These have both performance and usability issues as they stand now, and the usability ones hit long, long before the queries start getting expensive.
That still leaves http://api.drupal.org/api/function/forum_get_forums/7 which uses taxonomy_get_tree to load the actual hierarchy rather than just calculating term/node associations. Now for forum.module we could do something specific - use materialised path, cache the tree etc. - since there's no multiple parents and likely to be < 1000 discussion forums and limited depth - but that still leaves us with a crappy API for getting trees and branches of trees even if we work around it for all the places it's actually used in core. comment.module's threading desperately needs a revamp too, and it looks like this approach could also be applicable there.
Now it might turn out that this solution won't scale up as much as it needs to but if it does, then we potentially fix a lot of related issues in one go.
Comment #27
driki_ commentedHello,
A new file, with some more comments that I hope will make it clearer.
Comment #28
david strauss@catch Forums needs to have its own denormalized implementation for topic management. Even a perfectly optimized taxonomy system wouldn't fix the scalability issues in the forum module.
Comment #29
pwolanin commented@catch - does the approach here support multiple parents?
If not, is it feasible to consider different implementations of materialized path? Assuming we used a 255 varchar column (rather than individual int column), stored numbers as hex, and had 32 bit ints as IDs (i.e. max of 8 chars), plus used a separator character, we could be guaranteed to handle a depth of 28. MySQl 5.0+ and Postgres support longer varchar columns, and of course SQLite is its own beast in terms of this.
I'm not sure what the performance of indexed varchar columns looks like in terms of matching and sorting, but at least all the algorithms involved are immediately understandable.
Comment #30
catch@pwolanin - as far as I know it does - it'll treat each instance of the term as a different entity.
Comment #31
david strauss@pwoloanin: Official limits on VARCHAR column length are deceptive. Many relational databases only store the first X bytes in the main row record and store the rest elsewhere, which is little better than TEXT. Indexes have their own limits, independent of VARCHAR ones.
Comment #32
driki_ commentedYes it does,
I'm quite sorry, i'm still finishing it, I hope it's gonna be ok in less than a week.
Comment #33
sunsubscribing
Comment #34
driki_ commentedHello,
Here is an intermediate patch file.
I need to :
- optimize calculation in tree_move
- finish the tree_rebuild function, to be putted into cron to clean the tree after a subtree move.
As we are doing insertion into a level as the rightest son of this level, and manage children ordering with a position field, we can do subtree move quite efficiently :
we don't need to move every subtree of a level to insert a subtree between both of them: we just update the position field of the other children of this level.
we can continue to use inegality for filtering descendant of a node
we put all expensive calculations into cron, so insert move and delete should be quick for the end user.
Then, as some joins are made at the moment into taxonomy, I think we need to pass to all the filtering functions another table name, the field to join on, and the field we want to get back from the second table.
So we won't have to make two query to get a descendant and his name ( as with taxonomy_term_data and taxonomy_term_hierarchy ).
Comment #35
natuksubscribing
Comment #36
sdrycroft commentedPlease tell me this isn't seriously being considered for core.
Comment #37
driki_ commentedWhy ?
It's pretty efficient even if it's not finished yet.
Comment #38
sdrycroft commentedI guess what needs to be asked is, what issues is this trying to fix with Drupal's current taxonomy? I wrote the leftandright (nested set) module as Drupal is unable to handle large taxonomies efficiently (or at all), this patch doesn't fix that issue.
I've attempted to load a large (2M terms) tree into a database using the algorithm that this patch is using, only to find that it results in denominator and numerator values of ~10^15. This results in having to use bigint columns (or in the case of the patch, varchar), which in turn results in having to use a high level of precision for the float_value column. Comparing varchar columns against each other is never going to be as fast as comparing int columns. If we can't handle large taxonomies using this patch, then what is the point in using it at all.
Really? I doubt that one very much, in fact, looking at the table layout, I'd say it's far from being efficient.
Assuming the purpose of this patch is to improve the speed at which taxonomies can be loaded into Drupal, then it's a large trade off against the speed at which taxonomies can be edited/added.
Drupal's current taxonomy system works for the vast majority of users, why is it necessary to complicate it, and slow it down, for what appears to be zero gain. What needs looking at, instead of the tree storage algorithm, is how taxonomy is being used, and whether or not we can work without methods like drupal_get_tree.
Comment #39
Mgccl commentedsubscribe.
Nice idea. So this is still nested sets instead of nested intervals?
I don't know if what I'm going to say can also apply to nested sets. but nested intervals are also used to represent hierarchy.
The way to pick those rational numbers doesn't look optimal. The size grows pretty fast. I don't know if there is other ways to pick rational numbers. maybe nested intervals with Farey Fractions? with that, 1M records produce numerator/denominator below 30k, at least the data used in the article.
I could work on this for GSoC... if Farey fractions could work...
Comment #40
Scott Reynolds commentedGood to see people doing nested set work. Recently came across this concept when reading MySQL book. varchar columns seems to not be the way to go. I haven't had a chance to review this patch, but I am putting it on my radar.
I believe that this technique will allow the taxonomy landing pages to pull in all nodes with child tags of that term efficiently.
Comment #41
catchtagging.
Comment #42
catchThis won't get into Drupal 7, we need to seriously revisit this in Drupal 8 to get it properly sorted out though.
Comment #43
giorgio79 commentedConsider this as well for hierarchical taxonomy handling:
http://drupal.org/project/lineage
I believe it is the best solution for large hierarchical taxo handling.
Comment #44
xjmRan across this in checking to see what issues are already open about the taxonomy API.
Re: #43. Um. I'm not sure lineage is the best solution for anything other than building taxonomy views more simply. It stores strings in database fields, regexes them to render, and loops over big arrays to sort an render hierarchical taxonomy lists. The storage method may be faster for large, complicated hierarchies, but the module itself, while useful for views plugins, contains redundancies and probably isn't that performant.
Comment #45
xjmCrossposting #1207326: Refactor taxonomy hierarchy API for performance, consistency, and convenience.
Comment #46
pwolanin commentedWe should look at entity_tree as the basis for a new (performant) tree API be used across menu_links, book, taxonomy, comment, etc
Comment #47
giorgio79 commentedOr at this https://www.drupal.org/project/tree
Comment #48
Anonymous (not verified) commentedGood reading: http://bojanz.wordpress.com/2014/04/25/storing-hierarchical-data-materia...
Comment #49
pwolanin commented@ivanjaros - we already use a variant on materialized path for menu links since Drupal 6.
Comment #63
smustgrave commentedThank you for creating this issue to improve Drupal.
We are working to decide if this task is still relevant to a currently supported version of Drupal. There hasn't been any discussion here for over 8 years which suggests that this has either been implemented or is no longer relevant. Your thoughts on this will allow a decision to be made.
Since we need more information to move forward with this issue, the status is now Postponed (maintainer needs more info). If we don't receive additional information to help with the issue, it may be closed after three months.
Thanks!
Comment #64
catchThis is still overall a scalability issue in Drupal core, and there are some contrib approaches these days like https://www.drupal.org/project/entity_hierarchy which uses nested sets.
Moving back to 'active' but converting to a plan.