So I was doing some research, and I came across this:
By storing a DFS search tree, you can make SQL do some really cool stuff very quickly. Watch this:
So you have your nodes, maybe something like this: (node:list of children)
A:B,D,E
B:C,D
C:
D:E
E:
You build a DFS traversal tree, but allow visiting nodes more than once (creating a new record for each revisit.)
Your DFS tree will end up like this:
NODE LEFT RIGHT
A 1 16
B 2 9
C 3 4
D 5 8
D 10 13
E 6 7
E 11 12
E 14 15
Store this in the database.
Here's the cool part:
Suppose you want to find all paths to the root from a point.. Say, D. By using D's left and right in a where clause, you can do this very easily:
mysql> select * from testing where l<5 and r>8 order by l,r;
+------+------+------+
| name | l | r |
+------+------+------+
| a | 1 | 16 |
| b | 2 | 9 |
+------+------+------+
2 rows in set (0.00 sec)
mysql> select * from testing where l<10 and r>13 order by l,r;
+------+------+------+
| name | l | r |
+------+------+------+
| a | 1 | 16 |
+------+------+------+
1 row in set (0.00 sec)
Assuming you took weight into consideration while building the search tree, the first occurrence of the node in question will get you the "primary path" back to the root. Checking additional occurrences will get you alternate paths. (Anything not found in the DFS tree at all was an orphan, for obvious reasons)
This also works the other way.
B's children, recursively? No sweat.
mysql> select * from testing where l >2 and r < 9 order by l,r;
+------+------+------+
| name | l | r |
+------+------+------+
| c | 3 | 4 |
| d | 5 | 8 |
| e | 6 | 7 |
+------+------+------+
3 rows in set (0.00 sec)
If left skips more than 1, that means it's the start of a new child path.
Third trick: Let's find all the childless nodes:
mysql> select * from testing where l+1=r order by l,r;
+------+------+------+
| name | l | r |
+------+------+------+
| c | 3 | 4 |
| e | 6 | 7 |
| e | 11 | 12 |
| e | 14 | 15 |
+------+------+------+
4 rows in set (0.00 sec)
C and E both have no children. Obviously, there's other ways to accomplish this, but it's kind of neat to do it using the search tree.
Finally, let's find nodes that DO have children.
mysql> select * from testing where l+1<r order by l,r;
+------+------+------+
| name | l | r |
+------+------+------+
| a | 1 | 16 |
| b | 2 | 9 |
| d | 5 | 8 |
| d | 10 | 13 |
+------+------+------+
4 rows in set (0.00 sec)
A,B, and D. Whenever the difference between left and right was greater than 1, there were children.
So, I'm thinking of rewriting some of the trees stuff to be able to utilize this method. Any thoughts?
Comments
Comment #1
bdragon commentedI screwed up some escaping there somehow....
Repost.
So I was doing some research, and I came across this:
By storing a DFS search tree, you can make SQL do some really cool stuff very quickly. Watch this:
So you have your nodes, maybe something like this: (node:list of children)
A:B,D,E
B:C,D
C:
D:E
E:
You build a DFS traversal tree, but allow visiting nodes more than once (creating a new record for each revisit.)
Your DFS tree will end up like this:
Store this in the database.
Here's the cool part:
Suppose you want to find all paths to the root from a point.. Say, D. By using D's left and right in a where clause, you can do this very easily:
Assuming you took weight into consideration while building the search tree, the first occurrence of the node in question will get you the "primary path" back to the root. Checking additional occurrences will get you alternate paths. (Anything not found in the DFS tree at all was an orphan, for obvious reasons)
This also works the other way.
B's children, recursively? No sweat.
If left skips more than 1, that means it's the start of a new child path.
Third trick: Let's find all the childless nodes:
C and E both have no children. Obviously, there's other ways to accomplish this, but it's kind of neat to do it using the search tree.
Finally, let's find nodes that DO have children.
A,B, and D. Whenever the difference between left and right was greater than 1, there were children.
So, I'm thinking of rewriting some of the trees stuff to be able to utilize this method. Any thoughts?
Comment #2
bdragon commentedCould get even more flexibility by keeping track of depth on each record...
Tomorrow, I think I'll try to figure out if making changes to children and parents can be done incrementally. That would be nice, because then we could avoid the cost of completely rebuilding the tree every time. I was thinking of something involving finding an offset and offsetting all values in a certain range by a certain number... Computing the number and range I don't know how to do yet, or even if it's possible. I'll get a friend of mine that's good at algorithms to help me.
--Brandon
Comment #3
bdragon commentedHmm, I just read that incremental changes ARE a possibility with this model. Excellent.
Comment #4
Jaza commentedLooks very promising! If this offers a big performance boost to the category module, then it would certainly be worthwhile implementing it. More efficient tree traversal is also something that other parts of Drupal could benefit from, including the comment system, the menu system, and (of course) the core book and taxonomy systems. I am more than happy for the category module to be a testing ground for performance increases that can then be ported into core.
Just a few things to keep in mind:
Comment #5
bdragon commentedWorked on this more this morning.
Theoretical gains:
Stuff I haven't figured out yet, but I'm pretty sure is reasonable:
More ideas:
Comment #6
bdragon commentedHmm.... I just realized that making copies of subtrees and modifying them isn't as expensive as I thought.
By removing depth culling from the tree (which would cause it to contain EVERYTHING that's not orphaned), handling depth manipulations and tree generation is just a case of INSERT INTO ... SELECT queries and a little bit of math. (So the number of queries can be minimized...)
Comment #7
bdragon commentedstupid escaping...
...removing depth culling from the <root> tree...
Comment #8
bdragon commentedHmm, it appears I was confusing TOC depth with category depth.
So each container can only have a single depth? That makes things quite a bit easier.
Comment #9
bdragon commentedHmm, never mind about that last comment, my test data was buggy.
Comment #10
bdragon commentedATTN: Ramdak and other non-technical people who have used category a lot.
This question is a question about the behavior of TOC settings.
Ok, I suppose this one should be answered by someone who's actually used category to categorize stuff. I've spent so much more time with the hood off and a screwdriver in my hand than actually *using* category that I'm not sure how the GUI part works ;)
Assume this:
Container: "A", children "B", "D", "G"
Category: "B", children "C"
Category: "C"
Category: "D", children "C","F"
Category: "E", children "G"
Category: "F", children "E"
Category: "G"
So the category tree of the container would look like this, ignoring depth constraints. (Yell if I'm wrong here)
If the container's TOC depth is "2", the view at "A" should look like this, right?
And viewing "D" with the TOC depth being "2" would get you this, right?
Comment #11
bdragon commentedOk, talked it over with my friend, and it looks valid.
I'll do more testing and write some routines. Things are looking good. :)
Also, depth culling is "free" during query, so I believe that we will only have to store one tree per container :)
In addition, doing child queries can be done from any instance of a cid in the tree.
Also also, all of this IS compatible with weights and having categories in a certain order.
Example:
Node:Children
A: C,B
B: C
C: D
D: E
E:
Visual tree:
DFS table:
Say we wanted to find one level of children from C. There are two points we could use, and both work equally well.
C: Left 2, Right 7, Depth 1
C: Left 9, Right 14, Depth 2
One level of children from C fufill the following criteria:
Child left > C's left
Child right < C's right
Child depth <= C's depth + 1
Records fufilling the criteria for the first C (L>2,R<7,D<=2):
D: Left 3, Right 6, Depth 2
Records fufilling the criteria for the second C (L>9,R<14,D<=3):
D: Left 10, Right 13, Depth 3
That's nice, you may say, but how about something more complex?
Two levels of children from A:
A: Left 1, Right 16, Depth 0
Criteria: (L>1,R<16,D<=2)
Results:
Visually:
So, by keeping track of depth, we can easily do TOC culling on the SQL side, wherever your point of reference lies.
Comment #12
bdragon commentedAs a teaser for what my next post is going to be about:
How to add or remove ANY parent->child relationship from the tree with only two very simple and fast queries (even adding a new parent to a category that has 50000 children categories, or removing the last parent from a category with 50000 children categories) :-)
I haven't come up with the pseudocode yet, I have the theory banged out though, just have to get it down on paper :-)
Comment #13
bdragon commentedOk, this isn't what I promised, but I got sidetracked a little.
~~~How to swap adjacent siblings~~~
A: Left 2, Right 3
B: Left 4, Right 7
Desired result:
A B -> B A
Procedure:
Adiff = SELECT (R - L) WHERE N='A' LIMIT 1;
Bdiff = SELECT (R - L) WHERE N='B' LIMIT 1;
UPDATE TESTING SET L=L+(Adiff-Bdiff), R=R+(Adiff-Bdiff) WHERE L>=2 AND R<=3;
UPDATE TESTING SET L=L+(Adiff+Bdiff), R=R+(Adiff+Bdiff) WHERE L>=4 AND R<=7;
Comment #14
bdragon commentedUm, I think I got a little mixed up there.
The SQL queries are backwards, B is subtraction and A is addition.
Comment #15
TheWhippinpost commentedEr... just cheering from the sidelines bdragon - Ain't got a feckin clue what you're rambling on about really but sounds exciting ;)
Comment #16
bdragon commentedOk, looks like I was totally off base there.
THIS I double checked against a non trivial tree by hand.
XY -> YX
XS=XR-XL
YS=YR=YL
XL,XR += YS+1
YL,YR -= XS+1
Comment #17
bdragon commentedI haven't tested this code at all, it's all theory so far. I'm not even sure if it is valid PHP, I've been doing all of this on paper...
This function (if it works) will graft a child onto a parent. (It will appear at the end of the parent's child list. It's also possible to put it at the front, but I think writing a function for shuffling children around based on the formula in the previous post would work better...)
I'll work on a companion function to add a sibling (so you can place a child at a certain point in the child list right away.)
Also, a function for actually putting stuff into the tree in the first place would be nice. :-D
Comment #18
bdragon commentedI'm starting to get comfortable with the way this all works. :-)
Still haven't tested the code, but I've factored the grafting out into a seperate function.
Coming up soon is:
Comment #19
bdragon commentedThe payoff, part 1:
This is a rewrite of category_location to use the DFS tree.
The queries are tested, but the PHP is not.
SQL overhead: Minimal (A key lookup and three ref lookups, number of rows equals number of links in the lightest path to root)
Network overhead: Minimal (Amount of extra data transferred: A single row for each hidden container that ends up getting thrown away -- Rewriting the SQL slightly would eliminate even this.)
PHP overhead: Minimal (Say goodbye to recursion.)
Of course, this won't be testable as such until the rest is done.
Comment #20
bdragon commentedI should be receiving some books in the mail shortly that will help me out with this. Stay Tuned ;-)
Comment #21
bdragon commented(Just got my books)
It appears to be a variation on the "Nested Set with Depth Model" described on P. 160 of "Trees and Hierarchies in SQL For Dummies"
My implementation has the additional property of non-unique node id's, due to the need for handling of multiple-parents.
This partially breaks the ease of finding parents, as to get all paths to the root, you need to query each instance of the node.
Children are still very simple, as all instances of the node have a matched subgraph. I've just been grabbing the first one.
However, usually it's just the lightest weight path that is needed, so grabbing the left and right of the first match will get the wanted results.
This headache is due to the ability of Category to support stuff like this:
A
--B
----C
----D
--E
----B
------C
------D
For the most part, we only need to work downwards, so it may not be too bad.
I'll investigate the various techniques in the books.
Comment #22
bdragon commentedI've been doing more thinking than coding on this.
Current code piggybacks on category_hierarchy. Performance comparison between that and a seperate table has not been done. Primary storage for hierarchy data will be (cid,parent) because that is way easier to repair by hand if needed, and it's easier to generate the lrd data from the cid,parent data than the other way around.
Here's a regen function that will walk the tree. It's a bit of a last resort thing, most operations won't require rebuilding the whole thing from scratch, but it's really simple and can repair damage done by the "smarter" methods if they break down.
The database table currently looks like this:
The code I'm writing is running on a test site at http://category-dev.rtk0.net/ .
Currently, all that is available is a page that shows what the hierarchy table looks like, but as I get more of this done, it'll gain more testing tools.
(as an aside, using a pager in a php page snippet is a nice way to view a table when one doesn't want to bother installing views :wink:)
Comment #23
bdragon commentedGuess I'll have to split the table after all. Forgot that >1 row is needed to represent (cid,parent) when using multiple parents.
Comment #24
bdragon commentedI have been working on this a lot lately.
I managed to make an iterative version of category_display_toc, among other things.
By tweaking category_load to not incur a category_get_parents() cache (I made a lite version that doesn't cache and only gets the parents of the requested cid), serving a depth-limited version of a container (which is optimized on the SQL side with the LRD code active, i.e. only rows we want are returned) takes <10ms. For comparison, the hit taken to do a full category_get_parents() call with the same data is between 900 and 1000 ms.
The entire page generation time was 320.19 ms (This is for a TOC of a container with 2277 categories, single random parents, depth 1.)
Test page is at:
http://category-dev.rtk0.net/index.php?q=node/14999
The new TOC code is only used for this test page, clicking on any category on the page will use the old rendering code. (I've found that it takes 3-4 seconds on average)
Here's another test. I turned the depth limiter off for the container (so it displays the entire tree of categories)
http://category-dev.rtk0.net/index.php?q=node/15000
I was getting a time between 3-6 seconds. (This is on a linode with a bunch of debugging stuff enabled, on a "real" site this would be faster.)
Comment #25
bdragon commentedHmm, odd... I just dropped a key that I didn't think I needed, and the page time went from <400ms to >2s... I readded the key and it didn't seem to change.
/me retraces his steps...
Comment #26
bdragon commented... And regenerating the tree fixed it. Odd.
Comment #27
bdragon commentedOk, it seems the query was executed in a different order.
For the record, this execution plan is good:
and this one is bad:
Comment #28
moonray commentedAfter seeing your post on using the "modified preorder tree traversals" I decided to use it for the tasks_advanced module. One question: how do you move a branch (change the sort order)?
I came up with the following formula.
UPDATE tree SET left = (left + [(b-right - b-left + 1)]), right = (right + [(b-right - b-left + 1)]) WHERE left BETWEEN [a-left] AND [a-right]
UPDATE tree SET left = (left - [(a-right - a-left + 1)]), right = (right - [(a-right - a-left + 1)]) WHERE left BETWEEN [b-left] AND [b-right]
a-left and a-right are the left and right of the node you're swapping.
b-left and b-right are the left and right of the node you're swapping with.
The problem is that this requires the two SQL statements to be run in the same transaction (simultaneously). And transactions aren't an option with most drupal setups.
How would you tackle this? Pull a list of IDs before running these on specific IDs? That would require 4 queries to be run instead of 2. :-S
Comment #29
moonray commentedbtw,
Depth can be pulled from the database on the fly. But it might impact performance. (not sure)
example from http://dev.mysql.com/tech-resources/articles/hierarchical-data.html:
Comment #30
bdragon commentedAlso, I have to deal with multiple parents, so a cid can have multiple depths.
As far as shuffling subtrees go, I'd find out which is bigger, make a hole that size next to the smaller one, shift the bigger one in to place, shift the smaller one into place. That would leave the tree consistent between queries. I wouldn't even worry about fixing the two gaps that occur after doing this, it's not like gaps in the numbering are a problem or anything...
I was thinking about it, and I believe spreading things apart a bit would make moves significantly faster, due to not having to shift large parts of the table. Knowing how far to do so is the tricky part. (Perhaps an advanced option to say how many subcategories you are expecting this category to have in the near future, etc...) And then there's the matter of checking whether a gap exists....
Also, (again mainly because I have to handle multiple parents) sometimes it's faster to truncate the table and rewalk. Especially if we're talking about moving a 6000-wide node with 20 parents. ;-)
Comment #31
bdragon commentedI just read that article, and cringed at some of those queries. Definately not realtime.
~~ An aside about multiple parents ~~
The problem with multiple parents (Which I'm sure you're glad you don't have to worry about, moonray ;) is how they turn what would normally be a strict hierarchy into a general graph.
Imagine ten categories, 1-10. Now imagine each category has the set of parents 1 : (self-1).
a) How many rows does it take to store this?
b) Don't you wish Mysql 4.0 and 4.1 had stored procedures?
As an aside, I tested this situation, and observed the following:
a) It's damn slow to generate.
b) It makes a very impressive table of contents.
c) Isn't this somewhere around O(n!)?
d) This would be much quicker if I turned off indexes temporarily.
e) This would be much quicker as a bulk insert.
f) I was originally going to do the alphabet, A-Z, but I gave up somewhere in the middle due to the exponential increase in the amount of time I had to wait for the table rebuild to finish.
g) Anyone using multiple parents in this manner is crazier than I am.
Comment #32
moonray commentedThe only problem with leaving holes is that you can no longer derive stats such as amount of children, etc. from the left and right values.
Comment #33
moonray commentedI finally finished my implementation of the Modified Preorder Tree Traversal in the tasks_advanced module. For the curious: http://drupal.org/cvs?commit=46311.
Comment #34
pkej commentedHave you ever read the following articles? If not, read them now:
http://www.dbmsmag.com/9603d06.html
http://www.dbmsmag.com/9604d06.html
http://www.dbmsmag.com/9605d06.html
There you will find the pseudocode for merging two trees, and more.
His book on the subject might also be a nice fit in any sql library: http://www.elsevier.com/wps/find/bookbibliographicinfo.cws_home/702605/d...
Paul
Comment #35
pkej commentedOK, here is the ultimate MYSQL solution to most problems with the NSM Tree:
http://dev.mysql.com/tech-resources/articles/hierarchical-data.html
Hope this helps you with your work.