Idea: Bloody fast trees [#87918]

So I was doing some research, and I came across this:
By storing a DFS search tree, you can make SQL do some really cool stuff very quickly. Watch this:

So you have your nodes, maybe something like this: (node:list of children)
A:B,D,E
B:C,D
C:
D:E
E:

You build a DFS traversal tree, but allow visiting nodes more than once (creating a new record for each revisit.)
Your DFS tree will end up like this:

NODE  LEFT  RIGHT
A     1     16
B     2     9
C     3     4
D     5     8
D     10    13
E     6     7
E     11    12
E     14    15

Store this in the database.

Here's the cool part:

Suppose you want to find all paths to the root from a point.. Say, D. By using D's left and right in a where clause, you can do this very easily:

mysql> select * from testing where l<5 and r>8 order by l,r;
+------+------+------+
| name | l    | r    |
+------+------+------+
| a    |    1 |   16 |
| b    |    2 |    9 |
+------+------+------+
2 rows in set (0.00 sec)
mysql> select * from testing where l<10 and r>13 order by l,r;
+------+------+------+
| name | l    | r    |
+------+------+------+
| a    |    1 |   16 |
+------+------+------+
1 row in set (0.00 sec)

Assuming you took weight into consideration while building the search tree, the first occurrence of the node in question will get you the "primary path" back to the root. Checking additional occurrences will get you alternate paths. (Anything not found in the DFS tree at all was an orphan, for obvious reasons)

This also works the other way.
B's children, recursively? No sweat.

mysql> select * from testing where l >2 and r < 9 order by l,r;
+------+------+------+
| name | l    | r    |
+------+------+------+
| c    |    3 |    4 |
| d    |    5 |    8 |
| e    |    6 |    7 |
+------+------+------+
3 rows in set (0.00 sec)

If left skips more than 1, that means it's the start of a new child path.

Third trick: Let's find all the childless nodes:

mysql> select * from testing where l+1=r order by l,r;
+------+------+------+
| name | l    | r    |
+------+------+------+
| c    |    3 |    4 |
| e    |    6 |    7 |
| e    |   11 |   12 |
| e    |   14 |   15 |
+------+------+------+
4 rows in set (0.00 sec)

C and E both have no children. Obviously, there's other ways to accomplish this, but it's kind of neat to do it using the search tree.

Finally, let's find nodes that DO have children.

mysql> select * from testing where l+1<r order by l,r;
+------+------+------+
| name | l    | r    |
+------+------+------+
| a    |    1 |   16 |
| b    |    2 |    9 |
| d    |    5 |    8 |
| d    |   10 |   13 |
+------+------+------+
4 rows in set (0.00 sec)

A,B, and D. Whenever the difference between left and right was greater than 1, there were children.

So, I'm thinking of rewriting some of the trees stuff to be able to utilize this method. Any thoughts?

Comments

Comment #1

bdragon commented 7 October 2006 at 08:30

I screwed up some escaping there somehow....

Repost.

So I was doing some research, and I came across this:
By storing a DFS search tree, you can make SQL do some really cool stuff very quickly. Watch this:

So you have your nodes, maybe something like this: (node:list of children)
A:B,D,E
B:C,D
C:
D:E
E:

You build a DFS traversal tree, but allow visiting nodes more than once (creating a new record for each revisit.)
Your DFS tree will end up like this:

NODE  LEFT  RIGHT
A     1     16
B     2     9
C     3     4
D     5     8
D     10    13
E     6     7
E     11    12
E     14    15

Store this in the database.

Here's the cool part:

Suppose you want to find all paths to the root from a point.. Say, D. By using D's left and right in a where clause, you can do this very easily:

mysql> select * from testing where l<5 and r>8 order by l,r;
+------+------+------+
| name | l    | r    |
+------+------+------+
| a    |    1 |   16 |
| b    |    2 |    9 |
+------+------+------+
2 rows in set (0.00 sec)
mysql> select * from testing where l<10 and r>13 order by l,r;
+------+------+------+
| name | l    | r    |
+------+------+------+
| a    |    1 |   16 |
+------+------+------+
1 row in set (0.00 sec)

This also works the other way.
B's children, recursively? No sweat.

mysql> select * from testing where l>2 and r<9 order by l,r;
+------+------+------+
| name | l    | r    |
+------+------+------+
| c    |    3 |    4 |
| d    |    5 |    8 |
| e    |    6 |    7 |
+------+------+------+
3 rows in set (0.00 sec)

If left skips more than 1, that means it's the start of a new child path.

Third trick: Let's find all the childless nodes:

mysql> select * from testing where l+1=r order by l,r;
+------+------+------+
| name | l    | r    |
+------+------+------+
| c    |    3 |    4 |
| e    |    6 |    7 |
| e    |   11 |   12 |
| e    |   14 |   15 |
+------+------+------+
4 rows in set (0.00 sec)

C and E both have no children. Obviously, there's other ways to accomplish this, but it's kind of neat to do it using the search tree.

Finally, let's find nodes that DO have children.

mysql> select * from testing where l+1<r order by l,r;
+------+------+------+
| name | l    | r    |
+------+------+------+
| a    |    1 |   16 |
| b    |    2 |    9 |
| d    |    5 |    8 |
| d    |   10 |   13 |
+------+------+------+
4 rows in set (0.00 sec)

A,B, and D. Whenever the difference between left and right was greater than 1, there were children.

So, I'm thinking of rewriting some of the trees stuff to be able to utilize this method. Any thoughts?

Comment #2

bdragon commented 7 October 2006 at 08:37

Could get even more flexibility by keeping track of depth on each record...

Tomorrow, I think I'll try to figure out if making changes to children and parents can be done incrementally. That would be nice, because then we could avoid the cost of completely rebuilding the tree every time. I was thinking of something involving finding an offset and offsetting all values in a certain range by a certain number... Computing the number and range I don't know how to do yet, or even if it's possible. I'll get a friend of mine that's good at algorithms to help me.

--Brandon

Comment #3

bdragon commented 7 October 2006 at 08:42

Hmm, I just read that incremental changes ARE a possibility with this model. Excellent.

Comment #4

Jaza commented 7 October 2006 at 09:09

Looks very promising! If this offers a big performance boost to the category module, then it would certainly be worthwhile implementing it. More efficient tree traversal is also something that other parts of Drupal could benefit from, including the comment system, the menu system, and (of course) the core book and taxonomy systems. I am more than happy for the category module to be a testing ground for performance increases that can then be ported into core.

Just a few things to keep in mind:

The algorithm has to fully handle nodes that have multiple parents. Taxonomy/category trees are not strict trees, i.e. strict trees (in graph theory, e.g. binary trees, b-trees) do not allow multiple parents per node. From what you've posted, looks like this won't be a problem.
Whatever gets implemented has to be capable of producing the same results as the current solution, just in a more efficient way. The category module still has a strict policy of being fully backwards-compatible through the book and taxonomy wrappers. Any changes to the internals of the category module have to be of such a nature, that they can be exposed through the wrappers in a backwards-compatible way. At the moment, the category module works pretty similar internally to book and taxonomy - but that can change, as long as it is still able to expose a compatible public API.

Comment #5

bdragon commented 7 October 2006 at 18:59

Worked on this more this morning.

Theoretical gains:

Grabbing the flat tree with 1 query, properly ordered. (Can anyone say pager_query? ;)
Deleting items is just a delete operation, a simple addition, and a second query to renumber the tree (very fast, constant subtraction on all numbers >x)
Adding items only requires walking the children, inserts for that, and a query to renumber the tree (very fast, constant addition on all numbers >x)
It is possible to do loop detection easily during traversal! Just store an array of nodes visited, and empty it whenever you backtrack!
Container depth limits can be enforced on the SQL side! I added a depth column to storage, and this made things very trivial.
The speed of adding or removing parents is dependent on the number of children of the current category
A parent list is a single SQL query
A child list is a single SQL query
Huge reduction in memory usage
Much potential for simplification in various parts of Category.

Stuff I haven't figured out yet, but I'm pretty sure is reasonable:

Changing weights
- This means pieces of the subtrees will swap places
- I'm not sure what the actual algorithm will be, will discuss it with my friend (he's better at algorithms than me)
Per -category depth limits
- Probabaly doable by having a set of subtrees, one for each category. This shouldn't have too bad of a performance penalty. It pretty much just changes the tree key from being cnid to being cid. The id of the container will still be used to grab the container's tree.
- Should make deep hierarchies easily doable.

More ideas:

JIT creation for category subtrees. Could spread the cost of category subtrees out a bit. Probabaly best to have as a toggle, and an option to "generate subtrees now" or "generate X subtrees during cron runs"
- The cost to create a subtree is equal to the number of children in the recursive children list. Could probabaly use a heuristic to determine whether it is cheaper to fixup all relevant trees vs/ nuke the whole subtree and rebuild when needed.

Comment #6

bdragon commented 7 October 2006 at 19:09

Hmm.... I just realized that making copies of subtrees and modifying them isn't as expensive as I thought.
By removing depth culling from the tree (which would cause it to contain EVERYTHING that's not orphaned), handling depth manipulations and tree generation is just a case of INSERT INTO ... SELECT queries and a little bit of math. (So the number of queries can be minimized...)

Comment #7

bdragon commented 7 October 2006 at 19:10

stupid escaping...

...removing depth culling from the <root> tree...

Comment #8

bdragon commented 7 October 2006 at 20:32

Hmm, it appears I was confusing TOC depth with category depth.

So each container can only have a single depth? That makes things quite a bit easier.

Comment #9

bdragon commented 7 October 2006 at 20:39

Hmm, never mind about that last comment, my test data was buggy.

Comment #10

bdragon commented 7 October 2006 at 21:01

ATTN: Ramdak and other non-technical people who have used category a lot.

This question is a question about the behavior of TOC settings.

Ok, I suppose this one should be answered by someone who's actually used category to categorize stuff. I've spent so much more time with the hood off and a screwdriver in my hand than actually *using* category that I'm not sure how the GUI part works ;)

Assume this:
Container: "A", children "B", "D", "G"
Category: "B", children "C"
Category: "C"
Category: "D", children "C","F"
Category: "E", children "G"
Category: "F", children "E"
Category: "G"

So the category tree of the container would look like this, ignoring depth constraints. (Yell if I'm wrong here)

B
--C
D
--C
--F
----E
------G
G

If the container's TOC depth is "2", the view at "A" should look like this, right?

B
--C
D
--C
--F
----E
G

And viewing "D" with the TOC depth being "2" would get you this, right?

C
F
--E
----G

Comment #11

bdragon commented 7 October 2006 at 22:32

Ok, talked it over with my friend, and it looks valid.

I'll do more testing and write some routines. Things are looking good. :)

Also, depth culling is "free" during query, so I believe that we will only have to store one tree per container :)

In addition, doing child queries can be done from any instance of a cid in the tree.

Also also, all of this IS compatible with weights and having categories in a certain order.

Example:
Node:Children
A: C,B
B: C
C: D
D: E
E:

Visual tree:

A
--C
----D
------E
--B
----C
------D
--------E

DFS table:

N   L   R   D
----------------
A   1   16  0
C   2   7   1
D   3   6   2
E   4   5   3
B   8   15  1
C   9   14  2
D   10  13  3
E   11  12  4

Say we wanted to find one level of children from C. There are two points we could use, and both work equally well.
C: Left 2, Right 7, Depth 1
C: Left 9, Right 14, Depth 2

One level of children from C fufill the following criteria:
Child left > C's left
Child right < C's right
Child depth <= C's depth + 1

Records fufilling the criteria for the first C (L>2,R<7,D<=2):
D: Left 3, Right 6, Depth 2

Records fufilling the criteria for the second C (L>9,R<14,D<=3):
D: Left 10, Right 13, Depth 3

That's nice, you may say, but how about something more complex?
Two levels of children from A:

A: Left 1, Right 16, Depth 0

Criteria: (L>1,R<16,D<=2)
Results:

N   L   R   D
----------------
A   1   16  0
C   2   7   1
D   3   6   2
B   8   15  1
C   9   14  2

Visually:

A
--C
----D
--B
----C

So, by keeping track of depth, we can easily do TOC culling on the SQL side, wherever your point of reference lies.

Comment #12

bdragon commented 7 October 2006 at 22:38

As a teaser for what my next post is going to be about:

How to add or remove ANY parent->child relationship from the tree with only two very simple and fast queries (even adding a new parent to a category that has 50000 children categories, or removing the last parent from a category with 50000 children categories) :-)

I haven't come up with the pseudocode yet, I have the theory banged out though, just have to get it down on paper :-)

Comment #13

bdragon commented 7 October 2006 at 23:53

Ok, this isn't what I promised, but I got sidetracked a little.

~~~How to swap adjacent siblings~~~

A: Left 2, Right 3
B: Left 4, Right 7

Desired result:
A B -> B A

Procedure:
Adiff = SELECT (R - L) WHERE N='A' LIMIT 1;
Bdiff = SELECT (R - L) WHERE N='B' LIMIT 1;

UPDATE TESTING SET L=L+(Adiff-Bdiff), R=R+(Adiff-Bdiff) WHERE L>=2 AND R<=3;
UPDATE TESTING SET L=L+(Adiff+Bdiff), R=R+(Adiff+Bdiff) WHERE L>=4 AND R<=7;

Comment #14

bdragon commented 8 October 2006 at 00:05

Um, I think I got a little mixed up there.
The SQL queries are backwards, B is subtraction and A is addition.

Comment #15

TheWhippinpost commented 8 October 2006 at 00:07

Er... just cheering from the sidelines bdragon - Ain't got a feckin clue what you're rambling on about really but sounds exciting ;)

Comment #16

bdragon commented 8 October 2006 at 00:35

Ok, looks like I was totally off base there.
THIS I double checked against a non trivial tree by hand.

XY -> YX

XS=XR-XL
YS=YR=YL
XL,XR += YS+1
YL,YR -= XS+1

Comment #17

bdragon commented 8 October 2006 at 03:40

I haven't tested this code at all, it's all theory so far. I'm not even sure if it is valid PHP, I've been doing all of this on paper...

This function (if it works) will graft a child onto a parent. (It will appear at the end of the parent's child list. It's also possible to put it at the front, but I think writing a function for shuffling children around based on the formula in the previous post would work better...)

I'll work on a companion function to add a sibling (so you can place a child at a certain point in the child list right away.)

Also, a function for actually putting stuff into the tree in the first place would be nice. :-D

/**
 * This function will graft $cid to the end of $parent's child list.
 */
function addChildToParent($cid,$parent) {

  $result = db_query('SELECT c.* FROM {testing} c WHERE c.cid=%d LIMIT 1',$child);
  // Any copy of the child node will do. Just grab the first one, it's got the lowest numbers anyway.
  $child = db_fetch_object($result);

  // Find the width of the child subtree.
  $width = $child->r - $child->l;
  // Find our offset so we can treat our child as a zero-based subtree. This makes the graft calculations easier.
	$offset = $child->l;
  // Same story for depth.
  $depth = $child->d;

	// Find our graft points.
  $junctions = array();
  $result = db_query('SELECT c.d,c.r FROM {testing} c WHERE c.cid=%d',$parent);
  while($j = db_fetch_object($result)) {
    $junctions[] = $j;
  }

	// Make holes for our grafts.
	// The holes appear at the end of the child list.
  foreach($junctions as $junction) {
    db_query('UPDATE {testing} c SET c.l = c.l + %d WHERE c.l >= %d',$width+1,$junction->r);
    db_query('UPDATE {testing} c SET c.r = c.r + %d WHERE c.r >= %d',$width+1,$junction->r);
  }

  foreach($junctions as $junction) {
    // Graft ourselves into place.
		db_query('INSERT INTO {testing} (cid,l,r,d) SELECT c.cid, c.l - %d + %d, c.r - %d + %d, c.d - %d + %d FROM {testing} c WHERE c.l >= %d AND c.r <= %d',
			$offset,$junction->r, // left
			$offset,$junction->r, // right
			$depth,$junction->d + 1,  // depth
			$child->l,$child->r // subtree selector (inclusive)
		);
	}
}

Comment #18

bdragon commented 8 October 2006 at 21:24

I'm starting to get comfortable with the way this all works. :-)

Still haven't tested the code, but I've factored the grafting out into a seperate function.

Coming up soon is:

Children order manipulation
Consistency check/repair function (easier than it sounds)
Search utility functions

/**
 * This function will add $cid to the end of $parent's child list.
 */
function addChildBack($cid,$parent) {
	// Adding a child to the end of the child list means our graft point is the parent's right.

	// There may be more than one graft point, due to the way multiple-parent relationships work.
	$result = db_query('SELECT c.d+1 AS depth, c.r AS point FROM {testing} c WHERE c.cid=%d',$parent);
	while($point = db_fetch_object($result)) {
    _testcode_graft($cid,$point->point,$point->depth);
  }
}

/**
 * This function will add $cid to the beginning of $parent's child list.
 */
function addChildFront($cid,$parent) {
	// Adding a child to the beginning of the child list means our graft point is the parent's
	// left + 1 (i.e. equal to the first child's left.)

	// There may be more than one graft point, due to the way multiple-parent relationships work.
	$result = db_query('SELECT c.d+1 AS depth, c.l+1 AS point FROM {testing} c WHERE c.cid=%d',$parent);
  while($point = db_fetch_object($result)) {
    _testcode_graft($cid,$point->point,$point->depth);
  }
}

/**
 * Perform a graft.
 *
 * @param $cid
 *   The cid of the subtree to graft from.
 * @param $point
 *   The point in the tree to graft onto.
 * @param $point_depth
 *   The depth to graft at. Since we're shortcutting the traversal, we need to know this.
 */
function _testcode_graft($cid,$point,$point_depth) {

  // Determine the width of our subtree.
	$width = db_result(db_query('SELECT c.r-c.l FROM {testing} c WHERE c.cid=%d',$cid));

  // Make a hole in the graph to fit the graft into.
  // Todo: Make this a single query.
  db_query('UPDATE {testing} c SET c.l = c.l + %d WHERE c.l >= %d',$width+1,$point);
	db_query('UPDATE {testing} c SET c.r = c.r + %d WHERE c.r >= %d',$width+1,$point);

	// We need to wait until after the holes are made to fetch stats about the subtree.
  // It may have moved during the hole creation.
	$child = db_fetch_object(db_query('SELECT c.l,c.r,c.d FROM {testing} c WHERE c.cid=%d',$cid));
	// Calculate the offset needed to copy the subgraph to the correct location.
	$offset = $point - $child->l;
	$depth_offset = $point_depth - $child->d;

  // Do the actual graft.
  db_query('INSERT INTO {testing} (cid,l,r,d) SELECT c.cid, c.l + %d, c.r + %d, c.d + %d FROM {testing} c WHERE c.l >= %d AND c.r <= %d',
		$offset,$offset, // Translate the lefts and rights to the correct number
		$depth_delta,    // The caller of the graft function is responsible for using the correct depth. Since a graft doesn't care whether
                     // we're making children or siblings, we rely on the caller to set the delta correctly.
    $child->l,$child->r // And this is the range to copy from. It includes the root of the subgraph (hence the = signs in the query)
  );
}

Comment #19

bdragon commented 8 October 2006 at 23:49

The payoff, part 1:

This is a rewrite of category_location to use the DFS tree.
The queries are tested, but the PHP is not.

SQL overhead: Minimal (A key lookup and three ref lookups, number of rows equals number of links in the lightest path to root)
Network overhead: Minimal (Amount of extra data transferred: A single row for each hidden container that ends up getting thrown away -- Rewriting the SQL slightly would eliminate even this.)
PHP overhead: Minimal (Say goodbye to recursion.)

Of course, this won't be testable as such until the rest is done.

function testcode_category_location($nid, $hidden=FALSE) {
	$nodes = array();

	// Query for the l and r of the first occurrence.
	$result = db_query_range('SELECT c.l,c.r FROM {testing} c WHERE c.cid=%d ORDER BY c.l',0,1));

	if(db_num_rows($result)) {
		$src = db_fetch_object($result);
		$hidden_sql = $hidden ? '' : 'AND cc.hidden_cont = 0';
		$result = db_query('SELECT c.*, n.title FROM {category} c INNER JOIN {node} n ON c.cid = n.nid INNER JOIN {testing} d ON d.cid=c.cid LEFT JOIN {category_cont} cc ON c.cnid=cc.cid WHERE d.l < %d AND d.r > %d ORDER BY d.l'.$hidden_sql,$src->l,$src->r);
		
		while($n = db_fetch_object($result)) {
			$nodes[] = $n;
		}
	}
	return $nodes;
}

Comment #20

bdragon commented 10 October 2006 at 16:17

I should be receiving some books in the mail shortly that will help me out with this. Stay Tuned ;-)

Comment #21

bdragon commented 10 October 2006 at 17:51

(Just got my books)

It appears to be a variation on the "Nested Set with Depth Model" described on P. 160 of "Trees and Hierarchies in SQL For Dummies"

My implementation has the additional property of non-unique node id's, due to the need for handling of multiple-parents.

This partially breaks the ease of finding parents, as to get all paths to the root, you need to query each instance of the node.

Children are still very simple, as all instances of the node have a matched subgraph. I've just been grabbing the first one.

However, usually it's just the lightest weight path that is needed, so grabbing the left and right of the first match will get the wanted results.

This headache is due to the ability of Category to support stuff like this:

A
--B
----C
----D
--E
----B
------C
------D

For the most part, we only need to work downwards, so it may not be too bad.

I'll investigate the various techniques in the books.

Comment #22

bdragon commented 18 October 2006 at 04:44

I've been doing more thinking than coding on this.

Current code piggybacks on category_hierarchy. Performance comparison between that and a seperate table has not been done. Primary storage for hierarchy data will be (cid,parent) because that is way easier to repair by hand if needed, and it's easier to generate the lrd data from the cid,parent data than the other way around.

Here's a regen function that will walk the tree. It's a bit of a last resort thing, most operations won't require rebuilding the whole thing from scratch, but it's really simple and can repair damage done by the "smarter" methods if they break down.

/**
 * Completely rebuild the LRD tree using the (cid,parent) data.
 * Do not use it unless you fubared the tree.
 * This function will remove holes, use category_lrd_optimize() to regain insertion speed.
 */
function category_lrd_regen() {
  $counter = 0;
  $depth = 0;
  _category_lrd_regen_recurse($counter,$depth,0);
}

/**
 * Internal function used by category_lrd_regen().
 * Do not use.
 */
function _category_lrd_regen_recurse(&$counter,$depth,$parent) {
  $children = array();
  $result = db_query('SELECT h.cid,h.parent from {category_hierarchy} h INNER JOIN {category} c ON h.cid=c.cid INNER JOIN {node} n ON c.cid=n.nid WHERE h.parent = %d ORDER BY c.weight, n.title',$parent);
  while ($child = db_fetch_array($result)) {
    $l = ++$counter;
    _category_lrd_regen_recurse($counter,$depth+1,$child['cid']);
    $r = ++$counter;
    db_query('UPDATE {category_hierarchy} SET l = %d, r = %d, d = %d WHERE cid = %d AND parent = %d',$l,$r,$depth,$child['cid'],$child['parent']);
  }
}

The database table currently looks like this:

      db_query("CREATE TABLE {category_hierarchy} (
        cid int(10) unsigned NOT NULL default '0',
        parent int(10) unsigned NOT NULL default '0',
        l int(10) unsigned NOT NULL default '0',
        r int(10) unsigned NOT NULL default '0',
        d tinyint(4) unsigned NOT NULL default '0',
        PRIMARY KEY (cid, parent),
        KEY lrd (l, r, d),
        KEY parent (parent)
      ");

The code I'm writing is running on a test site at http://category-dev.rtk0.net/ .
Currently, all that is available is a page that shows what the hierarchy table looks like, but as I get more of this done, it'll gain more testing tools.

(as an aside, using a pager in a php page snippet is a nice way to view a table when one doesn't want to bother installing views :wink:)

Comment #23

bdragon commented 18 October 2006 at 05:08

Guess I'll have to split the table after all. Forgot that >1 row is needed to represent (cid,parent) when using multiple parents.

Comment #24

bdragon commented 1 November 2006 at 23:17

I have been working on this a lot lately.

I managed to make an iterative version of category_display_toc, among other things.

By tweaking category_load to not incur a category_get_parents() cache (I made a lite version that doesn't cache and only gets the parents of the requested cid), serving a depth-limited version of a container (which is optimized on the SQL side with the LRD code active, i.e. only rows we want are returned) takes <10ms. For comparison, the hit taken to do a full category_get_parents() call with the same data is between 900 and 1000 ms.

The entire page generation time was 320.19 ms (This is for a TOC of a container with 2277 categories, single random parents, depth 1.)

Test page is at:
http://category-dev.rtk0.net/index.php?q=node/14999

The new TOC code is only used for this test page, clicking on any category on the page will use the old rendering code. (I've found that it takes 3-4 seconds on average)

Here's another test. I turned the depth limiter off for the container (so it displays the entire tree of categories)
http://category-dev.rtk0.net/index.php?q=node/15000

I was getting a time between 3-6 seconds. (This is on a linode with a bunch of debugging stuff enabled, on a "real" site this would be faster.)

Comment #25

bdragon commented 1 November 2006 at 23:34

Hmm, odd... I just dropped a key that I didn't think I needed, and the page time went from <400ms to >2s... I readded the key and it didn't seem to change.

/me retraces his steps...

Comment #26

bdragon commented 1 November 2006 at 23:38

... And regenerating the tree fixed it. Odd.

Comment #27

bdragon commented 1 November 2006 at 23:45

Ok, it seems the query was executed in a different order.

For the record, this execution plan is good:

mysql> describe SELECT c.*, n.title, l.d as level, l.r-l.l as width FROM category c INNER JOIN
category_cont cn ON c.cnid=cn.cid INNER JOIN category_lrd l ON c.cid=l.cid
INNER JOIN node n ON c.cid=n.nid WHERE l.l > 7 AND l.r <= 4562 
AND l.d <= 2 AND cn.hidden_cont = 0 ORDER BY l.l;
+----+-------------+-------+--------+----------------------------------+---------+---------+---------------------+------+-----------------------------+
| id | select_type | table | type   | possible_keys                    | key     | key_len | ref                 | rows | Extra                       |
+----+-------------+-------+--------+----------------------------------+---------+---------+---------------------+------+-----------------------------+
|  1 | SIMPLE      | l     | ALL    | PRIMARY,upwards,parent_group,nid | NULL    |    NULL | NULL                | 2337 | Using where; Using filesort |
|  1 | SIMPLE      | c     | eq_ref | PRIMARY,tid                      | PRIMARY |       4 | category_dev.l.cid  |    1 |                             |
|  1 | SIMPLE      | cn    | eq_ref | PRIMARY                          | PRIMARY |       4 | category_dev.c.cnid |    1 | Using where                 |
|  1 | SIMPLE      | n     | ref    | PRIMARY,nid                      | nid     |       4 | category_dev.c.cid  |   19 |                             |
+----+-------------+-------+--------+----------------------------------+---------+---------+---------------------+------+-----------------------------+
4 rows in set (0.00 sec)

and this one is bad:

mysql> describe SELECT c.*, n.title, l.d as level, l.r-l.l as width FROM category c 
INNER JOIN category_cont cn ON c.cnid=cn.cid INNER JOIN category_lrd l ON c.cid=l.cid
 INNER JOIN node n ON c.cid=n.nid WHERE l.l > 7 AND l.r <= 4562 
AND l.d <= 2 AND cn.hidden_cont = 0 ORDER BY l.l;
+----+-------------+-------+------+----------------------------------+------+---------+---------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys                    | key  | key_len | ref                 | rows | Extra                                        |
+----+-------------+-------+------+----------------------------------+------+---------+---------------------+------+----------------------------------------------+
|  1 | SIMPLE      | cn    | ALL  | PRIMARY                          | NULL |    NULL | NULL                |    4 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | c     | ref  | PRIMARY,tid                      | tid  |       4 | category_dev.cn.cid |  179 |                                              |
|  1 | SIMPLE      | l     | ref  | PRIMARY,upwards,parent_group,nid | nid  |       4 | category_dev.c.cid  |    1 | Using where                                  |
|  1 | SIMPLE      | n     | ref  | PRIMARY,nid                      | nid  |       4 | category_dev.c.cid  |   19 |                                              |
+----+-------------+-------+------+----------------------------------+------+---------+---------------------+------+----------------------------------------------+

Comment #28

moonray commented 15 November 2006 at 22:02

After seeing your post on using the "modified preorder tree traversals" I decided to use it for the tasks_advanced module. One question: how do you move a branch (change the sort order)?

I came up with the following formula.

UPDATE tree SET left = (left + [(b-right - b-left + 1)]), right = (right + [(b-right - b-left + 1)]) WHERE left BETWEEN [a-left] AND [a-right]
UPDATE tree SET left = (left - [(a-right - a-left + 1)]), right = (right - [(a-right - a-left + 1)]) WHERE left BETWEEN [b-left] AND [b-right]

a-left and a-right are the left and right of the node you're swapping.
b-left and b-right are the left and right of the node you're swapping with.

The problem is that this requires the two SQL statements to be run in the same transaction (simultaneously). And transactions aren't an option with most drupal setups.

How would you tackle this? Pull a list of IDs before running these on specific IDs? That would require 4 queries to be run instead of 2. :-S

Comment #29

moonray commented 15 November 2006 at 22:32

btw,

Depth can be pulled from the database on the fly. But it might impact performance. (not sure)
example from http://dev.mysql.com/tech-resources/articles/hierarchical-data.html:

SELECT node.name, (COUNT(parent.name) - 1) AS depth
FROM nested_category AS node,
nested_category AS parent
WHERE node.lft BETWEEN parent.lft AND parent.rgt
GROUP BY node.name
ORDER BY node.lft;

Comment #30

bdragon commented 16 November 2006 at 05:46

mysql> describe select l.cid, (count(p.cid)-1) as d from category_lrd l, category_lrd p where l.l>20 and l.r<50 and l.l between p.l and p.r group by l.cid order by l.l;
+----+-------------+-------+-------+-----------------+---------+---------+------+------+----------------------------------------------+
| id | select_type | table | type  | possible_keys   | key     | key_len | ref  | rows | Extra                                        |
+----+-------------+-------+-------+-----------------+---------+---------+------+------+----------------------------------------------+
|  1 | SIMPLE      | l     | range | PRIMARY,upwards | upwards |       4 | NULL |   22 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | p     | ALL   | NULL            | NULL    |    NULL | NULL | 2337 | Using where                                  |
+----+-------------+-------+-------+-----------------+---------+---------+------+------+----------------------------------------------+
2 rows in set (0.00 sec)

mysql> describe select cid, d from category_lrd where l>20 and r<50;
+----+-------------+--------------+-------+-----------------+---------+---------+------+------+-------------+
| id | select_type | table        | type  | possible_keys   | key     | key_len | ref  | rows | Extra       |
+----+-------------+--------------+-------+-----------------+---------+---------+------+------+-------------+
|  1 | SIMPLE      | category_lrd | range | PRIMARY,upwards | upwards |       4 | NULL |   22 | Using where |
+----+-------------+--------------+-------+-----------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)

mysql>

Also, I have to deal with multiple parents, so a cid can have multiple depths.

As far as shuffling subtrees go, I'd find out which is bigger, make a hole that size next to the smaller one, shift the bigger one in to place, shift the smaller one into place. That would leave the tree consistent between queries. I wouldn't even worry about fixing the two gaps that occur after doing this, it's not like gaps in the numbering are a problem or anything...

I was thinking about it, and I believe spreading things apart a bit would make moves significantly faster, due to not having to shift large parts of the table. Knowing how far to do so is the tricky part. (Perhaps an advanced option to say how many subcategories you are expecting this category to have in the near future, etc...) And then there's the matter of checking whether a gap exists....

Also, (again mainly because I have to handle multiple parents) sometimes it's faster to truncate the table and rewalk. Especially if we're talking about moving a 6000-wide node with 20 parents. ;-)

Comment #31

bdragon commented 16 November 2006 at 06:23

I just read that article, and cringed at some of those queries. Definately not realtime.

~~ An aside about multiple parents ~~

The problem with multiple parents (Which I'm sure you're glad you don't have to worry about, moonray ;) is how they turn what would normally be a strict hierarchy into a general graph.

Imagine ten categories, 1-10. Now imagine each category has the set of parents 1 : (self-1).

a) How many rows does it take to store this?

b) Don't you wish Mysql 4.0 and 4.1 had stored procedures?

As an aside, I tested this situation, and observed the following:

a) It's damn slow to generate.

b) It makes a very impressive table of contents.

c) Isn't this somewhere around O(n!)?

d) This would be much quicker if I turned off indexes temporarily.

e) This would be much quicker as a bulk insert.

f) I was originally going to do the alphabet, A-Z, but I gave up somewhere in the middle due to the exponential increase in the amount of time I had to wait for the table rebuild to finish.

g) Anyone using multiple parents in this manner is crazier than I am.

http://www.dbmsmag.com/9603d06.html
http://www.dbmsmag.com/9604d06.html
http://www.dbmsmag.com/9605d06.html

There you will find the pseudocode for merging two trees, and more.

His book on the subject might also be a nice fit in any sql library: http://www.elsevier.com/wps/find/bookbibliographicinfo.cws_home/702605/d...

Paul

Comment #35

pkej commented 1 March 2007 at 11:25

OK, here is the ultimate MYSQL solution to most problems with the NSM Tree:

http://dev.mysql.com/tech-resources/articles/hierarchical-data.html

Hope this helps you with your work.