I've been testing this module, I used it for production apps, and I've find some problems on it.

When we have more than 5000 categories/containers, this module will cause:

- Too High memory needs. About 45Mb viewing a category, 55Mb while editing it.
- Too slow. All drupal gets slower while using this module. Pages are loaded in 5000ms (avg) - 30000ms (max)

I did take a look on the code in "category.inc", and I've find a set of functions that are wating CPU, Memory and MySQL Queries.

I'll try to explain why are slower, and how to solve this:

- function category_get_category($cid, $reset = FALSE)

This function is caching all results, including that results that the calling function do not want.
Caché is not eficient, because it is stored on PHP memory, and it gets released at the end of the click.

How to solve:

You can try this code instead:

function category_get_category($cid, $reset = FALSE) {
  static $categories;

  if (!is_numeric($cid)) {
    return FALSE;
  }
 
  // Simple cache to eliminate duplicate queries
  if (!isset($categories) || !isset($categories[$cid])) {

    $result = db_query('SELECT c.*, n.title FROM {category} c INNER JOIN {node} n ON c.cid = n.nid AND c.cid=%d',$cid);
    while ($category = db_fetch_object($result)) {
      $categories[$category->cid] = $category;
    }
  }

  return $categories[$cid];
}

This code makes a small cache, and don't caché innecesary data.

But, I think that the best will be cache data on new Database tables.

- function category_is_cat_or_cont($nid, $reset = FALSE)

Same problem. You're caching all nodes in PHP memory. If we've 1 million of nodes, we will need more than 100Mb in order to run this module.

- function category_get_parents($cid, $key = 'cid', $distant = TRUE, $reset = FALSE)
Same problem.

- function category_get_children($cid, $cnid = 0, $key = 'cid', $reset = FALSE)
Same problem.

- function category_get_tree($cnid, $parent = NULL, $depth = -1, $max_depth = NULL, $distant = FALSE)
Same problem. (This function is eating a lot of memory)

- function category_category_count_nodes($cid, $type = 0)
Same problem.

- function category_node_get_categories($nid, $key = 'cid', $reset = FALSE)
Same problem.

- function _category_category_children($cid)
Same problem.

Remember that is wrong to collect data that have not been requested to the function.
Only cache it if it's requested to the function.

I'm attaching my category.inc file, that is not working at all, but is faster and needs less memory than yours.

I hope to have been useful,
David.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

marcoBauli’s picture

Title: Memory & Perforormance problems using Category module » Memory & Performance problems using Category module

Hi David, and thanks for submitting this issue.

Can i ask how did your production site based on Category behave after your patch to category.inc?

I am close to be online with a quite big project based on Category too, and if things go as expected it will be easy to reach 5000-10000 categories.

How close to Taxonomy performances would be the Category module after this patch? What other potential slow-downs did you find in the module? Are there other potential problems in using Category for big production sites? Is still core Taxonomy a better choice (if yes why)?

These would be extremely valuable informations for all the people following this mod, and for Jaza hoping he will be committed to Category once he will be back from Summer of Code!

Thank you very much in advance

notabenem’s picture

I am also interested in this!
What do you mean by "not working at all", but makes your site load faster?

marcoBauli’s picture

this problem was rised also some time ago in this other issue.

Here is the solution proposed there in brief, but please refer to the full issue for further comments:

One possible work around would be to cache the variable in the database beetween calls, another possibility would be to reduce the scope of the array to build only the usefull part of the tree for the request.

i would set this to 'critical', since this is one of those issues that make the difference in choosing Category instead of Taxonomy.

deavidsedice’s picture

Priority: Normal » Critical
FileSize
11.79 KB

Can i ask how did your production site based on Category behave after your patch to category.inc?

Before
-----------
Memory: 20Mb (Min), 35Mb (avg), 55Mb (Max)
Performance: 2,5 seconds/clic (Min), 5 seconds/clic (avg), 30 seconds/clic (max)

After
----------
Memory: 12Mb (Min), 20Mb (avg), 25Mb (Max)
Performance: 0.8 seconds/clic (Min), 2,5 seconds/clic (avg), 10 seconds/clic (max)
(Maximums are reached when trying to edit categories)

I am close to be online with a quite big project based on Category too, and if things go as expected it will be easy to reach 5000-10000 categories.

My site have 5000 categories now, and will reach 500.000, I think.

How close to Taxonomy performances would be the Category module after this patch?

I don't know how fast is the Taxonomy module. But, if we patch all category module, it should be fast as taxonomy module.
My patch is incomplete. I only patched some functions to gain maximum performance with minimum work. Nothing more.
It is possible that, my patch doesn't work to other sites or doesn't change any performance on certain situations. But I think that it should work for the most sites.

What other potential slow-downs did you find in the module?

mmm... other big problem is on category.module.
It has one function called:
function _category_category_select_options($cnid, $multiple, $blank, &$value, $exclude = array())

And it has serious problems:

1742: $tree = category_get_tree(0); // It retrieves all ...

¿What if I've one million of categories? mmm... too much memory. (but this function its so fast..)

1760: foreach ($tree as $category) {
// Boolean blasphemy has been cleaned up and moved to separate function
if (_category_category_select_check_category($category, $container, $cnid, $exclude)) {
$options[$category->cid] = _category_depth($category->depth, '-') . ($category->cnid ? $category->title : $category->admin_title) . ($category->cnid ? '' : ' *');
}
}

Here, we are reading item by item to determine if can be parent of a category... this is a BIG problem.

I don't know how to solve this. Can I create another database table, and put some data on it? I think this will much faster if was a MySQL query. (But I don't know if it is possible to write a MySQL query that archieves this)

Are there other potential problems in using Category for big production sites?

I don't know. The only thing I've seen is that category module is not stable nor easy enough for some sites. (Think in selling drupal sites to clients)

Is still core Taxonomy a better choice (if yes why)?

I never used Taxonomy. Taxonomy its not complex enough for my sites.
For example, the site that I patched this module it's a library. We have Books, Films, Manga, etc... This is impossible with taxonomy module.

----------------------------------------------------

What do you mean by "not working at all", but makes your site load faster?

I don't write very well in English... but I'll try to explain it:
The category module, after applying the patch, will add defects. Some own functionalities of the module will not be, since I must delete them so that it could add speed. (Thanks to google translation ;)
However, the patch that I submitted is only for category.inc, and the bug I've generated is on category.module. Sorry!

-----------------------------------------------------

One possible work around would be to cache the variable in the database beetween calls, another possibility would be to reduce the scope of the array to build only the usefull part of the tree for the request.

This is that I want to do. First I reduced the cached data to do it more useful. I reduced also the data retrieved from MySQL.

I believe that it is possible to do it better. I'm still working on this...

Oh, I changed priority to "critical" as suggested by kiteatlas.

Finally: My site is http://www.leelibros.com/biblioteca and I'm attaching an statistics creted by http://www.site24x7.com/

deavidsedice’s picture

FileSize
66.99 KB

I've decided to upload my category.module...

This patch will make a bug inside category module: When you try to add or edit a category, the "parent" combo will have less options.

This patch only affects to editing categories, nothing more. After applying:

- Memory required by PHP will be less.
- Time required to process this click will be less.

REMEBER: Parent combo will get less options.

This is a pre-patch for "_category_category_select_options" to speed it up. This function fills the "Parent" combo box.

deavidsedice’s picture

Assigned: Unassigned » deavidsedice

I found the problem that slowed down the load of the page when we published a category.

The problem resides in the function that loads the possible parents of the category which we edited, which calls as well to another function so that it verifies for each node if it would have to introduce the father or not.

In order to verify each node, this function called “_category_category_select_check_category” is in charge to verify one by one if we have access to each one of the categories.

The verification of a node demands that this one must be loaded in memory completely, so that the final object can be used by the function “node_access".

This means to load from the database to memory all the nodes that are categories, which can suppose in many cases, the load of more than 100000 nodes simultaneously.

Causing therefore the delay in more than 30 seconds when doing click and the demand in memory of more of 50Mb.

-------------------------------------

Personally I think that the verification of the categories one by one is excessive, and commonly, or access to all the categories or none is given.

I've decided to comment those lines of code:
1806: // Next, we get rid of categories that the user does not have access to
1807: //if (!node_access('view', node_load($category->cid))) {
1808: // return FALSE;
1809: //}

marcoBauli’s picture

Status: Active » Needs review

David, there's another small patch about performance at http://drupal.org/node/84424

meantime let's hope Jaza (or some other savvy coder ;) will find some time to dedicate to this issue...

i am also thinking about trying rising some funds to help Category module get a serious bug-fix round...mmm...a thread about this will probably follow soon..

setting this as 'code needs work'

Jaza’s picture

http://drupal.org/node/84424 has been marked duplicate of this thread

deavidsedice’s picture

David, there's another small patch about performance at http://drupal.org/node/84424

Thanks! :D

I'll see it deeper tomorrow ;)

jvandervort’s picture

I think the 84424 patch may cause other issues.

Issue?

deavidsedice’s picture

Yes, the 84424 patch will corrupt "function category_get_children".
....

I'm looking on category.module, _category_category_select_check_category function. Here we've a problem: Every node will be entirely loaded to check if it user has access to it or not.

a) Check it at the end. This will solve some memory problems and improve some performance. But when we want to create another category, all nodes will get loaded into memory.

b) Never check it. This can create an issue, because some users with privileges to create categories can create categories that are children of other categories that they can't edit nor read.

Jaza’s picture

I've committed a fix to HEAD and 4.7, that improves the performance of category_get_category() and category_is_cat_or_cont() a fair bit (uses much more intelligent caching). However, when testing on a site with 4,000+ categories, the module is still extremely slow. More work needs to be done to improve performance.

Leaving this issue at present status.

bdragon’s picture

I have an idea regarding the fixes committed.

I was analyzing the cached queries, and I noticed some problems with the way category_get_cached_item uses the database.

I'll post a patch shortly incorporating my ideas for speedup.

--Brandon

bdragon’s picture

I did a profiling run, and I highly suspect _category_category_select_check_category of being a major part of the slowdowns.

It is node_loading EVERY CATEGORY to check whether the user is allowed to access it.

JonathanDStopchick’s picture

I finally pounded away at this thing for a few hours and found the issue. There are two instances of node_load which are used at a per category basis, determining whether they view or not. This slow it down grotesquely. I'm not sure but maybe on query to all the categories and then parse it after might be better.

But anyway the two that are a problem are:

function _category_category_select_check_category
*****
  if (!node_access('view', node_load($category->cid))) { //This line
    return FALSE;
  }

and

function _category_node_select_options($cnid, $blank, &$value, $exclude = array()) {
************
  if ($tree) {
    foreach ($tree as $category) {
      if (!in_array($category->cid, $exclude) && node_access('view', node_load($category->cid))) { //and this line
        $options[$category->cid] = _category_depth($category->depth, '-') . $category->title;
      }
    }

Until a better solution presents itself my quick fix has been commenting out the entire first one -

  //if (!node_access('view', node_load($category->cid))) { //This line
  //  return FALSE;
  //}
  

and editing the line in the second to remove the node_load, it's less secure, but at least it speeds it up, A LOT.

  if ($tree) {
    foreach ($tree as $category) {
      //if (!in_array($category->cid, $exclude) && node_access('view', node_load($category->cid))) {
      if (!in_array($category->cid, $exclude)) { //replaced with me!!
        $options[$category->cid] = _category_depth($category->depth, '-') . $category->title;
      }
    }

I hope this information is helpful to everyone!

bdragon’s picture

It's not so much a security thing, really. People who don't limit access to categories won't even need this ability..

For now, here's a patch to add a toggle in the settings page.

In the future, perhaps adding a flag to categories or containers to selectively run the check only for them may be worth implementing. (checkbox -- Enable advanced access control for this category or something...)

bdragon’s picture

Here's one that does the same thing for both cases pointed out in #15.

Both of these patches were against DRUPAL-4-7, btw.

bdragon’s picture

And then it occurs to me...

WHY THE HELL AREN'T WE USING db_rewrite_sql() HERE??!?!??

bdragon’s picture

category_get_tree is already db_rewrite_sql()ing.....

If it is working properly, both those node_load()s should be completely redundant.....

bdragon’s picture

Unfortunately, that appears to not be the case...
I traced the return values from db_rewrite_sql for a normal user and got:
"SELECT c.cid, c.*, h.parent, n.title, cn.admin_title FROM {category} c INNER JOIN {category_hierarchy} h ON c.cid = h.cid INNER JOIN {node} n ON c.cid = n.nid LEFT JOIN {category_cont} cn ON c.cid = cn.cid LEFT JOIN {category} c2 ON h.parent = c2.cid WHERE (c2.cid IS NOT NULL OR h.parent = 0) AND n.status = 1 AND n.moderate = 0 AND c.cnid = %d ORDER BY c.weight, n.title"
"SELECT c.cid, c.*, h.parent, n.title, cn.admin_title FROM {category} c INNER JOIN {category_hierarchy} h ON c.cid = h.cid INNER JOIN {node} n ON c.cid = n.nid LEFT JOIN {category_cont} cn ON c.cid = cn.cid LEFT JOIN {category} c2 ON h.parent = c2.cid WHERE (c2.cid IS NOT NULL OR h.parent = 0) AND n.status = 1 AND n.moderate = 0 AND c.cnid = %d ORDER BY c.weight, n.title "

Looking into why db_rewrite_sql isn't kicking in properly....

bdragon’s picture

Hmm, it appears I should install a node access module of some sort before trying to test this XD

bdragon’s picture

With cac_lite, I get:

SELECT c.cid, c.*, h.parent, n.title, cn.admin_title FROM {category} c INNER JOIN {category_hierarchy} h ON c.cid = h.cid INNER JOIN {node} n ON c.cid = n.nid LEFT JOIN {category_cont} cn ON c.cid = cn.cid LEFT JOIN {category} c2 ON h.parent = c2.cid INNER JOIN {node_access} na ON na.nid = n.nid WHERE (na.grant_view >= 1 AND ((na.gid = -1 AND na.realm = 'all') OR (na.gid = 2340 AND na.realm = 'cac_lite') OR (na.gid = 0 AND na.realm = 'cac_lite'))) AND (c2.cid IS NOT NULL OR h.parent = 0) AND n.status = 1 AND n.moderate = 0 AND c.cnid = %d ORDER BY c.weight, n.title

There's definately something wrong with the join order....

bdragon’s picture

And then I read that joins are commutative. Meh.
Sorry about the trackerspam, it appears to work....

bdragon’s picture

Here's the issue that lead to the code in question:
http://drupal.org/node/52552

So, OG doesn't use the node_access table?

bdragon’s picture

Looks like Drupal 5 will be much better with access controls:
http://drupal.org/node/71420
http://drupal.org/node/75395

In any case, it appears that sites using regular node_access style access control already work fine, so I'd recommend having the two cases in #15 at the very least least be a toggle, so that people using no access control plugin / "normal" access control plugin don't have to take the node_load hit.

--Brandon

deavidsedice’s picture

Until now, the best optimization than I have obtained has be doing the following things:

1) category.inc: To improve the filling method of cache of most of functions

2) category.module, _category_category_select_check_category function: To eliminate the verification of permissions to avoid that the function load all the nodes

2.1) We can also force the system to execute this operation at the end of everything, thus avoiding useless loads.

2.2) And we can create a cache of this thing.

3) category.inc, category_get_tree function: Make a database table to store data cache for this function.

4) node.module, node_load function: I was playing with different types of cache, and I've decided to delete drupal cache of nodes.

5) Use Turck-mmcache accelerator with drupal. This decreases the memory required to execute our scripts.

Problems:

> In category.inc, category_get_tree function is slower and eats a lot of memory. We should rewrite this function, trying to use less resources as possible.

> In category.module, _category_category_select_check_category is checking every node permissions... ¿Is this necessary? ¿Can be cached or not? ¿It is possible to write a code that checks permissions without loading the entire node?

scroogie’s picture

Sorry when I get a bit offtopic, but what kind of tree structure does category.module use? Perhaps it could be a speedup to use e.g. the Nested Set model instead of a parent relationship.

deavidsedice’s picture

Sorry when I get a bit offtopic, but what kind of tree structure does category.module use? Perhaps it could be a speedup to use e.g. the Nested Set model instead of a parent relationship.

Uses a parent relationship.
I believe that we cannot change this without recoding all modules... but i'm not sure.

JonathanDStopchick’s picture

I wonder if it would be faster to create a new function in category called "category_load" based off of node load, but the way the data is processed and accessed would be different. Firstly the function grabs the entire tree structure of categories, not just the one at a time method. Secondly the data grabbed via the query is ONLY what is needed. Node_load grabs everything, not cool. Lastly, and this should be obvious by now, the use of this function wouldn't be looped like it is currently, it would be loaded once and parsed against the current tree.

Anyone else think this could be a sound solution?

bdragon’s picture

scroogie:

Is THAT the name of what I'm trying to invent in http://drupal.org/node/87918 ?

FWIW, I'll be recieving some Joe Celko books in the post shortly, I'm sure they will help me in my quest to make Category faster ;)

--Brandon

scroogie’s picture

bdragon: I will reply in that issue you pointed me to when i had time to read and understand your idea. On first sight, i dont think its the same. Actually, the Nested Sets Model is an idea of Joe Celko, too.

deavidsedice’s picture

JonathanDStopchick wrote: I wonder if it would be faster to create a new function in category called "category_load" based off of node load, but the way the data is processed and accessed would be different. Firstly the function grabs the entire tree structure of categories, not just the one at a time method. Secondly the data grabbed via the query is ONLY what is needed. Node_load grabs everything, not cool. Lastly, and this should be obvious by now, the use of this function wouldn't be looped like it is currently, it would be loaded once and parsed against the current tree.

Anyone else think this could be a sound solution?

Yes, I think that It will be faster if we load less data.

I've done some tests, and I coded one function named: node_preload(); It's a very simplified version of node_load(). I think there is no need to load all data at once: This will eat some RAM, and we can do more than 1000 querys to our database without delay.

Now, I've a very fast _category_category_select_check_category function.
Code:

function _category_category_select_check_category($category, $container, $cnid, $exclude) {
  if (in_array($category->cid, $exclude)) return FALSE;
  $return = false;

  if (!isset($container)) $return = TRUE;

  if ($category->cnid) 
  {
      if (in_array($category->cnid, $container->children_allowed_parents)) $return = TRUE;
      if (!$container->children_have_distant && $category->cnid == $cnid) $return = TRUE;
  }
  else 
  {
    if (in_array($category->cid. '#', $container->children_allowed_parents)) $return = TRUE;
    if (!$container->children_have_distant && $category->cid == $cnid) $return = TRUE;     
  }

  if ($return) // I've inverted the order. Now, node_access is at the end. By this way, we'll load less nodes.
  {
    if (!node_access('view', node_preload($category->cid))) return FALSE;
  }

  return $return;

}

And here the code for node_preload:

function node_preload($param = array()) {
  $arguments = array();
  if (is_numeric($param)) 
  {
    $cond = 'n.nid = %d';$arguments[] = $param;
  }
  else {
    // Turn the conditions into a query.
    foreach ($param as $key => $value) {
      $cond[] = 'n.'. db_escape_string($key) ." = '%s'";
      $arguments[] = $value;
    }
    $cond = implode(' AND ', $cond);
  }
  $node = db_fetch_object(db_query('SELECT n.nid, n.vid, n.type, n.status, 
    n.created, n.changed, n.comment, n.promote, n.moderate, 
    n.sticky FROM {node} n WHERE '. $cond . ' LIMIT 1', $arguments));
  return $node;
}
deavidsedice’s picture

oops!

I missed a bug in the previous code:

function _category_category_select_check_category($category, $container, $cnid, $exclude) {
 if (in_array($category->cid, $exclude)) return FALSE;
 $return = false;

 if (!isset($container)) $return = TRUE;
 if (!$return) // This is needed to avoid errors with in_array function
 {
  if ($category->cnid) 
  {
      if (in_array($category->cnid, $container->children_allowed_parents)) $return = TRUE;
      if (!$container->children_have_distant && $category->cnid == $cnid) $return = TRUE;
  }
  else 
  {
    if (in_array($category->cid. '#', $container->children_allowed_parents)) $return = TRUE;
    if (!$container->children_have_distant && $category->cid == $cnid) $return = TRUE;     
  }
 }
  if ($return) // I've inverted the order. Now, node_access is at the end. By this way, we'll load less nodes.
  {
    if (!node_access('view', node_preload($category->cid))) return FALSE;
  }

  return $return;

}
chud’s picture

I am having some serious performance issues with a site that now has several thousand categories. I have been keeping up with the latest versions of the category module, and I was wondering if anyone else has been as well, and could post a new patch for the latest category.inc??

I did manage to get this patched category.inc working by checking out the version of category module from July 11, but I'd prefer to use a more recent version of category.

Thanks,
Colin

deavidsedice’s picture

Hi, guys. These days I was optimizing my site, and now I'm still correcting bugs, but now it is very fast. (400-1000ms per clic, 7500 categories.)

If you want, I can post there some of the files modified, but they aren't 100% bug free.

I've modified categories.inc, and also, the Image module, and the Drupal core. Some tables were added to MySQL, and I've created some new queries, but only compatible with MySQL. (I think that will be easy to make MySQL queries compatible with PostgreSQL, but I don't know how.)

chud’s picture

Hi David,

I for one would be very interested in getting a hold of your latest changes. Any chance you can post your category module (all files) in its entirety? Also interested in your other patches to core.

Many thanks,
Colin

deavidsedice’s picture

Hi, Colin.

Yes, I want to send my latest changes. But it will be hard, because there are too changes on the code.

I've no problem in packing all code (except uploaded files) and make a copy of the Database Structure (with no data). This will be easy.

But I wish to send some documentation of changes. But I speak Spanish, not English. I have to write the doc and then, translate it entirely.

Some of the things I have done to the code:

--- Category module

1.- Change all cache data policy for all functions: Try to avoid memory load. Try to make cache data faster.
2.- Avoid compulsive node_loads. This can be done if you load the nodes only when is _really_ needed.
3.- Change cache data type of category_get_tree, and put the cache in a MySQL table. This avoid up to 10Mb of memory per clic.

--- Node module

1.- Change cache data type of node_load. Put the cache data in MySQL and erase it from memory.

--- Image module

1.- Make a MySQL table to cache the theme_node and theme_teaser functions. This will avoid some time, because each node is rendered only once per modification.

Before making any changes, the pages, hosted in a dedicated Debian Server with PHP compiled and a Turk-MMCache installed, were using:

- In frontpage clics: from 7 to 10 seconds. Memory up to 25Mb (without TurckMMCache)
- Editing a category: from 25 to 50 seconds. Memory up to 55Mb (without TurckMMCache)

Now:

- In frontpage clics: from 0.5 to 1.2 seconds. Memory up to 0.9Mb (with TurckMMCache)
- Editing a category: from 1 to 4 seconds. Memory up to 3Mb (with TurckMMCache)

But, here are some known bugs when editing categories. I'm working on this now.

You can see this site at: http://www.leelibros.com/biblioteca/
(At the bottom, you can see the time and memory used to create the page.)

deavidsedice’s picture

FileSize
5.69 KB

In my last post I attached a tar.bz file of my site. Here is the SQL structure.

senthiln’s picture

This thread is silent for almost a year. Does this mean that category module is now free of any performance problems.

I am using categories module to provide free tagging function for this site for submitting free press release. This site runs on Drupal 5.1 and it uses Category module version 5.x-1.1 . There are more than 3000 Categories. If some one tries to submit a press release(article) with some free tags, PHP throws fatal error because of insufficient memory. It asks for more than 32MB.

So is there any patch to improve the performance of Category 5..x-1.1 module. I don't know whether this patch for Drupal 4.7 version of category module is already incorporated in the new version of category module for 5,x. Some one who knows more about category mode help me please.

bdragon’s picture

Category still has performance issues, and I am currently swamped with other work and really need to find a block of time to spend on things... :-/

liquidcms’s picture

Wow, wish i had seen this sooner. Just launching 2 sites this week which could end up being very large sites - and of course both use category module. I have never done one yet that doesn't use this module.

The main thing i use Cat for is the ability to categorize with nodes. Am i correct in thinking this is cleanest solution for doing this? I am pretty sure i can't do this with taxonomy module??

So, if i am right about needing category module to be able to tag something with a thing that also has image fields, url fields, etc - then we need to get performance issue sorted out. Personally i don think i care how much memory is being used -if the site is big it likely shouldn't be running on a low end server anyway - BUT... page loads taking 3 times as long is kind of scary.

But did i read this right??? If i use cat module but only have a few categories.. but 100,000 nodes that are tagged with those categories - is this a performance issue?? Regardless i have no issue throwing some money into a bounty to get category module issues sorted out - anyone else. Brandon can't be expected to work for free - i don't.. well trying not to more these days.. :)

Peter Lindstrom
LiquidCMS - Content Management Solution Experts

ps - stay tuned for major site release notice tomorrow - and this site uses category module

bdragon’s picture

No, a bunch of nodes being tagged should not cause performance issues. Only the creation of lots of categories has caused any issues in my experience.

andrabr’s picture

So, does this mean that so-called "free tagging" is a kiss of death?

deavidsedice’s picture

andrabr, is possible to run Category module with a million of tags; but you must patch your category.module (by yourself) and tune it up.
Of course, you must remove some features that produce a high load.

I'm using this module at http://www.leelibros.com/biblioteca , and I've around 6.000 nodes with 10 categories each one. (60.000 cat-terms)

But I'll uninstall this module when I upgrade leelibros to drupal 5 (or 6, if later). I prefer using Taxonomy & CCK instead. Is much simpler, less funcionalities, and faster.

andrabr’s picture

Thanks, David! - Wow, even I get it now!
Good karma!!

inforeto’s picture

I'd like to try some of the suggestions here, but i'm not sure which is the patch and there seems to be several problems addressed.

I have an issue with submitting categories, using drupal 5.
Some performance reports:

I have a moderate traffic site where submitting a new category creates a spike that ends crashing eaccelerator.
There's about 5000 categories, but these are accessed all the time by a hundred visitors.
The spike is not too big, and the server is only on 1/3 of max load, but mysql runs over the 32m memory_limit by the time watchdog entries are written.
(it often throws mysql errors, like watchdog tables locked or mysql gone away).
I also suspect it also triggers a drupal cache clear, which might be the cause for the actual crashes on eaccelerator.
(eaccelerator is known to be sensitive to sudden loads, but gets the work done).

So far, the troubleshoot tracing points to category nodes being both edited or added.
Devel and mysql slow queries are still being examined, without conclusive findings.
However, if the site is put into maintenance mode, there's no bottleneck and while the spike is still observed it is not noticeable by users and there's no crash.
In this case the brief spike happens after the site is put back online, but drupal cache is not always cleared.
The result is that if you send 1 container and 10 categories in a row you have 11 chances of crash.
With site offline you send the 11 nodes with only a single spike, without bottleneck because the site is idle.

The site in question has undergone patching for the tid issue, exposed cid in views and have heavy views optimization via theming.
It began with drupal 4.7 and now uses drupal 5.7, so it is hard to tell how old is the problem, it has only worsened recently as the site growed.
An exact setup on another site, with only a couple hundred categories and few visitors have no such problem.
So far, only regular categories are used, no book style articles with multi pages transformed to categories.
Users are unable to post this kind of nodes, and use the outline module instead.

JirkaRybka’s picture

Version: 4.7.x-1.x-dev » 6.x-2.0-rc1
Assigned: deavidsedice » Unassigned
Category: bug » task
Status: Needs review » Active

This is silent for a year and a half, and 4.7.x is unsupported now - but still, there's a lot of interesting reports in the above discussion.

As of 6.x-2.0-rc1, my performance patch got in (#501378: PERFORMANCE! Central caching for category API functions) - that's focused on slow queries, though, solving them with caching. It helped a lot on my site, which have some 100-150 categories, thousands of tagged nodes, 64M memory limit, and traffic just big enough to turn slow queries into a problem on a shared host (difficult to estimate any numbers on that part). Now that it's in, it'll get real "testing" in the wild, and we can further improve/build on that. But apart from that, there are areas unrelated to that patch:

The status of 6.x version is most probably still not examined yet, regarding these points mentioned here:

- memory footprint of category (do we avoid loading of unnecessary data?)
- scalability to thousands of categories
- whether we still have performance hit on access checks (the extra node_load() calls are gone now, but I suspect there might be some indirect replacement in use?)
- whether it makes sense to extend caching to wrapper modules (for example legacy taxonomy queries - my experience from live site says yes, but possible side-effects must be carefully examined in this legacy code).

So, while the 4.7.x patching attempts are obsolete now, it's entirely possible that some of the underlying problems still stand, so I'm turning this issue to a 6.x task, in hope that it gets a bit of research.

(Edit - also marking as duplicate: #245897: When trying to edit a category: "Fatal error: Allowed memory size of...")