This usability issue is that when large taxonomies are apply to a content type the end users wait between 15 to 60 seconds before some pages are fully loaded and display.

STEPS TO REPRODUCE THE ISSUE
Download & install Drupal 7.x-dev http://drupal.org/project/drupal

Download & install Devel module 7.x-1.x-dev http://drupal.org/project/drupal

Activate Devel & 'Devel generate' modules admin/config/modules

Go to admin/config/development/generate/taxonomy to generate a new taxonomy. Using the following values. Feel free to use bigger values if your server has more resources.
-How many vocabularies would you like to generate? 10
-How many terms would you like to generate? 1000
-Max word length of term/vocab names: 12
-Leave unchecked "Delete existing terms and vocabularies before generating new content" option.

Generate 14 more taxonomies using the same values as above. So you'll get 15 vocabularies with a total of 150,000 terms. The important part is a total of 150,000 term. The number of vocabularies doesn't make a big difference. Note that to avoid error, each time you create a new taxonomy you must go to admin/config/development/generate/taxonomy Do NOT use the browser back button. Otherwise your Drupal installation might no longer work.

(Alternative: If none of the above works try creating large taxonomy on a local site instead of remote site. Then import the database to your remote site.)

(Note: Before generating a large vocabulary your PHP and MySQL limits might need to be increase appropriately. Read more at http://drupal.org/node/29268 Another option is to use a local site to create your large vocabulary.)

Go to admin/structure/taxonomy. Under the column 'Content types' make sure that all your vocabularies are apply to the content type 'Page'. NOT apply with Blog, Forum or Story.

Once your large vocabularies have been created and applied you'll notice that the following pages are now very slow to load:
-'Find content' at admin/content. Find below two screenshots 'admin-content-1-during-loading.png' and 'admin-content-2-after-loading.png' to clarify.
-'Create Page' at node/add/page
-And all other pages displaying large taxonomies.

The usability issue is that the end users wait between 15 seconds to 60 seconds before the page is fully loaded and display. I have set this issue to 'critical' because most end users give up after waiting only a few seconds. Thinking that Drupal is too slow or falsely that Drupal 'Bug'. Advanced end users will be annoy. But will know that the page is still loading. This usability issue is present in Drupal 5, 6 & 7 dev. Large taxonomies are often use for various lists such as 'Countries, Cities, Accent Cities' or Book numbers.

Expected result is waiting 1 to 3 seconds before the page is fully loaded and display. I will post a few suggested solutions below. Hope this help.

I'm not a developer but I would be happy to contribute patch testing.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Francewhoa’s picture

Here are a few suggested solutions. Hope this help.

  • Provides an autocomplete form field instead of a drop down if a taxonomy is bigger than a predefined number of terms. Similar to http://drupal.org/project/batax
  • On content type page or taxonomy page. Offer 2 options: 1-Use drop down. 2-Use autocomplete.
  • Collapse the taxonomy fieldset by default. Then load it when extended by user. Using AJAX/AHAH.
  • Loading a drop down menu populated with taxonomy title but not all terms. Then when user select a taxonomy title load only the term associated with the taxonomy. Using AJAX/AHAH. Similar to http://drupal.org/project/taxonomy_widget or http://drupal.org/project/hierarchical_select

My vote goes for the first solution. Because it seems it would answer all type of user needs. The drop down menu would be the default value. But user could set it to change if a taxonomy is bigger than a predefined number of terms. Plus user could set a number of terms to trigger an autocomplete.

Francewhoa’s picture

Here is another suggestion that might help fix this issue. Overriding core taxonomy module with custom code. One way of doing this is using taxonomy_override_selector to reduce load overheads.

Alan did that to improve memory usage for another module call 'TaxiSelect'. And had great results. Read more at http://drupal.org/node/544070

Any volunteers?

catch’s picture

Category: feature » bug

This can still get in as a usability/performance improvement in Drupal 7.

Bojhan’s picture

Issue tags: -Usability, -D7UX +Performance

Moving it from my queue as it doesn't seem to be a real usability bug.

moshe weitzman’s picture

I'd say that we should fold into core the http://drupal.org/project/batax module

bleen’s picture

I'd say that we should fold into core the http://drupal.org/project/batax module

+1

sun.core’s picture

Version: 7.x-dev » 8.x-dev
Category: bug » task
Priority: Critical » Normal

This issue exists since like ever.

moshe weitzman’s picture

Version: 8.x-dev » 7.x-dev
Priority: Normal » Critical

The fact that this is old is not too relevant IMO. If this bug showed up today, we would mark it critical.

fabsor’s picture

I like the idea to fold the batax module into core.

Subscribing!

webchick’s picture

Priority: Critical » Normal

We're not blocking the release of 7 on a bug that's present in every prior version of Drupal, and represents a big edge case. Please use the 'critical' status judiciously.

catch’s picture

Title: Large taxonomies. Reducing pages load time. » taxonomy_get_tree() should never be called for an entire vocabulary

Crashing people's sites just because they happen to have a decent amount of content is a bug, and a critical one. If we were loading the entire contents of the node table every time you go to search/node, admin/content, node/n/edit and taxonomy/term/n/edit this wouldn't have been downgraded. But I won't play status pong.

webchick’s picture

Could you clarify? Is what's happening in 7.x worse than what's happening in 6.x and below? If not, I agree it is a "major" bug, but it is not a release-blocking "critical" one. If so, then restore the status.

moshe weitzman’s picture

Whats happenning is the same as what happens in 6. I disagree with the criteria that "if it is in 6, it is not critical". Bugs should be evaluated as if they are new. catch is saying that if we had this same bug for nodes, it would be critical. taxonomy is a similar fundamental system. Looks at the paths that catch posted. Those are fundamental paths that won't even load if you have a lot of terms.

Anyway, the point is not to argue about status. I propose that the best fix is to move code from http://drupal.org/project/batax into core. We'd just be adding a new field api widget. Please comment on the feasibility of committing such a patch. If it is plausible (i know there are no guarantees), we'll prepare a patch.

catch’s picture

Priority: Normal » Critical

It's no worse than D6, except that D7 memory usage is way up, so it increases the likelihood of hitting memory limits (especially now we have menu_rebuild() a bit more under control).

However I would say that a lot more sites use large taxonomies now, compared to when D6 was released, which is why there's several D7 issues trying to deal with this (many of which have been bumped to D8 because they require huge API changes and refactoring), and a growing number of contrib modules to work around it.

We added variable_get('taxonomy_override_selector') at the very last minute in the Drupal 6 cycle, with the hope of getting hierarchical select (or some other scaleable solution to select lists and hierarchy), into core in Drupal 7, which we haven't done. So all that's left at this point is damage control.

I'm moving this back to critical for those reasons. At the worst case, we need to ensure that variable_get('taxonomy_override_selector') is called anywhere taxonomy_get_tree() might get triggered in core. Where it isn't, that's either a critical leftover from the D6 patch or a regression.

This is currently done by taxonomy_form_all() - see #693362: taxonomy_form_all() is dangerous
Field allowed values (no issue yet)
Taxonomy parent selection (still has the variable iirc)

edit: or like Moshe says, try to actually fix it - with something along the lines of batax being the lowest impact solution.

JeremyFrench’s picture

Priority: Critical » Normal

subscribe

moshe weitzman’s picture

Category: task » bug
Priority: Normal » Critical

that status change was accidental, i believe. jeremy is sufferring from this bug badly on economist.com

catch’s picture

Also opened #698918: Reduce memory usage from \Drupal\taxonomy\TermStorageInterface::loadTree() when parent is specified when I remembered that even if you specify $parent in taxonomy_get_tree() it still loads the entire vocabulary into memory. I'm not sure if that issue is 100% viable yet, but if it is then that'd mean we don't have to remove the last four words of this issue title ;)

JeremyFrench’s picture

I have uploaded a patch to this #106015: [performance] DB caching for \Drupal\taxonomy\TermStorageController::loadTree() which may help the DB side of this bug, but it still won't help a user with a 100,000 item select box.

janusman’s picture

Status: Active » Closed (duplicate)

IIUC, this is about building a too-large form element in content/admin and search/node/.

If so this is no longer an issue due to #693362: taxonomy_form_all() is dangerous removing the filter-by-term form elements.

Marking as duplicate.

janusman’s picture

OR: If we want to do something like replace with an autocomplete or other widget, then change the issue title and reopen.

catch’s picture

Title: taxonomy_get_tree() should never be called for an entire vocabulary » taxonomy_get_tree() memory issues and mis-use for select lists
Status: Closed (duplicate) » Active

We still call this for the term parents when editing nodes, and in taxonomy_field_allowed_values(). Retitling for the remaining issues.

Anonymous’s picture

I can take responsibility for working on a batax-ish solution for this issue. I did the Field API conversion of taxonomy, so if this is a sane solution we can agree with, I am equipped to do it. I can't start until Thursday, but I can give the whole day. I'll assign the issue to me when I start.

My initial concerns:

A new widget for this problem sounds.. oblique. Can't I just force the widget to be the term reference autocomplete widget?

This is a problem of thresholds. What is the threshold? Memory? Time? Number of allowed values in any given term field? What's the right moment to determine when this threshold is crossed? Or do we trust the user to know when her vocab is too big by recognizing the symptoms of white screens and locked up web browsers?

The widget for selecting a parent term is a FAPI element while the one for selecting a term reference field value is a Field API widget. This means I'm looking at two places to intervene in two different ways.

My reading of batax is that it represents the ancestry of the terms suggested by the autocomplete. Is this true? My guess is that in a massive list with deep hierarchies, there will be name collisions. The only way to distinguish to terms is to know their ancestry.

catch’s picture

Forcing the widget sounds good.

Threshold, I think this should be a variable (although not necessarily an exposed one). We could have a taxonomy_autocomplete_threshold and a taxonomy_vocabulary_term_counts variable -the latter updated on taxonomy_term_insert. When the latter crosses the formerf or any particular vocabulary, update instances?

I like the idea of showing the ancestry. Another option if that's hard, would be the [tid: 123] approach used by node and user reference.

JeremyFrench’s picture

Usability becomes an issue way before WSOD and slow browsers. A country select box has < 200 items in it and is not particularly usable. So I would suggest a default cut off around that sort of value.

catch’s picture

Yeah 150-200 seems like a sensible threshold to me too.

Anonymous’s picture

My priority will be to develop the widget switch-a-roo based on the threshold. JeremyFrench is right... the usability threshold is crossed well before the system resources threshold.

On the point of usability, displaying the ancestry in the term suggestions is much more useful than a term ID. If I can do this as well in the time I can contribute, I will make it an option on the autocomplete widget.

Anonymous’s picture

Assigned: Unassigned »

Here we go!

Anonymous’s picture

Status: Active » Needs review
FileSize
7.57 KB

This is my first incomplete attempt. I will definitely work on this more, but I want some reactions to the unfortunate stuff that is going on.

#412518: Convert taxonomy_node_* related code to use field API + upgrade path removed all the old nodeapi code from taxonomy, but it also removed the simple and versatile Form API element code. The taxonomy term elements for nodes were also used for the taxonomy term edit form. taxonomy_autocomplete took just a vocabulary ID. Things were simple.

Under Field API, we're now loading field info to determine how to validate input. However, some of the code I'm going to target for element/widget switch-a-roo to avoid calls to taxonomy_get_tree use plain Form API elements. This makes it hard to reuse the autocomplete callback and element validate functions.

So guess what? I brought some of this code back.

Each time a term is added, I'm checking to see if the autocomplete threshold has already been crossed for its vocab. If the threshold isn't crossed yet, I count the terms in the vocab. If it's now beyond the threshold, I store the machine name of the vocab in a variable (probably should be the vid) and I never check again. I don't actually store the count because it would never be valid for vocabs which have crossed the threshold.

I'm marking needs review because I want tests to run. There's still plenty of unfinished work here.

moshe weitzman’s picture

H,,. I wonder if we should check that count during cron instead of adding a query on every save. That sort of query is painful when creating terms en masse like a data migration. At least that auto-check for vocab size should be able to be disabled.

janusman’s picture

@bangpound: did quick scan of your code, looks ok so far.

Re: what moshe said in #29, how about a static flag so the check is only done once? This would cover the mass import use case.

 +++ modules/taxonomy/taxonomy.module 
 @@ -1540,3 +1586,27 @@ function taxonomy_taxonomy_term_delete($term) {
 +function taxonomy_taxonomy_term_insert($term) {
++  static $vid_checked = array();
++  if (isset($vid_checked[$term->vocabulary_machine_name])) {
++    return;
++  }
 +  $counts = variable_get('taxonomy_vocabulary_term_counts', array());
 +
 +  // Counts array stores vocabulary machine name when the vocabulary has
 +  // exceeded the autocomplete threshold.
 +  if (empty($counts[$term->vocabulary_machine_name])) {
 +    $count = db_select('taxonomy_term_data', 't')
 +      ->condition('t.vid', $term->vid)
 +      ->countQuery()
 +      ->execute()
 +      ->fetchField();
 +    if ($count > variable_get('taxonomy_autocomplete_threshold', 150)) {
 +      $counts[$term->vocabulary_machine_name] = $term->vocabulary_machine_name;
 +      variable_set('taxonomy_vocabulary_term_counts', $counts);
 +    }
 +  }
++  $vid_checked[$term->vocabulary_machine_name] = TRUE;
 +}

EDIT: Not versed enough in drupal_static() to implement that, the above code probably should be using it.

catch’s picture

Status: Needs review » Needs work

@Moshe: this is a no return theshold- it's checked until the threshold is crossed, then never checked again. So for migrations it would only affect the first 200 terms inserted into a vocabulary - once those are in, the threshold is passed, and it shouldn't be checked again.

For migrations, it'd be enough to set $conf['taxonomy_autocomplete_threshold'] = 1; - then you'd get only one COUNT query per vocabulary. Although that's a bit obscure, I'm not sure we need any other special handling here. Would prefer we didn't do it on cron, since then you're counting for a 100 term vocabulary that never changes in perpetuity.
However, variable_set('taxonomy_vocabulary_term_counts', $counts); - this variable should be renamed to taxonomy_vocabulary_threshold_exceeded or something - it's not easy to tell what's going on there.

@bangpound - what's left to do here otherwise?

Anonymous’s picture

There's plenty left to do. I can get back on it on Thursday and Friday this week and hopefully push it across the "needs review" threshold.

moshe weitzman’s picture

@catch. thanks. my concerns are gone. i'd be happy to see this move ahead.

Anonymous’s picture

Ok. Unless anyone gets very eager before Thursday and does some work on it, I'll pick up where I left off then.

Catch's explanation of the plan vis à vis the threshold and SQL counts is accurate.

Francewhoa’s picture

@all: If you want to test patches. Using Drupal 7 alpha 3 or more recent the steps to 'assigning vocabularies to content types' has changed.

Before you can use a new vocabulary to classify your content, a new Taxonomy term field must be added to a content type on its manage fields page. When adding a taxonomy field, you choose a widget to use to enter the taxonomy information on the content editing page: a select list, checkboxes, radio buttons, or an auto-complete field (to build a free-tagging vocabulary). After choosing the field type and widget, on the subsequent field settings page you can choose the desired vocabulary, whether one or multiple terms can be chosen from the vocabulary, and other settings. The same vocabulary can be added to multiple content types, by using the "Add existing field" section on the manage fields page.

Source: Drupal 7 online help button. #overlay=admin/help/taxonomy

I'm currently testing patch #28. Will post my result here.

Francewhoa’s picture

Thanks for the patch bangpound :)

Here is my testing results with patch in #28. The patch applied successfully. I tested only the front end. I'm not a coder so I haven't look at the code. I have assigned a vocabulary containing 50,000+ terms to the content types 'Basic Page'.

I didn't notice any change before and after the patch. Is it normal? Maybe since bangpound said in #28 that there is plenty of unfinished work. Find attached screenshots to clarify before after patch.

Tested with.
-Drupal 7 alpha 3 fresh install.
-Garland theme.
-Firefox 3.
-Server OS: Ubuntu Server 8.04 LTS
-End user OS: Windows XP

Francewhoa’s picture

The following is a note to myself for future patch testing.

New steps to create large vocabulary. Then attach it to a content type.

Create new vocabularies with 'Devel Generate taxonomy'.

Go to admin/structure/types

Click on 'manage fields' link beside 'Basic page'.

Fill the 'Add new field' row.

For 'Field type' select 'Term reference'.

For 'Widget' select 'Select list'.

Click on save button. Find attached screenshot to clarify.

On the next screen FIELD SETTING. Select the Vocabulary 'Tags'. Note that the current 'Devel Generate taxonomy (devel-7.x-1.0-beta1.tar)' module as a glitch. All new taxonomy generated with 'Devel Generate taxonomy' are added to the 'Tags' vocabulary. I have open another issue about that at #750842: 'Devel Generate taxonomy' does not generate expected number of terms

On the next screen leave all to default. Click on SAVE SETTINGS button.

On the next screen click on SAVE CONTENT TYPE button.

Go to /user/1#overlay=node/add/page

Follow instructions on screen to create a new page.

Francewhoa’s picture

Attaching screenshots for #37.

Anonymous’s picture

Onopoc: The patch is not finished yet. I don't think there's anything to profile yet, and I will likely not do anything for taxonomy term reference fields. I'm targeting taxonomy term parents in the term edit form. Otherwise, Drupal needs to urge the user to use autocomplete widgets in taxonomy term reference fields when the total number of allowed values exceeds the threshold.

Francewhoa’s picture

@bangpound: Thanks for the information. I'll be happy to test other patches. I'll wait for the 'Status' to read 'Needs review'. Take all the time you need ;)

andypost’s picture

Wim Leers’s picture

Great work here. Hopefully I'll get Hierarchical Select in D8 core.

lotyrin’s picture

Priority: Critical » Normal

"Critical: When a bug renders a module, a core system, or a popular function unusable. ... to be fixed immediately because the software is not usable at all."

This isn't critical. The software works in a majority of cases.

Not to dismiss it, this issue should be fixed.

dejbar’s picture

Firstly I'd like to lend my support to this being critical. It can cause your server to run out of memory and crash. Even worse for some, a whole section of your site can silently fail when the number of tags used forces the memory usage above the php limit. If you allow free tagging and have a large amount of content then hitting this problem is virtually inevitable. Just because this critical bug existed in 5 and 6 doesn't make it any less essential to fix in my view.

Secondly. Isn't it usually vocabularies with no hierarchy such as tags that end up having enough terms to cause this problem? Wouldn't it be sensible to check if the vocab even has a hierarchy before doing the work and using the memory required to find the non-existent tree of a term? I think this would solve about 95% of the problems with this - probably not good enough to mark this as totally fixed it would surely justify a downgrade from critical.

Anyway here's my the Drupal 5 code for my solution

(at the start of taxonomy_get_tree)

+ if($parent > 0) {
+ $res = taxonomy_get_vocabulary($vid);
+ if($res->hierarchy < 1) {
+ return array();
+ }
+ }

If anyone knows a better place to post this or sees an error please let me know.

lotyrin’s picture

Priority: Normal » Major

We have major priority now.

mh86’s picture

I was taking a closer look at the taxonomy_get_tree function. when having a bigger taxonomy with some multi-parents, taxonomy_get_tree results in many, many recursions, which is not really fast in php. But what really takes time is the array_merge function call for merging in the results of the recursion in the tree array. with a little bit of refactoring the looping over the children can be separated into an own helper function an the $tree can be passed by reference, which avoids the array_merge and speeds at the code.
anyway, at the moment I think the whole function is broken, see #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects

mh86’s picture

Status: Needs work » Needs review
FileSize
3.37 KB

I was working on improving the performance of the taxonomy_get_tree function, which is used in many places (e.g. for loading the options of a select list or for the term overview page). This patch does not affect any form elements, which try to render too many terms at a time, so not definitely sure if it's the right place to post the patch.

As I pointed out in my previous comment, the recursion and the array_merge calls take up a long time. Thus I replaced the whole recursion with iterations, which really speeds up the complete function. Performance improvements can be easily seen with bigger taxonomies and multi-parents. On the downside, the code gets a bit more difficult to read.

colan’s picture

Subscribing.

mh86’s picture

Priority: Major » Critical
Status: Needs review » Needs work

I tried to use my taxonomy with D7 and was facing heavy performance problems (on a server with an Intel Core i7 920 and 12 GB RAM). The vocabulary has around 14.000 terms and some multi-parents.

First, the term overview crashes -> fatal error: memory size exhausted: this is caused by taxonomy_get_tree and the huge table.
Second, the term edit form takes around 12 seconds just for loading the form, same problem here: taxonomy_get_tree
In general, every time taxonomy_get_tree gets called, I have to wait for a long time. Since taxonomy_get_tree fetches full term objects, the function is much slower than in D6 (full entity loading has been added in #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects). When switching back to the old behaviour of taxonomy_get_tree, without the entity loading, the execution time goes back to 1 second (still bad).

marking this issue as critical, because the performance is much worse than in D6 and the problem will affect other sites as well.

overview of the execution time of taxonomy_get_tree:
with entity loading: 12sec
without entity loading: 1,1 sec
without entity loading and without recursions (patch comment #47): 0,45sec (additionally removing the cloning of terms reduces the PHP memory peak from 160mb to 62mb)

not sure what the best solution is. having incomplete objects is not really a good idea, but on the other side we need a fast function for building the forms and overviews.

further, multi parents are currently broken, see #899258: Multi parents broken

chx’s picture

Priority: Critical » Major

demoting from critical because there is nothing new here, this always was so.

fago’s picture

Priority: Major » Critical

>demoting from critical because there is nothing new here, this always was so.
I'm not happy about another critical, but I disagree. As mh86 noted above, the same taxonomy was working fine in previous drupal versions - so that's a regression :(

A that bad performing taxonomy_get_tree() seriously breaks dealing with any term, as the function is used during term editing + overview. Improving it to only entity-load the terms really needed would be a first start. However I doubt that will suffice as its often called for all vocabs too.

Apart from that, the taxonomy_get_tree() does not really return a tree. Moreover it looks like it is basically used for generating a flat list of the tree for overviews / selects. So a separate function optimized for generating that list (= without full entity loading) would make sense to me - maybe it should even return arrays instead of objects to make a clear distinction to full entity objects. taxonomy_get_tree() could then be fixed up in core to really return a tree, however I think there is no use-case for that in core.

mh86’s picture

I was working on the performance problems. As fago suggested in the previous comment, I added a separate function taxonomy_get_entries, which is used in places, where a full term object is not needed, e.g. to generate the options array for select lists. This function works much more efficiently than taxonomy_get_tree: when using it with my vocabulary the processing time goes down from 12 seconds (taxonomy_get_tree) to 0,45 seconds (taxonomy_get_entries) and uses less memory. As soon as multi-parents are fixed in taxonomy_get_tree (#899258: Multi parents broken), it will even be slower.

taxonomy_get_entries is also faster than taxonomy_get_tree in D6, because the recursion has been replaced with iterations.

Furthermore, it is clearly documented, that taxonomy_get_entries will return incomplete term objects. If someone needs full terms for further processing, taxonomy_get_tree has to be used. In core, all occurrences of taxonomy_get_tree can be replaced with the new function.

Concerning the terms overview page, the new function helps, but we still face problems with deeper hierarchies and long child lists, since it tries to render all child terms of the first parent until it reaches the next root level term (same behaviour as in D6). Maybe a threshold of terms to render might help. This would make it impossible to move the child terms to the next root level term with drag&drop, but still better than a white screen.

mh86’s picture

Status: Needs work » Needs review
mh86’s picture

fixed trailing whitespaces at line endings

klausi’s picture

Status: Needs review » Needs work
+++ modules/taxonomy/taxonomy.module	3 Sep 2010 14:31:38 -0000
@@ -858,6 +860,99 @@ function taxonomy_get_tree($vid, $parent
+ * Creates a flat list of term entries, usually used for generating forms or ¶
...
+  $process_parents = array(); ¶
...
+          $process_parents[] = $parent; ¶
+          $process_parents[] = $child->tid; ¶
...
+          reset($children[$vid][$child->tid]); ¶
...
+          next($children[$vid][$parent]); ¶

trailing white spaces

+++ modules/taxonomy/taxonomy.module	3 Sep 2010 14:31:38 -0000
@@ -858,6 +860,99 @@ function taxonomy_get_tree($vid, $parent
+ * Creates a flat list of term entries, usually used for generating forms or ¶
...
+  $process_parents = array(); ¶

trailing white spaces

Otherwise I think this is a good idea. Changing the comments on taxonomy_get_tree() would be good, to indicate that people should not use it anymore for the hierarchy listing use case. And a hint like @see taxonomy_get_entries().

And we should deal with the threshold for the term overview page in a separate issue.

Powered by Dreditor.

tstoeckler’s picture

Status: Needs work » Needs review

Crosspost. Resetting status

klausi’s picture

Status: Needs review » Needs work

Changing the comments on taxonomy_get_tree() would be good, to indicate that people should not use it anymore for the hierarchy listing use case. And a hint like @see taxonomy_get_entries().

chx’s picture

Priority: Critical » Major
Status: Needs work » Needs review

Yes. The issue is major but there is nothing new. It's just that more users will meet this because taxonomy_term is now a fielded entity eating up more RAM. Retrieveing a whole vocabulary was always impossible for any bigger site anyways. Tell me if there is a real regression here, a new bug. One user meeting this the first time is not one.

colan’s picture

Priority: Major » Critical
Status: Needs review » Needs work

As stated in #49,

Since taxonomy_get_tree fetches full term objects, the function is much slower than in D6...

...it is therefore a new bug, a regression as it now takes longer than D6. The results are different, and worse.

mh86’s picture

Status: Needs work » Needs review
FileSize
9.97 KB

updated patch contains additional comments for taxonomy_get_tree, as klausi recommended in #57

furthermore I did some performance testing for another use case: a flat free tagging vocabulary with 10.000 terms.
just calling the functions for this voc took following execution times:
taxonomy_get_tree(): 6.01729798317 sec
taxonomy_get_entries() 0.0657458305359 sec

marcingy’s picture

Priority: Critical » Major

In agreement with chx this is not critical vocabularies have always behaved in this way.

jamonation’s picture

Subscribing. As mh86 and fago are pointing out the "same taxonomy was working fine in previous Drupal versions - so that's a regression". It seems pretty clear that regardless of what re-factoring or new/different calls are involved, what used to work with the same set of terms in D6 now doesn't.

Even if the same problem exists in D6 with many taxonomy terms, the fact is, there is still a problem. Consequently, using taxonomies in D7 will be difficult for many users and developers.

marcingy’s picture

Major does not mean something won't be fixed. This bug does anot render the system unusable, however it does have significant repercussions therefore according to this http://drupal.org/node/45111 the issue is major not critical.

janusman’s picture

Just my $.02...

Also agree it's not critical; performance is relative and "many" terms would cause problems in D6 as well as in D7, or if you change the number that "many terms" means.

This *will* be fixed, but not critical (meaning "hold back D7 release").

I'd suggest focusing on marking issues "critical" for much more common use cases that do cause slowness/memory consumption problems.

Having said the above... it'd be nice to have a memory/performance benchmark with/without the patch to ease review =)

Anonymous’s picture

Please don't get anxious about the priority. This particular issue is a hard one to classify. It never stopped Drupal 5 or Drupal 6 from being released, so why stop Drupal 7? Meanwhile, it's clear that Drupal 7 fanciness exacerbates the bad situation.

As of April in San Francisco, Webchick and I (and others) agreed that this issue is major and not critical. However, real world scenarios with the upgrade path could obligate us to escalate the priority of the issue.

Will this be a critical issue with no functioning patch? Or a major issue with a golden patch?

According to the evidence presented, mh86's function taxonomy_get_entries() is much faster. He says it's harder to understand, but taxonomy_get_tree() is ALREADY opaque to me and taxonomy_get_entries() is a bit clearer to understand. So this is a good accomplishment!

Drupal core does not need debris and legacy functions. taxonomy_get_entries() is not going to get a pass while taxonomy_get_tree() also gets to stick around. Ultimately, the RTBC solution is one that FIXES or REPLACES taxonomy_get_tree().

The array returned by taxonomy_get_tree() is a real trick to use. It's not a flat array. Is it exceptional enough that we don't need to think about entities here? Parent term is not a field,
so the way hierarchy is implemented in Drupal 7 is already a bit of a relic.

Do all our tests pass if the body of mh86's taxonomy_get_entries() is pasted over the body of taxonomy_get_tree()? This is where I'd like to start.

I'll come back to the issue in about a day. I need to do more research about how taxonomy_get_tree() is used in core. I don't oppose having it return a lightweight representation of term names, IDs and hierarchy only. But I also don't know if that's possible. More tomorrow.

chx’s picture

I am a bit hesitant here because this approach skips hook_entity_load and hook_taxonomy_term_entity_load. Does anyone want to mangle taxonomy term names on load? Another approach would be to have a helper that returns the select options because that's what get tree is usually called for... and accept that term names can't be mangled on load. Other uses of taxonomy_get_tree I suspect would be with smaller vocabs anyways -- if you have 10 000 terms then loading that into memory is infeasible, one way or another. Also, killing recursive in get_tree is just awesome, thanks for the work! We totally should do that in get tree. I doubt it's hard to read. We have comments for that. You would do an awesome service to the Drupal world if you could convert drupal_render and drupal_render_children in a similar way. Please!

mh86’s picture

The patch from #60 replaces all occurrences of taxonomy_get_tree with the new taxonomy_get_entries function and all tests pass. Full entity loading was added after alpha6 (see #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects).

I'm still not sure, how the best solution should look like. Dealing with incomplete terms is not a good idea. On the other hand, the way core uses it for building forms, full terms mean a lot of overhead. This was the reason, why I added a separate function for this use case, but still with the tradeoff of no possibility to change terms on hook load, as chx mentioned.

Furthermore, if we return a lightweight representation of terms, we should think about returning term arrays (like $term['tid']...) instead of incomplete term objects. This would make it clearer to developers, that they are not dealing with real terms. On the other hand this would break much code in core and contrib.

Anonymous’s picture

Think of this in two phases: LIST gives us the rows from taxonomy_term_data and taxonomy_term_hierarchy tables. LOAD turns that into fattened entities. Can we design taxonomy_get_tree() to optionally return just the LIST? And if we want to LOAD, can the function take the outcome of the list and efficiently return entities based on that?

LIST is taxonomy_get_entries(), LOAD is taxonomy_get_tree() in this patch. Can we do this all in taxonomy_get_tree?

function taxonomy_get_tree($vid, $parent = 0, $max_depth = NULL, $only_list = TRUE) {
  [... mh86's taxonomy_get_entries() ...]
  if ($only_list) { return $tree; }
  [... call taxonomy_term_load_multiple on the tree ...]
  return $tree;
}
mh86’s picture

Designing it this way sounds reasonable to me. The code stays simple and it is close to the previous implementation (except the replacement of the recursion). Combining List+Load in one function just required a few additional lines of code and both profit from the iteration implementation.

To give some performance numbers, tested the function on my local machine with 5.000 terms, simple hierarchy:

  • original taxonomy_get_tree, with full entities: 2,7sec
  • updated taxonomy_get_tree, with full entities: 2,3sec
  • updated taxonomy_get_tree, just list: 0,2sec

At the moment core never requires full entities, therefore this change really improves performance.
I'm still thinking of the variable naming $list_only, which wasn't clear for me at the first glance. Maybe we should use something like $fetch_full_term_entities or $fetch_structure_only, what would you say?

Anonymous’s picture

Do we need an additional static cache for $term_entities? The entity controller has a static cache. Why not just rely on that?

I cannot wait to try out this patch! Thanks mh86.

mh86’s picture

You are right, there is no need for another static variable. I can update the patch tomorrow.

Anonymous’s picture

"List only" mode may be a bad name, but the default for the function should be to load entities. It's important to keep the current behavior as the default.

After you post a new patch, I'll test it and go through existing usage of taxonomy_get_tree() to find places we can use "list only" mode. taxonomy_overview_terms_submit() and taxonomy_check_vocabulary_hierarchy() I think to start...

mh86’s picture

here the updated patch:

  • removes static var for $term_entities and reuses the cache from the entity controller
  • renamed $list_only in $load_entities (seems more intuitive to me) and made it default. Still I hope the list mode is sufficient for most places in core, but we'll see.

In my previous posts I forgot to mention, that this patch also fixes multi parents, which are currently broken in taxonomy_get_tree (term with multi parents only added once to $tree array).

moshe weitzman’s picture

This looks awesome. The @return in the doxygen needs updating as well. After that, it is RTBC IMO. Would be good to hear from bangpound on this latest revision.

catch’s picture

Looks great to me as well.

This issue was the first mention of #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects I've seen, I would've argued strongly against that going in given the existence of this issue, but at least this alleviates the worst of it.

moshe weitzman’s picture

Status: Needs review » Needs work

Also, don't you need to change all the calls to taxonomy_get_tree() in core so they they set $load_entities to FALSE? I'd be OK with just changing default behavior of taxonomy_get_tree() to FALSE but bangpound seems opposed to that.

yhahn’s picture

I ran into this as well. It's worth noting that this is not merely an edge case caused by very large vocabularies. If you attach fields to your taxonomy_term entities you can quickly reach 20, 30+ second node edit form page loads with vocabularies containing relatively few terms (e.g. 100 or 200 terms). The more fields, the slower the loads.

The patch in #73 looks good but it seems like a one-off for taxonomy_get_tree() when to me this seems like a problem for any large entity_load() operation that doesn't require full loading of fields. Is the recommendation here to not use the entity API when you know you don't need fields and will incur a serious performance hit for calling it? (Start looking around core and you will see many instances of taxonomy_term_load_multiple() and user_load_multiple() that will never need information beyond data from the entity base table...)

My first inclination was to add a new function entity_list() (and derivative wrapper functions) and corresponding flag to the entity controllers ->load() method that avoids calling ->attachLoad(). If this sounds reasonable I have the patch on hand.

catch’s picture

The main issue with not using the entity API is something like #557292: TF #3: Convert node title to fields - that patch was reverted though, a theoretical contrib module which alters entity properties on load would break with either this taxonomy_get_tree() code or a more generic entity_list(). I think it's worth its own issue for discussion though. Also while it won't help with the memory issues from many hundreds of entities, quick plug for http://drupal.org/project/entitycache which will mean neither the base table nor the field storage, nor non-field load hooks will get hit once your cache is warm.

mh86’s picture

@moshe
I'm in favour making the list mode as default option. I was reviewing the taxonomy_get_tree function calls and I don't see any place where it's is necessary to have full entities, but waiting for bangpounds opinion.

Anonymous’s picture

Sorry for my delay here.

@mh86: taxonomy_get_tree() is called up in two ways. In taxonomy_overview_terms_submit() and taxonomy_check_vocabulary_hierarchy () it's called just to get the hierarchy. Term names don't matter. Whatever fields are attached to term entities don't matter either. This is where it's safe to use the "list only" mode.

In other places like taxonomy_allowed_values(), the term names are used as well as the hierarchy. If we don't load the terms as entities, are we going to be able to translate the term names? Will translating the term names need to be done two ways: (1) for entities and (2) for terms loaded in taxonomy_get_tree()'s list only mode?

The reason to make "list only" mode optional and not default is to match current expectations and documentation. The last argument presently on taxonomy_get_tree() [$depth] shouldn't be passed by any other function except taxonomy_get_tree() itself... although it is passed in some places. I prefer the current behavior to be the default because if someone never learns about this patch, they don't get bitten by it.

mh86’s picture

If the plan is to translate term names during term loading, it sounds reasonable to me to use full entities. In this case, we are not going to consider performance problems on the node edit form, as yhahn reported in #77. Nevertheless, a simple contrib module could override the allowed_values with a more efficient implementation.
If we just use the list mode in allowed values, translating should still be possible with some form altering (as it was done previous to D7). Compared to full entities, this is more efficient, but maybe requires more code?
So, both approaches have pros and cons, most important we get something like the efficient list mode for taxonomy_get_tree in, so that contrib modules can take advantage.
Concerning the expectations of the results returned by taxonomy_get_tree, I would rather assume, developers don't expect full entities, because this is completely new since alpha6.

Alan D.’s picture

Great work reducing the load on the taxonomy tree. Big +1 to this getting in.

In relation to the widget:

I know this is very late in the thread sorry, and I'm not trying to push my own project, but if you are still thinking about implementing the multiple drop-down widget have a quick look at the logic behind taxiselect. It uses a single non-recursive db_query() to generate the list of options from the tree, utilizing the power of the database rather than PHP to map the tree. On small vocabs, a single select is great. On moderate sized vocabs, multiple drop-downs work, but on really deep ones I haven't found anything else that comes close.

For example, using the geonames database, with ~6 million locations @ four levels deep, the Taxiselect autocomplete is comparable in performance to the user autocomplete, from an end users perspective and the memory load is trivial.

Using Hierarchical Select is not the best option as there are levels that have thousands of entities. For example, all towns and cities in Brazil
Using Big Autocomplete TAXonomy is not the best option as there are place names with 100's of duplicate entries.

I would NOT push taxiselect into core as the background processing logic is much more complicated, so maintenance would be an issue. Plus it is for a fairly specific use-case.

So rather than trying to get a specific implementation rushed into core, how about just focusing on refactoring the taxonomy select override so that users can have the option? This is a badly thought out flag thrown in without any real use case thought out.

Propose:
A new field on each taxonomy, 'parent_selector' vc(31) default 'taxonomy', and call the implementing mechanism for generating this FAPI element.

New hooks (thinking on the fly)

<?php
function hook_taxonomy_parent_list_selector() {
  return array(
    'taxonomy' => t('Single select list'),
  );
}
?>

New callbacks

<?php
function callback_parent_list_selector_widget(....) {
  return $element;
}
?>

New element of the vocab page to select the widget, #access == count(hook_taxonomy_parent_list_selector() results) > 1

Finally a new implementation of hook_disable() in taxonomy.module to default any callbacks to taxonomy in case the implementing module gets disabled.

This way we could have all of the above mentioned modules working in harmony and only core would have to handle the functionality of it's own widget.

fago’s picture

I just discussed this with mh86 and wondered whether it is an option to disable attaching fields to the term object by default?

I guess attaching fields is that what makes entity load that more expensive than a simple load operation. But I think they are usually not needed. So we could go and not attach them by default, but provide an API function to do so on demand.

phantomvish’s picture

+1 subscribing

mh86’s picture

I took a closer look at the entity loading and I found two main issues, which make loading of full terms that slow:

1. In the current implementation and in the patch we call:
$term_entities = taxonomy_term_load_multiple(array_keys($terms[$vid]));
This ends up in a query with "... WHERE tid IN ( all term ids)", where retrieving the result set was extremely slow (even though the query execution time isn't that bad). Changing this line to:
$term_entities = taxonomy_term_load_multiple(array(), array('vid' => $vid));
helps a lot.

2. indeed, attaching fields to a term takes a lot of time, even if the vocabulary has no fields associated. Attaching them on demand, as fago suggested, would make sense.

With fixing these issues we can get taxonomy_get_tree working in a reasonable amount of time, even with entities. I'm going to update the patch as soon as I find time.

catch’s picture

@mh86:

#1: I'd be interested to see where the difference is with this. One issue with using $conditions instead of $tids is that you don't get any benefit from modules like entitycache, but bulding IN() queries with thousands of terms is no fun either.

#2: if you don't attach fields, then you don't have an entity. I'd stick with the optimization (or more like revert of #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects, which should never have gone in) that's already here of not doing the entity loading. On-demand field loading is D8 material, and a separate issue from this one.

catch’s picture

Status: Needs work » Needs review
FileSize
5.19 KB

Patch no longer applied, re-rolled with the doxygen changes Moshe asked for, and defaulting the parameter to FALSE. Otherwise it's unchaged apart from updating to HEAD.

Part one of #85 is worth looking at, but I'm pretty happy with this overall as is.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

Thanks for the changes. RTBC. Wait for green before commit.

Dave Reid’s picture

Status: Reviewed & tested by the community » Needs work

Subscribe

tstoeckler’s picture

Status: Needs work » Reviewed & tested by the community

Crosspost?

alex_b’s picture

Reviewed #87 - this fixes the performance problems I experienced on node edit pages that use vocabularies with many terms with fields.

FWIW, I agree w/ catch on #86/2 - local optimization seems to be preferable to introducing the concept of fieldless entities. #85/1 is an interesting point, but really material for a separate issue.

RTBC, here too.

catch’s picture

Priority: Major » Critical

I re-opened then re-closed #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects, am bumping this back to critical since it's fixing that regression now.

catch’s picture

FileSize
5.2 KB

Pre-emptive postgres un-breakage via webchick in irc.

webchick’s picture

Status: Reviewed & tested by the community » Needs work

So, a few things:

-function taxonomy_get_tree($vid, $parent = 0, $max_depth = NULL, $depth = -1) {
+function taxonomy_get_tree($vid, $parent = 0, $max_depth = NULL, $load_entities = FALSE) {

Technically, this is an API change but since $depth was previously an internal property only (and it always annoyed me to begin with), I think this is fine. Especially if we gain a performance boost from not doing recursion here, which is what it sounds like.

-    $query->orderBy('t.weight');
-    $query->orderBy('t.name');
...
+      ->orderBy('weight')
+      ->orderBy('name')

That seems like it's going to cause problems if someone hook_query_alter()s in the block table or something else with a "name" or "weight" column. (looks like catch took care of that in the last patch)

+  $process_parents = array();
+  $process_parents[] = $parent;
+  while (count($process_parents)) {
+    $parent = array_pop($process_parents);
+    $depth = count($process_parents);
+    if ($max_depth > $depth && !empty($children[$vid][$parent])) {
+      $has_children = FALSE;
+      $child = current($children[$vid][$parent]);
+      do {
+        if (empty($child)) {
+          break;
+        }
+        $term = $load_entities ? $term_entities[$child] : $terms[$vid][$child];
...
+        $term->depth = $depth;
+        unset($term->parent);
+        $term->parents = $parents[$vid][$term->tid];
+        $tree[] = $term;
+        if (!empty($children[$vid][$term->tid])) {
+          $has_children = TRUE;

Most of this function is documented pretty well (certainly much better than it was), but I'm wondering if we could flesh this part out with EVEN more comments, since I don't think we'll ever have the collective genius mindshare of taxonomy maintainers in one place again to clean it up after.

mh86’s picture

Status: Needs work » Needs review
FileSize
5.41 KB

Updated the patch from #93 with more comments on the loop.

Concerning #85/1: loading full entities from a voc with 2500 terms with current code took 1,2sec, with the $vid condition instead 0,85sec. But this can be addressed in a separate issue.

Status: Needs review » Needs work

The last submitted patch, taxonomy_get_tree_556842.patch, failed testing.

catch’s picture

Needs a re-roll with git --no-prefix.

Reviewing the patch:
This is my fault, but taxonomy_term_data should be {taxonomy_term_data} IMO, makes it obvious it's a db table. Not a blocker, may be making up that code style too...

Comment additions look good.

mh86’s picture

Status: Needs work » Needs review
FileSize
5.41 KB

this one should apply
further changed taxonomy_term_data to {taxonomy_term_data} in the comment

catch’s picture

Status: Needs review » Reviewed & tested by the community

Looks great.

webchick’s picture

Status: Reviewed & tested by the community » Fixed

Great! Thanks for taking the time to do that.

Committed to HEAD.

Francewhoa’s picture

Thanks all for your contributions. This is awesome :)

sfinerty’s picture

Is there any plan to roll this back to Drupal 6? We are having serious performance issues and have narrowed it down to Tagadelic tag cloud displaying terms and when displaying all nodes in a view organized by taxonomy terms. As Drupal 7 is only in beta, we're not ready to move our sites to it yet.

webchick’s picture

Title: taxonomy_get_tree() memory issues and mis-use for select lists » taxonomy_get_tree() memory issues
Version: 7.x-dev » 6.x-dev
Status: Fixed » Patch (to be ported)

Yeah, actually, the change as-is might be back-portable since it didn't end up affecting the UI. It does incur an API change, but to a property that contrib authors shouldn't be using. Up to Gábor, of course.

gianfrasoft’s picture

Subscribing.

yonailo’s picture

subscribing.

TCRobbert’s picture

subscribing

wqmeng’s picture

Status: Patch (to be ported) » Needs review

#98: taxonomy_get_tree_556842_1.patch queued for re-testing.

Alan D.’s picture

Status: Needs review » Needs work

#100 committed to D7. It will need work for D6 as the patch has heaps of D7 dependent code!

gianfrasoft’s picture

For those who are trying to solve the slow down problem of hierarchical select, take a look at this link:

http://drupal.org/node/1059676#comment-4095278

Hope this helps.

wqmeng’s picture

subscribing

wojtha’s picture

Status: Needs work » Needs review
Issue tags: +Needs backport to D6
FileSize
4.47 KB

Straight backport to D6, just stripped the entities stuff.

wojtha’s picture

dddave’s picture

phantomvish’s picture

Thanks a ton for the D6 patch ! works fine in the preliminarily tests with huge vocab of more than 10k items.

wojtha’s picture

Assigned: » Unassigned

@phantomvish if it is working for you please, RTBC this issue

phantomvish’s picture

Status: Needs review » Reviewed & tested by the community
deviantintegral’s picture

Here's an update that fixes some small issues with the phpdoc comment.

Testing this looks to be about 20% faster than the current taxonomy_get_tree() function in D6. However, I'm not seeing any difference in the memory profile with this patch. Reading the previous issues, it seems as if the expected improvement is only in CPU time, and not memory use. If that's the case, this is fine at RTBC, otherwise I think this needs work.

gielfeldt’s picture

I've created module that somewhat address this issue. It extends the taxonomy data model, providing faster lookups for hierarchies.

http://drupal.org/project/taxonomy_edge

It includes a core patch for optimizing taxonomy_get_tree(). Also includes a module for Views providing optimized queries for queries using depth on taxonomies, and reintroduces the /all modifier for D7.

EDIT: Clarification: Taxonomy Edge only optimizes the select queries for taxonomies. It does not fix the problem of trying to put 100000 terms into a drop down box.

Gábor Hojtsy’s picture

Status: Reviewed & tested by the community » Fixed

Committed, pushed, thanks.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

asb’s picture

Status: Closed (fixed) » Active

I have not followed this issue so far, and I didn't have any problems with "taxonomy_get_tree()" before (whatever this might be), but after updating to Drupal core 6.24 I'm getting a WSOD on one of my older sites. From Apache's error.log:

[Sat Feb 11 02:32:17 2012] [error] [client 123.456.789.123] PHP Fatal error: Allowed memory size of 262144000 bytes exhausted (tried to allocate 24 bytes) in /var/www/drupal/modules/taxonomy/taxonomy.module on line 908

I raised the PHP memory limit to 1024M (default: 250M) which did not resolve this issue. The higher I set PHP memory limit, the more other lines of taxonomy.module are referenced, especially 891, and 896. error.log reports no other issues except for a "PHP Fatal error" originating in taxonomy.module.

On this site, all pages result in a WSOD, including admin pages, and update.php for user #1. According to drush status, Drupal bootstrap is "successfull" (for Drush, at least), and I could run update.php via drush updatedb (without errors).

These problems were not there before the update, so this is clearly a regression, at least for this one site. However I don't know if it was introduced by this bugfix or something else from 6.24.

Environment: LAMP with Debian/stable, Apache/2.2.16, MySQL 5.1.49-3, and PHP 5.3.3-7+squeeze7 with Suhosin-Patch, Memcached, and APC. This is a quad core server with 24 Gig of memory, and roughly a dozend other Drupal sites run fine on this system with Drupal 6.24 (respectively Pressflow).

Francewhoa’s picture

@All: Thanks for your contributions. Confirming #119 works for all our Drupal 6.24 sites :)

@asb: Have you tried reproducing the issue in #122 with a fresh Drupal 6.24 install? The issue you are describing is maybe cause by something else. Like a contrib module or configurations.

asb’s picture

@Francewhoa: As I wrote in #121, "I don't know if it was introduced by this bugfix or something else from 6.24", I just know for a fact that the issue was introduced in Drupal 6.24. It is not there in Drupal core 6.23, or Pressflow 6.23, or Drupal core 6.22, or Pressflow 6.22.

wojtha’s picture

Priority: Critical » Major
Issue tags: -Needs backport to D6

@asb could you describe the characteristics of the page and taxonomy where you getting this error? (For instance type and size of the used taxonomy vocabulary and contrib modules related to the taxonomy module.)

asb’s picture

First, the issue still persits in Drupal core 6.25, and Pressflow 6.24 (the latest code available from Github as of today), so Drupal core 6.23 is the last version that does not break the site, and I'm still considering this issue as "critical" as defined ("render a system unusable").

As said in #121, I'm getting the WSOD on all pages, including admin pages, and update.php. There is no exception; as soon as I install 6.24 or 6.25, the complete site dies.

The site uses six vocabularies, and it has 2,422 + 1,893 + 857 + 640 + 49 + 680 terms (according to ./admin/reports/systeminfo/drupal). All vocabularies are configured as "Tags", "Multiple selection" is allowed, they are "Not required". Contrib modues related to taxonomy are, according to ./admin/build/modules: Edit term 6.x-1.3, Tagadelic 6.x-1.3, Taxonomy switch 6.x-1.x-dev, and a (so far disfunctional) Taxonomy Edge 6.x-1.3. Also (from other sections): Taxonomy VTN 6.x-1.11, Taxonomy redirect 6.x-1.3, Taxonomy breadcrumb 6.x-1.1, Taxonomy Defaults 6.x-2.1, Taxonomy Manager 6.x-2.2.

asb’s picture

Priority: Major » Critical

There seems to be a workaround for the regression that has been introduced with Drupal core 6.24; it's a contributed module called Taxonomy Edge.

Test results with an unaltered Drupal core and Taxonomy Edge (including core patches):

  • Drupal core 6.23 without Taxonomy Edge: site runs fine
  • Drupal core 6.23 with Taxonomy Edge: site runs fine
  • Drupal core 6.24 without Taxonomy Edge: WSOD
  • Drupal core 6.24 with Taxonomy Edge: site runs fine, including "Status report" page
  • Drupal core 6.25 without Taxonomy Edge: WSOD
  • Drupal core 6.25 with Taxonomy Edge: site runs fine, including "Status report" page

That proves, that a regression has been introduced with 6.24, and it strongly suggests that taxonomy_get_tree() is the root cause of the problem.

The core patch provided by 'Taxonomy Edge' "overrides for the core module Taxonomy in regards to taxonomy_get_tree() and taxonomy_select_nodes()" (quoted from the module's project page), so the changes introduced in 6.24 probably might "need work" or should be revoked.

bleen’s picture

asb ... that is a mighty list of taxonomy related modules to have installed on the same site. It is certainly possible that this change to taxonomy_get_tree caused on of those modules to begin causing a WSOD. Have you attempted to disable those modules one-by-one to see which might be the culprit?

I did a quick install of plain ol' D6.25 and generated ~3000 terms in each of 2 vocabs without any issue. This suggests that it is one of those contrib modules you mentioned.

asb’s picture

I'm using exactly the same set of modules on roughly a dozend of other D6 sites where I don't get WSOD, neither with 6.24, 6.25, now with current Pressflow versions. This does not exclude a remote possibility that some contrib modules have problems with the changes applied to taxonomy_get_tree(), but it doesn't give me a workable hint where to look, either.

However, I agree that probably there is something site specific that drives exactly this site over the edge with Drupal core 6.24/6.25, and doesn't harm others (except that I had to raise the PHP memory limit on most sites, starting with 6.24). At least my gut feeling tells me that this specific "something" is neither a contrib module, nor the size, but the way a non-trivial taxonomy is handled in Drupal.

Size issues probably could be excluded by letting devel generate >60k nodes and tag them with >8k different terms, each node with an average of 10 terms. I am pretty confident that Drupal can handle 600k tags. But the structure of the taxonomy looks/feels unhealthy in the UI (or the UI is unhealthy ;). Some vocabularies have hierarchical structures, and some (root) terms have multiple parents; editing nested terms via ./admin/content/taxonomy/edit/term/… and expanding the "Extended options" fieldset takes minutes while freezing the browser; and what is being displayed in the "Parent terms"/"Related terms" boxes looks like endless recursions and insane nestings("------------------------------------------------------------------term").

Maybe the UI is simply bad and not capable to display complex term relationships, or something has corrupted the taxonomy structures, and maybe the behaviour can be replicated by creating a couple of large hierarchies, and then ginving those multiple parent terms. At least I have been working for years under the assumption that Drupal can not handle hierarchies and multiple parents well, and maybe that's why my newer sites do not show these issues. Or maybe the younger sites don't have issues because I have stopped to use core's taxonomy UI to create/edit terms and moved to 'Taxonomy Manager' (where the vocabularie's structures look fine, and editing a term ony takes seconds). Or maybe the site is simply too old (started around 4.5/4.6), bugs might have been fixed in the code, but not in the site's datem and the taxonomy handling has been enhanced in the meantime, so that younger sites are not affected.

gielfeldt’s picture

@asb If your term_* tables don't contain too sensitive data, you can try to post it here, so we can replicate the issue.

asb’s picture

FileSize
928.97 KB

@gielfeldt: No, I don't think that there is too sensitive data as the vocabularies have been built with community tagging anyway. Attaching a dump of the following tables: mysqldump --add-drop-table drupal term_counter term_data term_edge term_hierarchy term_node term_relation term_synonym. Does this suffice?

gielfeldt’s picture

Ok. I'll take a look at it in a couple of days when my hangovers go away.

gielfeldt’s picture

@asb: "Good" news ... I've been unable to reproduce the WSOD using your dump.

You've got a cyclic reference:

mysql> select * from term_hierarchy where tid = 53;
+-----+--------+
| tid | parent |
+-----+--------+
|  53 |     39 |
|  53 |    723 |
|  53 |   1324 |
|  53 |   1623 |
+-----+--------+
4 rows in set (0.00 sec)

mysql> select * from term_hierarchy where tid = 723;
+-----+--------+
| tid | parent |
+-----+--------+
| 723 |     53 |
| 723 |    116 |
+-----+--------+

If you delete it, it works again.

mysql> delete from term_hiearchy where tid = 723 and parent = 53;

The old taxonomy_get_tree() apparently was immune to this. So I guess it's a question of definition whether or not the new taxonomy_get_tree() is flawed...

If the new taxonomy_get_tree() is deemed valid, then there should be an upgrade note, that >= D6.24 will break on cyclic references (if there isn't already, I haven't checked).

EDIT: unable = able of course ...

bleen’s picture

Thanks @gielfeldt for helping to get to the bottom of this...

I dont think taxonomy_get_tree() should be babysitting circular references. Tempted to mark this as fixed, but Ill leave open for now to allow others to disagree

gielfeldt’s picture

@bleen18: I "slightly" disagree. While I agree with you for reasons of principle, it is not uncommon for a function to handle bad input in way that it doesn't croak.

Especially when considering Drupal's data-integrity (i.e. lack thereof). For me it comes down to how impractical it would be to support for ignoring errors in the term hierarchy. If it's too much trouble to implement, then I suggest a release note explaining this potential issue. If the problem can be averted with a simple if-statement, then that would be preferable.

asb’s picture

@gielfeldt: I deleted the circular reference, as you suggested, and after that I could install the current Pressflow 6.25 without WSOD. Thank you very, very much for your analysis.

Obviously, from the user's point of view I have to strongly disagree with the statement from #133 that taxonomy_get_tree() "should not be babysitting circular references". I am more than worried that, quite often, the integrity of the user's data seems to come last in Drupal's priorities.

Since circular references, as discovered by gielfeldt in #132, can (or could) be created by Drupal core's Taxonomy UI, they must not be allowed to harm a site, under no circumstances, especially not with a especially hard to debug WSOD; for example, 'Taxonomy Manager' checks for circular references, and prevents them to be created in the first place. If the Taxononomy UI should have been fixed in the past years, the update scripts - IMHO - would need to check the user's data, and fix the issues introduced by issues in prior versions.

There is quite a bunch of other examples where developers do not feel responsible for the user's data that is created and stored through Drupal, e.g. by duplicate strings in the search index that break the restorability of backups, or the painful disaster with the book module and broken book hierarchies during the D5/D6 upgrade; it's pretty consistent that Drupal core developers tend to push responsibility for data integrity away from Drupal; e.g. in case of the duplicates in Drupal's search index, it's presumably the responsibility of the database not to accept corrupted data from the application. IMHO it's the application's responsibility not to store corrupt data.

I believe this irresponsible (by denying responsibility of Drupal) point of view is a major problem for a Content Management System. What else is the essential requirement for Content Management but to protect data integrity by all means possible? So please consider to re-add the mechanisms that have been in place prior to Drupal core 6.24. Alternatively, either a release note, as suggested in #132, or something like a "Data Integrity Advisory" (similar to Drupal's Security Advisories) might be a fair compromise to deal with issues like this. A Security Advisory, warning about potential attacks from the outside, isn't worth much, if Drupal itself corrupts it's data from inside.

bleen’s picture

Just to clarify my statement in #133... I agree that drupal's core taxonomy module should not allow circular references to occur. I just dont think that the taxonomy_get_tree() function should be enforcing this. I believe this should be handled on term save, and THAT sort of change should be handled in a different issue. This issue is dedicated specifically (and exclusively) to improving the memory usage by taxonomy_get_tree().

Alan D.’s picture

circular references, as discovered by gielfeldt in #132, can (or could) be created by Drupal core's Taxonomy UI

I would have thought that this would have been introduced by a contrib module. Can any one actually replicate this in core? I've done this myself in the past but never from the core UI.

I think that it is worth splitting this into a new issue with a clear description of the problem and a link back to this thread for reference.

gielfeldt’s picture

@Alan D.: Using the example about, I believe it would be possible for Drupal Core to create a circular reference, if term 53 and 723 were saved concurrently. But I also agree that this problem belongs in another ticket, as it is not a memory issue.

That aside, and oddly enough "on topic", I think taxonomy_get_tree() still has memory issues. The static caching is a double edged sword. While it improves performance for multiple calls, it also in many cases increases the memory footprint unnecessarily. Also the fact that it always fetches every single term from the vocabulary specified, is inherently not very scalable.

Another thing is that modules tend to use taxonomy_get_tree() in situations where less could do it. A good example is the taxonomy_select_nodes() which first gets the entire tree, just to get a list of tids, which can then be used to select nids. This could be done in a single sql query instead. The forum module also does some unnecessary stuff. I'm still flabbergasted by forum.module:186-196.

wojtha’s picture

@gielfeldt

That aside, and oddly enough "on topic", I think taxonomy_get_tree() still has memory issues. The static caching is a double edged sword. While it improves performance for multiple calls, it also in many cases increases the memory footprint unnecessarily. Also the fact that it always fetches every single term from the vocabulary specified, is inherently not very scalable.

I have one project with more than 100.000 terms and I had to rewrite all these taxonomy functions (including the form element, core tries to render vocabulary of these size in the selectbox!) in a way like the Khalid did in
http://2bits.com/articles/bottleneck-replacing-taxonomy-term-count-nodes.... In my case I do static caching but only for the terms which were used once. However this case is the edge case and for that kind of issues the Taxonomy Edge and Term Node Count were developed. The current taxonomy system was apparently designed to vocabularies of thausends terms not hundreds thausends or more. For vocabularies of thausends terms the static caching in the current system is very effective and reduce chance of DB to become the bottleneck.

Another thing is that modules tend to use taxonomy_get_tree() in situations where less could do it. A good example is the taxonomy_select_nodes() which first gets the entire tree, just to get a list of tids, which can then be used to select nids. This could be done in a single sql query instead. The forum module also does some unnecessary stuff. I'm still flabbergasted by forum.module:186-196.

Thats a valid argument, however I'm not sure if it isnt waste of energy to modify general Drupal 6 API these days. Drupal 6 will be past in approx. 1.5 year. Might be we could integrate these changes to taxonomy edge or create "Big Taxonomy" project which will have different implementation of the standard taxonomy functions and will be designed to handle 100.000+ terms taxonomies.

PS: great work to tracing the circular reference issue

wojtha’s picture

To summarize my thoughts above I think that there is no good solution we could deliver for Drupal 6 which resolves perfomance issues for both subsets of the possible taxonomy vocabulary sizes. For smaller solutions the static caching of the whole vocabulary tree makes sense. It is simple stupid and just works. However for big vocabularies its a waste of memory when we use just several terms from set of 100.000 terms and advanced solution should be developed. I can imagine things like caching of partial trees (like menu system does) and such things to reduce need of processing of the whole taxonomy hierarchy tree during every page load and even for the flat vocabularies like Tags.... But this solution is needed in special cases like 1% of sites or so and might be even less.

gielfeldt’s picture

@wojtha: Thanks for the tip about taxonomy_term_count_nodes(). This is a perfect candidate for Taxonomy Edge.

asb’s picture

From my experience as maintainer of two dozend D6 sites, the memory requirements significantly increased with Drupal core 6.24; I noticed this because I have a rather tight PHP memory limit, configured indidually per site. As a side effect of this upgrade, I had to increase the PHP memory limit on about 1/3 of those sites with 30-40 MB per site. However, I don't know for sure that this is caused by the changes in taxonomy_get_tree().

@bleen18 (#136): I can not comment on the right place for this kind of safeguards; this decision is up to capable developers that know the code. If there is a better place where issues like this are handled, I'd agree to close this one.

@wojtha (#139): Please do not bury D6 before it's dead. Considering the amount of essential contributed modules without full migration paths (like image attachments from legacy 'image' module), and the current state of the upgrade path from CCK to Field API, many of us can not even consider site updates. If D8 would be actually released in 1.5 years, and the missing/incomplete- upgrade- path- situation wouldn't change fundamentally, there might be lots D6 sites past the end of it's official life cycle, and many of them might depend on "workarounds" like the 'Taxonomy Edge' module, or custom code rewrites like you mentioned in #139.

Gábor Hojtsy’s picture

If it was "better before", sounds like we want to roll this back?

asb’s picture

I'm not sure if the issues discussed in the past days warrant a full roll back of this patch; it's more than two years of work, and maybe a minor adjustment, like a "simple if-statement" (as suggested in #134) would suffice?

However, this issue is about "taxonomy_get_tree() memory issues", and the patch committed was probably #117, where deviantintegral states, that the new code is "about 20% faster", but it doesn't consume less memory. Maybe someone is able to provide some tests that allow to quantify what the new code actually does in regard to the initial issue's problem?

gielfeldt’s picture

IMHO, and to (partially?) repeat myself, I don't believe the "memory issues" will ever be solved as long as taxonomy_get_tree() is used for the tasks that's it's used for, and as long it loads every term for a vocabulary and storing it in a static cache. It inherently doesn't scale. However if this is by design, as previously suggested by wojtha, then this issue should be marked as "won't fix".

I also agree with asb that this issue has been derailed somewhere along it's path, as the patch to taxonomy_get_tree() doesn't address the memory consumption, but optimizes the function in another way (eliminates recursion). But ..... I suspect the gain of eliminating the recursion is not seen until you have large and/or deep trees, and then it seems to only serve as a micro-optimization since the memory issue at this point will be more of a problem?

While taxonomy_term_count_nodes() and taxonomy_select_nodes() have been eliminated in D7, the taxonomy_get_tree() memory issues are still present in D8 AFAIK.

Francewhoa’s picture

@All: Confirming that asb in #123 is correct. The patch in #117 does not fix the memory consumption issue. We double checked, the patch in #117 works well and is faster/optimized, but it does not fix the memory consumption issue.

gpk’s picture

#117 also replaced drupal_clone() with clone, which is not in PHP 4 (which Drupal 6 apparently supports). #1429376: Use of clone not allowed in PHP4 resulting in WSOD

marthinal’s picture

#118 fix my problems thanks very much @gielfeldt

seabear’s picture

40000 terms add term page (admin/content/taxonomy/1/add/term)
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 40 bytes) in /modules/taxonomy/taxonomy.module on line 1136

gielfeldt’s picture

@seabear: The problem here is not so much taxonomy_get_tree() as what it's used for (e.g. drop-down list with parents on add/edit page). You might be able to fix this problem using the Big Autocomplete TAXonomy module, which uses an autocomplete box for selecting a parent, instead of a drop-down box with e.g. 40000 terms.

vlad.dancer’s picture

HI, folks. I'll just leave a word here. In my case I had corrupted data in the taxonomy_term_hierarchy table, two terms had a circular hierarchy. So I removed them and all worked fine then.

Status: Active » Closed (outdated)

Automatically closed because Drupal 6 is no longer supported. If the issue verifiably applies to later versions, please reopen with details and update the version.

RAWDESK’s picture

@vlad.dancer Thanks man.. a life saver. had a similar issue in a custom feed module, but didn't check for duplicate entries coming from my remote source.