| Project: | Drupal core |
| Version: | 6.x-dev |
| Component: | taxonomy.module |
| Category: | bug report |
| Priority: | critical |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
| Issue tags: | needs backport to D6, Performance, taxonomy_get_tree |
Issue Summary
This usability issue is that when large taxonomies are apply to a content type the end users wait between 15 to 60 seconds before some pages are fully loaded and display.
STEPS TO REPRODUCE THE ISSUE
Download & install Drupal 7.x-dev http://drupal.org/project/drupal
Download & install Devel module 7.x-1.x-dev http://drupal.org/project/drupal
Activate Devel & 'Devel generate' modules admin/config/modules
Go to admin/config/development/generate/taxonomy to generate a new taxonomy. Using the following values. Feel free to use bigger values if your server has more resources.
-How many vocabularies would you like to generate? 10
-How many terms would you like to generate? 1000
-Max word length of term/vocab names: 12
-Leave unchecked "Delete existing terms and vocabularies before generating new content" option.
Generate 14 more taxonomies using the same values as above. So you'll get 15 vocabularies with a total of 150,000 terms. The important part is a total of 150,000 term. The number of vocabularies doesn't make a big difference. Note that to avoid error, each time you create a new taxonomy you must go to admin/config/development/generate/taxonomy Do NOT use the browser back button. Otherwise your Drupal installation might no longer work.
(Alternative: If none of the above works try creating large taxonomy on a local site instead of remote site. Then import the database to your remote site.)
(Note: Before generating a large vocabulary your PHP and MySQL limits might need to be increase appropriately. Read more at http://drupal.org/node/29268 Another option is to use a local site to create your large vocabulary.)
Go to admin/structure/taxonomy. Under the column 'Content types' make sure that all your vocabularies are apply to the content type 'Page'. NOT apply with Blog, Forum or Story.
Once your large vocabularies have been created and applied you'll notice that the following pages are now very slow to load:
-'Find content' at admin/content. Find below two screenshots 'admin-content-1-during-loading.png' and 'admin-content-2-after-loading.png' to clarify.
-'Create Page' at node/add/page
-And all other pages displaying large taxonomies.
The usability issue is that the end users wait between 15 seconds to 60 seconds before the page is fully loaded and display. I have set this issue to 'critical' because most end users give up after waiting only a few seconds. Thinking that Drupal is too slow or falsely that Drupal 'Bug'. Advanced end users will be annoy. But will know that the page is still loading. This usability issue is present in Drupal 5, 6 & 7 dev. Large taxonomies are often use for various lists such as 'Countries, Cities, Accent Cities' or Book numbers.
Expected result is waiting 1 to 3 seconds before the page is fully loaded and display. I will post a few suggested solutions below. Hope this help.
I'm not a developer but I would be happy to contribute patch testing.
| Attachment | Size | Status | Test result | Operations |
|---|---|---|---|---|
| admin-content-1-during-loading.png | 61.4 KB | Ignored: Check issue status. | None | None |
| admin-content-2-after-loading.png | 43.85 KB | Ignored: Check issue status. | None | None |
Comments
#1
Here are a few suggested solutions. Hope this help.
My vote goes for the first solution. Because it seems it would answer all type of user needs. The drop down menu would be the default value. But user could set it to change if a taxonomy is bigger than a predefined number of terms. Plus user could set a number of terms to trigger an autocomplete.
#2
Here is another suggestion that might help fix this issue. Overriding core taxonomy module with custom code. One way of doing this is using
taxonomy_override_selectorto reduce load overheads.Alan did that to improve memory usage for another module call 'TaxiSelect'. And had great results. Read more at http://drupal.org/node/544070
Any volunteers?
#3
This can still get in as a usability/performance improvement in Drupal 7.
#4
Moving it from my queue as it doesn't seem to be a real usability bug.
#5
I'd say that we should fold into core the http://drupal.org/project/batax module
#6
+1
#7
This issue exists since like ever.
#8
The fact that this is old is not too relevant IMO. If this bug showed up today, we would mark it critical.
#9
I like the idea to fold the batax module into core.
Subscribing!
#10
We're not blocking the release of 7 on a bug that's present in every prior version of Drupal, and represents a big edge case. Please use the 'critical' status judiciously.
#11
Crashing people's sites just because they happen to have a decent amount of content is a bug, and a critical one. If we were loading the entire contents of the node table every time you go to search/node, admin/content, node/n/edit and taxonomy/term/n/edit this wouldn't have been downgraded. But I won't play status pong.
#12
Could you clarify? Is what's happening in 7.x worse than what's happening in 6.x and below? If not, I agree it is a "major" bug, but it is not a release-blocking "critical" one. If so, then restore the status.
#13
Whats happenning is the same as what happens in 6. I disagree with the criteria that "if it is in 6, it is not critical". Bugs should be evaluated as if they are new. catch is saying that if we had this same bug for nodes, it would be critical. taxonomy is a similar fundamental system. Looks at the paths that catch posted. Those are fundamental paths that won't even load if you have a lot of terms.
Anyway, the point is not to argue about status. I propose that the best fix is to move code from http://drupal.org/project/batax into core. We'd just be adding a new field api widget. Please comment on the feasibility of committing such a patch. If it is plausible (i know there are no guarantees), we'll prepare a patch.
#14
It's no worse than D6, except that D7 memory usage is way up, so it increases the likelihood of hitting memory limits (especially now we have menu_rebuild() a bit more under control).
However I would say that a lot more sites use large taxonomies now, compared to when D6 was released, which is why there's several D7 issues trying to deal with this (many of which have been bumped to D8 because they require huge API changes and refactoring), and a growing number of contrib modules to work around it.
We added variable_get('taxonomy_override_selector') at the very last minute in the Drupal 6 cycle, with the hope of getting hierarchical select (or some other scaleable solution to select lists and hierarchy), into core in Drupal 7, which we haven't done. So all that's left at this point is damage control.
I'm moving this back to critical for those reasons. At the worst case, we need to ensure that variable_get('taxonomy_override_selector') is called anywhere taxonomy_get_tree() might get triggered in core. Where it isn't, that's either a critical leftover from the D6 patch or a regression.
This is currently done by taxonomy_form_all() - see #693362: taxonomy_form_all() is dangerous
Field allowed values (no issue yet)
Taxonomy parent selection (still has the variable iirc)
edit: or like Moshe says, try to actually fix it - with something along the lines of batax being the lowest impact solution.
#15
subscribe
#16
that status change was accidental, i believe. jeremy is sufferring from this bug badly on economist.com
#17
Also opened #698918: Reduce memory usage from taxonomy_get_tree() when parent is specified when I remembered that even if you specify $parent in taxonomy_get_tree() it still loads the entire vocabulary into memory. I'm not sure if that issue is 100% viable yet, but if it is then that'd mean we don't have to remove the last four words of this issue title ;)
#18
I have uploaded a patch to this #106015: [performance] DB caching for taxonomy_get_tree() which may help the DB side of this bug, but it still won't help a user with a 100,000 item select box.
#19
IIUC, this is about building a too-large form element in content/admin and search/node/.
If so this is no longer an issue due to #693362: taxonomy_form_all() is dangerous removing the filter-by-term form elements.
Marking as duplicate.
#20
OR: If we want to do something like replace with an autocomplete or other widget, then change the issue title and reopen.
#21
We still call this for the term parents when editing nodes, and in taxonomy_field_allowed_values(). Retitling for the remaining issues.
#22
I can take responsibility for working on a batax-ish solution for this issue. I did the Field API conversion of taxonomy, so if this is a sane solution we can agree with, I am equipped to do it. I can't start until Thursday, but I can give the whole day. I'll assign the issue to me when I start.
My initial concerns:
A new widget for this problem sounds.. oblique. Can't I just force the widget to be the term reference autocomplete widget?
This is a problem of thresholds. What is the threshold? Memory? Time? Number of allowed values in any given term field? What's the right moment to determine when this threshold is crossed? Or do we trust the user to know when her vocab is too big by recognizing the symptoms of white screens and locked up web browsers?
The widget for selecting a parent term is a FAPI element while the one for selecting a term reference field value is a Field API widget. This means I'm looking at two places to intervene in two different ways.
My reading of batax is that it represents the ancestry of the terms suggested by the autocomplete. Is this true? My guess is that in a massive list with deep hierarchies, there will be name collisions. The only way to distinguish to terms is to know their ancestry.
#23
Forcing the widget sounds good.
Threshold, I think this should be a variable (although not necessarily an exposed one). We could have a taxonomy_autocomplete_threshold and a taxonomy_vocabulary_term_counts variable -the latter updated on taxonomy_term_insert. When the latter crosses the formerf or any particular vocabulary, update instances?
I like the idea of showing the ancestry. Another option if that's hard, would be the [tid: 123] approach used by node and user reference.
#24
Usability becomes an issue way before WSOD and slow browsers. A country select box has < 200 items in it and is not particularly usable. So I would suggest a default cut off around that sort of value.
#25
Yeah 150-200 seems like a sensible threshold to me too.
#26
My priority will be to develop the widget switch-a-roo based on the threshold. JeremyFrench is right... the usability threshold is crossed well before the system resources threshold.
On the point of usability, displaying the ancestry in the term suggestions is much more useful than a term ID. If I can do this as well in the time I can contribute, I will make it an option on the autocomplete widget.
#27
Here we go!
#28
This is my first incomplete attempt. I will definitely work on this more, but I want some reactions to the unfortunate stuff that is going on.
#412518: Convert taxonomy_node_* related code to use field API + upgrade path removed all the old nodeapi code from taxonomy, but it also removed the simple and versatile Form API element code. The taxonomy term elements for nodes were also used for the taxonomy term edit form. taxonomy_autocomplete took just a vocabulary ID. Things were simple.
Under Field API, we're now loading field info to determine how to validate input. However, some of the code I'm going to target for element/widget switch-a-roo to avoid calls to taxonomy_get_tree use plain Form API elements. This makes it hard to reuse the autocomplete callback and element validate functions.
So guess what? I brought some of this code back.
Each time a term is added, I'm checking to see if the autocomplete threshold has already been crossed for its vocab. If the threshold isn't crossed yet, I count the terms in the vocab. If it's now beyond the threshold, I store the machine name of the vocab in a variable (probably should be the vid) and I never check again. I don't actually store the count because it would never be valid for vocabs which have crossed the threshold.
I'm marking needs review because I want tests to run. There's still plenty of unfinished work here.
#29
H,,. I wonder if we should check that count during cron instead of adding a query on every save. That sort of query is painful when creating terms en masse like a data migration. At least that auto-check for vocab size should be able to be disabled.
#30
@bangpound: did quick scan of your code, looks ok so far.
Re: what moshe said in #29, how about a static flag so the check is only done once? This would cover the mass import use case.
+++ modules/taxonomy/taxonomy.module@@ -1540,3 +1586,27 @@ function taxonomy_taxonomy_term_delete($term) {
+function taxonomy_taxonomy_term_insert($term) {
++ static $vid_checked = array();
++ if (isset($vid_checked[$term->vocabulary_machine_name])) {
++ return;
++ }
+ $counts = variable_get('taxonomy_vocabulary_term_counts', array());
+
+ // Counts array stores vocabulary machine name when the vocabulary has
+ // exceeded the autocomplete threshold.
+ if (empty($counts[$term->vocabulary_machine_name])) {
+ $count = db_select('taxonomy_term_data', 't')
+ ->condition('t.vid', $term->vid)
+ ->countQuery()
+ ->execute()
+ ->fetchField();
+ if ($count > variable_get('taxonomy_autocomplete_threshold', 150)) {
+ $counts[$term->vocabulary_machine_name] = $term->vocabulary_machine_name;
+ variable_set('taxonomy_vocabulary_term_counts', $counts);
+ }
+ }
++ $vid_checked[$term->vocabulary_machine_name] = TRUE;
+}
EDIT: Not versed enough in drupal_static() to implement that, the above code probably should be using it.
#31
@Moshe: this is a no return theshold- it's checked until the threshold is crossed, then never checked again. So for migrations it would only affect the first 200 terms inserted into a vocabulary - once those are in, the threshold is passed, and it shouldn't be checked again.
For migrations, it'd be enough to set $conf['taxonomy_autocomplete_threshold'] = 1; - then you'd get only one COUNT query per vocabulary. Although that's a bit obscure, I'm not sure we need any other special handling here. Would prefer we didn't do it on cron, since then you're counting for a 100 term vocabulary that never changes in perpetuity.
However, variable_set('taxonomy_vocabulary_term_counts', $counts); - this variable should be renamed to taxonomy_vocabulary_threshold_exceeded or something - it's not easy to tell what's going on there.
@bangpound - what's left to do here otherwise?
#32
There's plenty left to do. I can get back on it on Thursday and Friday this week and hopefully push it across the "needs review" threshold.
#33
@catch. thanks. my concerns are gone. i'd be happy to see this move ahead.
#34
Ok. Unless anyone gets very eager before Thursday and does some work on it, I'll pick up where I left off then.
Catch's explanation of the plan vis à vis the threshold and SQL counts is accurate.
#35
@all: If you want to test patches. Using Drupal 7 alpha 3 or more recent the steps to 'assigning vocabularies to content types' has changed.
Source: Drupal 7 online help button.
#overlay=admin/help/taxonomyI'm currently testing patch #28. Will post my result here.
#36
Thanks for the patch bangpound :)
Here is my testing results with patch in #28. The patch applied successfully. I tested only the front end. I'm not a coder so I haven't look at the code. I have assigned a vocabulary containing 50,000+ terms to the content types 'Basic Page'.
I didn't notice any change before and after the patch. Is it normal? Maybe since bangpound said in #28 that there is plenty of unfinished work. Find attached screenshots to clarify before after patch.
Tested with.
-Drupal 7 alpha 3 fresh install.
-Garland theme.
-Firefox 3.
-Server OS: Ubuntu Server 8.04 LTS
-End user OS: Windows XP
#37
The following is a note to myself for future patch testing.
New steps to create large vocabulary. Then attach it to a content type.
Create new vocabularies with 'Devel Generate taxonomy'.
Go to admin/structure/types
Click on 'manage fields' link beside 'Basic page'.
Fill the 'Add new field' row.
For 'Field type' select 'Term reference'.
For 'Widget' select 'Select list'.
Click on save button. Find attached screenshot to clarify.
On the next screen FIELD SETTING. Select the Vocabulary 'Tags'. Note that the current 'Devel Generate taxonomy (devel-7.x-1.0-beta1.tar)' module as a glitch. All new taxonomy generated with 'Devel Generate taxonomy' are added to the 'Tags' vocabulary. I have open another issue about that at #750842: 'Devel Generate taxonomy' does not generate expected number of terms
On the next screen leave all to default. Click on SAVE SETTINGS button.
On the next screen click on SAVE CONTENT TYPE button.
Go to /user/1#overlay=node/add/page
Follow instructions on screen to create a new page.
#38
Attaching screenshots for #37.
#39
Onopoc: The patch is not finished yet. I don't think there's anything to profile yet, and I will likely not do anything for taxonomy term reference fields. I'm targeting taxonomy term parents in the term edit form. Otherwise, Drupal needs to urge the user to use autocomplete widgets in taxonomy term reference fields when the total number of allowed values exceeds the threshold.
#40
@bangpound: Thanks for the information. I'll be happy to test other patches. I'll wait for the 'Status' to read 'Needs review'. Take all the time you need ;)
#41
Subscribe, similar approach recommended for #763380: Do not use taxonomy_get_tree() in taxonomy_overview_terms()
#42
Great work here. Hopefully I'll get Hierarchical Select in D8 core.
#43
"Critical: When a bug renders a module, a core system, or a popular function unusable. ... to be fixed immediately because the software is not usable at all."
This isn't critical. The software works in a majority of cases.
Not to dismiss it, this issue should be fixed.
#44
Firstly I'd like to lend my support to this being critical. It can cause your server to run out of memory and crash. Even worse for some, a whole section of your site can silently fail when the number of tags used forces the memory usage above the php limit. If you allow free tagging and have a large amount of content then hitting this problem is virtually inevitable. Just because this critical bug existed in 5 and 6 doesn't make it any less essential to fix in my view.
Secondly. Isn't it usually vocabularies with no hierarchy such as tags that end up having enough terms to cause this problem? Wouldn't it be sensible to check if the vocab even has a hierarchy before doing the work and using the memory required to find the non-existent tree of a term? I think this would solve about 95% of the problems with this - probably not good enough to mark this as totally fixed it would surely justify a downgrade from critical.
Anyway here's my the Drupal 5 code for my solution
(at the start of taxonomy_get_tree)
+ if($parent > 0) {
+ $res = taxonomy_get_vocabulary($vid);
+ if($res->hierarchy < 1) {
+ return array();
+ }
+ }
If anyone knows a better place to post this or sees an error please let me know.
#45
We have major priority now.
#46
I was taking a closer look at the taxonomy_get_tree function. when having a bigger taxonomy with some multi-parents, taxonomy_get_tree results in many, many recursions, which is not really fast in php. But what really takes time is the array_merge function call for merging in the results of the recursion in the tree array. with a little bit of refactoring the looping over the children can be separated into an own helper function an the $tree can be passed by reference, which avoids the array_merge and speeds at the code.
anyway, at the moment I think the whole function is broken, see #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects
#47
I was working on improving the performance of the taxonomy_get_tree function, which is used in many places (e.g. for loading the options of a select list or for the term overview page). This patch does not affect any form elements, which try to render too many terms at a time, so not definitely sure if it's the right place to post the patch.
As I pointed out in my previous comment, the recursion and the array_merge calls take up a long time. Thus I replaced the whole recursion with iterations, which really speeds up the complete function. Performance improvements can be easily seen with bigger taxonomies and multi-parents. On the downside, the code gets a bit more difficult to read.
#48
Subscribing.
#49
I tried to use my taxonomy with D7 and was facing heavy performance problems (on a server with an Intel Core i7 920 and 12 GB RAM). The vocabulary has around 14.000 terms and some multi-parents.
First, the term overview crashes -> fatal error: memory size exhausted: this is caused by taxonomy_get_tree and the huge table.
Second, the term edit form takes around 12 seconds just for loading the form, same problem here: taxonomy_get_tree
In general, every time taxonomy_get_tree gets called, I have to wait for a long time. Since taxonomy_get_tree fetches full term objects, the function is much slower than in D6 (full entity loading has been added in #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects). When switching back to the old behaviour of taxonomy_get_tree, without the entity loading, the execution time goes back to 1 second (still bad).
marking this issue as critical, because the performance is much worse than in D6 and the problem will affect other sites as well.
overview of the execution time of taxonomy_get_tree:
with entity loading: 12sec
without entity loading: 1,1 sec
without entity loading and without recursions (patch comment #47): 0,45sec (additionally removing the cloning of terms reduces the PHP memory peak from 160mb to 62mb)
not sure what the best solution is. having incomplete objects is not really a good idea, but on the other side we need a fast function for building the forms and overviews.
further, multi parents are currently broken, see #899258: Multi parents broken
#50
demoting from critical because there is nothing new here, this always was so.
#51
>demoting from critical because there is nothing new here, this always was so.
I'm not happy about another critical, but I disagree. As mh86 noted above, the same taxonomy was working fine in previous drupal versions - so that's a regression :(
A that bad performing taxonomy_get_tree() seriously breaks dealing with any term, as the function is used during term editing + overview. Improving it to only entity-load the terms really needed would be a first start. However I doubt that will suffice as its often called for all vocabs too.
Apart from that, the taxonomy_get_tree() does not really return a tree. Moreover it looks like it is basically used for generating a flat list of the tree for overviews / selects. So a separate function optimized for generating that list (= without full entity loading) would make sense to me - maybe it should even return arrays instead of objects to make a clear distinction to full entity objects. taxonomy_get_tree() could then be fixed up in core to really return a tree, however I think there is no use-case for that in core.
#52
I was working on the performance problems. As fago suggested in the previous comment, I added a separate function taxonomy_get_entries, which is used in places, where a full term object is not needed, e.g. to generate the options array for select lists. This function works much more efficiently than taxonomy_get_tree: when using it with my vocabulary the processing time goes down from 12 seconds (taxonomy_get_tree) to 0,45 seconds (taxonomy_get_entries) and uses less memory. As soon as multi-parents are fixed in taxonomy_get_tree (#899258: Multi parents broken), it will even be slower.
taxonomy_get_entries is also faster than taxonomy_get_tree in D6, because the recursion has been replaced with iterations.
Furthermore, it is clearly documented, that taxonomy_get_entries will return incomplete term objects. If someone needs full terms for further processing, taxonomy_get_tree has to be used. In core, all occurrences of taxonomy_get_tree can be replaced with the new function.
Concerning the terms overview page, the new function helps, but we still face problems with deeper hierarchies and long child lists, since it tries to render all child terms of the first parent until it reaches the next root level term (same behaviour as in D6). Maybe a threshold of terms to render might help. This would make it impossible to move the child terms to the next root level term with drag&drop, but still better than a white screen.
#53
#54
fixed trailing whitespaces at line endings
#55
+++ modules/taxonomy/taxonomy.module 3 Sep 2010 14:31:38 -0000@@ -858,6 +860,99 @@ function taxonomy_get_tree($vid, $parent
+ * Creates a flat list of term entries, usually used for generating forms or ¶
...
+ $process_parents = array(); ¶
...
+ $process_parents[] = $parent; ¶
+ $process_parents[] = $child->tid; ¶
...
+ reset($children[$vid][$child->tid]); ¶
...
+ next($children[$vid][$parent]); ¶
trailing white spaces
+++ modules/taxonomy/taxonomy.module 3 Sep 2010 14:31:38 -0000@@ -858,6 +860,99 @@ function taxonomy_get_tree($vid, $parent
+ * Creates a flat list of term entries, usually used for generating forms or ¶
...
+ $process_parents = array(); ¶
trailing white spaces
Otherwise I think this is a good idea. Changing the comments on taxonomy_get_tree() would be good, to indicate that people should not use it anymore for the hierarchy listing use case. And a hint like @see taxonomy_get_entries().
And we should deal with the threshold for the term overview page in a separate issue.
Powered by Dreditor.
#56
Crosspost. Resetting status
#57
#58
Yes. The issue is major but there is nothing new. It's just that more users will meet this because taxonomy_term is now a fielded entity eating up more RAM. Retrieveing a whole vocabulary was always impossible for any bigger site anyways. Tell me if there is a real regression here, a new bug. One user meeting this the first time is not one.
#59
As stated in #49,
...it is therefore a new bug, a regression as it now takes longer than D6. The results are different, and worse.
#60
updated patch contains additional comments for taxonomy_get_tree, as klausi recommended in #57
furthermore I did some performance testing for another use case: a flat free tagging vocabulary with 10.000 terms.
just calling the functions for this voc took following execution times:
taxonomy_get_tree(): 6.01729798317 sec
taxonomy_get_entries() 0.0657458305359 sec
#61
In agreement with chx this is not critical vocabularies have always behaved in this way.
#62
Subscribing. As mh86 and fago are pointing out the "same taxonomy was working fine in previous Drupal versions - so that's a regression". It seems pretty clear that regardless of what re-factoring or new/different calls are involved, what used to work with the same set of terms in D6 now doesn't.
Even if the same problem exists in D6 with many taxonomy terms, the fact is, there is still a problem. Consequently, using taxonomies in D7 will be difficult for many users and developers.
#63
Major does not mean something won't be fixed. This bug does anot render the system unusable, however it does have significant repercussions therefore according to this http://drupal.org/node/45111 the issue is major not critical.
#64
Just my $.02...
Also agree it's not critical; performance is relative and "many" terms would cause problems in D6 as well as in D7, or if you change the number that "many terms" means.
This *will* be fixed, but not critical (meaning "hold back D7 release").
I'd suggest focusing on marking issues "critical" for much more common use cases that do cause slowness/memory consumption problems.
Having said the above... it'd be nice to have a memory/performance benchmark with/without the patch to ease review =)
#65
Please don't get anxious about the priority. This particular issue is a hard one to classify. It never stopped Drupal 5 or Drupal 6 from being released, so why stop Drupal 7? Meanwhile, it's clear that Drupal 7 fanciness exacerbates the bad situation.
As of April in San Francisco, Webchick and I (and others) agreed that this issue is major and not critical. However, real world scenarios with the upgrade path could obligate us to escalate the priority of the issue.
Will this be a critical issue with no functioning patch? Or a major issue with a golden patch?
According to the evidence presented, mh86's function taxonomy_get_entries() is much faster. He says it's harder to understand, but taxonomy_get_tree() is ALREADY opaque to me and taxonomy_get_entries() is a bit clearer to understand. So this is a good accomplishment!
Drupal core does not need debris and legacy functions. taxonomy_get_entries() is not going to get a pass while taxonomy_get_tree() also gets to stick around. Ultimately, the RTBC solution is one that FIXES or REPLACES taxonomy_get_tree().
The array returned by taxonomy_get_tree() is a real trick to use. It's not a flat array. Is it exceptional enough that we don't need to think about entities here? Parent term is not a field,
so the way hierarchy is implemented in Drupal 7 is already a bit of a relic.
Do all our tests pass if the body of mh86's taxonomy_get_entries() is pasted over the body of taxonomy_get_tree()? This is where I'd like to start.
I'll come back to the issue in about a day. I need to do more research about how taxonomy_get_tree() is used in core. I don't oppose having it return a lightweight representation of term names, IDs and hierarchy only. But I also don't know if that's possible. More tomorrow.
#66
I am a bit hesitant here because this approach skips hook_entity_load and hook_taxonomy_term_entity_load. Does anyone want to mangle taxonomy term names on load? Another approach would be to have a helper that returns the select options because that's what get tree is usually called for... and accept that term names can't be mangled on load. Other uses of taxonomy_get_tree I suspect would be with smaller vocabs anyways -- if you have 10 000 terms then loading that into memory is infeasible, one way or another. Also, killing recursive in get_tree is just awesome, thanks for the work! We totally should do that in get tree. I doubt it's hard to read. We have comments for that. You would do an awesome service to the Drupal world if you could convert drupal_render and drupal_render_children in a similar way. Please!
#67
The patch from #60 replaces all occurrences of taxonomy_get_tree with the new taxonomy_get_entries function and all tests pass. Full entity loading was added after alpha6 (see #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects).
I'm still not sure, how the best solution should look like. Dealing with incomplete terms is not a good idea. On the other hand, the way core uses it for building forms, full terms mean a lot of overhead. This was the reason, why I added a separate function for this use case, but still with the tradeoff of no possibility to change terms on hook load, as chx mentioned.
Furthermore, if we return a lightweight representation of terms, we should think about returning term arrays (like $term['tid']...) instead of incomplete term objects. This would make it clearer to developers, that they are not dealing with real terms. On the other hand this would break much code in core and contrib.
#68
Think of this in two phases: LIST gives us the rows from taxonomy_term_data and taxonomy_term_hierarchy tables. LOAD turns that into fattened entities. Can we design taxonomy_get_tree() to optionally return just the LIST? And if we want to LOAD, can the function take the outcome of the list and efficiently return entities based on that?
LIST is taxonomy_get_entries(), LOAD is taxonomy_get_tree() in this patch. Can we do this all in taxonomy_get_tree?
function taxonomy_get_tree($vid, $parent = 0, $max_depth = NULL, $only_list = TRUE) {[... mh86's taxonomy_get_entries() ...]
if ($only_list) { return $tree; }
[... call taxonomy_term_load_multiple on the tree ...]
return $tree;
}
#69
Designing it this way sounds reasonable to me. The code stays simple and it is close to the previous implementation (except the replacement of the recursion). Combining List+Load in one function just required a few additional lines of code and both profit from the iteration implementation.
To give some performance numbers, tested the function on my local machine with 5.000 terms, simple hierarchy:
At the moment core never requires full entities, therefore this change really improves performance.
I'm still thinking of the variable naming $list_only, which wasn't clear for me at the first glance. Maybe we should use something like $fetch_full_term_entities or $fetch_structure_only, what would you say?
#70
Do we need an additional static cache for $term_entities? The entity controller has a static cache. Why not just rely on that?
I cannot wait to try out this patch! Thanks mh86.
#71
You are right, there is no need for another static variable. I can update the patch tomorrow.
#72
"List only" mode may be a bad name, but the default for the function should be to load entities. It's important to keep the current behavior as the default.
After you post a new patch, I'll test it and go through existing usage of taxonomy_get_tree() to find places we can use "list only" mode. taxonomy_overview_terms_submit() and taxonomy_check_vocabulary_hierarchy() I think to start...
#73
here the updated patch:
In my previous posts I forgot to mention, that this patch also fixes multi parents, which are currently broken in taxonomy_get_tree (term with multi parents only added once to $tree array).
#74
This looks awesome. The @return in the doxygen needs updating as well. After that, it is RTBC IMO. Would be good to hear from bangpound on this latest revision.
#75
Looks great to me as well.
This issue was the first mention of #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects I've seen, I would've argued strongly against that going in given the existence of this issue, but at least this alleviates the worst of it.
#76
Also, don't you need to change all the calls to taxonomy_get_tree() in core so they they set $load_entities to FALSE? I'd be OK with just changing default behavior of taxonomy_get_tree() to FALSE but bangpound seems opposed to that.
#77
I ran into this as well. It's worth noting that this is not merely an edge case caused by very large vocabularies. If you attach fields to your
taxonomy_termentities you can quickly reach 20, 30+ second node edit form page loads with vocabularies containing relatively few terms (e.g. 100 or 200 terms). The more fields, the slower the loads.The patch in #73 looks good but it seems like a one-off for
taxonomy_get_tree()when to me this seems like a problem for any largeentity_load()operation that doesn't require full loading of fields. Is the recommendation here to not use the entity API when you know you don't need fields and will incur a serious performance hit for calling it? (Start looking around core and you will see many instances oftaxonomy_term_load_multiple()anduser_load_multiple()that will never need information beyond data from the entity base table...)My first inclination was to add a new function
entity_list()(and derivative wrapper functions) and corresponding flag to the entity controllers->load()method that avoids calling->attachLoad(). If this sounds reasonable I have the patch on hand.#78
The main issue with not using the entity API is something like #557292: TF #3: Convert node title to fields - that patch was reverted though, a theoretical contrib module which alters entity properties on load would break with either this taxonomy_get_tree() code or a more generic entity_list(). I think it's worth its own issue for discussion though. Also while it won't help with the memory issues from many hundreds of entities, quick plug for http://drupal.org/project/entitycache which will mean neither the base table nor the field storage, nor non-field load hooks will get hit once your cache is warm.
#79
@moshe
I'm in favour making the list mode as default option. I was reviewing the taxonomy_get_tree function calls and I don't see any place where it's is necessary to have full entities, but waiting for bangpounds opinion.
#80
Sorry for my delay here.
@mh86: taxonomy_get_tree() is called up in two ways. In taxonomy_overview_terms_submit() and taxonomy_check_vocabulary_hierarchy () it's called just to get the hierarchy. Term names don't matter. Whatever fields are attached to term entities don't matter either. This is where it's safe to use the "list only" mode.
In other places like taxonomy_allowed_values(), the term names are used as well as the hierarchy. If we don't load the terms as entities, are we going to be able to translate the term names? Will translating the term names need to be done two ways: (1) for entities and (2) for terms loaded in taxonomy_get_tree()'s list only mode?
The reason to make "list only" mode optional and not default is to match current expectations and documentation. The last argument presently on taxonomy_get_tree() [$depth] shouldn't be passed by any other function except taxonomy_get_tree() itself... although it is passed in some places. I prefer the current behavior to be the default because if someone never learns about this patch, they don't get bitten by it.
#81
If the plan is to translate term names during term loading, it sounds reasonable to me to use full entities. In this case, we are not going to consider performance problems on the node edit form, as yhahn reported in #77. Nevertheless, a simple contrib module could override the allowed_values with a more efficient implementation.
If we just use the list mode in allowed values, translating should still be possible with some form altering (as it was done previous to D7). Compared to full entities, this is more efficient, but maybe requires more code?
So, both approaches have pros and cons, most important we get something like the efficient list mode for taxonomy_get_tree in, so that contrib modules can take advantage.
Concerning the expectations of the results returned by taxonomy_get_tree, I would rather assume, developers don't expect full entities, because this is completely new since alpha6.
#82
Great work reducing the load on the taxonomy tree. Big +1 to this getting in.
In relation to the widget:
I know this is very late in the thread sorry, and I'm not trying to push my own project, but if you are still thinking about implementing the multiple drop-down widget have a quick look at the logic behind taxiselect. It uses a single non-recursive db_query() to generate the list of options from the tree, utilizing the power of the database rather than PHP to map the tree. On small vocabs, a single select is great. On moderate sized vocabs, multiple drop-downs work, but on really deep ones I haven't found anything else that comes close.
For example, using the geonames database, with ~6 million locations @ four levels deep, the Taxiselect autocomplete is comparable in performance to the user autocomplete, from an end users perspective and the memory load is trivial.
Using Hierarchical Select is not the best option as there are levels that have thousands of entities. For example, all towns and cities in Brazil
Using Big Autocomplete TAXonomy is not the best option as there are place names with 100's of duplicate entries.
I would NOT push taxiselect into core as the background processing logic is much more complicated, so maintenance would be an issue. Plus it is for a fairly specific use-case.
So rather than trying to get a specific implementation rushed into core, how about just focusing on refactoring the taxonomy select override so that users can have the option? This is a badly thought out flag thrown in without any real use case thought out.
Propose:
A new field on each taxonomy, 'parent_selector' vc(31) default 'taxonomy', and call the implementing mechanism for generating this FAPI element.
New hooks (thinking on the fly)
<?phpfunction hook_taxonomy_parent_list_selector() {
return array(
'taxonomy' => t('Single select list'),
);
}
?>
New callbacks
<?phpfunction callback_parent_list_selector_widget(....) {
return $element;
}
?>
New element of the vocab page to select the widget, #access == count(hook_taxonomy_parent_list_selector() results) > 1
Finally a new implementation of hook_disable() in taxonomy.module to default any callbacks to taxonomy in case the implementing module gets disabled.
This way we could have all of the above mentioned modules working in harmony and only core would have to handle the functionality of it's own widget.
#83
I just discussed this with mh86 and wondered whether it is an option to disable attaching fields to the term object by default?
I guess attaching fields is that what makes entity load that more expensive than a simple load operation. But I think they are usually not needed. So we could go and not attach them by default, but provide an API function to do so on demand.
#84
+1 subscribing
#85
I took a closer look at the entity loading and I found two main issues, which make loading of full terms that slow:
1. In the current implementation and in the patch we call:
$term_entities = taxonomy_term_load_multiple(array_keys($terms[$vid]));This ends up in a query with "... WHERE tid IN ( all term ids)", where retrieving the result set was extremely slow (even though the query execution time isn't that bad). Changing this line to:
$term_entities = taxonomy_term_load_multiple(array(), array('vid' => $vid));helps a lot.
2. indeed, attaching fields to a term takes a lot of time, even if the vocabulary has no fields associated. Attaching them on demand, as fago suggested, would make sense.
With fixing these issues we can get taxonomy_get_tree working in a reasonable amount of time, even with entities. I'm going to update the patch as soon as I find time.
#86
@mh86:
#1: I'd be interested to see where the difference is with this. One issue with using $conditions instead of $tids is that you don't get any benefit from modules like entitycache, but bulding IN() queries with thousands of terms is no fun either.
#2: if you don't attach fields, then you don't have an entity. I'd stick with the optimization (or more like revert of #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects, which should never have gone in) that's already here of not doing the entity loading. On-demand field loading is D8 material, and a separate issue from this one.
#87
Patch no longer applied, re-rolled with the doxygen changes Moshe asked for, and defaulting the parameter to FALSE. Otherwise it's unchaged apart from updating to HEAD.
Part one of #85 is worth looking at, but I'm pretty happy with this overall as is.
#88
Thanks for the changes. RTBC. Wait for green before commit.
#89
Subscribe
#90
Crosspost?
#91
Reviewed #87 - this fixes the performance problems I experienced on node edit pages that use vocabularies with many terms with fields.
FWIW, I agree w/ catch on #86/2 - local optimization seems to be preferable to introducing the concept of fieldless entities. #85/1 is an interesting point, but really material for a separate issue.
RTBC, here too.
#92
I re-opened then re-closed #870528: taxonomy_get_parents(), taxonomy_get_children(), and taxonomy_get_tree() do not return a full term objects, am bumping this back to critical since it's fixing that regression now.
#93
Pre-emptive postgres un-breakage via webchick in irc.
#94
So, a few things:
<?php-function taxonomy_get_tree($vid, $parent = 0, $max_depth = NULL, $depth = -1) {
+function taxonomy_get_tree($vid, $parent = 0, $max_depth = NULL, $load_entities = FALSE) {
?>
Technically, this is an API change but since $depth was previously an internal property only (and it always annoyed me to begin with), I think this is fine. Especially if we gain a performance boost from not doing recursion here, which is what it sounds like.
<?php- $query->orderBy('t.weight');
- $query->orderBy('t.name');
...
+ ->orderBy('weight')
+ ->orderBy('name')
?>
That seems like it's going to cause problems if someone hook_query_alter()s in the block table or something else with a "name" or "weight" column. (looks like catch took care of that in the last patch)
<?php+ $process_parents = array();
+ $process_parents[] = $parent;
+ while (count($process_parents)) {
+ $parent = array_pop($process_parents);
+ $depth = count($process_parents);
+ if ($max_depth > $depth && !empty($children[$vid][$parent])) {
+ $has_children = FALSE;
+ $child = current($children[$vid][$parent]);
+ do {
+ if (empty($child)) {
+ break;
+ }
+ $term = $load_entities ? $term_entities[$child] : $terms[$vid][$child];
...
+ $term->depth = $depth;
+ unset($term->parent);
+ $term->parents = $parents[$vid][$term->tid];
+ $tree[] = $term;
+ if (!empty($children[$vid][$term->tid])) {
+ $has_children = TRUE;
?>
Most of this function is documented pretty well (certainly much better than it was), but I'm wondering if we could flesh this part out with EVEN more comments, since I don't think we'll ever have the collective genius mindshare of taxonomy maintainers in one place again to clean it up after.
#95
Updated the patch from #93 with more comments on the loop.
Concerning #85/1: loading full entities from a voc with 2500 terms with current code took 1,2sec, with the $vid condition instead 0,85sec. But this can be addressed in a separate issue.
#96
The last submitted patch, taxonomy_get_tree_556842.patch, failed testing.
#97
Needs a re-roll with git --no-prefix.
Reviewing the patch:
This is my fault, but taxonomy_term_data should be {taxonomy_term_data} IMO, makes it obvious it's a db table. Not a blocker, may be making up that code style too...
Comment additions look good.
#98
this one should apply
further changed taxonomy_term_data to {taxonomy_term_data} in the comment
#99
Looks great.
#100
Great! Thanks for taking the time to do that.
Committed to HEAD.
#101
Thanks all for your contributions. This is awesome :)
#102
Is there any plan to roll this back to Drupal 6? We are having serious performance issues and have narrowed it down to Tagadelic tag cloud displaying terms and when displaying all nodes in a view organized by taxonomy terms. As Drupal 7 is only in beta, we're not ready to move our sites to it yet.
#103
Yeah, actually, the change as-is might be back-portable since it didn't end up affecting the UI. It does incur an API change, but to a property that contrib authors shouldn't be using. Up to Gábor, of course.
#104
Subscribing.
#105
subscribing.
#106
subscribing
#107
#98: taxonomy_get_tree_556842_1.patch queued for re-testing.
#108
#100 committed to D7. It will need work for D6 as the patch has heaps of D7 dependent code!
#109
For those who are trying to solve the slow down problem of hierarchical select, take a look at this link:
http://drupal.org/node/1059676#comment-4095278
Hope this helps.
#110
subscribing
#111
Straight backport to D6, just stripped the entities stuff.
#112
#919122: Fatal error: Allowed memory size of 134217728 bytes exhausted... in /modules/taxonomy/taxonomy.module on line 866 marked as duplicate of this issue.
#113
#286103: WSOD (blank page) on admin/content/node with freetagging taxonomy marked as dupe.
#114
Thanks a ton for the D6 patch ! works fine in the preliminarily tests with huge vocab of more than 10k items.
#115
@phantomvish if it is working for you please, RTBC this issue
#116
#117
Here's an update that fixes some small issues with the phpdoc comment.
Testing this looks to be about 20% faster than the current taxonomy_get_tree() function in D6. However, I'm not seeing any difference in the memory profile with this patch. Reading the previous issues, it seems as if the expected improvement is only in CPU time, and not memory use. If that's the case, this is fine at RTBC, otherwise I think this needs work.
#118
I've created module that somewhat address this issue. It extends the taxonomy data model, providing faster lookups for hierarchies.
http://drupal.org/project/taxonomy_edge
It includes a core patch for optimizing taxonomy_get_tree(). Also includes a module for Views providing optimized queries for queries using depth on taxonomies, and reintroduces the /all modifier for D7.
EDIT: Clarification: Taxonomy Edge only optimizes the select queries for taxonomies. It does not fix the problem of trying to put 100000 terms into a drop down box.
#119
Committed, pushed, thanks.
#120
Automatically closed -- issue fixed for 2 weeks with no activity.