Problem/Motivation
- The
{taxonomy_index}
table contains an index of terms that are set on each published node. (For background on this table and why it is maintained only for published nodes, see #8). - Currently, the index is maintained in taxonomy field hooks, despite that terms are indexed per entity rather than per field. This leads to several problems:
- Duplicate index records are inserted when a node has the same term in different fields.
- Enforcing uniqueness at the field level would result in redundancy and overhead.
- Index entries may be "lost" for fields that are unchanged and not present in the entity object. (See #1126000: [meta] hook_field_*() called on fields not set in the $entity object (partial updates).)
- The current implementation also does not check if
$node->status
is set before using it, which can cause errors whennode_save()
is used to save nodes programmatically. (See #1110636: Notice: Undefined property: stdClass::$status in taxonomy_field_update.)
Proposed resolution
- Maintain the term index at the entity level rather than the field level. This allows all terms on an entity to be collected and indexed uniquely.
- Check for properties and field values before using them.
- Add test coverage for cases where multiple fields include the same terms.
Remaining tasks
- #58 was previously RTBC.
- #67 needs review. It includes cleanup and removes a separate bugfix. (#1343822: Autocreated taxonomy terms duplicated if they are added to multiple fields has been opened as a separate issue.)
User interface changes
- None.
API changes
- The taxonomy index is now maintained at the entity level.
- Two new API functions are available,
taxonomy_build_node_index()
andtaxonomy_delete_node_index()
.
Original report by rhayun
Hi all
i think i find a major bug in taxonomy module maybe it cause by the field module (i dont know)
i have a content type that have two fields in term reference type
once its a select list and the other is a autocomplete
when i save the node
the function 'taxonomy_field_update' call
but as i can see its not call once per field or object its call more then twice
and this is why the duplicate is come
// We don't maintain data for old revisions, so clear all previous values
// from the table. Since this hook runs once per field, per object, make
// sure we only wipe values once.
if (!isset($first_call[$entity->nid])) {
$first_call[$entity->nid] = FALSE;
db_delete('taxonomy_index')->condition('nid', $entity->nid)->execute();
}
well its work its clean all terms from taxonomy_index but its will work only if this function called twice and not more
cause when its called more then twice
the code:
if ($entity->status && !isset($first_call[$field[field_name]])) {
$query = db_insert('taxonomy_index')->fields(array('nid', 'tid', 'sticky', 'created'));
foreach ($items as $item) {
$query->values(array(
'nid' => $entity->nid,
'tid' => $item['tid'],
'sticky' => $entity->sticky,
'created' => $entity->created,
));
}
$query->execute();
}
keep insert the same terms into taxonomy_index
and its cause a duplicate
please test it and confirm this bug
BTW:
i made a solution but i dont know if it elegant
i just did this change and its fix the problem but dont know if its ok to do it:
/**
* Implements hook_field_update().
*/
function taxonomy_field_update($entity_type, $entity, $field, $instance, $langcode, &$items) {
if (variable_get('taxonomy_maintain_index_table', TRUE) && $field['storage']['type'] == 'field_sql_storage' && $entity_type == 'node') {
$first_call = &drupal_static(__FUNCTION__, array());
// We don't maintain data for old revisions, so clear all previous values
// from the table. Since this hook runs once per field, per object, make
// sure we only wipe values once.
if (!isset($first_call[$entity->nid])) {
$first_call[$entity->nid] = FALSE;
db_delete('taxonomy_index')->condition('nid', $entity->nid)->execute();
}
// Only save data to the table if the node is published.
if ($entity->status && !isset($first_call[$field[field_name]])) {
$query = db_insert('taxonomy_index')->fields(array('nid', 'tid', 'sticky', 'created'));
foreach ($items as $item) {
$query->values(array(
'nid' => $entity->nid,
'tid' => $item['tid'],
'sticky' => $entity->sticky,
'created' => $entity->created,
));
}
$query->execute();
}
$first_call[$field[field_name]] = FALSE;
}
}
Comment | File | Size | Author |
---|---|---|---|
#71 | taxonomy-index-1050466-71.patch | 13.34 KB | makara |
#67 | taxonomy-index-1050466-67.patch | 13.38 KB | xjm |
#63 | interdiff-60-61.txt | 750 bytes | xjm |
#61 | taxonomy-index-1050466-61.patch | 17.95 KB | xjm |
#60 | taxonomy-index-1050466-60.patch | 17.94 KB | xjm |
Comments
Comment #1
catchThis is a duplicate of #610076: Add a primary key to the {taxonomy_index} table and prevent duplicates being inserted.
Comment #2
Damien Tournoud CreditAttribution: Damien Tournoud commentedDe-duplicating. The current code *should* work, if I read it correctly. But it is very very very ugly. The
{taxonomy_index}
table is an entity table (it lists all the terms that are related to a *node*), as a consequence it should be maintained in an entity-level hook (here hook_node_update()), not in a field level hook.Comment #3
rhayun CreditAttribution: rhayun commentedIts can be fix also like this:
Comment #4
sunComment #5
catchThis is at least major due to #1126000: [meta] hook_field_*() called on fields not set in the $entity object (partial updates).
Comment #6
makara CreditAttribution: makara commentedSubscribe.
Comment #7
makara CreditAttribution: makara commentedHmm.. the
taxonomy_index
table is a bit strange to me. It only indexes tid to nid, means it only indexes for nodes but not for other entity types.Another thing is that I don't understand why do we need to index "sticky".
Comment #8
catchThe table has very limited functionality by design, it was added only to allow previous taxonomy features (like /taxonomy/term/n pages - which respect the 'sticky' flag, hence why it's there) to work in Drupal 7 without a serious performance regression compared to Drupal 6.
The proper solution for dealing with slow queries due to core's default field storage is either alternative field storage engines (such as mongodb or something like PBS), or something like the materialized view API. Neither of these were anywhere near core inclusion during the Drupal 7 release cycle, and {taxonomy_index} was put in as a compromise between postponing terms as fields on vapourware, vs. joining on all the possible field tables which would've been a significant performance regression. Similar tables exist in both tracker and forum modules to deal with similar problems.
If you just want this table, but for any entity type, there is http://drupal.org/project/taxonomy_entity_index - I would've objected to including something like this in Drupal 7 core since it is too generic to justify the overhead for core, but not generic enough to solve the wider problem of listing query performance against field values or other normalized data.
I'm hopeful that the work on materialized view API as part of gsoc, or Damien's proposed document storage model for Drupal 8, would allow us to ship a proper system for this stuff in Drupal 8 rather than these one-off hacks.
Comment #9
makara CreditAttribution: makara commentedThen this issue is for Drupal 7 I think. And let's get this hack fixed so that:
1) We don't get duplicate index if a node has two fields reference to a same term (
taxonomy_field_insert
andtaxonomy_field_update
).2) We don't get two terms with the same name if a user input the same thing in two free tagging fields that reference to a same vocabulary (
taxonomy_field_presave
).I'm trying to fix the first point with the patch. Also added a new test case.
I'm using node hooks since the index is only for node. I changed the title as well.
Comment #11
makara CreditAttribution: makara commentedFailed to apply..
Comment #12
xjmComment #14
makara CreditAttribution: makara commentedOK. We need to support cases that node_save() is called directly.
I added tests for that as well.
Comment #15
xjmTagging.
Comment #16
xjmTagging.
Comment #17
makara CreditAttribution: makara commentedHi,
What does the tag "Issue summary initiative" mean? What do we need to do for it?
Thanks.
Comment #18
xjmHi makara,
The tag is to help contributors find issues where an issue summary might be useful, but it's not a requirement, just something that's meant to be helpful. (You can read the post about the tag for more information if you're interested.)
Comment #19
mat. CreditAttribution: mat. commentedHi,
I have the same trouble as described above.
I have duplicates in taxonomy_index table
http://imageshack.us/photo/my-images/829/moodleszskmczphpmyadmin.png/
and this causes the duplicates items in Views (for example "Related posts", etc.).
I am only partly experienced Drupal user, that is why it is not clear for me.
Is this patch the solution of the problem?
Is this trouble of core which will be ported?
How to solve it..?
Comment #20
catch@makara, thanks for the patch!
Couple of things:
1. Could you post just the tests by themselves so the test bot shows them as a fail without the accompanying fix? It's good to have an issue that the tests fail in the issue itself.
2. Do we want to look at adding an update to locate the duplicate entries and remove them (this could be in a major follow-up issue I think).
3. Pending #2 we should also add an update/schema change to add a primary key on nid/tid to {taxonomy_index}.
The patch itself looks great and I think #2/3 can be done in followups, so there may not be anything more to do here.
Comment #21
makara CreditAttribution: makara commentedThanks, catch.
Attaching the tests without the fix.
For point 2 and 3 in #20, there's another case: http://drupal.org/node/610076.
And for this case, we still have one field hook to be converted:
taxonomy_field_presave()
. It can also generate duplicates. Please refer to #9.Comment #22
makara CreditAttribution: makara commented#21: 1050466_taxonomy_field_hook_tests.patch queued for re-testing.
Comment #24
fietserwin@catch: a patch to remove duplicates can be found in #610076-13: Add a primary key to the {taxonomy_index} table and prevent duplicates being inserted but rebuilding the table could be an option as well, it is, after all, duplicate data.
I got the idea that adding language to the table AND primary index will largely solve this problem, but am unsure about how this table is used and thus if it should contain language at all. My first response would be: of course it needs language, otherwise this table may present non-existing relationships within a given language context.
However adding language will not solve all problems as we still can have the case of 2 separate fields referring to the same term.
Comment #25
makara CreditAttribution: makara commentedAdding the test that tests when there are multiple free-tagging fields referencing to a same vocabulary.
Comment #27
catchThis needs to go into 8.x with a backport to d7. Code should be identical at the moment though.
Comment #28
makara CreditAttribution: makara commentedOK. The test only patch.
Comment #30
makara CreditAttribution: makara commentedAnd with the fix.
For the function
taxonomy_field_presave
, I think we don't need to convert it to a node hook. It doesn't maintain anything that is node or entity specified. We just need a test there to check if a term with the same name has been created.Comment #31
andypostCode looks good but should we measure performance impact of this change? Probably for sites with a lot of tags!
I think adding this to taxonomy_field_presave() really bad idea
At least this hunk should be optimized for direct query db_query('SELECT tid FROM {taxonomy_term_data} WHERE name = :name AND vid = :vid'....
hook_field_insert() and hook_field_insert() in taxonomy.api.php examples should be updated in follow-up issue (don't forget we should not have bad examples in core)
10 days to next Drupal core point release.
EDIT: Please add a reference to taxonomy_index
Comment #32
podaroksubscribe
Comment #33
makara CreditAttribution: makara commentedThanks andypost.
I thought it could benefit from static cache of entity load, but it actually resets the cache after each term saves.
I updated the patch.
Comment #34
andypostAnother idea to prevent race conditions is using db_merge() query. When _presave() is fired there's a probability that DB in transaction so depending on transaction isolation level we still could have duplicates so we need to add primary key first from #610076: Add a primary key to the {taxonomy_index} table and prevent duplicates being inserted
Another way is to drop insert in _presave() and use current implementation on field level with db_merge but still we should care about transaction, probably a savepoint support in core could help.
Also I very confused by performance of current implementation
Now it's much better but I think this leads to race conditions
Not sure about converting status to int because core should not prohibit a usage of values other then [0-1]. Same for sticky
10 days to next Drupal core point release.
Comment #35
andypostThe only possible situation when duplicates happen - 2 fields with the same vocabulary attached to node, it seems that this case is rarely used and there's no reason to make queries against taxonomy_index because fieldEntityQuery() is for this corner cases
Comment #36
makara CreditAttribution: makara commentedThanks andypost.
But sorry I don't quite understand your points.
What are you suggesting in #34 and #35?
Comment #37
larowlanThis solves the problem of trying to save a node using node_save that has been programatically manipulated.
+1 from this end.
Comment #38
xjmAttached just rerolls #33 for core/.
Comment #39
andypostIn case of node unpublished the index is not changed?
Comment #40
xjmThat's correct. Unpublished nodes are currently excluded from the index. There's another issue about that: #962664: Taxonomy Index for unpublished entities
Comment #41
andypostThey can be in original node but can be changed so this node should be excluded from index. Current approach looks tricky - if node has no status after save - this is a erroneous case so empty() is useless here.
Powered by Dreditor.
Comment #42
makara CreditAttribution: makara commented1. When updating a node, the index is removed before it is inserted again.
2. People can invoke node_save() directly with a node object that omits some properties, and the omitted properties will not be updated (except for the timestamps). So if you want to change the status, you need to say $node->status = 0, but not unset($node->status).
Comment #43
figureone CreditAttribution: figureone commented#38: taxonomy_index_hook-1050466-38.patch queued for re-testing.
Comment #44
chx CreditAttribution: chx commentedI like where this is going but it needs a few comments. Like, why are we using node_insert and not field_insert so that noone will revert the issue two years down the road. Also, a short comment on why are we calling taxonomy_node_delete and node_insert directly. I mean, I get it but they are hook implementations so it's a slight WTF needing a comment.
Comment #45
xjmThe test class could also use a somewhat better description. It doesn't mean much out of context. Also, generally, it's better to use (e.g.) "node ID" rather than "nid" in text.
A few of the one-line summaries also need to be adjusted slightly to standards; see http://drupal.org/node/1354#functions and http://drupal.org/node/1354#classes. In general, they should start with a third-person verb and describe in 80 characters or less precisely what the thing does:
Tests the blah for blah blah blah.
orTests to verify that the blah blah is blah.
Comment #46
makara CreditAttribution: makara commentedThanks all.
I'm not good at documentations. I'm trying my best.
Comment #47
xjmThanks for your continued work on this! It will be great to have this fixed. I had a few more suggestions on the comments, but I accidentally closed my window and lost what I'd written. :P So, instead, I'll just roll the patch with a few suggested changes.
Comment #48
xjmComment #49
xjmClicked save by accident. :P Just polished the docs a bit.
Comment #50
xjmOops, looks like I transposed a character. (Darn transpose hotkey being next to yank.) Attached fixes that.
Comment #51
xjmAaaaand diffing against the right branch. #50 is just an interdiff and will fail to apply. Sorry for the noise.
Comment #52
xjmComment #53
xjmFixing typo.
Comment #55
xjmI actually noticed this before reading chx's comment, and sort of wavered on whether helpers should be factored out of the insert and delete helpers for this reason. On the one hand, it's weird. On the other, helpers would add superfluous code, since this works fine really.
Edit: Also, looks like the bot is sick.
Comment #57
xjmTestbot, testbot.
Comment #58
chx CreditAttribution: chx commentedLooks good.
Comment #59
xjmAlright I dreamed about this and we do need to factor these out... what if something gets added to taxonomy's
hook_node_insert()
sometime? No one would expect they'd need to change this. I can reroll with that change.Comment #60
xjmFixed #59 and also cleaned up one of the test cases, since it was a bit unclear.
Comment #61
xjmWith a grammatical fix for a comment that was bothering me.
Comment #63
xjmComment #64
xjmAlso, after talking to sun and Dave Reid about this issue yesterday, I think it probably needs a summary.
Comment #65
xjmIn the process of writing the summary, I noticed that some of #34 has not been addressed, specifically the remark:
We still add this query in
taxonomy_field_presave()
:Is this a concern?
Maybe this fix and its test should not be a part of this issue at all.
Comment #65.0
xjmUpdated issue summary.
Comment #65.1
xjmUpdated issue summary.
Comment #66
xjmMarked #1110636: Notice: Undefined property: stdClass::$status in taxonomy_field_update as duplicate since that has to be fixed in order to fix this.
Comment #67
xjmAttached removes the hunk in #65 and its tests. Opened #1343822: Autocreated taxonomy terms duplicated if they are added to multiple fields, since it is a separate issue.
Comment #67.0
xjmUpdated issue summary.
Comment #68
xjmSummary has been added.
Comment #68.0
xjmUpdated issue summary.
Comment #68.1
xjmUpdated issue summary.
Comment #68.2
xjmUpdated issue summary.
Comment #69
chx CreditAttribution: chx commentedI liked it previously it just got better since.
Comment #70
catchLooks great. Agree with handling the duplicates in a follow-up (I think there's another issue dealing with that as well somewhere, but all the more reason not to bundle it in here). Committed/pushed to 8.x, will need a re-roll for 7.x.
Comment #71
makara CreditAttribution: makara commentedA simple re-roll. Nothing changed.
Comment #72
makara CreditAttribution: makara commentedComment #73
xjmThe reroll looks good.
Comment #74
webchickHm. This looks a bit scary for D7. :) But it's a legitimate bug fix, is passing existing tests without changes, and adds a bunch of new ones to prove that things are functioning properly. I like that the code got cleaned up here pretty significantly in the process of fixing this bug. Sounds like the duplicates issue is being handled over at #610076: Add a primary key to the {taxonomy_index} table and prevent duplicates being inserted (which I'll go un-postpone now).
Committed and pushed #71 to 7.x. Thanks!
Comment #75
fietserwinDoes this patch prevent duplicates being inserted (in multi lingual sites as well) or should we still check for that? The answer can be used to create a patch for the now unblocked #610076: Add a primary key to the {taxonomy_index} table and prevent duplicates being inserted.
Comment #76
TripleEmcoder CreditAttribution: TripleEmcoder commentedWe have applied this patch in our deployment and found that it fails with i18_taxonomy. The following would fix it.
This is needed because the latest i18n_taxonomy replaces the field type with hook_field_storage_details_alter (see #1281704: Taxonomy terms are not translated in content create/edit form) to hijack taxonomy term rendering. I know this is beyond the scope of core, but I would like to ask for you guidance on how this should be fixed?
Comment #77
webchickWell, we definitely can't add code like that to Drupal core. Re-opening and marking CNW to see if the authors of this patch have any ideas. Else, I'll need to roll back for D7.
Comment #78
catchi18n would have to add its own version of taxonomy_build_node_index().
I can't off the top of my head think of a bc way to fix this, but it's late here and been a long day so there might well be something we could do that's a bit more agnostic.
If there isn't one, then I think we need to live with the small bc break here since there are at least a couple of major bugs linked to this now.
Comment #79
TripleEmcoder CreditAttribution: TripleEmcoder commentedWell, it wouldn't be the first time i18n would be copy-pasting a core function to change a small aspect like above. By the way, do you guys know if there are any plans for moving i18n into core?
Comment #80
xjm@memfis: Check out the Drupal 8 multilingual initiative: #1260534: META-META: Make language support awesome in Drupal 8.
Comment #81
makara CreditAttribution: makara commentedI agree, and possibly another index table to index language. We cannot just support every module that implements
hook_field_storage_details_alter
, because we won't know if it wants the taxonomy index. But from what I see, I18N had to do it because the core lacks some flexibilities. I think we should point out what it lacks and fix it, not doing hijack for another hijack.Comment #82
fietserwinIs there already an issue in i18n?
Comment #83
catchThis has been fixed in D7 for four months, so I'm going to move it back there - we definitely won't revert it any more.
Comment #84
xjmI believe it has also been resolved for the affected contrib modules, so we are good.
Comment #85.0
(not verified) CreditAttribution: commentedUpdated issue summary.