Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
The way taxonomy terms and vocabularies are cached in functions like taxonomy_get_term()
allows later functions to alter the cached data. This is because PHP (at least in version 5) doesn't copy objects, but passes them by reference.
So, the solution to the problem is to clone the objects in cache before returning them.
example (note the last line: drupal_clone()):
function taxonomy_get_term($tid) {
static $terms = array();
if (!isset($terms[$tid])) {
$terms[$tid] = db_fetch_object(db_query('SELECT * FROM {term_data} WHERE tid = %d', $tid));
}
return drupal_clone($terms[$tid]);
}
Comment | File | Size | Author |
---|---|---|---|
#25 | 154859-nocache.diff.txt | 4.15 KB | pwolanin |
#21 | drupal_clone.patch | 1.99 KB | fago |
#3 | taxonomy_42.patch | 1.08 KB | Wim Leers |
Comments
Comment #1
drummComment #2
Wim LeersThis problem still exists in D6.
Comment #3
Wim LeersI've quickly gone through the entire taxonomy.module and found 2 more places where we're returning cached objects (thus effectively returning references to these objects, thus allowing for corruption).
Note that performance implications might have to be tested, I have no idea how heavy drupal_clone() is.
Comment #4
Wim LeersBump! Marking this as critical because it can potentially cause very weird errors.
Comment #5
Dries CreditAttribution: Dries commentedI wonder whether the taxonomy module needs to clone these or not. Throughout Drupal, we cache a lot of object, and I don't think we ever clone them.
It's rare that a consumer will alter a cached object. If they do, it's probably a good idea to have _them_ clone the object. It should be rare, not?
Comment #6
Wim LeersI agree that it should be rare. And I also know we're cloning almost nowherein core. But many people have been running Drupal on servers with PHP 4, while this problem only occurs with PHP 5. When Drupal 7 will be PHP 5 only, it's almost certain that more of these weird "bugs" will appear.
IMVHO, it's better to have the solution in core.
A quick google search gave me these results:
-article by Steven: http://acko.net/node/54
-issues: http://drupal.org/node/10490, http://drupal.org/node/63239, http://drupal.org/node/57884
Comment #7
moonray CreditAttribution: moonray commentedPart of the problem is that somewhere down the road, you don't realize you have to clone the object before using it. If you get it directly from the DB, it's a fresh object anyway, so why bother, right?
Except, things are inconsistent. Sometimes objects are cached, sometimes they come straight from the DB.
When you're working with an API (i.e. getting taxonomy terms), you don't expect to be altering the cached version of objects (IMO).
Comment #8
Gábor HojtsyI see the problem with the inconsistent object behavior. Some people (http://drupal4hu.com/node/52 - although this had lots of comments before chx moved the site) say we should solve this by using arrays and that would give us lots of array functionality which is nonexistent for objects, and we get the on-demand cloning (AKA lazy copy) implemented by PHP.
Anyway, we can approach this problem by fitting the results with cloning, or rethinking where we use objects. The latter is obviously not a Drupal 6 matter, the former could be.
Comment #9
chx CreditAttribution: chx commentedThis would need a much more through patching -- why just taxonomy -- and I do not think that is a 6.x matter. Goba, I never throw away content, if you re-read my blogpost, the comments are summed up in there.
Comment #10
Wim LeersOf course you're right chx. But you have to start somewhere and get feedback on that ;) And yes, we're close to a beta for D6, so it's definitely something for D7.
Comment #11
pwolanin CreditAttribution: pwolanin commentedbump - for 7.x and or possible backport?
Comment #12
Anonymous (not verified) CreditAttribution: Anonymous commentedsubscribing
Comment #13
catchLooks like this one got lost, moving out of the taxonomy queue since it's not specific to there.
Comment #14
neochief CreditAttribution: neochief commentedHere's a related issue #359746: node_view() modifies the node object stored in the menu (candidate for duplicate)
Comment #15
catchDuplicate #424602: Unsaved node changes are reflected on subsequent calls to node_load()
Comment #16
Damien Tournoud CreditAttribution: Damien Tournoud commentedObviously cloning everything before returning is not practical (as it essentially double the memory usage), and switching all our data structures to arrays seems to go in the wrong direction. It think this is essentially a documentation issue.
Most of the objects used throughout the process of a request are shared between all the modules (typical example: objects returned by menu_get_object()). We simply need to document that. I'm with Dries on that, it's the caller responsibility to do the cloning if it needs to.
Comment #17
Damien Tournoud CreditAttribution: Damien Tournoud commentedErm.
Comment #18
AmrMostafa CreditAttribution: AmrMostafa commentedI just duplicated this here #453406: noad_load_multiple() should return cloned objects?, thanks for pointing me here Damien.
It doesn't seem right that it's the "caller responsibility". I think that everyone agrees that conceptually it's not the right thing to do, but we will do it because we are also practical. On that basis, I would say that the most practical argument that holds against this solution is that it means Drupal core will have to do that too, things like the node building which change the node object[1] will have to do that for node objects, and it's one of the largest objects in drupal. I wonder where else core would need to do that, and consequently, if in that case the "caller responsibility" would still maintain its practicality.
[1] and this currently hits Poll module for example, check #3 at #372650: Poll node is not displayed when poll block is displayed as well and see how ugly the current possible fix is.
Comment #19
AmrMostafa CreditAttribution: AmrMostafa commentedComment #20
sunsubscribing
Comment #21
fagoWell I ran over this issue and also thought cloning is the right way to do it, so I just did a patch for it -> attached.
>When you're working with an API (i.e. getting taxonomy terms), you don't expect to be altering the cached version of objects (IMO).
While I see the possible performance increase I must agree with that: When using a _load() function you expect a fresh object fetched from database. This is the way it always worked until d6.
Thus when we change that for d7 due to performance, I think we need numbers to see how big the differences are really.
Comment #22
catchThis needs memory profiling, also I disagree that it's critical, so downgrading.
Comment #23
pwolanin CreditAttribution: pwolanin commentedI agree this is kind of important - as fago says, when you call node_load() you expect to be able to mess with the returned object.
Comment #24
pwolanin CreditAttribution: pwolanin commentedThinking about this - Drupal 6 allowed you to specify in node_load() that the node should not be in the static cache.
Looks like that's missing in Drupal 7 - that alone may be a more serious memory issue. How can I load 1000 nodes in batches and not have them stay in memory?
Comment #25
pwolanin CreditAttribution: pwolanin commentedHere's a quick start adding a cachable flag.
Drupal 6 in node_load derived this in a opaque way from the revision argument, but to me that seems like a bad idea. Still need to consider whether to force this to FALSE if revisions are being loaded.
Comment #26
sun.core CreditAttribution: sun.core commentedNot critical. Please read and understand Priority levels of Issues, thanks.
Comment #27
pwolanin CreditAttribution: pwolanin commented@sun - if I can't update my modules to D7 I think it's critical.
Comment #28
catchFrom entity.inc
So no regression from D6. I disagree with the API change here, and the batch issue should be solved by #375494: Restrict the number of nodes held in the node_load() static cache.
Comment #29
pwolanin CreditAttribution: pwolanin commentedIf this issue is "normal" or "won't fix" than http://drupal.org/node/375494 needs to be marked critical
@catch - is there some way I can pass a dummy revision condition in to prevent caching?
Comment #30
catch@pwolanin, no, you'd need to actually pass the revision ID in. I see what you mean now, in D6 you could pass revision as FALSE and skip caching, without it querying for a revision ID. That looks like a side-effect to me which is not documented in the phpdoc, so a bit hard to mark as a critical regression.
I still dislike either an automatic clone, or an extra param on node_load(). The caller should clone if it needs to do something freaky, and can also reset the static if it needs to.
Happy for the LRU cache issue to be marked critical but I expect it'll get marked back down again in the "reduce the critical issues by reclassifying" sprint which is currently occurring
Comment #31
David_Rothstein CreditAttribution: David_Rothstein commentedSubscribing - the current behavior is just totally crazy.
Hm, I don't really understand what the patch in #25 has to do with this issue... Maybe it lets you optionally bypass the caching, but it doesn't seem to do anything to either fix or document the bug in the normal use case.
So it sounds like the current options are either #21 or better documentation?
Comment #32
catchYes it's either #21 or better docs. IMO #21 is won't fix unless someone can prove that it won't make our significant memory problems even worse.
Until a few months ago, we used to do three distinct node_load() on every single comment rendered in a thread. i.e betwen 30 and 900 node_load() depending on comment settings. If we went with #21, that would translate to 900 distinct copies stored in memory. That particular example has been fixed, but just a couple of mistakes in contrib and you have major memory leaks.
Comment #33
sunThe issue title already suggests better docs. So what's wrong with better docs?
Comment #34
catchWe can document it in entity_load() - would prefer there than adding the same line to 4-5 different functions.
Comment #35
nevergone CreditAttribution: nevergone commentedsubscribe
http://drupal.org/node/714248#comment-2705710
Comment #36
mikl CreditAttribution: mikl commentedWell, objects are not passed by reference, in fact they are not passed at all. The variable contains a handle for the object, not the object itself.
This behavior is PHP5 OOP 101, and is actually documented, but I suppose it would be a good thing to hammer it home…
Comment #37
pwolanin CreditAttribution: pwolanin commented@catch - I don't see why you say #21 will cause greater memory issues. If an object goes out of scope, the memory it occupies should be recovered.
Id I call X_load, I expect to be able to arbitrarily manipulate the returned object. Omitting the clone seems like a very unexpected change between D6 and D7.
Comment #38
Damien Tournoud CreditAttribution: Damien Tournoud commentedI still believe it's best not to clone the returned object. It sounds dumb that some process in the requests modify one object, and all other process still see the outdated version.
Comment #39
pwolanin CreditAttribution: pwolanin commented@Damien - that's a totally different problem I think. If I call node_save() the static cache should be expired. If I call node_load() I want to be sure I am getting the canonical version of the object, not something some contrib module messed with.
Comment #40
catchpwolanin: see http://drupal.org/cvs?commit=333880 - PHP garbage collection is not to be trusted.
Also this would only be a change between D6 and D7 if you were using PHP4... The only difference in D7 is that we don't explicitly pass objects by reference any more because there's no need to.
Comment #41
pwolanin CreditAttribution: pwolanin commented@catch - Drupal 6 *always* clones, e.g.: http://api.drupal.org/api/function/node_load/6
So Drupal 7 is a change (or regression) from the Drupal 6 behavior.
Comment #42
catchNot in the case of these two:
http://api.drupal.org/api/function/taxonomy_get_term
http://api.drupal.org/api/function/user_load
Comment #43
pwolanin CreditAttribution: pwolanin commentedWell Drupal 6 use_load() doesn't cache, so it's far worse than calling clone().
Comment #44
catchSo my main issues here are:
1. We don't know what the memory implications are. Given that devel issue (memory leak just from assigning an array), I'd rather err on the side of caution.
2. Where, in core or contrib, does a module call $node = node_load($nid); $node->foo = 'bar'; unless it's going to save the node again? I don't remember ever seeing this anywhere. Unless someone can point to this, and its actually causing an issue, and there's some reason that module can't do the cloning itself, I don't see why we need to support it explicitly in core.
Comment #45
fagoI agree that having one shared object instance makes totally sense.
@catch:
ad) 2: I agree that usually that should be fine. However a problem I ran into is that this change makes it rather hard to determine the changes a save() operations is about to save as it's impossible to get the clean, unchanged object from database. Also see #651240: Allow modules to react to changes to an entity.
Comment #46
pwolanin CreditAttribution: pwolanin commentedA indeed - I actually mention this in my "cover developer" slides I threw together - our API doesn't have a good way to let you know what's changed when a node is saved. So - for some cases like seeing if a node is getting published, I guess I have to query the node table to compare?
For non-SQL storage it becomes even harder.
Comment #47
David_Rothstein CreditAttribution: David_Rothstein commentedI was under the impression that many of the issues which were marked duplicate of this one or otherwise linked to above are (or at least at one point were) examples of this?
More generally, any code along the lines of this would certainly look reasonable to most people but would lead to an issue, right?
(I think the core entities do protect against that kind of thing in various ways, though, but not so sure that others will.)
It's true that any of the above can be fixed by having the calling code do the cloning, and that would probably make more sense if Drupal were fully object-oriented or something, but we tend to use arrays and objects fully interchangeably - so unfortunately I don't think this requirement is going to make a lot of sense for most people, even if it's well-documented :(
Comment #48
catch@pwolanin: file_field_update() is checking changes to field values: see #644338: file_field_update() triggers a full entity load durung entity_save() and #733756: image_field_insert() makes no sense and causes a memory leak with devel_generate(). That works regardless of field storage module. It's not good as an API but it does work, preferable to getting $form_state['values'] IMO.
@David: that poll issue highlights some of my memory concerns about this - at what point are the $node objects from poll_view() and poll_block_view() etc. taken out of scope? Surely they get passed around until we get to templates, so effectively for the entire request. Would be good if someone could roll a patch so we can actually see what difference this makes.
Comment #49
fago@file_field_update(): This approach of detecting changes is broken now as the updated and the not updated object might be the same as the objects are shared now. It's working for node forms as the updated object is generated out of the form values - however if I do a form where the updated object is loaded through entity_load() and just gets the form values incorporated, then it will fail. Also in case of simple programmatic updates it would fail as I mentioned in http://drupal.org/node/651240#comment-2721004.
Comment #50
pwolanin CreditAttribution: pwolanin commentedIs the patch in #21 not correct?
Comment #51
pwolanin CreditAttribution: pwolanin commentedI think this is more than normal - this affects e.g. every module that calls node_load()
Comment #52
sun.core CreditAttribution: sun.core commentedTrying to make any sense of the major queue. All suggested API changes in this issue are out of scope for D7 in the meantime, so the only possible todo is to document the situation (as the issue title suggests). Thus, demoting to normal.
Comment #53
pwolanin CreditAttribution: pwolanin commentedI'm still considering this a possible regression from D6. Moving up to critical since we need more opinions on whether this needs to be changed. There will be tons of ugly bugs in D7 contrib .
Comment #54
chx CreditAttribution: chx commentedhttp://drupal.org/node/363802#comment-1219547
I was not allowed to help people with adding comments to DBTNG methods based mostly on this argument. This issue is hardly different.
(Also note I have tried to help understanding how PHP5 objects work in http://drupal4hu.com/node/224 )
Comment #55
fagoI agree with #52 that the only way right now is to document it. I don't think we that there will be lots of bugs, but of course we need to document it.
ad #54:
Yes, but this issue is different. Indeed PHP5 always worked that way, but with d6 any object retrieved from e.g. node_load() was cloned in order to prevent this behaviour. But with d7, we stopped doing that - so this is a significant change, which needs documentation.
Comment #56
EvanDonovan CreditAttribution: EvanDonovan commented@fago: In what functions would this need to be documented? Functions that load nodes, taxonomy terms, users...what else?
Comment #57
catchI think this should go in the 6-7 update guide rather than dozens of comments littered around core.
Comment #58
jpstrikesback CreditAttribution: jpstrikesback commented+1 to #57 - Document in the 6 - 7 update guide
Comment #59
chx CreditAttribution: chx commentedComment #60
EvanDonovan CreditAttribution: EvanDonovan commentedAgreed with catch in #57.
Comment #61
Crell CreditAttribution: Crell commentedDocumenting how the PHP language works is not critical. Put a note in the upgrade guide per #57 and call it a day. There's nothing else needed here.
Comment #62
David_Rothstein CreditAttribution: David_Rothstein commentedPutting a note in the upgrade guide definitely makes sense, but doesn't help people who are writing new code rather than upgrading it.
We need to document it in the individual functions also. The fact that this code:
affects all future callers of node_load() in the same page request is extremely nonintuitive. Especially given what the word "load" typically means.
And especially given that other similar functions, for example comment_load(), do NOT show the same behavior. For example, this code:
has no effect on future callers of comment_load().
Given that, it is certainly a pure Drupalism - so if we can't fix it, it should be documented for each function that behaves that way. There is no simple way for anyone to figure out which functions are affected otherwise.
Comment #63
jpstrikesback CreditAttribution: jpstrikesback commentedSorry to ask, but is it per entity controller that the cloning is/isn't done(in core-currently)?
Comment #64
pwolanin CreditAttribution: pwolanin commentedIf different X_load() functions variously return a reference to the cached object or a clone that's a serious wtf for anyone using them. That argues that the behavior shoudl be indicated as part of the @return docs for each such function.
Comment #65
catch@David: @pwolanin, @jpstrikesback no that's not what they do, this issue is getting very frustrating and I wish people would actually look at the code or write the documentation instead of further muddying things.
So comment_load() called twice builds the object completely from scratch, that's got absolutely nothing to do with cloning or not cloning it.
Comment #66
catchAlso, I don't think this is remotely a wtf because you should never be changing the properties of these objects unless you're about to call node_save() or using the node api which provides ample oppurtunities to do so.
Comment #67
David_Rothstein CreditAttribution: David_Rothstein commented@catch, I've already looked at the code and know exactly what it does. What did I say in #62 that is incorrect?
The point is that unless someone has looked deeply into this, they won't know or understand why some functions behave this way and others don't. (Just look at all the duplicate issues of this one.) There are in general many legitimate reasons to take the return value of a function, put it in a variable, and then modify that variable for internal use only... except for these functions, where doing that doesn't work.
I'm happy to write documentation for this issue, but I don't understand the resistance to having it be in the API docs. I think we should write something for the API docs and get it committed... In any case, I think the proposal of documenting this in some depth for entity_load(), and then having a much shorter warning in node_load(), user_load(), etc that points to the entity_load() docs for more detail, is probably a good one.
Comment #68
catchThe resistance to having this in the API docs, at least for me, is because once we start documenting what is standard PHP5 behaviour in core, not just in one place where there's a phpwtf, but in quite normal functions that everyone uses and standard behaviour in regards to objects, that sets a precedent for duplicating the php documentation itself in the rest of Drupal.
For example there have been several issues that wanted to change all the hook_foo_bar($object) documentation to explain why the object is neither explicitly passed by reference, and/or why there is no return value. This was a change from Drupal 6, it's also again completely standard PHP5 behaviour in regards to objects.
There are many, many of those functions, I'm sure if we can looked we could find other places where this behaviour could be documented too. All this does is add additional text to the actual API documentation, text that doesn't help to explain the purpose of those functions, what they should be used for etc. - i.e. it's not actually about documenting the API any more.
node_load() and user_load() etc. already have an @see entity_load(). If this really must be dealt with in core, can we rely on that for the reference and only add this to entity_load()'s documentation? At least it's in one central place then instead of spilling out all over core.
What annoyed me about #62 was "Given that, it is certainly a pure Drupalism", and the fact it caused so much confusion in subsequent comments. Should we document drupal_load(), path_load(), poll_load() and all the $entity_load() functions to point out that some are wrappers around entity_load() and some aren't too? And that none of the $entity_load() functions, nor entity_load() itself are implementations of hook_load()? There are so, so many ridiculous wtfs in Drupal and acting on objects by resource is the least of them.
Comment #69
fagoI agree that this should be added to the d6-d7 update guide. Maybe documenting it once at entity_load() too, but that's difficult as it really depends on the controller.
Comment #70
David_Rothstein CreditAttribution: David_Rothstein commented@catch, maybe we were both missing each other's point then. I don't want to document how PHP works; I want to document how Drupal works :) Currently the documentation does not do that. I agree we don't need to go into the technical details of why it behaves the way it does.
For example, I am thinking something along these lines.
For comment_load():
For node_load():
It's more complicated than that due to revisions, but something along those lines. Then entity_load() can explain in a little more detail why some entities behave one way and others behave another and how that is related to the static-caching properties in hook_entity_info(), etc.
Comment #71
jhodgdonI like the idea in #70. I've been bitten by the node_load caching in the past, and I think saying whether something is a reference or a cached copy in the @return of these types of functions is a great idea. What's the down side?
Comment #72
catch#70 isn't too bad in principle, except the controllers are pluggable, so the comment could be loaded from memcache, or you could just hook_entity_alter() it to static cache, and then the phpdoc doesn't match.
Comment #73
jhodgdonAh yes, that's true. Maybe it could say "by default", and then have an @see at the bottom of the function pointing to entity_load (or whatever the function is that documents the situation with a note that entity controllers sometimes use caches, sometimes not, and are pluggable).
Comment #74
David_Rothstein CreditAttribution: David_Rothstein commentedHm, that does make things a bit more confusing to document, doesn't it...
I'm adding this to my list of things to work on after the RC1 release. If someone wants to work on documenting it before then, feel free :)
Comment #75
chx CreditAttribution: chx commentedOf course, it's not a reference. http://drupal4hu.com/node/224
Comment #76
jhodgdonSeems like a good project for the upcoming documentation sprint.
Comment #77
jhodgdonNew tagging scheme.
This issue... It looks like we need some mention in the 6/7 update guide -- see comments around #28 - as this is a change from D6.
It also looks like we need to add some documentation to a few core functions to talk about caching.
Comment #78
mlncn CreditAttribution: mlncn commentedI'm sure there's many places that "passed by reference" language from Drupal 6 has to be revised, but over at #1003958: Correct _node_save_revision documentation to not refer to passing $node by reference, since it does not is a patch for one place, and i think it makes clear what Drupal is doing without being incorrect about what PHP is doing!
Comment #79
jhodgdonI am not in favor of the wording in that patch, and in any case, this issue is not really about documenting that objects are passed by reference in general, but specifically about functions that return cached objects that you shouldn't change except carefully.
Comment #80
hansfn CreditAttribution: hansfn commentedOK, I have read this thread and one of the many duplicates - #453406: noad_load_multiple() should return cloned objects? - where Damien Tournoud says "It's the caller responsibility to clone the object if it needs to be modified". But what if I can't see that it needs to be modified? I'm probably blind (and I'm definitely a Drupal noob), but here is an example from the diff module - see #997464: #builder_function:
Some times this gives no changes because node_load($form_state['values']['nid']) returns the previewed node. (If you select to view changes again, it works.) The following code seems to work, but I still don't see why it's necessary.
To me, such examples indicate that node_load always should create a clone, but that discussion is dead, right?
Comment #81
Steven Jones CreditAttribution: Steven Jones commentedSubscribe.
Comment #82
Steven Jones CreditAttribution: Steven Jones commentedSo I was caught out by this in a slightly strange use case: If during the process of a node_view a field does another node_view, on the same node, then you get a PHP fatal error. This is because node_view changes the contents of the node, and doesn't do a clone, because 99.9% of the time that is fine.
To set this situation up, add a node reference field to your node and reference yourself.
This needs documenting somewhere, that any node_load effectively returns a reference to a singleton and you can break other people's code very easily, or you can make core break very easily if you don't clone the node.
Comment #83
jhodgdonI added a note about this to the update guide:
http://drupal.org/update/modules/6/7#load_cache
Also changed the issue title, as it was confusing me as it was.
As discussed extensively above, we probably can't do anything else definite in the doc, so I'm marking this fixed. If you disagree, please mark back to active and describe what else should be documented.