The way taxonomy terms and vocabularies are cached in functions like taxonomy_get_term() allows later functions to alter the cached data. This is because PHP (at least in version 5) doesn't copy objects, but passes them by reference.

So, the solution to the problem is to clone the objects in cache before returning them.

example (note the last line: drupal_clone()):

function taxonomy_get_term($tid) {
  static $terms = array();

  if (!isset($terms[$tid])) {
    $terms[$tid] = db_fetch_object(db_query('SELECT * FROM {term_data} WHERE tid = %d', $tid));
  }

  return drupal_clone($terms[$tid]);
}
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

drumm’s picture

Priority: Critical » Normal
Status: Needs work » Active
Wim Leers’s picture

Version: 5.x-dev » 6.x-dev

This problem still exists in D6.

Wim Leers’s picture

Title: Caching of taxonomy terms allows corruption » Caching of taxonomy terms and vocabularies allows corruption
Assigned: Unassigned » Wim Leers
Status: Active » Needs review
FileSize
1.08 KB

I've quickly gone through the entire taxonomy.module and found 2 more places where we're returning cached objects (thus effectively returning references to these objects, thus allowing for corruption).

Note that performance implications might have to be tested, I have no idea how heavy drupal_clone() is.

Wim Leers’s picture

Priority: Normal » Critical

Bump! Marking this as critical because it can potentially cause very weird errors.

Dries’s picture

I wonder whether the taxonomy module needs to clone these or not. Throughout Drupal, we cache a lot of object, and I don't think we ever clone them.

It's rare that a consumer will alter a cached object. If they do, it's probably a good idea to have _them_ clone the object. It should be rare, not?

Wim Leers’s picture

I agree that it should be rare. And I also know we're cloning almost nowherein core. But many people have been running Drupal on servers with PHP 4, while this problem only occurs with PHP 5. When Drupal 7 will be PHP 5 only, it's almost certain that more of these weird "bugs" will appear.
IMVHO, it's better to have the solution in core.

A quick google search gave me these results:
-article by Steven: http://acko.net/node/54
-issues: http://drupal.org/node/10490, http://drupal.org/node/63239, http://drupal.org/node/57884

moonray’s picture

Part of the problem is that somewhere down the road, you don't realize you have to clone the object before using it. If you get it directly from the DB, it's a fresh object anyway, so why bother, right?
Except, things are inconsistent. Sometimes objects are cached, sometimes they come straight from the DB.

When you're working with an API (i.e. getting taxonomy terms), you don't expect to be altering the cached version of objects (IMO).

Gábor Hojtsy’s picture

I see the problem with the inconsistent object behavior. Some people (http://drupal4hu.com/node/52 - although this had lots of comments before chx moved the site) say we should solve this by using arrays and that would give us lots of array functionality which is nonexistent for objects, and we get the on-demand cloning (AKA lazy copy) implemented by PHP.

Anyway, we can approach this problem by fitting the results with cloning, or rethinking where we use objects. The latter is obviously not a Drupal 6 matter, the former could be.

chx’s picture

Title: Caching of taxonomy terms and vocabularies allows corruption » Clone stored objects
Version: 6.x-dev » 7.x-dev
Category: bug » task
Status: Needs review » Needs work

This would need a much more through patching -- why just taxonomy -- and I do not think that is a 6.x matter. Goba, I never throw away content, if you re-read my blogpost, the comments are summed up in there.

Wim Leers’s picture

Of course you're right chx. But you have to start somewhere and get feedback on that ;) And yes, we're close to a beta for D6, so it's definitely something for D7.

pwolanin’s picture

bump - for 7.x and or possible backport?

Anonymous’s picture

subscribing

catch’s picture

Component: taxonomy.module » base system

Looks like this one got lost, moving out of the taxonomy queue since it's not specific to there.

neochief’s picture

Here's a related issue #359746: node_view() modifies the node object stored in the menu (candidate for duplicate)

catch’s picture

Damien Tournoud’s picture

Title: Clone stored objects » Documents that objects are shared between modules

Obviously cloning everything before returning is not practical (as it essentially double the memory usage), and switching all our data structures to arrays seems to go in the wrong direction. It think this is essentially a documentation issue.

Most of the objects used throughout the process of a request are shared between all the modules (typical example: objects returned by menu_get_object()). We simply need to document that. I'm with Dries on that, it's the caller responsibility to do the cloning if it needs to.

Damien Tournoud’s picture

Title: Documents that objects are shared between modules » Document that objects are shared between modules

Erm.

AmrMostafa’s picture

I just duplicated this here #453406: noad_load_multiple() should return cloned objects?, thanks for pointing me here Damien.

It doesn't seem right that it's the "caller responsibility". I think that everyone agrees that conceptually it's not the right thing to do, but we will do it because we are also practical. On that basis, I would say that the most practical argument that holds against this solution is that it means Drupal core will have to do that too, things like the node building which change the node object[1] will have to do that for node objects, and it's one of the largest objects in drupal. I wonder where else core would need to do that, and consequently, if in that case the "caller responsibility" would still maintain its practicality.

[1] and this currently hits Poll module for example, check #3 at #372650: Poll node is not displayed when poll block is displayed as well and see how ugly the current possible fix is.

AmrMostafa’s picture

Status: Needs work » Active
sun’s picture

Assigned: Wim Leers » Unassigned
fago’s picture

Status: Active » Needs review
FileSize
1.99 KB

Well I ran over this issue and also thought cloning is the right way to do it, so I just did a patch for it -> attached.

>When you're working with an API (i.e. getting taxonomy terms), you don't expect to be altering the cached version of objects (IMO).
While I see the possible performance increase I must agree with that: When using a _load() function you expect a fresh object fetched from database. This is the way it always worked until d6.

Thus when we change that for d7 due to performance, I think we need numbers to see how big the differences are really.

catch’s picture

Priority: Critical » Normal

This needs memory profiling, also I disagree that it's critical, so downgrading.

pwolanin’s picture

I agree this is kind of important - as fago says, when you call node_load() you expect to be able to mess with the returned object.

pwolanin’s picture

Priority: Normal » Critical

Thinking about this - Drupal 6 allowed you to specify in node_load() that the node should not be in the static cache.

Looks like that's missing in Drupal 7 - that alone may be a more serious memory issue. How can I load 1000 nodes in batches and not have them stay in memory?

pwolanin’s picture

FileSize
4.15 KB

Here's a quick start adding a cachable flag.

Drupal 6 in node_load derived this in a opaque way from the revision argument, but to me that seems like a bad idea. Still need to consider whether to force this to FALSE if revisions are being loaded.

sun.core’s picture

Priority: Critical » Normal

Not critical. Please read and understand Priority levels of Issues, thanks.

pwolanin’s picture

@sun - if I can't update my modules to D7 I think it's critical.

catch’s picture

Status: Needs review » Needs work

From entity.inc

 // Revisions are not statically cached, and require a different query to
    // other conditions, so separate the revision id into its own variable.
    if ($this->revisionKey && isset($this->conditions[$this->revisionKey])) {
      $this->revisionId = $this->conditions[$this->revisionKey];
      unset($this->conditions[$this->revisionKey]);
    }

So no regression from D6. I disagree with the API change here, and the batch issue should be solved by #375494: Restrict the number of nodes held in the node_load() static cache.

pwolanin’s picture

If this issue is "normal" or "won't fix" than http://drupal.org/node/375494 needs to be marked critical

@catch - is there some way I can pass a dummy revision condition in to prevent caching?

catch’s picture

@pwolanin, no, you'd need to actually pass the revision ID in. I see what you mean now, in D6 you could pass revision as FALSE and skip caching, without it querying for a revision ID. That looks like a side-effect to me which is not documented in the phpdoc, so a bit hard to mark as a critical regression.

I still dislike either an automatic clone, or an extra param on node_load(). The caller should clone if it needs to do something freaky, and can also reset the static if it needs to.

Happy for the LRU cache issue to be marked critical but I expect it'll get marked back down again in the "reduce the critical issues by reclassifying" sprint which is currently occurring

David_Rothstein’s picture

Subscribing - the current behavior is just totally crazy.

Hm, I don't really understand what the patch in #25 has to do with this issue... Maybe it lets you optionally bypass the caching, but it doesn't seem to do anything to either fix or document the bug in the normal use case.

So it sounds like the current options are either #21 or better documentation?

catch’s picture

Yes it's either #21 or better docs. IMO #21 is won't fix unless someone can prove that it won't make our significant memory problems even worse.

Until a few months ago, we used to do three distinct node_load() on every single comment rendered in a thread. i.e betwen 30 and 900 node_load() depending on comment settings. If we went with #21, that would translate to 900 distinct copies stored in memory. That particular example has been fixed, but just a couple of mistakes in contrib and you have major memory leaks.

sun’s picture

The issue title already suggests better docs. So what's wrong with better docs?

catch’s picture

We can document it in entity_load() - would prefer there than adding the same line to 4-5 different functions.

nevergone’s picture

mikl’s picture

Well, objects are not passed by reference, in fact they are not passed at all. The variable contains a handle for the object, not the object itself.

This behavior is PHP5 OOP 101, and is actually documented, but I suppose it would be a good thing to hammer it home…

pwolanin’s picture

@catch - I don't see why you say #21 will cause greater memory issues. If an object goes out of scope, the memory it occupies should be recovered.

Id I call X_load, I expect to be able to arbitrarily manipulate the returned object. Omitting the clone seems like a very unexpected change between D6 and D7.

Damien Tournoud’s picture

I still believe it's best not to clone the returned object. It sounds dumb that some process in the requests modify one object, and all other process still see the outdated version.

pwolanin’s picture

@Damien - that's a totally different problem I think. If I call node_save() the static cache should be expired. If I call node_load() I want to be sure I am getting the canonical version of the object, not something some contrib module messed with.

catch’s picture

pwolanin: see http://drupal.org/cvs?commit=333880 - PHP garbage collection is not to be trusted.

Also this would only be a change between D6 and D7 if you were using PHP4... The only difference in D7 is that we don't explicitly pass objects by reference any more because there's no need to.

pwolanin’s picture

@catch - Drupal 6 *always* clones, e.g.: http://api.drupal.org/api/function/node_load/6

    if ($cachable) {
      // Is the node statically cached?
      if (isset($nodes[$param])) {
        return is_object($nodes[$param]) ? drupal_clone($nodes[$param]) : $nodes[$param];
      }
    }

So Drupal 7 is a change (or regression) from the Drupal 6 behavior.

catch’s picture

pwolanin’s picture

Well Drupal 6 use_load() doesn't cache, so it's far worse than calling clone().

catch’s picture

So my main issues here are:

1. We don't know what the memory implications are. Given that devel issue (memory leak just from assigning an array), I'd rather err on the side of caution.

2. Where, in core or contrib, does a module call $node = node_load($nid); $node->foo = 'bar'; unless it's going to save the node again? I don't remember ever seeing this anywhere. Unless someone can point to this, and its actually causing an issue, and there's some reason that module can't do the cloning itself, I don't see why we need to support it explicitly in core.

fago’s picture

I agree that having one shared object instance makes totally sense.

@catch:
ad) 2: I agree that usually that should be fine. However a problem I ran into is that this change makes it rather hard to determine the changes a save() operations is about to save as it's impossible to get the clean, unchanged object from database. Also see #651240: Allow modules to react to changes to an entity.

pwolanin’s picture

A indeed - I actually mention this in my "cover developer" slides I threw together - our API doesn't have a good way to let you know what's changed when a node is saved. So - for some cases like seeing if a node is getting published, I guess I have to query the node table to compare?

For non-SQL storage it becomes even harder.

David_Rothstein’s picture

2. Where, in core or contrib, does a module call $node = node_load($nid); $node->foo = 'bar'; unless it's going to save the node again? I don't remember ever seeing this anywhere. Unless someone can point to this, and its actually causing an issue, and there's some reason that module can't do the cloning itself, I don't see why we need to support it explicitly in core.

I was under the impression that many of the issues which were marked duplicate of this one or otherwise linked to above are (or at least at one point were) examples of this?

  1. #372650: Poll node is not displayed when poll block is displayed as well
  2. #359746: node_view() modifies the node object stored in the menu
  3. There's also #221081: Entity cache out of sync and inconsistent when saving/deleting entities, in which the current (shared object) behavior masks - but does not actually fix - the bug there.

More generally, any code along the lines of this would certainly look reasonable to most people but would lead to an issue, right?

$entity = my_entity_load($entity_id);
$html = drupal_render($entity->content);

(I think the core entities do protect against that kind of thing in various ways, though, but not so sure that others will.)

It's true that any of the above can be fixed by having the calling code do the cloning, and that would probably make more sense if Drupal were fully object-oriented or something, but we tend to use arrays and objects fully interchangeably - so unfortunately I don't think this requirement is going to make a lot of sense for most people, even if it's well-documented :(

catch’s picture

@pwolanin: file_field_update() is checking changes to field values: see #644338: file_field_update() triggers a full entity load durung entity_save() and #733756: image_field_insert() makes no sense and causes a memory leak with devel_generate(). That works regardless of field storage module. It's not good as an API but it does work, preferable to getting $form_state['values'] IMO.

@David: that poll issue highlights some of my memory concerns about this - at what point are the $node objects from poll_view() and poll_block_view() etc. taken out of scope? Surely they get passed around until we get to templates, so effectively for the entire request. Would be good if someone could roll a patch so we can actually see what difference this makes.

fago’s picture

@file_field_update(): This approach of detecting changes is broken now as the updated and the not updated object might be the same as the objects are shared now. It's working for node forms as the updated object is generated out of the form values - however if I do a form where the updated object is loaded through entity_load() and just gets the form values incorporated, then it will fail. Also in case of simple programmatic updates it would fail as I mentioned in http://drupal.org/node/651240#comment-2721004.

pwolanin’s picture

Is the patch in #21 not correct?

pwolanin’s picture

Priority: Normal » Major

I think this is more than normal - this affects e.g. every module that calls node_load()

sun.core’s picture

Priority: Major » Normal

Trying to make any sense of the major queue. All suggested API changes in this issue are out of scope for D7 in the meantime, so the only possible todo is to document the situation (as the issue title suggests). Thus, demoting to normal.

pwolanin’s picture

Priority: Normal » Critical

I'm still considering this a possible regression from D6. Moving up to critical since we need more opinions on whether this needs to be changed. There will be tons of ugly bugs in D7 contrib .

chx’s picture

Status: Needs work » Closed (won't fix)

http://drupal.org/node/363802#comment-1219547

We'd be providing documentation on how the language syntax works everywhere we use it. We don't do that for any other part of PHP

I was not allowed to help people with adding comments to DBTNG methods based mostly on this argument. This issue is hardly different.

(Also note I have tried to help understanding how PHP5 objects work in http://drupal4hu.com/node/224 )

fago’s picture

Status: Closed (won't fix) » Active
Issue tags: +Needs documentation

I agree with #52 that the only way right now is to document it. I don't think we that there will be lots of bugs, but of course we need to document it.

ad #54:
Yes, but this issue is different. Indeed PHP5 always worked that way, but with d6 any object retrieved from e.g. node_load() was cloned in order to prevent this behaviour. But with d7, we stopped doing that - so this is a significant change, which needs documentation.

EvanDonovan’s picture

@fago: In what functions would this need to be documented? Functions that load nodes, taxonomy terms, users...what else?

catch’s picture

I think this should go in the 6-7 update guide rather than dozens of comments littered around core.

jpstrikesback’s picture

+1 to #57 - Document in the 6 - 7 update guide

chx’s picture

Component: base system » documentation
EvanDonovan’s picture

Agreed with catch in #57.

Crell’s picture

Priority: Critical » Normal

Documenting how the PHP language works is not critical. Put a note in the upgrade guide per #57 and call it a day. There's nothing else needed here.

David_Rothstein’s picture

Putting a note in the upgrade guide definitely makes sense, but doesn't help people who are writing new code rather than upgrading it.

We need to document it in the individual functions also. The fact that this code:

$node = node_load($nid);
$node->title = 'Whatever';

affects all future callers of node_load() in the same page request is extremely nonintuitive. Especially given what the word "load" typically means.

And especially given that other similar functions, for example comment_load(), do NOT show the same behavior. For example, this code:

$comment = comment_load($cid);
$comment->subject = 'Whatever';

has no effect on future callers of comment_load().

Given that, it is certainly a pure Drupalism - so if we can't fix it, it should be documented for each function that behaves that way. There is no simple way for anyone to figure out which functions are affected otherwise.

jpstrikesback’s picture

Sorry to ask, but is it per entity controller that the cloning is/isn't done(in core-currently)?

pwolanin’s picture

If different X_load() functions variously return a reference to the cached object or a clone that's a serious wtf for anyone using them. That argues that the behavior shoudl be indicated as part of the @return docs for each such function.

catch’s picture

@David: @pwolanin, @jpstrikesback no that's not what they do, this issue is getting very frustrating and I wish people would actually look at the code or write the documentation instead of further muddying things.

function comment_entity_info() {
 ...
      'static cache' => FALSE,
    ),
  );

So comment_load() called twice builds the object completely from scratch, that's got absolutely nothing to do with cloning or not cloning it.

catch’s picture

Also, I don't think this is remotely a wtf because you should never be changing the properties of these objects unless you're about to call node_save() or using the node api which provides ample oppurtunities to do so.

David_Rothstein’s picture

@catch, I've already looked at the code and know exactly what it does. What did I say in #62 that is incorrect?

The point is that unless someone has looked deeply into this, they won't know or understand why some functions behave this way and others don't. (Just look at all the duplicate issues of this one.) There are in general many legitimate reasons to take the return value of a function, put it in a variable, and then modify that variable for internal use only... except for these functions, where doing that doesn't work.

I'm happy to write documentation for this issue, but I don't understand the resistance to having it be in the API docs. I think we should write something for the API docs and get it committed... In any case, I think the proposal of documenting this in some depth for entity_load(), and then having a much shorter warning in node_load(), user_load(), etc that points to the entity_load() docs for more detail, is probably a good one.

catch’s picture

The resistance to having this in the API docs, at least for me, is because once we start documenting what is standard PHP5 behaviour in core, not just in one place where there's a phpwtf, but in quite normal functions that everyone uses and standard behaviour in regards to objects, that sets a precedent for duplicating the php documentation itself in the rest of Drupal.

For example there have been several issues that wanted to change all the hook_foo_bar($object) documentation to explain why the object is neither explicitly passed by reference, and/or why there is no return value. This was a change from Drupal 6, it's also again completely standard PHP5 behaviour in regards to objects.

There are many, many of those functions, I'm sure if we can looked we could find other places where this behaviour could be documented too. All this does is add additional text to the actual API documentation, text that doesn't help to explain the purpose of those functions, what they should be used for etc. - i.e. it's not actually about documenting the API any more.

node_load() and user_load() etc. already have an @see entity_load(). If this really must be dealt with in core, can we rely on that for the reference and only add this to entity_load()'s documentation? At least it's in one central place then instead of spilling out all over core.

What annoyed me about #62 was "Given that, it is certainly a pure Drupalism", and the fact it caused so much confusion in subsequent comments. Should we document drupal_load(), path_load(), poll_load() and all the $entity_load() functions to point out that some are wrappers around entity_load() and some aren't too? And that none of the $entity_load() functions, nor entity_load() itself are implementations of hook_load()? There are so, so many ridiculous wtfs in Drupal and acting on objects by resource is the least of them.

fago’s picture

I agree that this should be added to the d6-d7 update guide. Maybe documenting it once at entity_load() too, but that's difficult as it really depends on the controller.

David_Rothstein’s picture

@catch, maybe we were both missing each other's point then. I don't want to document how PHP works; I want to document how Drupal works :) Currently the documentation does not do that. I agree we don't need to go into the technical details of why it behaves the way it does.

For example, I am thinking something along these lines.

For comment_load():

  * @return
- *   The comment object.
+ *   A fresh copy of the comment object, loaded from the database.
 

For node_load():

  * @return
- *   A fully-populated node object.
+ *   A reference to the loaded node object.
 

It's more complicated than that due to revisions, but something along those lines. Then entity_load() can explain in a little more detail why some entities behave one way and others behave another and how that is related to the static-caching properties in hook_entity_info(), etc.

jhodgdon’s picture

I like the idea in #70. I've been bitten by the node_load caching in the past, and I think saying whether something is a reference or a cached copy in the @return of these types of functions is a great idea. What's the down side?

catch’s picture

#70 isn't too bad in principle, except the controllers are pluggable, so the comment could be loaded from memcache, or you could just hook_entity_alter() it to static cache, and then the phpdoc doesn't match.

jhodgdon’s picture

Ah yes, that's true. Maybe it could say "by default", and then have an @see at the bottom of the function pointing to entity_load (or whatever the function is that documents the situation with a note that entity controllers sometimes use caches, sometimes not, and are pluggable).

David_Rothstein’s picture

Hm, that does make things a bit more confusing to document, doesn't it...

I'm adding this to my list of things to work on after the RC1 release. If someone wants to work on documenting it before then, feel free :)

chx’s picture

Of course, it's not a reference. http://drupal4hu.com/node/224

jhodgdon’s picture

Seems like a good project for the upcoming documentation sprint.

jhodgdon’s picture

New tagging scheme.

This issue... It looks like we need some mention in the 6/7 update guide -- see comments around #28 - as this is a change from D6.

It also looks like we need to add some documentation to a few core functions to talk about caching.

mlncn’s picture

I'm sure there's many places that "passed by reference" language from Drupal 6 has to be revised, but over at #1003958: Correct _node_save_revision documentation to not refer to passing $node by reference, since it does not is a patch for one place, and i think it makes clear what Drupal is doing without being incorrect about what PHP is doing!

jhodgdon’s picture

I am not in favor of the wording in that patch, and in any case, this issue is not really about documenting that objects are passed by reference in general, but specifically about functions that return cached objects that you shouldn't change except carefully.

hansfn’s picture

OK, I have read this thread and one of the many duplicates - #453406: noad_load_multiple() should return cloned objects? - where Damien Tournoud says "It's the caller responsibility to clone the object if it needs to be modified". But what if I can't see that it needs to be modified? I'm probably blind (and I'm definitely a Drupal noob), but here is an example from the diff module - see #997464: #builder_function:

$node = node_form_submit_build_node($form, $form_state);
$rows = _diff_body_rows(node_load($form_state['values']['nid']), $node);

Some times this gives no changes because node_load($form_state['values']['nid']) returns the previewed node. (If you select to view changes again, it works.) The following code seems to work, but I still don't see why it's necessary.

$old_node = clone node_load($form_state['values']['nid']);
$node = node_form_submit_build_node($form, $form_state);
$rows = _diff_body_rows($old_node, $node);

To me, such examples indicate that node_load always should create a clone, but that discussion is dead, right?

Steven Jones’s picture

Subscribe.

Steven Jones’s picture

So I was caught out by this in a slightly strange use case: If during the process of a node_view a field does another node_view, on the same node, then you get a PHP fatal error. This is because node_view changes the contents of the node, and doesn't do a clone, because 99.9% of the time that is fine.
To set this situation up, add a node reference field to your node and reference yourself.

This needs documenting somewhere, that any node_load effectively returns a reference to a singleton and you can break other people's code very easily, or you can make core break very easily if you don't clone the node.

jhodgdon’s picture

Title: Document that objects are shared between modules » Document that cached entity objects are not necessarily cloned any more
Status: Active » Fixed
Issue tags: -Needs documentation updates

I added a note about this to the update guide:
http://drupal.org/update/modules/6/7#load_cache

Also changed the issue title, as it was confusing me as it was.

As discussed extensively above, we probably can't do anything else definite in the doc, so I'm marking this fixed. If you disagree, please mark back to active and describe what else should be documented.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.