Steps to reproduce :
say you have 2 content types : 'type1' and 'type2'
and a field 'my_field' with one instance in each content types
node 1 is of type 'type1'
node 2 is of type 'type2'
remove the field instance in type 'type2'
the table for the field still contains the values for node 2
if you add the field back in 'type2',
node 2 recovers his previous valuefor 'my_field'
I'm not sure this 'data-recovery' thing is really a feature : the message on the confirmation form for a field deletion reads "If you have any content left in this field, it will be lost. This action cannot be undone."
Comments
Comment #1
karens commentedI wonder if this is really the same thing as http://drupal.org/node/89351. In that case, though, it looked like the field value got deleted, but not the value in the content table. Looks like we need to do some investigation of what is happening on deletions....
Comment #2
yched commentedYes, I saw your other issue (and exerienced it too) but I thought this was unrelated since "yours" deals with the deletion of a node, whereas this one deals with the deletion of a field instance.
But, yes, we might want to investigate deletions...
I think I saw some strange things on a _type_ deletion too :-) (in HEAD, at least) - I did not have the time to isolate this, though)
Comment #3
yched commentedThis issue is embarrassing because the field datas that stay in the tables are "dead" : unreachable, undeletable. They are doomed to stay there and clutter the table. Plus external data (files - think imagefield) are not deleted.
This happens on field deletion and on content-type deletion, since a field deletion basically consists of field instances deletion.
Comment #4
karens commentedI've been digging into this and am beginning to see what is happening, but still working on the best way to fix it.
The problem I'm looking at affects shared fields that live in their own tables. When you remove a field instance, none of the values for that content type are removed from the shared table, which ends up with invalid values in it. When you remove the next to last instance, the storage type changes from per-field back to per-content type and *all* values in that shared field table are copied to the surviving per-content table and the shared table is dropped. Now you have lots of invalid values in the surviving content type table, possibly values that originated in several other content types. If you happen to add the original field back to all the original content types it was in, all those invalid values get moved back to the per-field table and all is well again. If not, well, you have a mess. Clearly we cannot even pretend this is a 'feature', it is definitely a bug that needs to be fixed.
I'm still trying to figure out if there are similar problems when you delete a field from a per-content type table, but I don't think so. In that case the unused columns are dropped, which deletes their values.
Comment #5
karens commentedBTW, this behavior is the same in both HEAD and 4.7, so it has nothing to do with the 5.x changes.
Comment #6
dopry commentedHere is a patch for 4.7. Initial testing says it may do the job.
Comment #7
yched commentedWe should probably also clear the cached content for the affected nodes.
(but this requires actually retrieving the nids)
I've been wondering if this is not more complex than just deleting the data rows : I think field modules should also be informed of the entries that are to be deleted :
- external resources (imagefield's files...) can be deleted
- I think original CCK design allows for field modules to manage db storage themselves (even if I'm not aware of any contrib field module that actually does that). In that case, the module must have an opportunity to delete the data as well.
I had been playing with some code in content_field_instance_delete like :
The problem is it can become quite time consuming, and maybe there's a chance of hitting PHP execution time limit ?
Comment #8
karens commented@yched, I think you have a point about allowing modules to react to the deletion, I hadn't thought about that. I'm worried about execution time, though. I have instances where there are thousands of nodes for a content type. There is a routine you can use in install files that allows you to do updates in stages so they don't time out. We would need something like that. I wonder if we can include install.inc and use that multi-part routine somehow (see http://drupal.org/node/51220), or at least dig into the code and borrow the method.
Comment #9
yched commentedI'm worried too :-)
Yes, we can use the asynchronous progress bar, I did that once in a custom module for a project of mine.
Worked quite well, but it required mimicking a good portion of install.php (or.inc, can't remember).
Core provides progress.js, but does not provide any ready-to-use API to use it for heavy operations. And, AAMOF, updates (and now installation in 5.0) are the only part of core where this is used.
Maybe this could be a good opportunity to build such an API and ;
- submit it for Drupal 6
- keep it in CCK or in a external contrib for 4.7 / 5.0 ?
PS : I originally did this to deal with mass CSV data import - that was pre Import / Export API. Maybe Jaza did something we could steal in his import / export API module ?
Comment #10
karens commentedI don't have time to work on this right now. Does anyone else have time to try to get something like this figured out??
Comment #11
yched commentedSure, I will.
Comment #12
yched commented"Sure, I will", he said.
Well, took me some time, but I submitted a core patch for 'update.php-style progressive operations' at http://drupal.org/node/127539.
Still very rough...
Comment #13
karens commentedDoug provided a patch to make sure the field tables are dropped when there are no more instances at http://drupal.org/node/181144. That is a part of the solution of cleaning things up.
I'm working on the D6 port right now which is uncovering another related issue -- what happens when a field module is disabled or uninstalled? In D5, there's no way to clean its data out because we don't have a way to know when that happens. In D6 we're adding a content_notify() function so we'll know when that happens, and we now have a batch update function we can use to clean data out, but there are other issues.
One question is *when* should we remove the data? When you remove a field using the field settings, you will lose your data for data stored in content tables because the columns will be removed, so that's a given. But we need to do the same thing for data stored in field tables that still have good data in them from other content types. (And we need to be sure to drop the field table, too, once it's no longer being used, as Doug points out). But we don't always warn people when they're doing something that will cause them to lose data, so we need to identify exactly when data is going to be lost, handle it consistently, and warn people appropriately.
And what should happen when you disable or uninstall a module? We should definitely delete the data when it's uninstalled, but we should not remove the data when it's disabled, because sometimes you need to temporarily disable modules for administration reasons like when doing upgrades from one version to another, and you don't want to lose your data when that happens. But if you come back later and uninstall the module (which has to have been disabled to even show up on the uninstall list) there's no longer any way to get any information about its database schema or fields so we can clean up after it. So we need to store information about disabled modules in the database so we can get back to it. Or maybe we need to store field and db information in the database for all modules to make clean up easier.
What a tangled web...
Comment #16
karens commentedAfter lots of research and trying various solutions out, I think the best way to get this working correctly is to store the module name and column info in the field settings table. With that info available in the database, we can use the crud functions to alter and delete even fields that are disabled. I reworked the D6 port to do this, and I also added a content_module_delete() function that uses that information to clean out all field settings and data for fields created by a module that is being uninstalled.
This involved a lot of changes -- none of the database manipulation functions can use module_invoke() to get column info since that isn't reliable, so many functions needed a rewrite to ensure they were getting info directly from the stored info in the database instead.
Once the D6 port is finished, we can try to backport some of these changes to D5, but that won't happen right away.
Comment #17
Island Usurper commentedJust to making sure, but are you talking about modules with node types at the end of #13? That particular issue came to my attention and I wanted to make sure it was on someone's radar.
Comment #18
karens commented@Island Usurper, not sure what you're talking about.
The work done in D6 is to fix problems when a CCK field module is uninstalled after it was used to create data which left orphan field data records that did not get removed. Now when a field module is uninstalled we can also clean up any node it attached its field to and get rid of all its field data.
Comment #19
Island Usurper commentedI guess I'm really talking about the opposite problem, data being deleted when it shouldn't be, but I think it'll be based in the same code.
I'll make a new issue about it.
Comment #20
jefbak2 commentedsorry related but different issue
Comment #21
yched commentedjefbak2 : this is not what this thread is about. Please file a separate issue for this.
Comment #22
karens commentedThis is active again now that we removed the batching from content_alter_schema() since we are no longer cleaning up data when we alter the schema. We still need to figure out a way to use batching to clean up field data when fields are removed so we don't end up with orphan values in the database.
I've been thinking about other ways to do this and my best idea right now is to add into the UI processing a batch step before we alter the schema to empty the values in the node that are going to get deleted, then do a node_save on each. After all the values have been emptied, then go in and alter the schema. The alter_db() processing will still drop the tables and fields it does now, but they should all contain empty values by that time. I am pretty sure this will work for UI field changes, then anyone make changes via the API will have to take care of that step themselves.
Thoughts?
Comment #23
karens commentedThat last post was not too clear. Let me try again.
Example:
A multiple value field that we're going to change to a single value.
Current data looks like:
First, we create a batch that loads one node at a time and change the value to the following and save it:
Then we do the content_db_alter() on the table, which will move the zero delta to the content table and drop the other two values. But the other two values are already empty, so that's OK.
Comment #24
karens commentedBumping this forward. I can't decide if we still have this issue or not.
Comment #25
yched commentedwe probably do - we don't do anything about it in or around content_field_instance_delete().
That's one of the would-be batch stuff...
Comment #26
markus_petrux commentedhmm... I opened another issue that is a dup of this one: #445436: Dropping shared field from one content type leaves orphan records
I've been reading all the conversation between KarenS and yched. But, I'm not sure what's really left. Is it possible to enumerate the cases where CCK leaves orphans? ...or is it just when a shared field is deleted from one type (which is the case I noticed on our installation)? Could we add code to CCK to warn the user when that happens?
If it was possible to perform automatic clean up, is it viable to introduce this in CCK, or we should rely on custom/contrib code? Now, I realize that it is not only necessary to remove records from the DB, but also trigger CCK fields to let them do their clean up.
The problem is real in D6, so which would be the best practice at this moment in time?
Comment #27
yched commentedI'm actually less and less sure we really need to trigger field-type modules and let them do the cleanups. There are many cases where data gets silently dropped currently :
- when a non-shared multiple field is deleted: table is dropped
- when a non-shared single field is deleted: columns in the per-type table are dropped
- when the second-to-last instance of a field is deleted (field gets unshared): only relevant rows in the per-field table are copied to the per-type table, then the per-field table is dropped
- simply when a node is updated and field values change: previous values are dropped...
In fact, the only case we currently call hook_field_delete() is on node/revision deletion.
So obviously field types like filefield do not rely on this for their cleanup. Adding one case of 'drop data without notifiiction' wouldn't be that serious.
So if we find a way to do remove data from the tables with no timeout chances, we could go for it IMO.
Contrary to what I wrote in #445436: Dropping shared field from one content type leaves orphan records, MySQL supports JOINS in DELETE queries (see dopry's patch in #6 - 4.7 !!). We'd need to check PgSQL does too.
Now, I'm not sure about the amount of time needed to run 20+ such queries on a db with 10.000s of nodes...
One the one hand, the current 'data migrations' (per-field to per-type or back) can also raise timeouts on large dbs, and/or when cascaded because of a node-type delete.
On the other hand, and unlike the above where we need the data migration done "now, not at some point later", cleaning of the data in a per-field table can happen in later crons without any worry...
Comment #28
markus_petrux commentedSo we could try to introduce a new cron task to deal with this? ...such a cron would be triggered by a variable that contains all relevant information, that would be removed when the task is completed.
...or is it viable to try to introduce the use of Batch API to perform all these heavy tasks? I would assume it is too late in the D6 cycle for something like that, though. Batch API could solve the timeouts, but then affected content types would have to be locked somehow until the changes are completed.
So... is hook_cron() the way to go in D6 to deal with orphans that remain when a shared field is removed from types while it still remains using per field storage (still shared or multiple values)? If so, I may try to write a patch so it can be tested, etc.
Comment #29
markus_petrux commentedhmm... maybe we can resolve this remaining clean up from
content_field_instance_delete()as follows?Comment #30
markus_petrux commentedThe patch
Comment #31
markus_petrux commentedOops, the process ought to be dealing with nids, rather than vids. More efficient, and the SELECT does not use filesorts.
Comment #32
markus_petrux commentedNow, instead of build one big DELETE statement, it could be splitted into one for each (say) 10000 nids. If so, please let me know and I'll update the patch. This can easily be done while building the $nids array, for example.
Comment #33
markus_petrux commentedHere's a code snippet that can be used to check how many orphans exist in per field tables:
This kind of information is quite important for our project because we'll have to deal with thousands of nodes when the site is launched, I'll be compiling for us similar snippets into a custom module, so that we can check data integrity along the time. This is not only relevant while building the site, but also it could be used to detect inconsistencies caused by bugs, crashes, etc.
However, I'm wondering how many installations out there will see a few records when the $orphans array is printed (snippet above). Maybe something like this could live as a contrib add-on? ...or maybe included in CCK itself?
Comment #34
yched commentedThanks for looking into this, markus.
The problem with batch API (even though my initial aim with it was to address CCK housecleaning issues), is that it is a UI-only solution. It plays well with form-triggered operations, but not with API-triggered ones. And content_field_instance_delete() is an API function, it can potentially be called in contexts where batch is not allowed. Karen implemented batch API cleanups earlier in the D6 cycle, but we had to roll it back because of those :-(
A node-type deletion can potentially trigger 10s of content_field_instance_delete(), so doing the cleanup in there seems prone to timeouts. That's why we didn't tackle this so far. A cron-based solution could work, as you mention in #28, seems more bullet-proof. That's more or less how it will work in D7.
I'm not sure the functions in #33 would be useful per se in content.module. The mechanism could, however, be used by an update function to identify existing orphans and push them for cron cleanup.
Comment #35
markus_petrux commentedhmm... how many nodes would be reasonable to deleted per cron task execution? ...CCK does not have a generic settings panel where to place a setting for this. Maybe it could be a constant on top of content.module 'CONTENT_NODES_PER_CRON_STEP' equal to something... maybe 1000? 10000?
I'll work out a patch implementing hook_cron() for this... ok? ...or do you think there's enough material here so that you can implement it yourself, polishing these fine details?