Download & Extend

Offer options on how to handle uri collisions

Project:Migrate
Version:7.x-2.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:mikeryan
Status:closed (duplicate)
Issue tags:Migrate 2.4

Issue Summary

Something is wrong with the code in MigrateDestinationFile::import() for file saving.

<?php
   
if (!file_load_multiple(array(), array('uri' => $file->uri))) {
?>

$file->uri is NOT the primary key for files!
The correct primary key would be $file->fid.

Effect:
The migration will create new files, instead of updating existing ones.
In some cases this might not matter, but it does matter if you create stub files (such as, for user avatars).

The following things have to be checked:
- Is the $file->fid set, or are we creating a new file?
- If $file->fid is set, does this file actually exist?
- Does another file exist with the same filepath (uri)? Is this our friend that has the same fid, or is this something else?
- If $file->fid is set, and the file entity with that fid does exist already, do we have a modified uri?

Patch going to follow.

-----------

Disclaimer:
To be honest, my head is currently not in migrate stuff. This is all copy+paste from
#1314748-6: Files need to be resaved on update ff
This post is the result of some research, and I don't want it to get lost. Issue summary can be improved over time.
I'm going to re-upload the patch from there.

Also related:
#1358318: dedupe() overdedupes on --update

Comments

#1

Here is the patch from
#1314748-7: Files need to be resaved on update.

Amazing, now this does actually work :)

Here is a very basic patch.

EDIT:
There is probably a lot of redundancy with other parts of code, where things are already being checked.

Also, the dedupe stuff that I talk about in the comments, needs to happen somewhere. This all depends how the $file objects are prefilled with uri etc, stuff that is happening before this function.
Honestly, I did not study any of this.

AttachmentSize
migrate.file_save.7-x-2-x.1314748-7.patch 2.97 KB

#2

Assigned to:Anonymous» mikeryan
Status:active» needs review

Thanks for opening the distinct issue.

When submitting a patch, be sure to set the status to "needs review" to make sure it gets reviewed (I'm cleaning things up in anticipation of a Migrate 2.3 release, searching for the "needs review" status among others...).

There is no confusion over what the primary key is - the point of searching for an existing uri is the presumption that a matching uri means the file in question already exists, and thus we want to point to it. This has been the preferable approach for the file migrations I've done, but if I'm not mistaken your point here is you want to handle pre-existing uris differently? Your assumption is that it's a different file with the same name, so you want to dedupe it?

The file destination definitely needs some refactoring, which I want to address in #1240928: META: Refactoring of file destination/field handlers. I doubt I'll commit your patch by itself (need to look at it closely), but I'll take it into consideration in the refactoring.

Thanks!

#3

"needs review"

thanks :)

There is no confusion over what the primary key is - the point of searching for an existing uri is the presumption that a matching uri means the file in question already exists, and thus we want to point to it. [..] Your assumption is that it's a different file with the same name, so you want to dedupe it?

It could be a different file, which has nothing to do with our migration.
Or it could be the very same file, which we are now supposed to update.

The little difficulty I have is that I have not studied what happens before the import() method.
In the critical situation, we have a $file object with a not-NULL $file->fid, and a $file->uri which points to an already-existing file.
How is this uri determined? And what does it tell us? Please, copy the new file here? Or, the new file already has been copied here?
What is the intended behavior, if the import() method is called with such a $file object?

The critical thing happens here, where $file->fid is set to NULL. This will turn an update into an insert.
http://drupalcode.org/project/migrate.git/blob/refs/heads/7.x-2.x:/plugi...

<?php
  migrate_instrument_start
('file_save');
  if (!
file_load_multiple(array(), array('uri' => $file->uri))) {
   
// Get this orphaned file into the file table.
   
$file->fid = NULL;
   
$file->status |= FILE_STATUS_PERMANENT; // Save a write in file_field_presave().
 
}
 
$file = file_save($file);
?>

Actually this happens if a file record with the same uri is not found.

Some situations where this can break:

Scenario 1)
Source has an id for each file and, of course, a filename, path or uri, associated with each file id.
Our custom migration class has some conversion logic, to suggest a filename and path for the migrated file.
Finally, the file gets a destination fid, and a uri.

This all works great for the first import.
migrate_map_userpic is now filled with mappings between destination fid and source file id.

Now, second time import with --update.
In the meantime, some of the files in the source system have been renamed or moved around. Also, the logic to suggest a new filename for each migrated file, has changed.

Looking files with the same uri fails, because the uri is not the same as in the first import.
The evil $file->fid = NULL happens, and what should be an UPDATE is turned into an INSERT.

Intended behavior would be to delete the previously migrated files, but keep the respective records in file_managed, and then copy the original files to the new location.

Scenario 2)
Similar to the above, but now we create stub files.
We need this for user pictures.

The stub files are are just stupid blank txt files, that we create in a location where it doesn't hurt.

The second time we want to update these files to something real. But again, the uri check fails, and turns the UPDATE into an INSERT on file_managed.

The intended behavior would be to delete the stub file, keep the record in file_managed, copy the source file to the new destination uri location.

-------

So, the lesson learned:
If $file->fid is not NULL, then we want to update, not insert. If the $file->uri looks like something new, this is probably because the original file has been renamed, or there are other reasons why we want to save the file in a different path.

Question is, at which point in the pipeline should this be dealt with?

#4

Title:MigrateDestinationFile: primary key should be fid, not uri.» Offer options on how to handle uri collisions
Category:bug report» feature request
Status:needs review» postponed

This will be dealt with as part of the overall file refactoring for Migrate 2.4: #1240928: META: Refactoring of file destination/field handlers.

#5

Status:postponed» active

#6

Status:active» closed (duplicate)

This is being dealt with in #1240928: META: Refactoring of file destination/field handlers - in addition to the core FILE_EXISTS_RENAME and FILE_EXISTS_REPLACE options, we now have a custom MigrateFile::FILE_EXISTS_REUSE to explicitly reuse an existing file.

nobody click here