From #1481026: Get servers in place for Drupal 7 staging/dev environments: The files are a mess. For example, 484 files have filepath "files/issues/". An example is #629528: Error. The file actually does exist at http://drupal.org/files/issues/Error_3.JPG. There really isn't a good way to clean this up. Doing it well would mean matching up all the files, maybe using file size to help.

To get a current list:
SELECT count(DISTINCT f.fid) c, f.filepath, group_concat(DISTINCT u.nid) FROM files f LEFT JOIN upload u ON u.fid = f.fid GROUP BY cast(f.filepath AS BINARY) HAVING c > 1;

CommentFileSizeAuthor
#5 orphans.tsv_.gz412.86 KBdrumm
duplicates.txt11.93 KBdrumm

Comments

drumm’s picture

Issue tags: +drupal.org D7

tag

drumm’s picture

For future reference, the following tables reference fids. A lot of the duplicates may be orphans.

comment_upload.fid
comment_upload.legacy_fid
content_field_images.field_images_fid
content_type_casestudy.field_mainimage_fid
content_type_organization.field_logo_fid
content_field_project_images.field_project_images_fid
image.fid
project_release_file.fid
project_releases.fid
project_releases_backup.fid
upload.fid
drumm’s picture

Issue tags: +porting

tag

drumm’s picture

Assigned: Unassigned » drumm
drumm’s picture

StatusFileSize
new412.86 KB

Attached are 43836 files rows that I'm deleting. They have duplicate names with other rows and are not referenced from the fields listed in #2.

drumm’s picture

That took care of all duplicated files in files/images/*. Only 85 duplicated filenames left. Most have 2, or sometimes 3-4 rows, but 'files/issues/' has 93.

drumm’s picture

The 15 with filepath like 'files/releases/%' all had bad release nodes creating the duplicates. Things like having one each for CVS and Git tags. I deleted the bad ones and we are down to 70 duplicated filepaths.

drumm’s picture

I removed the duplicated files which had the wrong size, we just don't seem to have those files, so they are essentially bad uploads. In many cases, these were already re-uploaded by people. All the issues were quite old, I even saw one of mine from 2004.

That gets us down to 31. I'll try going by timestamps next. filepath = 'files/issues/' has 91 rows.

senpai’s picture

Issue tags: +sprint 2, +sprint 3

Tagging for sprint 3.

drumm’s picture

Status: Active » Fixed

Done!

Automatically closed -- issue fixed for 2 weeks with no activity.