I have a project for work where we needed to take an existing site built with Ruby on Rails using a MySQL database and convert it to use Drupal as the framework. It is for a regional debate league, and it has events, schools, news (general, middle school, and high school), galleries, contact forms, a google map showing all schools, and several static pages. It should be straight forward, except there are several hundred events, three times as many photo galleries, and 33,000+(!) images weighing in at 16 GB.

In the existing database, galleries were linked to images using a 1 to many relationship and foreign keys in the images database. The images database had 4 different sizes for each image, and everything but the original image was linked to the id of the main image through a parent id. Galleries were associated with events on a 1:1 basis, though about 2/3rds of the galleries were not associated with any event. News was either general, middle, or high school. Schools and events both had location information associated with them. I set up my content types to match the 'content types' on the old system using CCK fields and location where applicable.

I reviewed several options for the photo galleries with the client, and they liked the gallery included with the images module the best, as it supported mass import from FTP and was straight forward to use and set up.

To facilitate the migration, I decided on the node auto term module as a way to use taxonomy to tie together foreign key relationships in the previous database. The problem was that the module didn't automatically associate new nodes with the terms created, and though there was a patch to make this work, it was for the Drupal 5 version of the module. I ported the patch to the D6 version of the module, and fixed/worked around a bug that was causing a WSOD when editing a NAT node.

With that out of the way, I could start the import process, but I ran into a couple of show stoppers with the node import module. First and foremost was importing dates using the date CCK field, it would belch errors about null values in the date module. After a little research, I found that the import only works on date CCK fields if they use the select list widget, not the JS popup. Modifying the widget worked for the news, which went back to the end of 2007.

However, when I tried to import the events, I received endless "illegal value detected. please contact the administrator" messages, on some, but not all of the events. After some investigation, I found that the events went back to 2001, and the date CCK field defaults to -/+ 3 years for the available time. Changing this to 10 years fixed the problem.

Now, for the tricky part - to maintain referential integrity on the events, which relate to galleries, which relate to images, you need to push up the original ID from the old database on the import. It was easy enough - I made an integer CCK field called "original_system_id_value", and mapped the id from the database CSV dump here. Everything else from the events, including locative information and dates (after fixing the date range) went in without a hitch. Because of using node auto term and my modifications to make it automatically associate the terms with the nodes, every event node was set up to create a new term in the image galleries vocabulary, so there would be 1 gallery for each event, as there is in the old system.

The next step was to pull the term ID's mapped to the original system ID's from the Drupal database so I could insert them back into the original Ruby database for the purposes of having term ID's for inserting the images into galleries. This was easy enough, since I added the CCK field with the original ID's. Image galleries have a vocabulary ID of 3 - a little SQL query later:

SELECT n.tid, n.vid, ev.field_original_system_id_value 
FROM nat n LEFT JOIN content_type_events ev ON (ev.nid = n.nid) 
WHERE ev.field_original_system_id_value IS NOT null
AND n.vid = 3;

And I had it. On the old database, I added a new table called vocab_sync, added tid columns to the image_galleries and uploaded_images tables, and imported the 3 columns from the query above as tid, vid, and oid(original ID). I then updated the uploaded images associated with galleries that were associated with events using the following query:

UPDATE image_galleries ig, vocab_sync v 
SET ig.tid = v.tid 
WHERE ig.gallery_owner_id = v.oid;

UPDATE uploaded_images ui, image_galleries ig, events e, vocab_sync v 
SET ui.tid = v.tid 
WHERE ui.image_gallery_id = ig.id
AND ig.gallery_owner_id = e.id;

Now every image associated with a gallery associated with an event on the old system had the taxonomy term id (tid) from the new system. I still needed to import the other galleries that were not associated with events, so I ran a query like so on the old database:

SELECT id, name FROM image_galleries 
WHERE gallery_owner_id IS NULL;

That pulled the remaining galleries into a list, which I exported as CSV, and used node import to import it as terms for the image galleries vocabulary. I made a new parent term called Unassociated Galleries, which all imported terms would reside under, and performed the import using node import, setting the id from the original system as the description field in the vocabulary. I made note of the highest tid before the import in the term_data table on the Drupal database, then performed the following to grab a mapping for the rest of the galleries:

select tid, vid, description from term_data where vid = 3 and tid > 628 order by tid;

I imported the result into the Ruby database as another table called vocab_sync_unassociated, and did the following to associate the non-event galleries in the Ruby database:

UPDATE image_galleries ig, vocab_sync_unassocated v
SET ig.tid = v.tid 
WHERE ig.id = v.oid;

UPDATE uploaded_images ui, image_galleries ig
SET ui.tid = ig.tid 
WHERE ui.image_gallery_id = ig.id;

The next step was creating proper paths for the images as in the Ruby application, the images were stored and retrieved based on a hash of their ID in the database in a deep folder structure, so for instance say an image's ID was 135422 in the database, it would be contained in the folder 0013/5422/image.jpg. To make this into a path that Drupal would understand, I wrote a small php script like so :


$host="localhost";
$user="root";
$password="";
$db="rordb";

$full_filename = Array();

$conn = mysql_connect($host,$user,$password)
   or die(mysql_error());

mysql_select_db($db, $conn)
   or die(mysql_error());

$sql = "SELECT id, parent_id, filename FROM uploaded_images";

    $result = mysql_query($sql)
       or die(mysql_error());
    while ($row = mysql_fetch_assoc($result)) {
      if($row['parent_id'] > 0)
        $id_field = $row['parent_id'];
      else
        $id_field = $row['id'];
      $id_path = 100000000 + $id_field;
      $id_path = substr($id_path, 1);
      $id_path = str_split($id_path, 4);
      $id_path = 'uploaded_images/'.$id_path[0].'/'.$id_path[1].'/'.$row['filename'];
      $full_filename[$row['id']] = $id_path;
    }
    
    foreach($full_filename as $k => $full_file){
        $sql = "UPDATE uploaded_images SET full_path='".$full_file."' WHERE id = ".$k;
        $result = mysql_query($sql)
           or die(mysql_error());        
    }

And on 177,000 records, that took about 3 hours to complete. I'm sure I could have done it in SQL orders of magnitude faster, but it gave me a good excuse to play GTA4 for a while.

I untarred the tarball containing 177,000 images and derivative sizes stuffed in the ridiculous directory structure to an 'uploaded_images' folder under sites/default/files on the Drupal installation.

Now everything was ready to do, except one last problem - node import doesn't support image or file attachments to nodes, only CCK images and files. After playing around inside of Drupal's schema for a bit, I came up with a solution - import the full image, preview, and thumbnail images from the Ruby database as CCK fields, then update the images table in the Drupal database afterwards to point to the CCK files, and delete the CCK fields from the image data type.

I created 3 CCK fields on the image content type for full, preview, and thumbnail for the import. To facilitate the import using node import, I created a query on the images table and exported it as a CSV file, like so (the thumbnail column defined what type of image it was):

SELECT u.tid, u.created_at, u.name, u.content_type, 
u.caption,
 u.full_path as fullsize_full_path, 
 utn.full_path as thumbnail_full_path,
 uf.full_path as preview_full_path
FROM uploaded_images u 
LEFT JOIN uploaded_images utn ON (utn.parent_id = u.id) 
LEFT JOIN uploaded_images uf ON (uf.parent_id = u.id)

WHERE u.tid is not null
AND utn.thumbnail = 'thumbnail'
AND uf.thumbnail = 'preview';

I ended up with a CSV file containing 33,410 records, each of which would be turned into a node upon import. I knew this would take a while to import, but the basics were to import to the image content type, set the Image Gallery term to the tid column, the file path for the thumbnail, preview, and original image CCK fields to their respective columns on the CSV file, and run the import. It looks like the import will end up taking around 22 hours - here's a shot of it's current progress.

The final step (I tested these steps with the image import on a small subset of the image table from the Ruby database before going through the trouble of importing the entire thing) is a little transformation inside the Drupal database. Note that all of the file names, depending on the type of file, had '_small', '_large', or nothing appended to the end:


UPDATE files set filename = '_original' 
	WHERE filename NOT LIKE '%_large%' AND filename NOT LIKE '%_small%' AND fid > 1109;

UPDATE files set filename = 'thumbnail' 
	WHERE filename  LIKE '%_small%' AND fid > 1109;
UPDATE files set filename = 'preview' 
	WHERE filename LIKE '%_large%' AND fid > 1109;

INSERT INTO image (nid, fid, image_size) 
	SELECT nid, field_image_import_full_fid, '_original'  FROM content_type_image;

INSERT INTO image (nid, fid, image_size) 
	SELECT nid, field_image_import_preview_fid, 'preview'  FROM content_type_image;

INSERT INTO image (nid, fid, image_size) 
	SELECT nid, field_image_import_thumbnail_fid, 'thumbnail'  FROM content_type_image;

That takes care of turning the CCK imported fields into image attachments for the image nodes. After the import is completed, all that is left to do is delete the CCK fields.

I know that was quite long, and specific to a one-off schema for the source data, but the concepts should be applicable to any import where there is relational data involved, and I hope that helps someone else. Thanks!

-Rich

Comments

WorldFallz’s picture

WOW-- this is an amazing write up. Thanks for taking the time to document this. You may wish to consider adding some images (possibly before and after?) and requesting a front page post. There usually so much marketing fluff, a technical piece would me nice for a change.

hotspoons’s picture

Thanks for the props! Half of the reason I wrote this in so much detail is that I know I'll probably have to do something similar again sooner than later, so I took my mercurial commit logs for the project and formatted them into a narrative for quick reference; the other half of the reason is that I hope that it may help someone else.

Regarding images, I would love to include some graphics, but I don't see anywhere to attach a file in the forums, and I don't see <img> as one of the permitted tags. I'd probably like to complete the project before providing before and after pictures anyhow. Any ideas?

WorldFallz’s picture

The instructions for images are linked off the link above. iirc, you add the img tags to your post, create a webmaster issue to promote the post to the front page, then attach the images to the issue (which does accept attachments).

hotspoons’s picture

I don't know why I didn't see that before when I read the linked page. Thanks! I'm clearing it with my client to see if I can use the likeness of their website (i.e. screenshots of their site with their logo) in the write up before moving forward.