The problem: When migrating content into a Drupal site, no contributed modules provide a way to preserve old NID's during content import when consolidating several Drupal sites into a single installation. Please post counterexamples to this thread, if you know any.

Therefore, it is impossible to preserve old URLs with those modules. Without a way of mapping the old NIDs to their new NIDs, it is impossible to ensure that inbound links from other sites will be preserved.

My particular challenge: This is a write up of how to merge three sites into one while preserving NIDs. When a request comes from oldsite2.com/node/23 it forwards to that nodes new location on the new site, newsite.com/node/2023, likewise from oldsite1.com/node/23 to newsite.com/node/4023. See the problem this solves? I imagine that most people won't find this entire post particularly useful, but you can take parts of it as you might need.

This was a requirement for our client, and so I underwent a somewhat lengthy process to do this. There are probably several steps of this process that can be handled in a different way (for instance, the awk script could have been an SQL or PHP script). Feel free to post those to this thread.

I post this as a conceptual overview and some of my code so that others might benefit. I owe much to the Aquia migration blog post on this subject.

The solution overview: Migrate your old NIDs in a programmatic fashion. For instance, if an old site's node had a NID of 215, make it 2215, or , if it had a NID of 3, make it 2003. Add 2000 to the NIDs in the old database, then export them, and import them into the new database (using PHPmyAdmin or Drush or something). We also added 2000 to all comments and terms and things and imported them. We also mapped old input formats to new ones. The scripts below are just SAMPLES, and cannot be run verbatim.

Also export the url_alias table, and iterate through it with the language of your choice, adding 2000 to the NIDs in that file. I used awk.

Next, any URL that comes from the old site's address, check it for paths /node/nid, and rewrite it with mod_rewrite in the .htaccess file.

At the end of this process, any URL coming from oldsite.org/node/1 will be redirected and rewritten to newsite.org/node/2001. You can migrate in two sites, modifying these scripts to add 6000 to the url of the second site, so that oldsite2.org/node/1 will be redirected to oldsite.org/node/6001.

As you may have guessed, this tutorial is not for the faint of heart.

First, upgrade all your sites to the current version of Drupal, in my case Drupal 6. If you're doing this in Drupal 7, you'll have to modify this tutorial accordingly. You want to migrate from one table to another preserving structure. I ran into this issue when upgrading from 4.7 with Spanish characters and solved it like this.

NEXT: CREATE A COPY OF YOUR OLD DATABASE. RUN THESE SCRIPTS ON A COPY, NOT ON YOUR ORIGINAL.

2. Run your SQL scripts. Here are mine:


# add 2000 to nids, vids, and uids in node and node_revisions
UPDATE node SET nid = nid+2000;
UPDATE node SET vid = vid+2000;
UPDATE node SET uid = uid+2000;
UPDATE node_revisions SET nid = nid+2000;
UPDATE node_revisions SET vid = vid+2000;
UPDATE node_revisions SET uid = uid+2000;

# map input formats/filters to new formats/filters
UPDATE node_revisions SET format = replace( format, 2, 6 ) ;
UPDATE node_revisions SET format = replace( format, 3, 2 ) ;
UPDATE node_revisions SET format = replace( format, 5, 3 ) ;
UPDATE node_revisions SET format = replace( format, 4, 5 ) ;
UPDATE node_revisions SET format = replace( format, 6, 3 ) ;

## correct a mistake I made!!!  (in case you make a mistake)
UPDATE node_revisions SET format = replace( format, 1, 1 ) WHERE nid > 1999;

##these are so that attachments work.  
UPDATE files SET fid = fid+2000;

UPDATE upload SET fid = fid+2000;
UPDATE upload SET nid = nid+2000;
UPDATE upload SET vid = vid+2000;

##make the user ids that created the files match.  Otherwise you'll probably get an access denied error
UPDATE files SET uid = uid+2000;

## or, on another note, the following code can be used alternately to change UID's from one to another.  This has been useful to me in other projects.

UPDATE files SET uid = replace( uid, 8, 2 ) ;
UPDATE files SET uid = replace( uid, 4, 55 ) ;

##this is to update the new location of the files, if it's changed.  Maybe you'll need it, maybe not.
UPDATE files SET filepath = replace( filepath, 'sites/default/files', 'sites/newsite/files' ) ;


This next part was written by my co-worker, Josue.

## term_data 
UPDATE term_data SET tid = tid+2000;
UPDATE term_data SET vid = replace( vid, 3, 15 );
UPDATE term_data SET vid = replace( vid, 4, 16 );
UPDATE term_data SET vid = replace( vid, 5, 2 );

##vocabulary
UPDATE vocabulary SET vid = replace( vid, 3, 15 );
UPDATE vocabulary SET vid = replace( vid, 4, 16 );
UPDATE vocabulary SET vid = replace( vid, 5, 2 );

##vocabulary_node_types
UPDATE vocabulary_node_types SET vid = replace( vid, 3, 15 );
UPDATE vocabulary_node_types SET vid = replace( vid, 4, 16 );
UPDATE vocabulary_node_types SET vid = replace( vid, 5, 2 );

## term_hierarchy
UPDATE term_hierarchy SET tid = tid+2000;

## term_node
UPDATE term_node SET nid = nid+2000;
UPDATE term_node SET tid = tid+2000;
UPDATE term_node SET vid = vid+2000;

#users
UPDATE users SET uid = uid + 2000;

#users_roles
UPDATE users_roles SET uid = uid + 2000;

#comments
UPDATE comments SET cid = cid + 2000;
UPDATE comments SET nid = nid + 2000;
UPDATE comments SET uid = uid + 2000;

After running your (MODIFIED) scripts, export each of the modified tables, and then import them using phpmyadmin or mysql commands or drush or whatever.

Now of the url_alias table. This was tricky because the path looks something like /node/23. You can add 2000 to a string. You have to pull out the number, add 2000, then rebuild it. Nice, huh?

So, export your old url_alias table, as is. With a good text editor, chop the first part off the file so that it's just the data, and not the table definitions.

Then, modify this awk shell script and run it. I used gawk in linux, that is, gnu awk. Manual here. Yes, PHP could be used, too, to modify the tables in the database. Awk, however, is designed to regard text files as databases and provide complex string manipulation. While this script worked for me, it should be tested and tweaked.

This script must be executed on a csv export of a database (I exported from phpmyadmin) fields terminated by |. The quotation marks enclosing a field must be searched for and deleted. The data should look like this:

4407|taxonomy/term/855/0/feed|tag/vegetarianism/feed|
4408|taxonomy/term/856|tag/soul-force|
4409|taxonomy/term/856/0/feed|tag/soul-force/feed|

You can modify it afterwards to make it into an SQL file with inserts by combining it with an SQL export done the normal way.

Execute it like #script.sh url_aliases_export.sql > your_output_file.sql

#!/bin/bash


# this requires a csv file with a fields terminated by | and the fields enclosed with '  Export using phpmyadimin

TEMPFILE=/tmp/pd.key.tempfile.$$
TEMPFILE2=/tmp/pd.key.tempfile2.$$
NID=/tmp/pd.key.nid.$$
NID2=/tmp/pd.key.nid2.$$

trap "exit 1"				HUP INT PIPE QUIT TERM
trap "rm -f $NID" EXIT
#sed -e 's/\|\|/\|\'\'\|/g'
   sed -e '/^$/d' < $1 > $TEMPFILE
a=(blank blank2)
awk  -v Q="'" -v NID2=0 'BEGIN { FS = "|"; OFS = "," }  { gsub(Q, "", $1) ; gsub(Q, "", $2) ; $1 = $1 + 200000; match($2, /[0-9][0-9]?[0-9]?[0-9]?[0-9]?/, a) ; NID2 = a[0] + 2000 ; gsub(a[0], NID2, $2) } { print   "(" Q $1 Q , Q $2 Q ,Q $3 Q , Q "en" Q "),"}' $TEMPFILE  

Check over this file. You may need to add a semicolon on the end, and clean it up a bit. Make sure it's right. Compare it to your other file. Paste the old table definitions stuff back in. Import it into your new database. It may catch on some URLS that have already been defined. Just delete the first half of the file (the part already imported) minus the field definitions, and delete the offending line in your import sql.

See if your nodes and url aliases work as they are. You'll have to rebuild site permissions now or you'll get "access denied" errors over at Content -> Post Settings: admin/content/node-settings.

Now for the mod_rewrite magic. For more info, check out the migration blog and the mod_rewrite online documentation, a nice introduction, tons of examples, or this cheat sheet.

Finally, the most important thing to remember is that the RewriteCond is checking internal server variables. the RewriteRule is just checking and rewriting the URL. This doesn't change the internal server variables.

Paste the following into your .htaccess file.


# check each url and see if it is from the old sites.  If oldsite1.org, add 6000 to the node.  If oldsite2.net, add 2000.  

RewriteCond %{HTTP_HOST} ^oldsite1\.org$ [NC]
RewriteCond %{QUERY_STRING} ^q=node/([0-9]{3})$ [NC]
RewriteRule ^.*$ http://newsite.org/index.php?q=node/6%1 [R=301,L]

RewriteCond %{HTTP_HOST} ^oldsite1\.org$ [NC]
RewriteCond %{QUERY_STRING} ^q=node/([0-9]{2})$ [NC]
RewriteRule ^.*$ http://newsite.org/index.php?q=node/60%1 [R=301,L]

RewriteCond %{HTTP_HOST} ^oldsite1\.org$ [NC]
RewriteCond %{QUERY_STRING} ^q=node/([0-9]{1})$ [NC]
RewriteRule ^.*$ http://newsite.org/index.php?q=node/600%1 [R=301,L]

 RewriteCond %{HTTP_HOST} ^oldsite2\.net$ [NC]
 RewriteCond %{QUERY_STRING} ^q=node/([0-9]{3})$ [NC]
 RewriteRule ^.*$ http://newsite.org/index.php?q=node/2%1 [R=301,L]

RewriteCond %{HTTP_HOST} ^oldsite2\.net$ [NC]
RewriteCond %{QUERY_STRING} ^q=node/([0-9]{2})$ [NC]
RewriteRule ^.*$ http://newsite.org/index.php?q=node/20%1 [R=301,L]

RewriteCond %{HTTP_HOST} ^oldsite2\.net$ [NC]
RewriteCond %{QUERY_STRING} ^q=node/([0-9]{1})$ [NC]
RewriteRule ^.*$ http://newsite.org/index.php?q=node/200%1 [R=301,L]

RewriteRule !^http://newsite.org$ '-' [C]
RewriteRule ^http://[.*]/([.*])$ '-' [C]
RewriteRule ^(.*)$ http://newsite.org/$1 [R=301,L]

# rewrite all www's to the base name.  Necessary to not write exceptions to the above rules and make things nice and clean

   RewriteCond %{HTTP_HOST} ^www\.oldsite1\.org$ [NC]
   RewriteRule ^(.*)$ http://oldsite1.org/$1 [L,R=301]

   RewriteCond %{HTTP_HOST} ^www\.oldsite2\.net$ [NC]
   RewriteRule ^(.*)$ http://oldsite2.net/$1 [L,R=301]

   RewriteCond %{HTTP_HOST} ^www\.newsite\.org$ [NC]
   RewriteRule ^(.*)$ http://newsite.org/$1 [L,R=301]

 #rewrites anything from the old sites that that don't have node in them (all old urls) to newsite.org/whatever_the_old_one_was 

 RewriteCond %{HTTP_HOST} ^(oldsite2\.net)$ [NC]
 RewriteRule !^.*node.*$ '-' [C]
 RewriteRule ^(.*)$ http://newsite.org/$1 [R=301,L]

 RewriteCond %{HTTP_HOST} ^(oldsite1\.org)$ [NC]
 RewriteRule !^.*node.*$ '-' [C]
 RewriteRule ^(.*)$ http://newsite.org/$1 [R=301,L]

[C] means execute the following if and only if the current rule succeeds.
[NC] means capitalization is irrelevant.
[R=301] means permanent redirect.
[L] Means last rule, quit processing.
() captures the enclosed and creates a variable that can be accessed later using $1 or %1

To troubleshoot this (surely it will all work perfectly!), enable devel module's execute php block and execute print_r($GLOBALS); This will spit out all the internal apache server variables, which you can use to check to see if the RewriteCond can see the right stuff. Make sure you do this temporarily, don't print your variables anywhere in your theme or anything. That would be really bad, as the code contains your main database password and username.

Other than that, it's all probably working without a hitch now, and I suggest a cold beer. You deserve it.

Comments

bailey86’s picture

Hi,

I've been building a data export/import module which may be of help.

It's currently in the sandbox area but I'm hoping to get it accepted soon. It's at:

http://drupal.org/sandbox/bailey86/1278830