Hey Drupal guys and gals :D

I'm working on migrating a large amount of content to the new Drupal-powered version of my site neverside.com. I have about 21,000 users, 95,000 forum threads (nodes)... and 711,000 replies (comments). I imported users using User Import, and the nodes manually with SQL. So far I've successfully imported the users and nodes, as well as set correct taxonomy terms for the nodes to appear in the correct forum categories. I'm a bit stuck on getting the comments imported correctly though.

I have this query written to add my old system's posts to Drupal's comments table (I have a temp table that contains all the relevant data I have available for importing into Drupal... the comments/replies of my old system are sorted by timestamp, no threading at all):

INSERT INTO comments (cid, pid, nid, uid, subject, comment, hostname, timestamp, score, status, format, thread, users, name, mail, homepage)
SELECT
	p_thread_id,
	'0',
	p_user_id,
	p_title,
	p_post,
	'',
	p_time,
	'',
	'0',
	'1',
	'',
	'a:1:{i:0;i:0;}',
	'',
	''
FROM neverside_replies;

The main thing I'm unsure about is how to correctly populate the "thread" column. I've looked at how Drupal fills this column by default using the "int2vancode and vancode2int functions" (which I don't understand completely). Even when comments are not threaded, it seems to store this value in a special format. I noticed that using Devel module's content generation tool, comments are generated with a value of "0" for thread, and it seems to work fine. However I'm not sure if this is the best way to do this for my actual content (e.g. might it break if someone adds new comments to an old node?). When I'm done I know I need to also update the sequences table as well as rebuild node_comment_statistics.

In addition to the "thread" column, if there are also any other tips on importing comments correctly I'd greatly appreciate it :D

Thanks a bunch guys! :D

- David

Comments

alihammad’s picture

just out of curiosity, why are you not using comment_save() (drupal api) from within a drupal node ?

Ali Hammad Raza
WordsValley

dnewkerk’s picture

Hi Ali -

Thanks, I'll have a look at it on the API docs to see if it helps me understand better... however I'm importing everything directly via SQL, because I am not any good with PHP yet, and because it is such a huge amount of content doing it over the database is faster and less prone to page timeouts/etc. Also as not all my pages and their comments have sequential IDs (e.g. some deleted over time) this is the best way I can think of to ensure that everything stays properly connected to each other, maintaining their original IDs.

- If anyone knows, I'd still love some pointers in the right direction on how to turn what I have (posts grouped by node ID and ordered by timestamp) into correct values for the comments "thread" field. Devel sets it just to
- I also do have the comment IP addresses I could import, though does anyone know if this can be left blank? The IPs are useless at this point (up to 8 years old).

I've also corrected some of my SQL statement:

INSERT INTO comments (cid, pid, nid, uid, subject, comment, hostname, timestamp, score, status, format, thread, users, name, mail, homepage)
SELECT
	p_id,
	'0',
	'p_thread_id',
	p_user_id,
	p_title,
	p_post,
	'',
	p_time,
	'0',
	'0',
	'1',
	thread,
	'a:1:{i:0;i:0;}',
	'username',
	'',
	''
FROM neverside_replies;
superjacent’s picture

For my Wordpress to Drupal converter, this is how I've handled the 'thread' field of the comments table.

After each post/node is added (copied across from Wordrpess) an SQL Select query (recordset) retrieves all the comments, if any, for the particular post ordered by comment date (ascending - oldest first).

The thread field merely starts at 1 for the first and increments. The thread field is formatted as a string "00" so 1 is "01". Further there is a slash at the end of the string. So your thread field values will look like these:

01/
02/
03/

Remember, for the next post/node, start all over again with the comments 01/, 02/ and so on.

For Wordpress, comments are not threaded, one follows the other, so it was basically easy to implement.

Hope this helps.

___________________________

Steven Taylor
http://prime357.org

dnewkerk’s picture

Thanks Steven...

One thing I'm curious about... I noticed that "real" comments generated as normal by Drupal start off in that format, 01/, 02/, etc... though when it gets to the tenth comment for a given node, it instead uses 0a/, 0b/, etc (it continues adding alphabet letters in this format past comment 20, which is set to 0k/... after it gets to 0z/ it then switches to 110/ on comment 37 and continues... not sure what happens after that). If it were just using concat to add a 0 and slash to the CID (per node) I'd be good to go, though the alphabet letters throw a wrench into the mix for me. I haven't yet figured out how to accommodate this sort of thing in my SQL statement.

I tested adding my own threaded comment reply to a comment generated by Devel module (which just sets all thread values to 0), and the result was a thread value of 0.00/ for my new reply comment, and the comment appearing visually on the comment listing associated with a different comment than I had pressed the reply link for. So that answers one question for me... doing things Devel module's way may allow comments to function normally so long as I disable threading on future comments (or enable it and just know that threading will break on all old nodes if new comments are added). To keep the option for threaded comments open if desired though, it looks like I'm going to have to correctly populate the thread field. All my nodes are forum nodes, so revived old forum threads are a definite possibility, as tends to happen on forums.

I'm definitely in over my head on this one haha...

Thanks everyone for your help! :D

dejbar’s picture

If you're going to do it the raw SQL way (as I did) then you'll need to use the 'int2vancode' and possibly 'vancode2int'. A 'Vancode' must be a word to describe what is supposed to be in the 'thread' column.

http://api.drupal.org/api/function/int2vancode/5
http://api.drupal.org/api/function/vancode2int/5

Since these functions don't use any other drupal code you could just copy and paste them into any scripts of your own. Worked for me.

superjacent’s picture

Thanks, very helpful. I'll be creating the necessary functions for my next release of the Wordpress converter.

___________________________

Steven Taylor
http://prime357.org

dnewkerk’s picture

Hi dejbar -

Thanks :) Did you, after importing the rest of the fields of your comments as best you could using SQL, process the thread column using a PHP script so you could make use of those functions? If you did and still have it, I'd love to take a look at that script. If not that's ok, I have a friend who's great with PHP who may be able to recreate a script based on those functions (unfortunately I'm not very good at PHP yet myself). I definitely wouldn't have been able to import the whole site with a PHP script (the CSV files are hundreds of MB), though a script to process a single column about a million times may be ok haha :D

Thanks!

- David

natrio’s picture

I know this is an old thread, but I just imported and generated the thread field using SQL, so here's the SELECT query to generate the vancode for the PARENT POSTs (level 1 posts, which have threads like '01/', '02/', etc) only (I added 'isroot' field to the table just for testing, so you should change that WHERE part :D )

select concat(cast(char(length(conv(rank, 10, 36)) + ord('0') -1) as char), cast(conv(rank,10,36) as char)), nid, pid, cid, subject, comment, thread
from (
  select if (@nid=nid, @num := @num + 1, @num:=1 ) rank, @nid :=nid,
  nid, pid, cid, thread, subject,
  comment from comments e, (select @num:=0) d, (select @nid:=0) h
  where isroot=1
  order by nid asc, pid asc
  ) u ;

with a bit modification you could run it gradually from level 1 to level n

Hope that helps for a starting point

dnewkerk’s picture

Thanks very much Safri :) I will give it a try. The site in question is still under development (had a long break) so I'm sure the info will still be very helpful.

Again, thanks!

-- David
davidnewkerk.com | absolutecross.com
View my Drupal lessons & guides

seanreiser’s picture

I know it's been a while but I ran into a similar situation recently after merging a few sites and importing comments from disqus. At the end of the day I cobbled together this script to both rebuild the node_comments_statistics table and the comment.thread. As there's no clear cut solution posted here, I figured I'd post it incase someone else comes a looking. (Cross posted on my site http://seanreiser.com/projects/2009/06/06/rebuilding-drupal-comment-stat... )

db_query("UPDATE {comments} set thread = ''");

db_query("DELETE FROM {node_comment_statistics}");

$nodes = db_query("SELECT nid,uid,changed FROM {node} WHERE status=1");
while ($m = db_fetch_object($nodes)) {
    $nid = $m->nid;

    // Calculate node_comment_statistics

    $comments=db_query("SELECT nid,max(cid) FROM {comments} WHERE status=0 AND nid=$nid GROUP by nid");
    if($l = db_fetch_object($comments)) {
        $cid = $l->{"max(cid)"};

        $lastcomments = db_query("SELECT uid,name,timestamp FROM {comments} WHERE cid=$cid");
        $n = db_fetch_object($lastcomments);
        $uid = $n->uid;
        $name = $n->name;
        $timestamp = $n->timestamp;
 
        $counts = db_query("SELECT count(*) FROM {comments} WHERE status=0 AND nid=$nid");
        $n = db_fetch_object($counts);
        $count = $n->{"count(*)"};
    } else {
        $timestamp = $m->changed;
        $name = NULL;
        $uid = $m->uid;
        $count = 0;
    }
    db_query("INSERT INTO {node_comment_statistics} (nid,last_comment_timestamp,last_comment_name,last_comment_uid,comment_count) VALUES ('$nid','$timestamp','$name','$uid','$count')");



    // Build Comment Threads for Node


    $comments=db_query("SELECT cid, pid FROM {comments} WHERE nid=$nid ORDER BY timestamp");
    
    $nodethread = 0;
    while ($comment = db_fetch_object($comments)){
                
        if ($comment->pid == '0'){
            $thread = int2vancode(++$nodethread) . "/";
        }
        else {
            
            $parent = _comment_load($comment->pid);
            // This is comment with a parent comment: we increase
            // the part of the thread value at the proper depth.

            // Get the parent comment

            // Strip the "/" from the end of the parent thread.
            $parent->thread = (string) rtrim((string) $parent->thread, '/');

            // Get the max value in _this_ thread.
            $max = db_result(db_query("SELECT MAX(thread) FROM {comments} WHERE thread LIKE '%s.%%' AND nid = %d", $parent->thread, $edit['nid']));

            if ($max == '') {
                // First child of this parent.
                $thread = $parent->thread .'.'. int2vancode(0) .'/';
              }
            else {
                // Strip the "/" at the end of the thread.
                $max = rtrim($max, '/');

                // We need to get the value at the correct depth.
                $parts = explode('.', $max);
                $parent_depth = count(explode('.', $parent->thread));
                $last = $parts[$parent_depth];

                // Finally, build the thread field for this new comment.
                $thread = $parent->thread .'.'. int2vancode(vancode2int($last) + 1) .'/';
            
            }
        }
        db_query("UPDATE {comments} set thread = '$thread' where cid = $comment->cid");
    }

}
sinasalek’s picture

Thanks sreiser, very useful

sina.salek.ws, Software Manager & Lead developer
Feel freedom with open source softwares

GreenReaper’s picture

I had 4000 threaded comments to import from PostNuke for the new version of flayrah.com. Would have taken me an age to figure this out - so thanks a lot.

I did have to change line 18 to:

$name = mysql_real_escape_string($n->name);

to handle the case where a name had a quotation mark in it, though.

ziobudda’s picture

Hi, the correct function name to use is: db_escape_string()

M.

Freelancer Senior Drupal Developer -- http://www.ziobuddalabs.it

yngens’s picture

Sean,

Your script saved me from lot's of headaches, thank you very much!

yngens’s picture

Hi Sean,

I am trying to use your excellent script to correct thread and comment statistics for a huge website with more than 17000 nodes being imported to Drupal.

It started to work very smoothly in the beginning and correctly showed number of comments for the first nodes. After it would went on for very long time (couple hours maybe) it had stuck at node #15945, for which and all nodes after which it did not show correct number of comments. I ran script again and it stuck exactly at the same node.

So my question, is it possible to and if yes how exactly I modify the script so that it started to crawl from the node #15945 and not from the very beginning if I start it again?

Thanks,

Yngens

seanreiser’s picture

Hi Yngens-

I know you've been persistant trying to get in touch with me today about this. I'm sorry I haven't been available sooner, I was taking a day off. I really haven't thought about this code in about a year (I wrote it, it did what I needed, I moved on), but looking at the code I don't see why it would stop at 15945 (or any other node).

First off, where the code hasn't been written to limit the run to a subset of nodes you could add a "where nid => XXXXX" to the first 3 lines in order to run against a certain group of nodes.

To debug this I'd start off starting a run at node 15945. If it hangs up again I'd start at 15946 and see if it runs through to the end. If it does there is some corruption in the comment chain for 15495 (example 2 nodes pointing at each other as their parent).

I will repost this the other places you've left this comment and reply to your email.

yngens’s picture

Sean,

I just deleted all the comments and the problematic node itself and ran script without any interruption again. It worked great this time, thank you!

l33tdawg’s picture

Been working on migrating comments from Postnuke->Drupal 7 and had to modify Sean's script above - use at your own risk!

// Simple script to rebuild comment threads in Drupal D7 by l33tdawg@hackinthebox.org
// Run this using the 'Execute PHP Code' function inside the Devel Module 

	// Flush the comment_statistics table and update the comment thread to '' for all comments 

	db_query("UPDATE {comment} set thread = ''");
	db_query("DELETE FROM {node_comment_statistics}");

	echo ("SUCCESS! Table node_comment_statistics truncated and thread comments have been reset\n"); 
	
	// Grab the node ID, the user ID and the UNIX timestamp 
 
	$nodes = db_query("SELECT nid, uid, changed FROM {node} WHERE status=1");
	foreach ($nodes as $record) { 
	    $nid = $record->nid;
	 // Calculate node_comment_statistics

    	$comments=db_query("SELECT nid, max(cid) FROM {comment} WHERE status=1 AND nid=$nid GROUP by nid");
    	$l = $comments->fetchObject();  
		if ($l != NULL) {
 			    	$cid = $l->{"max(cid)"};
	        		$lastcomments = db_query("SELECT uid, name, created FROM {comment} WHERE cid=$cid");
		    	    $n = $lastcomments->fetchObject(); 
					$uid = $n->uid;
				    $name = $n->name;
			    	$timestamp = $n->created;
					$count_comment = db_query("SELECT status FROM {comment} WHERE status=1 AND nid=$nid");
					$count = $count_comment->rowCount(); 
        			}
	    else { 
					$cid = 0; 
	   		     	$timestamp = $record->changed;
			      	$name = NULL;
        			$uid = $record->uid;
        			$count = 0;
     	      } 		

	 // Start updating the node_comment_statistics table 
		
		$update_node_comment_statistics = db_insert('node_comment_statistics')->fields(array(
	    	'nid' => $nid,
    		'last_comment_timestamp' => $record->changed, 
    		'last_comment_name' => $name,
	    	'last_comment_uid' => $uid, 
    		'comment_count' => $count, 
    		))->execute();    			
	}
 	
	// Start building the comment threads for the nodes 
    
	$nodes = db_query("SELECT nid, uid, changed FROM {node} WHERE status=1");
	foreach ($nodes as $record) {
			$nid = $record->nid; 
		    $comments=db_query("SELECT cid, pid FROM {comment} WHERE nid=$nid ORDER BY created");
		    $nodethread = 0;
		    foreach ($comments as $comment) { 

    	    	   if ($comment->pid == '0'){
		       $thread = int2vancode(++$nodethread) . "/";
        		} else {

            		// This is comment with a parent comment so we need to increate the thread value at the proper depth 
		         $parent = comment_load($comment->pid); 

	             	// Get the parent comment for the thread start 
    	        	// Strip the "/" from the end of the parent thread.

			 $parent->thread = (string) rtrim((string) $parent->thread, '/');
    	    	    	$edit['nid'] = $record->nid; 

        	    	// Get the MAX VALUE in the current thread 
            
 $max = db_query("SELECT MAX(thread) FROM {comment} WHERE thread LIKE :string AND nid = :nid", array(':string' => $parent->thread,':nid' => $nid))->fetchField(); 
			
	                // Find the first child of this parent.
                         
			if ($max == '') {
              			$thread = $parent->thread .'.'. int2vancode(0) .'/';
              			}
            	   else {
                
	                // Strip the "/" at the end of the thread.
                
    		            $max = rtrim($max, '/');

            	    // We need to get the value at the correct depth.
                
                	 $parts = explode('.', $max);
		          $parent_depth = count(explode('.', $parent->thread));
		          $last = $parts[$parent_depth];

        	        // Finally, build the thread field for this new comment
                
            		  $thread = $parent->thread .'.'. int2vancode(vancode2int($last) + 1) .'/';
            
           		 }

        	     }	
		$update_thread = db_update('comment') 
		->fields(array('thread' => $thread)) 
		->condition('cid', $comment->cid, '=')
		->execute(); 
		
   	    	}
  } 
  	echo "SUCCESS! node_comment_statistics table has been rebuilt!\n";     	
	echo "SUCCESS! Comment threads have been rebuilt!\n\n"; 
	echo "ALL DONE!\n"; 	

// END-OF-FILE

Regards,
L33tdawg@hackinthebox.org

matthensley’s picture

After beating my head against this a bit myself, I found that the import script above was close but wasn't quite getting the desired result, at least for a thread with multiple levels of nesting. I looked at the equivalent chunk of code in comment_save and updated the thread setting behavior for comments without parents.

This script should correctly update comment threads:

<?php
db_query("UPDATE {comment} set thread = ''");
    
	$nodes = db_query("SELECT nid, uid, changed FROM {node} WHERE status=1");
	foreach ($nodes as $record) {
		$nid = $record->nid; 
		$comments=db_query("SELECT cid, pid, nid FROM {comment} WHERE nid=$nid ORDER BY created");
		$nodethread = 0;
		foreach ($comments as $comment) { 
			if ($comment->pid == '0'){
				 // This is a comment with no parent comment (depth 0): we start
				// by retrieving the maximum thread level.
				$max = db_query('SELECT MAX(thread) FROM {comment} WHERE nid = :nid', array(':nid' => $comment->nid))->fetchField();
				// Strip the "/" from the end of the thread.
				$max = rtrim($max, '/');
				// We need to get the value at the correct depth.
				$parts = explode('.', $max);
				$firstsegment = $parts[0];
				// Finally, build the thread field for this new comment.
				$thread = int2vancode(vancode2int($firstsegment) + 1) . '/';
			} else {
				// This is a comment with a parent comment, so increase the part of the
				// thread value at the proper depth.
				// Get the parent comment:
				$parent = comment_load($comment->pid);
				// Strip the "/" from the end of the parent thread.
				$parent->thread = (string) rtrim((string) $parent->thread, '/');
				// Get the max value in *this* thread.
				$max = db_query("SELECT MAX(thread) FROM {comment} WHERE thread LIKE :thread AND nid = :nid", array(
				  ':thread' => $parent->thread . '.%',
				  ':nid' => $comment->nid,
				))->fetchField();
				if ($max == '') {
				  // First child of this parent.
				  $thread = $parent->thread . '.' . int2vancode(0) . '/';
				}
				else {
				  // Strip the "/" at the end of the thread.
				  $max = rtrim($max, '/');
				  // Get the value at the correct depth.
				  $parts = explode('.', $max);
				  $parent_depth = count(explode('.', $parent->thread));
				  $last = $parts[$parent_depth];
				  // Finally, build the thread field for this new comment.
				  $thread = $parent->thread . '.' . int2vancode(vancode2int($last) + 1) . '/';
				}
			}	
			$update_thread = db_update('comment') 
			->fields(array('thread' => $thread)) 
			->condition('cid', $comment->cid, '=')
			->execute(); 
		}
	}  	
?>

After that, I'm relying on devel_rebuild_node_comment_statistics(); to update counts and the stats table.

Hope this helps someone else!

jwessels’s picture

The above update scripts did a partial job for me, so I had to run these two commands to finish the job for the counts and user names:

update node_comment_statistics, users set last_comment_name=users.name where node_comment_statistics.last_comment_uid=users.uid;
update node_comment_statistics set comment_count=(SELECT COUNT(comment.nid) FROM comment WHERE node_comment_statistics.nid=comment.nid GROUP BY comment.nid);

I didn't update the cid field though yet in the comment table, but everything is appearing ok in Advanced Forum/Forum lists and counts.