I have tested only once, and it did work. But use it on your own risk.

Based on code from Robert Douglass.

I have added few lines to make it copy-paste-able to a module.

Simply copy this to dedupe.module, and create module named "dedupe". More info about creating modules can be found here.


/**
 * @file
 * Remove Duplicate Nodes.
 */

/**
 * Implementation of hook_menu().
 */
function dedupe_menu() {
    $items['admin/content/dedupe'] = array(
        'title' => 'Dedupe',
        'description' => 'Delete Duplicate Nodes',
        'page callback' => 'drupal_get_form',
        'page arguments' => array('dedupe_content_command'),
        'access arguments' => array('administer site configuration'),
        'type' => MENU_NORMAL_ITEM,
    );
    return $items;
}
/**
 * Build Form
 */
function dedupe_content_command() {
    $options = node_get_types('names');
    $form['dedupe_node_types'] = array(
        '#type' => 'select',
        '#title' => t('You can select content type from which duplicates are removed.'),
        '#options' => $options,
        '#default_value' => variable_get('dedupe_node_types', array('page')),
        '#description' => t('Duplicate content from the selected content types will be deleted.'),
    );
    $form['submit'] = array(
        '#type' => 'submit',
        '#value' => t('Dedupe'),
    );
    return $form;
}
/**
 * Call delete function and set message.
 */
function dedupe_content_command_submit($form, $form_state) {
    $type= $form_state['values']['dedupe_node_types'];
    $dedupe_m = 'Duplicates deleted from content type '. $type;
    dedupe_delete($type);
    drupal_set_message(check_plain(t($dedupe_m)));
}
/**
 * Delete duplicates from selected content type, based on title. By Robert Douglass.
 * @see http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
 */
function dedupe_delete($type) {
    $previous = array();
    $result = db_query("SELECT nid, title FROM {node}
    WHERE title IN
      (SELECT title FROM {node}
        WHERE type = '%s'
        GROUP BY title HAVING count(*) > 1)
    ORDER BY title, created DESC", $type);
    while ($row = db_fetch_array($result)) {
        if ($row['title'] == $previous['title']) {
            node_delete($previous['nid']);
        }
        $previous = $row;
    }
}

Comments

bfodeke’s picture

You should make this an actual module. I just used it and it works like a charm. It would be nice it gave you a list of nodes that it's getting ready to delete and providing you with a 'delete' button at the bottom of the list.

I had a huge list of about 2500 nodes, with about 900 of them being duplicates, so this saved me a huge amount of time. Thanks man!

deryck.henson’s picture

If you're just starting out, you should incorporate the Unique Field module and minimize this problem from the start. So many ways to configure too.

I replied to the wrong comment, but whatever.

zazinteractive’s picture

The module works except that it seems to be deleting the first instance of the node. How can I make it delete the second one based on creation date?

ruloweb’s picture

It should work, havent tested yet.

Just changed 'created' ORDER criterial by 'nid' :)

/**
 * @file
 * Remove Duplicate Nodes.
 */

/**
 * Implementation of hook_menu().
 */
function dedupe_menu() {
    $items['admin/content/dedupe'] = array(
        'title' => 'Dedupe',
        'description' => 'Delete Duplicate Nodes',
        'page callback' => 'drupal_get_form',
        'page arguments' => array('dedupe_content_command'),
        'access arguments' => array('administer site configuration'),
        'type' => MENU_NORMAL_ITEM,
    );
    return $items;
}
/**
 * Build Form
 */
function dedupe_content_command() {
    $options = node_get_types('names');
    $form['dedupe_node_types'] = array(
        '#type' => 'select',
        '#title' => t('You can select content type from which duplicates are removed.'),
        '#options' => $options,
        '#default_value' => variable_get('dedupe_node_types', array('page')),
        '#description' => t('Duplicate content from the selected content types will be deleted.'),
    );
    $form['submit'] = array(
        '#type' => 'submit',
        '#value' => t('Dedupe'),
    );
    return $form;
}
/**
 * Call delete function and set message.
 */
function dedupe_content_command_submit($form, $form_state) {
    $type= $form_state['values']['dedupe_node_types'];
    $dedupe_m = 'Duplicates deleted from content type '. $type;
    dedupe_delete($type);
    drupal_set_message(check_plain(t($dedupe_m)));
}
/**
 * Delete duplicates from selected content type, based on title. By Robert Douglass.
 * @see <a href="http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
" title="http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
" rel="nofollow">http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
</a> */
function dedupe_delete($type) {
    $previous = array();
    $result = db_query("SELECT nid, title FROM {node}
    WHERE title IN
      (SELECT title FROM {node}
        WHERE type = '%s'
        GROUP BY title HAVING count(*) > 1)
    ORDER BY title, nid DESC", $type);
    while ($row = db_fetch_array($result)) {
        if ($row['title'] == $previous['title']) {
            node_delete($previous['nid']);
        }
        $previous = $row;
    }
}

ruloweb

zazinteractive’s picture

You just change DESC to ASC in the code

rahulwelcome’s picture

I have used "Remove duplicate node code". Its working for small data but i have more than 6lac nodes and when i used this on that its not working. please help me if you have any solution.

Thanks in advance

giorgio79’s picture

This would make a great VBO action.

Edward.H’s picture

not work for me and get internel 500 error.Maybe it is due to my mysql is too big come with long process.

Yorgg’s picture

Anyone has a working solution for Drupal 7?

Kind Regards,
Jorge

blazindrop’s picture

Came across this post and needed a D7 version also, so I spent some time converting it. I have done basic testing, but you're encouraged to do testing of your own! :)

Some things could be improved, but it works. I did change the form a bit so you can select what content types you want to dedupe, which is handy if you want to avoid deleting everything at once because of memory constraints.

The SQL logic is almost the same, except if the node created times are the same, the highest nid will survive the delete.

/**
* @file
* Remove Duplicate Nodes.
*/

/**
* Implementation of hook_menu().
*/
function nodededupe_menu() {
    $items['admin/content/dedupe'] = array(
        'title' => 'Dedupe',
        'description' => 'Delete Duplicate Nodes',
        'page callback' => 'drupal_get_form',
        'page arguments' => array('nodededupe_content_command'),
        'access arguments' => array('administer site configuration'),
        'type' => MENU_NORMAL_ITEM,
        'file' => NULL,
    );
    return $items;
}
/**
* Build Form
*/
function nodededupe_content_command() {
    $options = array();
    $types = node_type_get_types();
    foreach ($types as $type) {
      $options[$type->type] = $type->name;
    }
    $form['nodededupe_node_types'] = array(
        '#type' => 'checkboxes',
        '#title' => t('You can select content type from which duplicates are removed.'),
        '#options' => $options,
        //'#default_value' => variable_get('nodededupe_node_types', array('page')),
        '#description' => t('Duplicate content from the selected content types will be deleted.'),
    );
    $form['submit'] = array(
        '#type' => 'submit',
        '#value' => t('Dedupe'),
    );
    return $form;
}
/**
* Call delete function and set message.
*/
function nodededupe_content_command_submit($form, $form_state) {
    $types_del = array();
    $types = $form_state['values']['nodededupe_node_types'];
    foreach ($types as $idx => $type) {
      if ($type === 0) {
        continue;
      }
      $types_del[] = $type;
    }

    $nodededupe_m = 'Duplicates deleted from content types: '. implode(", ",$types_del);
    nodededupe_delete($types_del);
    drupal_set_message(check_plain($nodededupe_m));
}
/**
* Delete duplicates from selected content type, based on title. By Robert Douglass.
* @see <a href="http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
" title="http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
" rel="nofollow">http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
</a> */
function nodededupe_delete($types_del) {
    $prevobj = array();
    $result = db_query("SELECT n.nid, n.title, n.created FROM node n
    	inner join
      	(SELECT title FROM node
        	WHERE type in (:types)
        	GROUP BY title HAVING count(*) > 1) n2
     	on n.title = n2.title
    ORDER BY title, created, nid DESC", array(':types' => $types_del));
    
    foreach ($result as $obj) {
        if ($obj->title == $prevobj->title) {
            node_delete((int)$obj->nid);
            drupal_set_message("Deleted node nid={$obj->nid}, title={$obj->title}, created={$obj->created}");
        }
        $prevobj = $obj;
    }
}

... and here is the .info file I was using:

name = Dedupe
description = "Node dedupe module taken from http://drupal.org/node/720190"
core = 7.x
package = Custom

files[] = nodededupe.module
Yuri’s picture

I confirm that this D7 version works. Created nodededupe.module and nodededupe.info. Form available at admin/content/dedupe.
Thanks blazindrop!

mgifford’s picture

I just used the D7 version too. This really should get project status even though it's a pretty simple module.

I do think a confirmation stage would be a nice option, though it seemed to work fine.

Working with tools like Migrate, there are times when you can get multiple versions of the same content pretty easily.

trazom’s picture

For those unsure about how to install:

--Create a document with the code above, call it dedupe.module

--Create a basic dedupe.info file

(An install file is not needed.)

Create a directory in sites/all/modules called dedupe.

Upload both files to that folder.

Activate module (in the "Other" category")

Run cron so that the module is detected.

Go to admin/content/dedupe and select the content type.

ilbeppe’s picture

How i can use this with rules?

Yorgg’s picture

Thanks blazindrop.
This works for me.

Anonymous’s picture

This is awesome.

I was plodding through the manual method of deleting duplicates (not for the first time) when this saved me a whole load of ballache.

My compliments to the author(s)

westis’s picture

I get a notice and a warning on the /admin/content/dedupe page and no form displays.

Notice: Undefined index: dedupe_content_command in drupal_retrieve_form() (line 760 of /var/www/nnn/includes/form.inc).
Warning: call_user_func_array() expects parameter 1 to be a valid callback, function 'dedupe_content_command' not found or invalid function name in drupal_retrieve_form() (line 795 of /var/www/nnn/includes/form.inc).

Not being a coder, I'm not sure what's wrong. Have followed the steps explained in these comments to create the two files, activate the module and run cron.

ericbobson’s picture

Just wanted to add a note in case anybody is using this code and gets an error which states "table does not exist" - you need to make sure you add any table prefix you have set to the SQL query. E.g. if you have a table prefix "dr_" (so that the table "node" does not exist, but the table "dr_node" does), you must make the code reflect this. It stumped me for a few minutes! Perhaps in the long run it would be better to use Entity Queries.

dman’s picture

That's why all uses of db_query() are expected to use "{tablename}" syntax instead of just "tablename".
If the example used that syntax, the table name "node" would have been correctly prefixed on the fly thanks to the db rewrite layer.

mcfilms’s picture

Does anyone know if the D6 version of this module will delete the duplicate versions of MANY nodes with the same content? I had a bulk import that went wild and now I have dozens of instances of 25 duplicate nodes. Will this eliminate all but one?

Also, will it run as a batch? My database is pretty big with close to 100,000 nodes. Will this be a problem?

A list of some of the Drupal sites I have designed and/or developed can be viewed at motioncity.com

dman’s picture

Thanks for this.
It provided a good starting point, but I found I had to do things the long way in the end.
So here is a sandbox project that takes this and runs (a long way) with it.
http://drupal.org/sandbox/dman/1422586

That nested SQL just couldn't even run for me. Though I've still referred to it as a starting point.
I had to clean up 400 nodes x 800 copies and even listing all of them was impossible from a MySQL console, never mind actually deleting.

So here is a mega-scalable version that does it all in multiple batches.
Anyone want to join in to make it a full project? I can promote it, but don't want to maintain yet another new one on my own. As I don't plan to need to use it a lot (it's really a one-off repair case) - if someone else thinks it's worth helping look after in the future that would be cool. Co-maintainer volunteers here?
D7 branch possible and basic.

queenvictoria’s picture

Thanks dman. Good stuff.

A few notes that might trip people up.

1. clone the repo
$ git clone http://git.drupal.org/sandbox/dman/1422586.git deduplicate_nodes

2. the repo is empty!?!
$ cd deduplicate_nodes; git branch -a

Oh no it has two branches. Checkout the right one
$ git checkout -b 7.x-1.x origin/7.x-1.x

3. visit the admin page http://example.com/admin/content/dedupe

Oh no white screen of death!?! (Only me perhaps).
$ touch node.admin.inc

asb’s picture

I have a different use case; I need to find "similar" nodes (based on node title, and/or a CCK text field), and I only want to list them (not bulk delete them).

Example: movie titles like "My Movie (1991)", and "My Movie (1995)".

Ideally I'm looking for something like an VBO action that works like Unique field or Uniqueness, but doesn't require 100% identical strings, and plugs into Vieews. Ideally, the action (optionally) would work with a "fuzzy" algorithm like soundex to be able to locate similar titles like "photography" and "fotography" as well.

Has anyone seen something like this?

dman’s picture

Non-trivial problem.
Worthy of a masters thesis if anyone can solve it for you...

OTOH, perhaps it's not so silly. I've never used soundex for real, but .. you may have a chance
http://www.madirish.net/node/85

tomhung’s picture

ramirojoaquin’s picture

Hi everyone. i recently imported 73000 ubercart products (books), separated in 29 csv files with nodeimport. i have some errors in the process , and need to re-run some of the files (it takes me an entire week !!). every product has an SKU field (i put the ISBN there).
in this case some titles are the same and it is OK. the field that indicates the uniqueness of the node is the SKU field provided by ubercart.
so.. is there a way to adapt this code to work with other fields ?
i am not an expert in mysql querys, so i need some help !

this is the query of a view that select the node type product and show the sku field.

SELECT node.nid AS nid, uc_products.model AS uc_products_model FROM node node LEFT JOIN uc_products uc_products ON node.vid = uc_products.vid WHERE node.type in ('product')

thanxs !!

rahulwelcome’s picture

I have used "Remove duplicate node code". Its working for small data but i have more than 6lac nodes and when i used this on that its not working. please help me if you have any solution.

Thanks in advance

dman’s picture

did Deduplicate Nodes above not work for you? That adds batch processing to this sample code.

ambereyes’s picture

I tried the initial code and it works like a charm. Thanks for this, it saved me a bunch of time.
Katrina

agudivad’s picture

Nice way to detect duplicate content

pindaman’s picture

Is there a way to make this run with a cron?

Yorgg’s picture

Create a hook and call the delete function followed by the content-types arrays as an argument

function yourmodulename_cron() {
nodededupe_delete('content-type');
}

Something like that.

ressa’s picture

Seems to be Remove Duplicates.

tanzeel’s picture

Suggest similar titles can be used on first place to avoid duplicate titles.

Cheers

Thank you,
Tanzeel