Remove Duplicate Nodes based on title

I have tested only once, and it did work. But use it on your own risk.

Based on code from Robert Douglass.

I have added few lines to make it copy-paste-able to a module.

Simply copy this to dedupe.module, and create module named "dedupe". More info about creating modules can be found here.


/**
 * @file
 * Remove Duplicate Nodes.
 */

/**
 * Implementation of hook_menu().
 */
function dedupe_menu() {
    $items['admin/content/dedupe'] = array(
        'title' => 'Dedupe',
        'description' => 'Delete Duplicate Nodes',
        'page callback' => 'drupal_get_form',
        'page arguments' => array('dedupe_content_command'),
        'access arguments' => array('administer site configuration'),
        'type' => MENU_NORMAL_ITEM,
    );
    return $items;
}
/**
 * Build Form
 */
function dedupe_content_command() {
    $options = node_get_types('names');
    $form['dedupe_node_types'] = array(
        '#type' => 'select',
        '#title' => t('You can select content type from which duplicates are removed.'),
        '#options' => $options,
        '#default_value' => variable_get('dedupe_node_types', array('page')),
        '#description' => t('Duplicate content from the selected content types will be deleted.'),
    );
    $form['submit'] = array(
        '#type' => 'submit',
        '#value' => t('Dedupe'),
    );
    return $form;
}
/**
 * Call delete function and set message.
 */
function dedupe_content_command_submit($form, $form_state) {
    $type= $form_state['values']['dedupe_node_types'];
    $dedupe_m = 'Duplicates deleted from content type '. $type;
    dedupe_delete($type);
    drupal_set_message(check_plain(t($dedupe_m)));
}
/**
 * Delete duplicates from selected content type, based on title. By Robert Douglass.
 * @see http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
 */
function dedupe_delete($type) {
    $previous = array();
    $result = db_query("SELECT nid, title FROM {node}
    WHERE title IN
      (SELECT title FROM {node}
        WHERE type = '%s'
        GROUP BY title HAVING count(*) > 1)
    ORDER BY title, created DESC", $type);
    while ($row = db_fetch_array($result)) {
        if ($row['title'] == $previous['title']) {
            node_delete($previous['nid']);
        }
        $previous = $row;
    }
}

Comments

You should make this an

bfodeke commented 5 January 2011 at 15:25

You should make this an actual module. I just used it and it works like a charm. It would be nice it gave you a list of nodes that it's getting ready to delete and providing you with a 'delete' button at the bottom of the list.

I had a huge list of about 2500 nodes, with about 900 of them being duplicates, so this saved me a huge amount of time. Thanks man!

Unique Field

deryck.henson commented 4 December 2011 at 22:42

If you're just starting out, you should incorporate the Unique Field module and minimize this problem from the start. So many ways to configure too.

I replied to the wrong comment, but whatever.

The module works except that

zazinteractive commented 3 February 2011 at 00:44

The module works except that it seems to be deleting the first instance of the node. How can I make it delete the second one based on creation date?

It should work, havent tested

ruloweb commented 24 February 2011 at 09:18

It should work, havent tested yet.

Just changed 'created' ORDER criterial by 'nid' :)

/**
 * @file
 * Remove Duplicate Nodes.
 */

/**
 * Implementation of hook_menu().
 */
function dedupe_menu() {
    $items['admin/content/dedupe'] = array(
        'title' => 'Dedupe',
        'description' => 'Delete Duplicate Nodes',
        'page callback' => 'drupal_get_form',
        'page arguments' => array('dedupe_content_command'),
        'access arguments' => array('administer site configuration'),
        'type' => MENU_NORMAL_ITEM,
    );
    return $items;
}
/**
 * Build Form
 */
function dedupe_content_command() {
    $options = node_get_types('names');
    $form['dedupe_node_types'] = array(
        '#type' => 'select',
        '#title' => t('You can select content type from which duplicates are removed.'),
        '#options' => $options,
        '#default_value' => variable_get('dedupe_node_types', array('page')),
        '#description' => t('Duplicate content from the selected content types will be deleted.'),
    );
    $form['submit'] = array(
        '#type' => 'submit',
        '#value' => t('Dedupe'),
    );
    return $form;
}
/**
 * Call delete function and set message.
 */
function dedupe_content_command_submit($form, $form_state) {
    $type= $form_state['values']['dedupe_node_types'];
    $dedupe_m = 'Duplicates deleted from content type '. $type;
    dedupe_delete($type);
    drupal_set_message(check_plain(t($dedupe_m)));
}
/**
 * Delete duplicates from selected content type, based on title. By Robert Douglass.
 * @see <a href="http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
" title="http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
" rel="nofollow">http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
</a> */
function dedupe_delete($type) {
    $previous = array();
    $result = db_query("SELECT nid, title FROM {node}
    WHERE title IN
      (SELECT title FROM {node}
        WHERE type = '%s'
        GROUP BY title HAVING count(*) > 1)
    ORDER BY title, nid DESC", $type);
    while ($row = db_fetch_array($result)) {
        if ($row['title'] == $previous['title']) {
            node_delete($previous['nid']);
        }
        $previous = $row;
    }
}

ruloweb

You just change DESC to ASC

zazinteractive commented 3 May 2011 at 05:06

You just change DESC to ASC in the code

Need help

rahulwelcome commented 18 October 2012 at 05:48

I have used "Remove duplicate node code". Its working for small data but i have more than 6lac nodes and when i used this on that its not working. please help me if you have any solution.

Thanks in advance

This would make a great VBO

giorgio79 commented 19 June 2011 at 15:13

This would make a great VBO action.

My Drupal sites:

not work for me and get

Edward.H commented 7 July 2011 at 10:47

not work for me and get internel 500 error.Maybe it is due to my mysql is too big come with long process.

Anyone knows this in D7

Yorgg commented 6 October 2011 at 14:51

Anyone has a working solution for Drupal 7?

Kind Regards,
Jorge

Came across this post and

blazindrop commented 19 October 2011 at 19:28

Came across this post and needed a D7 version also, so I spent some time converting it. I have done basic testing, but you're encouraged to do testing of your own! :)

Some things could be improved, but it works. I did change the form a bit so you can select what content types you want to dedupe, which is handy if you want to avoid deleting everything at once because of memory constraints.

The SQL logic is almost the same, except if the node created times are the same, the highest nid will survive the delete.

/**
* @file
* Remove Duplicate Nodes.
*/

/**
* Implementation of hook_menu().
*/
function nodededupe_menu() {
    $items['admin/content/dedupe'] = array(
        'title' => 'Dedupe',
        'description' => 'Delete Duplicate Nodes',
        'page callback' => 'drupal_get_form',
        'page arguments' => array('nodededupe_content_command'),
        'access arguments' => array('administer site configuration'),
        'type' => MENU_NORMAL_ITEM,
        'file' => NULL,
    );
    return $items;
}
/**
* Build Form
*/
function nodededupe_content_command() {
    $options = array();
    $types = node_type_get_types();
    foreach ($types as $type) {
      $options[$type->type] = $type->name;
    }
    $form['nodededupe_node_types'] = array(
        '#type' => 'checkboxes',
        '#title' => t('You can select content type from which duplicates are removed.'),
        '#options' => $options,
        //'#default_value' => variable_get('nodededupe_node_types', array('page')),
        '#description' => t('Duplicate content from the selected content types will be deleted.'),
    );
    $form['submit'] = array(
        '#type' => 'submit',
        '#value' => t('Dedupe'),
    );
    return $form;
}
/**
* Call delete function and set message.
*/
function nodededupe_content_command_submit($form, $form_state) {
    $types_del = array();
    $types = $form_state['values']['nodededupe_node_types'];
    foreach ($types as $idx => $type) {
      if ($type === 0) {
        continue;
      }
      $types_del[] = $type;
    }

    $nodededupe_m = 'Duplicates deleted from content types: '. implode(", ",$types_del);
    nodededupe_delete($types_del);
    drupal_set_message(check_plain($nodededupe_m));
}
/**
* Delete duplicates from selected content type, based on title. By Robert Douglass.
* @see <a href="http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
" title="http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
" rel="nofollow">http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
</a> */
function nodededupe_delete($types_del) {
    $prevobj = array();
    $result = db_query("SELECT n.nid, n.title, n.created FROM node n
    	inner join
      	(SELECT title FROM node
        	WHERE type in (:types)
        	GROUP BY title HAVING count(*) > 1) n2
     	on n.title = n2.title
    ORDER BY title, created, nid DESC", array(':types' => $types_del));
    
    foreach ($result as $obj) {
        if ($obj->title == $prevobj->title) {
            node_delete((int)$obj->nid);
            drupal_set_message("Deleted node nid={$obj->nid}, title={$obj->title}, created={$obj->created}");
        }
        $prevobj = $obj;
    }
}

... and here is the .info file I was using:

name = Dedupe
description = "Node dedupe module taken from http://drupal.org/node/720190"
core = 7.x
package = Custom

files[] = nodededupe.module

I confirm that this D7

Yuri commented 10 November 2011 at 03:24

I confirm that this D7 version works. Created nodededupe.module and nodededupe.info. Form available at admin/content/dedupe.
Thanks blazindrop!

Turn into a module

mgifford

he/him

English

commented 6 February 2012 at 12:01

I just used the D7 version too. This really should get project status even though it's a pretty simple module.

I do think a confirmation stage would be a nice option, though it seemed to work fine.

Working with tools like Migrate, there are times when you can get multiple versions of the same content pretty easily.

Got it working

trazom commented 14 February 2012 at 18:13

For those unsure about how to install:

--Create a document with the code above, call it dedupe.module

--Create a basic dedupe.info file

(An install file is not needed.)

Create a directory in sites/all/modules called dedupe.

Upload both files to that folder.

Activate module (in the "Other" category")

Run cron so that the module is detected.

Go to admin/content/dedupe and select the content type.

How i can use this with

ilbeppe commented 15 May 2012 at 11:17

How i can use this with rules?

Working

Yorgg commented 19 July 2012 at 18:42

Thanks blazindrop.
This works for me.

Nodding in agreement

Anonymous (not verified) commented 29 August 2012 at 14:05

This is awesome.

I was plodding through the manual method of deleting duplicates (not for the first time) when this saved me a whole load of ballache.

My compliments to the author(s)

Error in D7

westis commented 24 October 2012 at 10:20

I get a notice and a warning on the /admin/content/dedupe page and no form displays.

Notice: Undefined index: dedupe_content_command in drupal_retrieve_form() (line 760 of /var/www/nnn/includes/form.inc).
Warning: call_user_func_array() expects parameter 1 to be a valid callback, function 'dedupe_content_command' not found or invalid function name in drupal_retrieve_form() (line 795 of /var/www/nnn/includes/form.inc).

Not being a coder, I'm not sure what's wrong. Have followed the steps explained in these comments to create the two files, activate the module and run cron.

Beware of table prefixes

ericbobson commented 24 May 2013 at 12:20

Just wanted to add a note in case anybody is using this code and gets an error which states "table does not exist" - you need to make sure you add any table prefix you have set to the SQL query. E.g. if you have a table prefix "dr_" (so that the table "node" does not exist, but the table "dr_node" does), you must make the code reflect this. It stumped me for a few minutes! Perhaps in the long run it would be better to use Entity Queries.

That's why all uses of

dman commented 24 May 2013 at 13:36

That's why all uses of db_query() are expected to use "{tablename}" syntax instead of just "tablename".
If the example used that syntax, the table name "node" would have been correctly prefixed on the fly thanks to the db rewrite layer.

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

So many questions

mcfilms commented 2 December 2011 at 18:01

Does anyone know if the D6 version of this module will delete the duplicate versions of MANY nodes with the same content? I had a bulk import that went wild and now I have dozens of instances of 25 duplicate nodes. Will this eliminate all but one?

Also, will it run as a batch? My database is pretty big with close to 100,000 nodes. Will this be a problem?

_{A list of some of the Drupal sites I have designed and/or developed can be viewed at motioncity.com}

OK, here is a replacement full module

dman commented 31 January 2012 at 11:15

Thanks for this.
It provided a good starting point, but I found I had to do things the long way in the end.
So here is a sandbox project that takes this and runs (a long way) with it.
http://drupal.org/sandbox/dman/1422586

That nested SQL just couldn't even run for me. Though I've still referred to it as a starting point.
I had to clean up 400 nodes x 800 copies and even listing all of them was impossible from a MySQL console, never mind actually deleting.

So here is a mega-scalable version that does it all in multiple batches.
Anyone want to join in to make it a full project? I can promote it, but don't want to maintain yet another new one on my own. As I don't plan to need to use it a lot (it's really a one-off repair case) - if someone else thinks it's worth helping look after in the future that would be cool. Co-maintainer volunteers here?
D7 branch possible and basic.

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

Notes for the uninitiated

queenvictoria commented 2 November 2012 at 17:30

Thanks dman. Good stuff.

A few notes that might trip people up.

1. clone the repo
$ git clone http://git.drupal.org/sandbox/dman/1422586.git deduplicate_nodes

2. the repo is empty!?!
$ cd deduplicate_nodes; git branch -a

Oh no it has two branches. Checkout the right one
$ git checkout -b 7.x-1.x origin/7.x-1.x

3. visit the admin page http://example.com/admin/content/dedupe

Oh no white screen of death!?! (Only me perhaps).
$ touch node.admin.inc

Different use case: Find similar nodes, don't delete them

asb commented 1 May 2012 at 06:58

I have a different use case; I need to find "similar" nodes (based on node title, and/or a CCK text field), and I only want to list them (not bulk delete them).

Example: movie titles like "My Movie (1991)", and "My Movie (1995)".

Ideally I'm looking for something like an VBO action that works like Unique field or Uniqueness, but doesn't require 100% identical strings, and plugs into Vieews. Ideally, the action (optionally) would work with a "fuzzy" algorithm like soundex to be able to locate similar titles like "photography" and "fotography" as well.

Has anyone seen something like this?

Non-trivial problem. Worthy

dman commented 1 May 2012 at 07:40

Non-trivial problem.
Worthy of a masters thesis if anyone can solve it for you...

OTOH, perhaps it's not so silly. I've never used soundex for real, but .. you may have a chance
http://www.madirish.net/node/85

_{.dan. is the New Zealand Drupal Developer working on Government Web Standards}

have you looked at

tomhung commented 7 September 2012 at 20:51

have you looked at http://drupal.org/project/fuzzysearch yet?

how can i make a CCK field version ?

ramirojoaquin commented 14 September 2012 at 16:37

Hi everyone. i recently imported 73000 ubercart products (books), separated in 29 csv files with nodeimport. i have some errors in the process , and need to re-run some of the files (it takes me an entire week !!). every product has an SKU field (i put the ISBN there).
in this case some titles are the same and it is OK. the field that indicates the uniqueness of the node is the SKU field provided by ubercart.
so.. is there a way to adapt this code to work with other fields ?
i am not an expert in mysql querys, so i need some help !

this is the query of a view that select the node type product and show the sku field.

SELECT node.nid AS nid, uc_products.model AS uc_products_model FROM node node LEFT JOIN uc_products uc_products ON node.vid = uc_products.vid WHERE node.type in ('product')

thanxs !!

How to remove huge amount of duplicate data

rahulwelcome commented 18 October 2012 at 05:51

I have used "Remove duplicate node code". Its working for small data but i have more than 6lac nodes and when i used this on that its not working. please help me if you have any solution.

Thanks in advance