Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
I have tested only once, and it did work. But use it on your own risk.
Based on code from Robert Douglass.
I have added few lines to make it copy-paste-able to a module.
Simply copy this to dedupe.module, and create module named "dedupe". More info about creating modules can be found here.
/**
* @file
* Remove Duplicate Nodes.
*/
/**
* Implementation of hook_menu().
*/
function dedupe_menu() {
$items['admin/content/dedupe'] = array(
'title' => 'Dedupe',
'description' => 'Delete Duplicate Nodes',
'page callback' => 'drupal_get_form',
'page arguments' => array('dedupe_content_command'),
'access arguments' => array('administer site configuration'),
'type' => MENU_NORMAL_ITEM,
);
return $items;
}
/**
* Build Form
*/
function dedupe_content_command() {
$options = node_get_types('names');
$form['dedupe_node_types'] = array(
'#type' => 'select',
'#title' => t('You can select content type from which duplicates are removed.'),
'#options' => $options,
'#default_value' => variable_get('dedupe_node_types', array('page')),
'#description' => t('Duplicate content from the selected content types will be deleted.'),
);
$form['submit'] = array(
'#type' => 'submit',
'#value' => t('Dedupe'),
);
return $form;
}
/**
* Call delete function and set message.
*/
function dedupe_content_command_submit($form, $form_state) {
$type= $form_state['values']['dedupe_node_types'];
$dedupe_m = 'Duplicates deleted from content type '. $type;
dedupe_delete($type);
drupal_set_message(check_plain(t($dedupe_m)));
}
/**
* Delete duplicates from selected content type, based on title. By Robert Douglass.
* @see http://robshouse.net/blog-post/remove-duplicate-nodes-dedupe-based-title
*/
function dedupe_delete($type) {
$previous = array();
$result = db_query("SELECT nid, title FROM {node}
WHERE title IN
(SELECT title FROM {node}
WHERE type = '%s'
GROUP BY title HAVING count(*) > 1)
ORDER BY title, created DESC", $type);
while ($row = db_fetch_array($result)) {
if ($row['title'] == $previous['title']) {
node_delete($previous['nid']);
}
$previous = $row;
}
}
Comments
You should make this an
You should make this an actual module. I just used it and it works like a charm. It would be nice it gave you a list of nodes that it's getting ready to delete and providing you with a 'delete' button at the bottom of the list.
I had a huge list of about 2500 nodes, with about 900 of them being duplicates, so this saved me a huge amount of time. Thanks man!
Unique Field
If you're just starting out, you should incorporate the Unique Field module and minimize this problem from the start. So many ways to configure too.
I replied to the wrong comment, but whatever.
The module works except that
The module works except that it seems to be deleting the first instance of the node. How can I make it delete the second one based on creation date?
It should work, havent tested
It should work, havent tested yet.
Just changed 'created' ORDER criterial by 'nid' :)
ruloweb
You just change DESC to ASC
You just change DESC to ASC in the code
Need help
I have used "Remove duplicate node code". Its working for small data but i have more than 6lac nodes and when i used this on that its not working. please help me if you have any solution.
Thanks in advance
This would make a great VBO
This would make a great VBO action.
My Drupal sites:
not work for me and get
not work for me and get internel 500 error.Maybe it is due to my mysql is too big come with long process.
Anyone knows this in D7
Anyone has a working solution for Drupal 7?
Kind Regards,
Jorge
Came across this post and
Came across this post and needed a D7 version also, so I spent some time converting it. I have done basic testing, but you're encouraged to do testing of your own! :)
Some things could be improved, but it works. I did change the form a bit so you can select what content types you want to dedupe, which is handy if you want to avoid deleting everything at once because of memory constraints.
The SQL logic is almost the same, except if the node created times are the same, the highest nid will survive the delete.
... and here is the .info file I was using:
I confirm that this D7
I confirm that this D7 version works. Created nodededupe.module and nodededupe.info. Form available at admin/content/dedupe.
Thanks blazindrop!
Turn into a module
I just used the D7 version too. This really should get project status even though it's a pretty simple module.
I do think a confirmation stage would be a nice option, though it seemed to work fine.
Working with tools like Migrate, there are times when you can get multiple versions of the same content pretty easily.
Got it working
For those unsure about how to install:
--Create a document with the code above, call it dedupe.module
--Create a basic dedupe.info file
(An install file is not needed.)
Create a directory in sites/all/modules called dedupe.
Upload both files to that folder.
Activate module (in the "Other" category")
Run cron so that the module is detected.
Go to admin/content/dedupe and select the content type.
How i can use this with
How i can use this with rules?
Working
Thanks blazindrop.
This works for me.
Nodding in agreement
This is awesome.
I was plodding through the manual method of deleting duplicates (not for the first time) when this saved me a whole load of ballache.
My compliments to the author(s)
Error in D7
I get a notice and a warning on the /admin/content/dedupe page and no form displays.
Not being a coder, I'm not sure what's wrong. Have followed the steps explained in these comments to create the two files, activate the module and run cron.
Beware of table prefixes
Just wanted to add a note in case anybody is using this code and gets an error which states "table does not exist" - you need to make sure you add any table prefix you have set to the SQL query. E.g. if you have a table prefix "dr_" (so that the table "node" does not exist, but the table "dr_node" does), you must make the code reflect this. It stumped me for a few minutes! Perhaps in the long run it would be better to use Entity Queries.
That's why all uses of
That's why all uses of db_query() are expected to use "{tablename}" syntax instead of just "tablename".
If the example used that syntax, the table name "node" would have been correctly prefixed on the fly thanks to the db rewrite layer.
.dan. is the New Zealand Drupal Developer working on Government Web Standards
So many questions
Does anyone know if the D6 version of this module will delete the duplicate versions of MANY nodes with the same content? I had a bulk import that went wild and now I have dozens of instances of 25 duplicate nodes. Will this eliminate all but one?
Also, will it run as a batch? My database is pretty big with close to 100,000 nodes. Will this be a problem?
A list of some of the Drupal sites I have designed and/or developed can be viewed at motioncity.com
OK, here is a replacement full module
Thanks for this.
It provided a good starting point, but I found I had to do things the long way in the end.
So here is a sandbox project that takes this and runs (a long way) with it.
http://drupal.org/sandbox/dman/1422586
That nested SQL just couldn't even run for me. Though I've still referred to it as a starting point.
I had to clean up 400 nodes x 800 copies and even listing all of them was impossible from a MySQL console, never mind actually deleting.
So here is a mega-scalable version that does it all in multiple batches.
Anyone want to join in to make it a full project? I can promote it, but don't want to maintain yet another new one on my own. As I don't plan to need to use it a lot (it's really a one-off repair case) - if someone else thinks it's worth helping look after in the future that would be cool. Co-maintainer volunteers here?
D7 branch possible and basic.
.dan. is the New Zealand Drupal Developer working on Government Web Standards
Notes for the uninitiated
Thanks dman. Good stuff.
A few notes that might trip people up.
1. clone the repo
$ git clone http://git.drupal.org/sandbox/dman/1422586.git deduplicate_nodes
2. the repo is empty!?!
$ cd deduplicate_nodes; git branch -a
Oh no it has two branches. Checkout the right one
$ git checkout -b 7.x-1.x origin/7.x-1.x
3. visit the admin page http://example.com/admin/content/dedupe
Oh no white screen of death!?! (Only me perhaps).
$ touch node.admin.inc
Different use case: Find similar nodes, don't delete them
I have a different use case; I need to find "similar" nodes (based on node title, and/or a CCK text field), and I only want to list them (not bulk delete them).
Example: movie titles like "My Movie (1991)", and "My Movie (1995)".
Ideally I'm looking for something like an VBO action that works like Unique field or Uniqueness, but doesn't require 100% identical strings, and plugs into Vieews. Ideally, the action (optionally) would work with a "fuzzy" algorithm like soundex to be able to locate similar titles like "photography" and "fotography" as well.
Has anyone seen something like this?
Non-trivial problem. Worthy
Non-trivial problem.
Worthy of a masters thesis if anyone can solve it for you...
OTOH, perhaps it's not so silly. I've never used soundex for real, but .. you may have a chance
http://www.madirish.net/node/85
.dan. is the New Zealand Drupal Developer working on Government Web Standards
have you looked at
have you looked at http://drupal.org/project/fuzzysearch yet?
how can i make a CCK field version ?
Hi everyone. i recently imported 73000 ubercart products (books), separated in 29 csv files with nodeimport. i have some errors in the process , and need to re-run some of the files (it takes me an entire week !!). every product has an SKU field (i put the ISBN there).
in this case some titles are the same and it is OK. the field that indicates the uniqueness of the node is the SKU field provided by ubercart.
so.. is there a way to adapt this code to work with other fields ?
i am not an expert in mysql querys, so i need some help !
this is the query of a view that select the node type product and show the sku field.
SELECT node.nid AS nid, uc_products.model AS uc_products_model FROM node node LEFT JOIN uc_products uc_products ON node.vid = uc_products.vid WHERE node.type in ('product')
thanxs !!
How to remove huge amount of duplicate data
I have used "Remove duplicate node code". Its working for small data but i have more than 6lac nodes and when i used this on that its not working. please help me if you have any solution.
Thanks in advance
4 posts up...
did Deduplicate Nodes above not work for you? That adds batch processing to this sample code.
.dan. is the New Zealand Drupal Developer working on Government Web Standards
Thanks!
I tried the initial code and it works like a charm. Thanks for this, it saved me a bunch of time.
Katrina
Nice way to detect duplicate
Nice way to detect duplicate content
Is there a way to make this
Is there a way to make this run with a cron?
Yes, by calling the cron hook
Create a hook and call the delete function followed by the content-types arrays as an argument
function yourmodulename_cron() {
nodededupe_delete('content-type');
}
Something like that.
The best solution for now...
Seems to be Remove Duplicates.
Suggest similar titles can be
Suggest similar titles can be used on first place to avoid duplicate titles.
Cheers
Thank you,
Tanzeel