This module helps the administrator to remove duplicate nodes according to node titles or CCK fields.

  • The project is designed for D7
  • The project uses batch processing

There is a similar sandbox project here : "Deduplicate nodes" : hxxps:\\drupal.org\sandbox\dman\1422586

But :

  • The project was designed for D6
  • The project is only designed to remove duplicates according to node title field
  • I exchanged mails with the project owner who doesn't intend to expend or maintain his project and, very nicely, offered to endorse the current project.

Link to project page : https://drupal.org/sandbox/virtuoworks/2005296

Link to git repository : http://drupalcode.org/sandbox/virtuoworks/2005296.git

Comments

cybernetikz’s picture

Status: Needs review » Needs work

Used check_plain on static messages, no need to do that.

sami radi’s picture

Status: Needs work » Needs review

check_plain removed on static messages

PA robot’s picture

Status: Needs review » Needs work

There are some errors reported by automated review tools, did you already check them? See http://ventral.org/pareview/httpgitdrupalorgsandboxdman1422586git

We are currently quite busy with all the project applications and we prefer projects with a review bonus. Please help reviewing and put yourself on the high priority list, then we will take a look at your project right away :-)

Also, you should get your friends, colleagues or other community members involved to review this application. Let them go through the review checklist and post a comment that sets this issue to "needs work" (they found some problems with the project) or "reviewed & tested by the community" (they found no major flaws).

I'm a robot and this is an automated message from Project Applications Scraper.

sami radi’s picture

Status: Needs work » Needs review

The robot checked the WRONG project. That's the problem with robots, they're not very smart... (that's why I changed the link to the old D6 project in application form, hoping that the robot won't recognize it)

labor b’s picture

Status: Needs review » Needs work

Hi, thanks for your contribution. I just reviewed your project:

First of all, you should define a correct git clone command. Took me some time to figure out which project to clone:
git clone --branch 7.x-1.x http://git.drupal.org/sandbox/virtuoworks/2005296.git

Line 172ff: No need to cast types on php variables.

UI suggestions:
* On the "Remove duplicates" form there should be a big message saying "Be careful, you might be losing data! I recommend doing a backup before using this module."
* Also there is no safety line, like a checkbox or a confirmation I have to click. Given the fact, that there are default values for all fields. If a user access the form and accidentaly hits enter, some of his nodes might be gone.
* Leading to the next issue: The module isn't really clear, which nodes will be deleted. Also it doesn't tell me what has been deleted afterwards.

labor b’s picture

I somehow posted this twice...

sami radi’s picture

Status: Needs work » Needs review

The git clone command is updated.

Line 172ff: Casts removed

UI suggestions:
* Warning message added
* Confirmation page added
* Preview of the nodes which will be deleted on the confirmation page added.

hatuhay’s picture

Status: Needs review » Needs work

Verify that suggestions on #6 were done.

Couple of lines that cannot be translated on following function, description and title:

<?php
function remove_duplicates_menu() {

  $items['admin/config/content/remove_duplicates'] = array(
    'title' => 'Remove Duplicates',
    'description' => 'Delete Duplicate Contents',
    'page callback' => 'drupal_get_form',
    'page arguments' => array('remove_duplicates_settings_form'),
    'access arguments' => array('administer remove_duplicates'),
    'type' => MENU_NORMAL_ITEM,
  );
  return $items;
}
?>

From users experience point of view I have a three suggestions:

  • A hint to diferentiate each node, I will suggest date of creation/last edition, author
  • The chance to delete the other duplicate.
  • Look if nodes could be display in table format.

I will also suggest to use the watchdog function to leave registration of action.

Finnally, I suggest to evaluate using flags, views and VBO to reduce code on nodes handling.

ayesh’s picture

Hello,
I tested the module and it works, but here are some points that I believe needs work.

First, it's not very clear "which duplicates" this module can remove. It could be comments, taxonomy terms, users, and other entities as well.

- Do not call theme_item_list function directly. Use theme('item_list', $variables) pattern.
- isset($form_state['storage']['confirm']) && $form_state['storage']['confirm'] can be replaced with empty($form_state['storage']['confirm'])
- I think it would better if we replace "cancel" buttons with a simple link.
- Consider replacing raw database calls to fetch nodes matching a specific field with EntityFieldQuery (doc)

I wish you all the best, and thank you for your hard work on this useful module!

sami radi’s picture

Status: Needs review » Needs work

Hi,

@hatuhay

Review :

  • Couple of lines that cannot be translated on following function, description and title

Irrelevant, Titles and Descriptions should no longer be wrapped in t(), See : https://drupal.org/node/140311

  • A hint to diferentiate each node, I will suggest date of creation/last edition, author

Done.

  • The chance to delete the other duplicate.

Done. Added a tableselect layout.

  • Look if nodes could be display in table format.

Done. Added a table layout.

Why multiple layouts ? On my test db i have more than 3500+ duplicates and around 2000 duplicate groups. With a tableselect, the PHP setting max post size must be increased to get all the data. So, i let the user choose if he wants an autoselection (no post data, no problem) or a manual selection (with a warning according to the post size matter).

  • I will also suggest to use the watchdog function to leave registration of action.

Done at multiple levels.

  • Finnally, I suggest to evaluate using flags, views and VBO to reduce code on nodes handling.

Not convinced :

  1. Node handling is 15~20% of the code, 80% is UI and batch
  2. This module is intended to work complementary to feeds not flags + view + VBO.
  3. Raw querying is much more efficient than multiple irrelevant DB calls that could be made by other modules (this is a delete batch, no need for caching).

@Ayesh

Review :

  • First, it's not very clear "which duplicates" this module can remove. It could be comments, taxonomy terms, users, and other entities as well.

I changed titles and descriptions to replace the word "contents" with the word nodes. That should make it.

  • Do not call theme_item_list function directly. Use theme('item_list', $variables) pattern.

Done.

  • isset($form_state['storage']['confirm']) && $form_state['storage']['confirm'] can be replaced with empty($form_state['storage']['confirm'])

Done. Replaced isset($var) && $var with !empty($var). cf. PHP.net : empty() is essentially the concise equivalent to !isset($var) || $var == false.

  • I think it would better if we replace "cancel" buttons with a simple link.

Done.

  • Consider replacing raw database calls to fetch nodes matching a specific field with EntityFieldQuery (doc)

EntityFieldQuery does not support GROUP BY nor DISTINCT. See : https://drupal.org/node/1565708. Using raw database calls instead.

Thanks for your reviews.

sami radi’s picture

Status: Needs work » Needs review
kscheirer’s picture

Status: Needs work » Postponed (maintainer needs more info)
PAReview: Individual user account
It seems you are using a non-individual account.
All user accounts are for individuals. Accounts created for more than one user or those using anonymous mail services will be blocked when discovered (see Get a Drupal.org account).

Can you confirm that the virtuoworks account is a single user? Please change your account information and enter your realname.

If you prefer, we can also promote this sandbox to a full project without granting you "git vetted user" status.

----
Top Shelf Modules - Crafted, Curated, Contributed.

sami radi’s picture

Status: Postponed (maintainer needs more info) » Needs review

Hi,

It's a single user account. I changed my username (virtuoworks) to my real name to prove it.

Best regards.

kscheirer’s picture

Status: Needs review » Needs work
  • Why are you clearing messages in remove_duplicates_settings_form()? If there are some present, they are likely to be useful. Also in remove_duplicates_build_confirm_settings_form().
  • The multiple comments about t() not being needed in watchdog is kind of excessive, maybe just the first time?
  • Also, you don't need the last 3 parameters for watchdog if they are array(), WATCHDOG_INFO, NULL - those are the defaults anyway.
  • The menu item has an access check for administer remove_duplicates but I don't see a matching hook_permission() declaration.
  • The functions are just a big wall of text - that's mostly my problem, but I'm having trouble evaluating those.

Most of the code seems sane though. Needs work for the missing permission.

----
Top Shelf Modules - Crafted, Curated, Contributed.

sami radi’s picture

Status: Needs work » Needs review

Hi,

  • Why are you clearing messages in remove_duplicates_settings_form()? If there are some present, they are likely to be useful. Also in remove_duplicates_build_confirm_settings_form().

I was trying to prevent the user from being overflowed with to many messages. I removed the message clearing.

  • The multiple comments about t() not being needed in watchdog is kind of excessive, maybe just the first time?

Indeed, I wanted to be sure that I wouldn't get a KO because I didn't wrap the text in the t() function as it happened on review #8... Whatever I removed the unnecessary comments.

  • Also, you don't need the last 3 parameters for watchdog if they are array(), WATCHDOG_INFO, NULL - those are the defaults anyway.

The default is WATCHDOG_NOTICE. I removed the last parameter (i.e NULL) from my watchdog function calls.

  • The menu item has an access check for administer remove_duplicates but I don't see a matching hook_permission() declaration.

Right ! I forgot to put the matching hook, thanks. That's done.

  • The functions are just a big wall of text - that's mostly my problem, but I'm having trouble evaluating those.

What do you mean ? Are there to much comments or are the functions too long ? However it's true that I had a lot of work on the UI as required from previous reviews and that the layout functions are massive...

Thanks for the review.

kscheirer’s picture

Status: Needs review » Reviewed & tested by the community

You don't need to list PHP >= 5.2.0, Drupal 7 already requires PHP 5.2.5.
Good project page.
Since this is a D7 module, you can probably remove the references to CCK fields.

Thanks for the updates, this looks good to me!

----
Top Shelf Modules - Crafted, Curated, Contributed.

sami radi’s picture

Hi,

I updated the project page, readme.txt and module comments according to your review.

Thanks,

kscheirer’s picture

Status: Reviewed & tested by the community » Fixed

It's been a month without any problems reported, so I'm promoting this myself as per https://drupal.org/node/1125818.

Thanks for your contribution, Sami Radi!

I updated your account to let you promote this to a full project and also create new projects as either a sandbox or a "full" project.

Here are some recommended readings to help with excellent maintainership:

You can find lots more contributors chatting on IRC in #drupal-contribute. So, come hang out and get involved!

Thanks, also, for your patience with the review process. Anyone is welcome to participate in the review process. Please consider reviewing other projects that are pending review. I encourage you to learn more about that process and join the group of reviewers.

Thanks to the dedicated reviewer(s) as well.

----
Top Shelf Modules - Crafted, Curated, Contributed.

sami radi’s picture

Status: Fixed » Closed (fixed)

@kscheirer & @reviewers :

Thanks for the reviews.

The full project is at :
https://drupal.org/project/remove_duplicates

sami radi’s picture

Issue summary: View changes

Changed the link to the old D6 project to avoid drupal robot wrong project check