Duplicate entries when importing using node_import [#765576]

Comment	File	Size	Author
#23	node_import-lock-765576-22.patch	4.23 KB	Bastlynn
#21	node_import-lock-765576-21.patch	4.2 KB	deekayen
#19	task_lock_fix-765576-2.patch	4.05 KB	Bastlynn
#14	task_lock_fix-765576-4482348.patch	1.14 KB	Bastlynn

Comment #1

myregistration CreditAttribution: myregistration commented 25 May 2010 at 22:55

I, too, would like to know how to avoid duplicate product entries. I am using node_import and when I import the inventory containing products that already exist I get duplicate products, it doesn't update current products. I'm not sure why it wasn't written to use SKU as the identifier since it's a required field, correct? Or even better, would be nice for setting to choose which is the unique identifier. Regardless, now it just creates a new product even though the SKU's are the same. I don't see an easy way to delete mass products either, just one at a time which is very time consuming when you are talking 15000+ products. Please advise. Thanks! :)

Log in or register to post comments

Comment #2

ch_masson CreditAttribution: ch_masson commented 26 May 2010 at 02:14

I cannot answer to the duplicate entry issue but I can certainly help you with how to delete all instances of a specific content type. For that you need to use the DEVEL module at http://drupal.org/project/devel.

It has the feature that you need. Be careful though to select ONLY the content type that you want to delete or else it will wipe out every single piece of content (The "Delete all" box is checked by default)!

Go to admin/content/delete_content.

Uncheck the "Delete All" box and check only the type of content that you want to delete! :)

Then click "Delete" and your 15,000+ products will be gone in no time!

Christian

Log in or register to post comments

Comment #3

myregistration CreditAttribution: myregistration commented 17 September 2010 at 09:35

Thank you for the info! By the way, I'm a total newbie at this point.

It would be nice if there was a checkbox beside listed products with a delete button or some alternative to the all-or-single option.

I found a module called node_import_update that is supposed to assist node_import by updating the products if they already exist, based on it's sku, instead of creating duplicates. Unfortunately, it's not working for me. It goes through all the steps of the import interface until step 8 (start install), instead it goes back to step 1. If I disable the module it works as before, it adds duplicates, but at least it processes. If anyone could debug this module it would be greatly appreciated. Thanks! :)

Answer: Going back to step 1 is an issue with IE and has nothing to do with the node_import_update module, when I use FireFox it seems to work fine.

Log in or register to post comments

Comment #4

myregistration CreditAttribution: myregistration commented 17 September 2010 at 09:44

The duplicate entries at end of import is still happening for me. The numerical status of rows imported goes past the actual line numbers and it duplicates nodes.

Log in or register to post comments

Comment #5

cherukan CreditAttribution: cherukan commented 13 October 2010 at 05:53

Version:

6.x-1.x-dev

» 6.x-1.0-rc4

Seeing this in RC4. Any suggestions on how to get around? The progress bar goes past the number of records, if I download all imported rows it shows the correct number of rows, but if I look at the data in Content Management/Content, I see duplicate nodes being created.

Log in or register to post comments

Comment #6

djevans CreditAttribution: djevans commented 30 March 2011 at 20:31

@myregistration, @cherukan:

Just had this error myself after upgrading to 6.x-1.1. Running update.php solved the problem for me - I don't have time to recreate the error but could you check if this works for you?

Log in or register to post comments

Comment #7

sveldkamp CreditAttribution: sveldkamp commented 1 April 2011 at 12:13

Version:

6.x-1.0-rc4

» 6.x-1.1

update.php didn't seem to do it for me, and I'm on 6.x-1.1. I'll post back if I find anything. -Steve

Log in or register to post comments

Comment #8

deekayen CreditAttribution: deekayen commented 16 May 2011 at 04:13

subscribe - this just cost my company some money

Log in or register to post comments

Comment #9

cYu CreditAttribution: cYu commented 16 May 2011 at 20:26

Using 1.1 as well and having this issue sporadically. Have a csv of about 2500 rows that imports fine most of the time, but in one instance it created 2 nodes for 50 of the lines in the csv. The pairs of nodes were non-sequential and seemingly unrelated.

Log in or register to post comments

Comment #10

cYu CreditAttribution: cYu commented 17 May 2011 at 18:34

In my case it looks like the duplicates may be occurring when the import overlaps with a cron run. In my watchdog log I start getting messages of '@type: added %title.' with a location of /cron.php doubled up with the normal node creation messages.

Log in or register to post comments

Comment #11

deekayen CreditAttribution: deekayen commented 18 May 2011 at 00:19

We're going to do the opposite of Semiclean and set the cron semaphore variable during our next round of imports to prevent cron from succeeding. It's a bit of a hack, but instead of duplicate entries, we'll get "Attempting to re-run cron while it is already running." in watchdog.

Log in or register to post comments

Comment #12

deekayen CreditAttribution: deekayen commented 18 May 2011 at 00:26

something like this excerpt...

variable_set('cron_semaphore', time());
...
$task = db_fetch_object(db_query('SELECT * FROM {node_import_tasks} WHERE taskid=%d', $task_id));
drupal_write_record('node_import_tasks', $data);
$task = node_import_load($data->taskid);
node_import_do_task($task);
...
variable_del('cron_semaphore');

Log in or register to post comments

Comment #13

Bastlynn CreditAttribution: Bastlynn commented 18 May 2011 at 01:09

I worked out the root cause. Node_import has a hook_cron implementation to kickoff cron on tasks. If a task is completed, it won't attempt to execute the task. But if a task is open (such as for a very long running import of a very large csv file) it will attempt to kick it off a second time. At this point you get duplicated reading in of data and node creation.

The root cause is in the task locking mechanism.

/**
 * Acquire or release our node_import lock.
 *
 * @param $release
 *   Boolean. If TRUE, release the lock. If FALSE, acquire the
 *   lock.
 *
 * @return
 *   Boolean. Whether the lock was acquired.
 */
function node_import_lock_acquire($release = FALSE) {
  static $lock_id, $locked;

  if (!isset($lock_id)) {
    $lock_id = md5(uniqid());
    $locked = FALSE;
    register_shutdown_function('node_import_lock_release');
  }

  if ($release) {
    db_query("DELETE FROM {variable} WHERE name = '%s'", 'node_import:lock');
    $locked = FALSE;
  }
  else if (!$locked) {
    if (@db_query("INSERT INTO {variable} (name, value) VALUES ('%s', '%s')", 'node_import:lock', $lock_id)) {
      $locked = TRUE;
    }
  }

  return $locked;
}

The reason is that the system gets away with this is because node_import_lock_aquire() function is not threadsafe. Static variables do not work that way and are not shared across processes. At the start of the cron kickoff the file is being processed simultaneously.

Solution: Use variable_get and variable_set for locks.

Log in or register to post comments

Comment #14

Bastlynn CreditAttribution: Bastlynn commented 18 May 2011 at 01:05

File	Size
task_lock_fix-765576-4482348.patch	1.14 KB

Potential patch up for review. See attached.

Log in or register to post comments

Comment #15

deekayen CreditAttribution: deekayen commented 18 May 2011 at 01:40

Status:

Active

» Needs work

Why set the uniqid at all? Why not just set the boolean? For that matter, why even keep it around in the shutdown function? I'd think you could variable_del() it at that point.

Log in or register to post comments

Comment #16

Bastlynn CreditAttribution: Bastlynn commented 18 May 2011 at 01:59

Agreed, the shutdown function isn't the best way, or at least not the clearest way, to trigger the event to clear the lock after the task is complete. To remove it would mean needing to retool the logic for the locking system to catch at the beginning of tasks and the clear at the end of tasks (not necessarily a bad idea) but would require more meddling in other functions.

At that point though, there needs to also be a way to clear or delete locked tasks in case the system crashes in the middle of an import so we don't end up with these incomplete tasks lingering all over the place and unable to process new tasks. It would be worthwhile then to use a lock per task so you could set the system to working on multiple imports without risking duplication or preventing yourself from multiple runs.

Log in or register to post comments

Comment #17

deekayen CreditAttribution: deekayen commented 18 May 2011 at 02:29

What's the point of the lock_id though?

Log in or register to post comments

Comment #18

Bastlynn CreditAttribution: Bastlynn commented 18 May 2011 at 14:05

Right now - there isn't one (I'm not convinced there was one in the original code either). Once I finish working up a patch to do individual task locking as described in my late night rambling- then there will be. ;)

Log in or register to post comments

Comment #19

Bastlynn CreditAttribution: Bastlynn commented 18 May 2011 at 15:20

File	Size
task_lock_fix-765576-2.patch	4.05 KB

Updated patch - I saw the comment tracing the history of the locking mechanism as described here, but the local implementation of it for node_import missed a critical element by not pulling data in from the variables table when trying to gain a lock.

I'm debating updating the locking mechanism here to use the formal locking mechanisms in Drupal - thoughts, opinions, pros and cons of doing so?

Log in or register to post comments

Comment #20

Bastlynn CreditAttribution: Bastlynn commented 20 May 2011 at 15:35

Status:

Needs work

» Needs review

Going? Going? Gone. I'm pretty content with not using the features in lock.inc for the moment. So this is ready for review and testing - please let me know if you spot issues with the logic being used here.

Log in or register to post comments

Comment #21

deekayen CreditAttribution: deekayen commented 20 May 2011 at 17:04

File	Size
node_import-lock-765576-21.patch	4.2 KB

function node_import_lock_acquire() looks like it would never return FALSE in #19, signifying that a lock was already in place on the requested task. Here's an un-tested, maybe full of parse errors revision.

Log in or register to post comments

Comment #22

Bastlynn CreditAttribution: Bastlynn commented 23 May 2011 at 17:37

re: #21 - The logic seems functionally the same as #19 but I like this better than trying to stick with the original course of logic for the module. It reads clearer to the general coder and reduces the number of flags stored in the variables table. The patch doesn't have parse errors to correct, it looks good to me.