Hi there,

I was wondering if anyone is getting the same problem as me and a fix for it.

When i get to the last stage of Node Import when importing products, the number of lines continues to grow even and the percentage complete goes up and down all the time.

When i stop the import i get the first few products over and over again. It duplicates the entries.

Any ideas?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

myregistration’s picture

I, too, would like to know how to avoid duplicate product entries. I am using node_import and when I import the inventory containing products that already exist I get duplicate products, it doesn't update current products. I'm not sure why it wasn't written to use SKU as the identifier since it's a required field, correct? Or even better, would be nice for setting to choose which is the unique identifier. Regardless, now it just creates a new product even though the SKU's are the same. I don't see an easy way to delete mass products either, just one at a time which is very time consuming when you are talking 15000+ products. Please advise. Thanks! :)

ch_masson’s picture

I cannot answer to the duplicate entry issue but I can certainly help you with how to delete all instances of a specific content type. For that you need to use the DEVEL module at http://drupal.org/project/devel.

It has the feature that you need. Be careful though to select ONLY the content type that you want to delete or else it will wipe out every single piece of content (The "Delete all" box is checked by default)!

Go to admin/content/delete_content.

Uncheck the "Delete All" box and check only the type of content that you want to delete! :)

Then click "Delete" and your 15,000+ products will be gone in no time!

Christian

myregistration’s picture

Thank you for the info! By the way, I'm a total newbie at this point.

It would be nice if there was a checkbox beside listed products with a delete button or some alternative to the all-or-single option.

I found a module called node_import_update that is supposed to assist node_import by updating the products if they already exist, based on it's sku, instead of creating duplicates. Unfortunately, it's not working for me. It goes through all the steps of the import interface until step 8 (start install), instead it goes back to step 1. If I disable the module it works as before, it adds duplicates, but at least it processes. If anyone could debug this module it would be greatly appreciated. Thanks! :)

Answer: Going back to step 1 is an issue with IE and has nothing to do with the node_import_update module, when I use FireFox it seems to work fine.

myregistration’s picture

The duplicate entries at end of import is still happening for me. The numerical status of rows imported goes past the actual line numbers and it duplicates nodes.

cherukan’s picture

Version: 6.x-1.x-dev » 6.x-1.0-rc4

Seeing this in RC4. Any suggestions on how to get around? The progress bar goes past the number of records, if I download all imported rows it shows the correct number of rows, but if I look at the data in Content Management/Content, I see duplicate nodes being created.

djevans’s picture

@myregistration, @cherukan:

Just had this error myself after upgrading to 6.x-1.1. Running update.php solved the problem for me - I don't have time to recreate the error but could you check if this works for you?

sveldkamp’s picture

Version: 6.x-1.0-rc4 » 6.x-1.1

update.php didn't seem to do it for me, and I'm on 6.x-1.1. I'll post back if I find anything. -Steve

deekayen’s picture

subscribe - this just cost my company some money

cYu’s picture

Using 1.1 as well and having this issue sporadically. Have a csv of about 2500 rows that imports fine most of the time, but in one instance it created 2 nodes for 50 of the lines in the csv. The pairs of nodes were non-sequential and seemingly unrelated.

cYu’s picture

In my case it looks like the duplicates may be occurring when the import overlaps with a cron run. In my watchdog log I start getting messages of '@type: added %title.' with a location of /cron.php doubled up with the normal node creation messages.

deekayen’s picture

We're going to do the opposite of Semiclean and set the cron semaphore variable during our next round of imports to prevent cron from succeeding. It's a bit of a hack, but instead of duplicate entries, we'll get "Attempting to re-run cron while it is already running." in watchdog.

deekayen’s picture

something like this excerpt...

variable_set('cron_semaphore', time());
...
$task = db_fetch_object(db_query('SELECT * FROM {node_import_tasks} WHERE taskid=%d', $task_id));
drupal_write_record('node_import_tasks', $data);
$task = node_import_load($data->taskid);
node_import_do_task($task);
...
variable_del('cron_semaphore');
Bastlynn’s picture

I worked out the root cause. Node_import has a hook_cron implementation to kickoff cron on tasks. If a task is completed, it won't attempt to execute the task. But if a task is open (such as for a very long running import of a very large csv file) it will attempt to kick it off a second time. At this point you get duplicated reading in of data and node creation.

The root cause is in the task locking mechanism.

/**
 * Acquire or release our node_import lock.
 *
 * @param $release
 *   Boolean. If TRUE, release the lock. If FALSE, acquire the
 *   lock.
 *
 * @return
 *   Boolean. Whether the lock was acquired.
 */
function node_import_lock_acquire($release = FALSE) {
  static $lock_id, $locked;

  if (!isset($lock_id)) {
    $lock_id = md5(uniqid());
    $locked = FALSE;
    register_shutdown_function('node_import_lock_release');
  }

  if ($release) {
    db_query("DELETE FROM {variable} WHERE name = '%s'", 'node_import:lock');
    $locked = FALSE;
  }
  else if (!$locked) {
    if (@db_query("INSERT INTO {variable} (name, value) VALUES ('%s', '%s')", 'node_import:lock', $lock_id)) {
      $locked = TRUE;
    }
  }

  return $locked;
}

The reason is that the system gets away with this is because node_import_lock_aquire() function is not threadsafe. Static variables do not work that way and are not shared across processes. At the start of the cron kickoff the file is being processed simultaneously.

Solution: Use variable_get and variable_set for locks.

Bastlynn’s picture

Potential patch up for review. See attached.

deekayen’s picture

Status: Active » Needs work

Why set the uniqid at all? Why not just set the boolean? For that matter, why even keep it around in the shutdown function? I'd think you could variable_del() it at that point.

Bastlynn’s picture

Agreed, the shutdown function isn't the best way, or at least not the clearest way, to trigger the event to clear the lock after the task is complete. To remove it would mean needing to retool the logic for the locking system to catch at the beginning of tasks and the clear at the end of tasks (not necessarily a bad idea) but would require more meddling in other functions.

At that point though, there needs to also be a way to clear or delete locked tasks in case the system crashes in the middle of an import so we don't end up with these incomplete tasks lingering all over the place and unable to process new tasks. It would be worthwhile then to use a lock per task so you could set the system to working on multiple imports without risking duplication or preventing yourself from multiple runs.

deekayen’s picture

What's the point of the lock_id though?

Bastlynn’s picture

Right now - there isn't one (I'm not convinced there was one in the original code either). Once I finish working up a patch to do individual task locking as described in my late night rambling- then there will be. ;)

Bastlynn’s picture

Updated patch - I saw the comment tracing the history of the locking mechanism as described here, but the local implementation of it for node_import missed a critical element by not pulling data in from the variables table when trying to gain a lock.

I'm debating updating the locking mechanism here to use the formal locking mechanisms in Drupal - thoughts, opinions, pros and cons of doing so?

Bastlynn’s picture

Status: Needs work » Needs review

Going? Going? Gone. I'm pretty content with not using the features in lock.inc for the moment. So this is ready for review and testing - please let me know if you spot issues with the logic being used here.

deekayen’s picture

function node_import_lock_acquire() looks like it would never return FALSE in #19, signifying that a lock was already in place on the requested task. Here's an un-tested, maybe full of parse errors revision.

Bastlynn’s picture

re: #21 - The logic seems functionally the same as #19 but I like this better than trying to stick with the original course of logic for the module. It reads clearer to the general coder and reduces the number of flags stored in the variables table. The patch doesn't have parse errors to correct, it looks good to me.

Bastlynn’s picture

I tweaked the logic on this patch one last time to make sure the lock releases correctly. See attached.