Closed (fixed)
Project:
Import / Export API
Version:
4.7.x-1.x-dev
Component:
Code
Priority:
Critical
Category:
Bug report
Assigned:
Unassigned
Reporter:
Created:
6 Dec 2006 at 17:44 UTC
Updated:
31 Dec 2006 at 00:45 UTC
Attached is an XML test import of stories. I am successfully importing the nodes, but the terms associated with those nodes are not consistently being added. For this file, two nodes fail to import the term, while one node does. Any thoughts. I'm running drupal 4.7.4.
| Comment | File | Size | Author |
|---|---|---|---|
| node_data_keyless_test7.xml.txt | 5.14 KB | jamesJonas |
Comments
Comment #1
z.stolar commentedI have the same problem, I beleive. By looking at your XML I see that all three stories have the exact same term(s). If the problem is indeed identical to the one I've experienced, only the LAST story got the terms. Am I right? If so - the issue's title change is in place.
I also escalated the priority to critical, because, when importing large sets of data, it's practicaly impossible to go back and fix all the terms.
I'm using this (blessed) module to import hundreds of nodes in each chunk (actualy thousands, but I have to split them...), and my colleague had to write an external script (python) to go over the DB and fix the terms...
Comment #2
jamesJonas commentedIt seems as though your are correct, that it the last story got the terms. This is only a subset of a larger import, but provides a nice test case. After encountering this issue I went back to node_import, fixed the error in the code (must recast variables into arrays using (array)), and proceeded with my bulk import. I'm now reevaluating since the import seems to be performing around 3.6 seconds per node. I need to ramp it up to 100 to 1000 times faster.
What is the speed per node for the imports you are doing via importexportapi.module? If it is not in the .3 to .03 seconds per node, I will need to go directly to the database. Right now I'm starting to map out a list of tables I need to write sql against. I only have nodes, locations and terms to update. [term_node, term_data, node_revisions, node, location]. Any others?
I'm also importing a bit of a large data set (350 mb]. Besides refining php.ini, are there any other areas that you have found helpful to customize during a large import?
Comment #3
z.stolar commentedCan you detail more about how you "fixed the error in the code" (line number etc)?
About the timing - I must admit I'm not as accurate as you in this matter, but I can put it this way - importing 400 users' data takes between 8 to 13 minutes (!), which is still a lot. Until I got to 400, I had to go through a long (and painful) session of trial and error of unfinished runs of the scripts. for me - 400 was the maximum number of units to import per run, and my php.ini directives are very generous. So what I do is splitting my data into small XML files, but then the import becomes l o n g...
Comment #4
jamesJonas commentedarray_merge
The details with regards to fixing issues for node_import are here:
http://drupal.org/node/101624
As of PHP 5.0 array_merge() now requires stronger type enforcement for the variables. What I did was cast the variable inside the statement by placing '(array)' in front of it. Thus for node_import.module - line 569.
Before: $errors = array_merge((array)$errors, $function($node, $preview > 0));
After:$errors = array_merge($errors, $function($node, $preview > 0));
I'm still waiting for varification on this change from the module owner.
Timing
I'm importing 1mm plus nodes. It sounds like importexportapi.module will also be to slow. Right now I'm digging into drupal databases and mapping direct imports. Not very fun. Added to my list of database that must be update is also the 'sequence' database. Direct Import into drupal databases requires you to map all modules that are impacted during a 'node_save()'. The sequence for nid, vid (node_revisions), and tid are all controled via the 'sequence' database. The sequence is updated through the db_next_id() function that locks the table, grabs the id, increments it and then unlocks the table. Unless you are working offline (like me), a direct import will need to lock several tables via sql. During a large import via node_import I experienced a creeping consumption of memory. Consumed 1 GB main memory and 2 GB virtual, then died after about 70k node imports. Sounds like a memory leak. I look forward to being able to stress test importexportapi.module after this non-population of terms issue is resolved.
Comment #5
jamesJonas commentedCorrection: Reverse the order:
Before:$errors = array_merge($errors, $function($node, $preview > 0));
After: $errors = array_merge((array)$errors, $function($node, $preview > 0));
Comment #6
Jaza commentedFix for the consequent terms import issue has been committed to HEAD and 4.7. Thanks.
The problem was that in the
['node']['taxonomy']definition, the 'tid' field was a 'key_component', but the 'nid' field was not. Both of these fields are now 'key_component' fields.Comment #7
(not verified) commented