I am doing using node_import to load 500,000 rows. The speed is quite slow from some testing I did where I can only load about 19,000 per hour. But in any case the process stops after loading about 64,000 rows saying:
Fatal error: Allowed memory size of 532676608 bytes exhausted (tried to allocate 5001 bytes) in .../modules/node_import/node_import.module on line 819
It is maxing out the memory allocated to a php process in php.ini. Looking at the node_import.mode code it does an unset($node), so I expect cleanup during each load...
I also came across this post which suggests that node_save could be causing this. See: http://drupal.org/node/208052
Please does anyone have any ideas on how to solve this?
I am trying to avoid resorting to manual sql updates... Thanks
Comments
Comment #1
zeezhao commentedFound a temporary solution...
1. The node_save function in node.module clears the cache each time a new node is loaded by calling:
cache_clear_all(); // its located towards the end of the function
This will impact speed when loading a huge number of rows and also eats up memory. So I created a new function called node_save2 without caching and used this in both files:
- node_import.module
- supported/cck/content.inc
Ideally node_save should have a variable to turn the cache on/off. There is a patch out there for this that will make its way into drupal core at some stage. I am still on drupal 5.7. See:
http://drupal.org/node/115319
2. Also in node_import.module, made similar change to switch off cache for node_load. See this line:
$node_current = node_load($node->nid, NULL, TRUE);
[not tested this yet, as more relevant for node updates]
3. I reduced length of allocated size, since I know biggest size of rows I am loading i.e in node_import.module, I changed 10000 to 500, so that fgets() uses less memory for my situation:
$length = variable_get('node_import_csv_size', 500);
This seems to get over the memory issue, but loading speed is still slow...
Comment #2
Robrecht Jacques commentedI've committed the node_load() part. More performance tweaking (like node_save()) will need to wait until 6.x.
Comment #3
zeezhao commentedOk, thanks.
A few more comments:
- I was also using the cck patch to enable image loads. This works very well, but I noticed it also called node_save() again. So it slows down the program. (The patch affects node_import\supported\cck\content.inc). So to improve performance, I dsiabled the patch, and I loaded my images manually and separately with some sql - much faster...
- taxonomy.module also has a function: taxonomy_node_save($nid, $terms), which is called by node_import.
In this function, it deletes terms using taxonomy_node_delete($nid) before inserting new ones. On a new database with no products, this is not necessary, so I amended it to only delete on updates... [I have a lot of terms so this was important]
Once I made all the changes I got a 50% improvement in my speed, from about 20k nodes per hour to about 30k nodes per hour on my laptop, so should be much faster on a server... Still slow, but been able to load batches of 500k nodes in about 16hrs in one shot by leaving it going for a while, and no memory issue...