I am doing using node_import to load 500,000 rows. The speed is quite slow from some testing I did where I can only load about 19,000 per hour. But in any case the process stops after loading about 64,000 rows saying:

Fatal error: Allowed memory size of 532676608 bytes exhausted (tried to allocate 5001 bytes) in .../modules/node_import/node_import.module on line 819

It is maxing out the memory allocated to a php process in php.ini. Looking at the node_import.mode code it does an unset($node), so I expect cleanup during each load...

I also came across this post which suggests that node_save could be causing this. See: http://drupal.org/node/208052

Please does anyone have any ideas on how to solve this?

I am trying to avoid resorting to manual sql updates... Thanks

Comments

zeezhao’s picture

Found a temporary solution...

1. The node_save function in node.module clears the cache each time a new node is loaded by calling:
cache_clear_all(); // its located towards the end of the function

This will impact speed when loading a huge number of rows and also eats up memory. So I created a new function called node_save2 without caching and used this in both files:
- node_import.module
- supported/cck/content.inc

Ideally node_save should have a variable to turn the cache on/off. There is a patch out there for this that will make its way into drupal core at some stage. I am still on drupal 5.7. See:
http://drupal.org/node/115319

2. Also in node_import.module, made similar change to switch off cache for node_load. See this line:

$node_current = node_load($node->nid, NULL, TRUE);

[not tested this yet, as more relevant for node updates]

3. I reduced length of allocated size, since I know biggest size of rows I am loading i.e in node_import.module, I changed 10000 to 500, so that fgets() uses less memory for my situation:

$length = variable_get('node_import_csv_size', 500);

This seems to get over the memory issue, but loading speed is still slow...

Robrecht Jacques’s picture

Version: 5.x-1.6 » master
Status: Active » Postponed

I've committed the node_load() part. More performance tweaking (like node_save()) will need to wait until 6.x.

zeezhao’s picture

Ok, thanks.

A few more comments:

- I was also using the cck patch to enable image loads. This works very well, but I noticed it also called node_save() again. So it slows down the program. (The patch affects node_import\supported\cck\content.inc). So to improve performance, I dsiabled the patch, and I loaded my images manually and separately with some sql - much faster...

- taxonomy.module also has a function: taxonomy_node_save($nid, $terms), which is called by node_import.
In this function, it deletes terms using taxonomy_node_delete($nid) before inserting new ones. On a new database with no products, this is not necessary, so I amended it to only delete on updates... [I have a lot of terms so this was important]

Once I made all the changes I got a 50% improvement in my speed, from about 20k nodes per hour to about 30k nodes per hour on my laptop, so should be much faster on a server... Still slow, but been able to load batches of 500k nodes in about 16hrs in one shot by leaving it going for a while, and no memory issue...