Ok - current setup... Development server...
PHP 5
Mysql 5
Athlon64 3.4
512Mb DDR
Fedora Core 5

I'm trying to migrate a whole bunch of articles from one site into Drupal. This site in particular has about 24,000 articles in total. I need to use CCK to reproduce these articles in Drupal. I also need to use Categories to organise everything.

I started off using a Drupal Module which was run through Apache and created nodes through drupal itself. Worked well... for about 50-100 nodes. Anything after that took a HELL of a long time. I very quickly realised that 24,000 articles was never gonna happen this way.

Method 2: A manual PHP script that gets run through command prompt on the server directly. Advantages:

  • Can be left going in the background without the concern of a browser timeout
  • I dont need to leave my machine on too :-)
  • So much quicker its not funny

Fantastic - so now I can import all 24,000 articles (plus about 300 categories) in about 40 minutes (whereas it took 20 mins to do 100 articles + 300 cats).

Thing is - if I import anything more than 5,000 articles (approx) I get PHP memory errors.

After enabling logging and faffing about with the server for a few days I have traced the error to the Cache.

After the import, I load the homepage and drupal seems to detect a difference between the cache and real life so regenerates the cache - including the menu. Now, as part of my initial plan, I decided to fully replicate the actions of Drupal whenever a node was created. This included creating one menu item for each node. When Drupal loads for the first time it tries to create a cache entry for the menu items.
It seems it does some very heavy duty PHP stuff (I see Apache killing my 3.4Ghz machine for a while) and then, according to the MySQL Log, dumps what looks like a serialised array into the database as a blob.

If this happened just one I'd be fine with it - part of the process! But it doesn't. This is whats getting me. After this is done (say I import 2000 articles as a test) and then I decide to create a new one. It then has to regenerate the entire menu system again.

Obviously there comes a point where Apache cannot do so anymore due to lack of resources. It could also mean daily downtime for the site while it sorts itself out.

What the hell can I do to get around this?

It seems stupid to me that Drupal tries to do this!

Btw: Caching is also disabled according to the settings.

I'd really appreciate any kind of feedback anyone can give here.

Comments

nicholasthompson’s picture

Just thought - it cant be an issue as big as I think, otherwise every new forum post on this site would cause a complete rebuild of the menu system.

What am I doing wrong?

nicholasthompson’s picture

Just a thought - it couldn't be caused by the category module could it?

nicholasthompson’s picture

Well I've done some debugging and it seems the Category Module is innocent and the issue seems to be core. Using xDebug I ran a few tests…

Using a sample of 2,000 articles and 300 categories, I got the following interesting results... the _menu_sort() hook was called 12,769 times and the Time Taken was 10.4929. If thats 10ms per call then thats DAMN slow...
The same goes for the menu_get_menu() function which gets called 12,502 times and a time taken of 0.25.

If it helps in the slightest, I have put the output from xDebug on the web...
http://www.pponline.co.uk/temp/test.htm

notabenem’s picture

This seems to be bogus. As far as I know, category_menu is the only module creating menus automatically.
I think if you disable the category_menu while doing the upload, you will be fine. i have just done a test: new item does not create a menu automatically, only when category_menu is told to do so ("Enabled", or "enabled (but disabled)", or "enabled (breadcrumbs only)" )

On the other side I noticed a serious impact, that I could not open the administrative page, because the log was that full (and there was no button to immediately delete it, or clean it up) because of some malfunctioned data import.

nicholasthompson’s picture

I will try this tomorrow - but I'm sure that I did this already.

So does this imply that if you have more than a few nodes on your site you cant use category_menu due to it regenerating the menu everytime?

nicholasthompson’s picture

Well tweaking that certainly does help a LOT. I can now actually browse the site - but I'm hitting other issues which i'll work on now as it seems my settings aren't perfect yet. The category module certainly could do with some optimising!

notabenem’s picture

"So does this imply that if you have more than a few nodes on your site you cant use category_menu due to it regenerating the menu everytime?"

My site has only a dozen nodes, I have just started to design it, but if such a performance penalty exists, it is a big showstopper.

As for disabling category_menu:
I haven't checked, but if I am correct and you disable this, you won't have your breadcrumbs...

nicholasthompson’s picture

I'm having partial sucess. If I setup the containers to only create menu items for categories and not nodes then any given pages takes 2-3 seconds to load and cache regeneration adds up to a few seconds on top that. Thats "acceptable" considering the test server I use at work is a dual 1Ghz (and it doesn't seem to be multi-threading, so effectively the server is 1Ghz).

My current issue is the category_node_get_categories command. It calls a query which produces (on my current setup) almost 8000 results which PHP then tries to turn into a 2-keyed array. This is what Devel has to say about it...

  • Time taken (ms): 1007.42
  • Where: category_node_get_categories
  • Query: SELECT n.nid, r.nid AS node_id, c.*, n.title FROM category c INNER JOIN category_node r ON c.cid = r.cid INNER JOIN category cn ON c.cnid = cn.cid INNER JOIN node n ON c.cid = n.nid INNER JOIN node cnn ON cn.cid = cnn.nid WHERE n.status = 1 AND n.moderate = 0 ORDER BY cn.weight, cnn.title, c.weight, n.title

Thats a LONG query. Geting MySQL to explain it produces...

id 	select_type 	table 	type 	possible_keys 	key 	key_len 	ref 	rows 	Extra
1 	SIMPLE 	c 	ALL 	PRIMARY,tid 	NULL 	NULL 	NULL 	343 	Using temporary; Using filesort
1 	SIMPLE 	cn 	eq_ref 	PRIMARY 	PRIMARY 	4 	drupal_sb.c.cnid 	1 	 
1 	SIMPLE 	n 	ref 	PRIMARY,status,node_moderate,node_status_type,nid 	PRIMARY 	4 	drupal_sb.c.cid 	1 	Using where
1 	SIMPLE 	cnn 	ref 	PRIMARY,nid 	PRIMARY 	4 	drupal_sb.cn.cid 	1 	 
1 	SIMPLE 	r 	ref 	PRIMARY,cid 	PRIMARY 	4 	drupal_sb.c.cid 	54 	Using index

(sorry if that comes out wierd - I'm not allowed to use HTML Table tags....)

Everything seems to have a key, except the first one - except that does have all appropriate fields indexed (either as index or primary).

I shall investigate further...

notabenem’s picture

Do you create your nodes as categories????
I can't imagine having more than a couple of dozen categories in a website. Why don't you create them as page/story/blog or any other special node, just not category???

Besides: have you installed ActiveSelect? If not, try it. It might help you a bit in distributing the load into several async queries.

nicholasthompson’s picture

I'll give ActiveSelect a go. Could be good!

The reason we have so many is the site has to categorise 24,000 news items (which grows by 5 a day, so about 1500 a year). I'm currently migrating a site (http://www.sportbusiness.com) from an old CMS into Drupal.

Each news item has its own CCK type and is assigned to categories which are split into 3 containers (Country, Sport and Sector). In total there are approximately 350 categories, about half of which are countries, almost half are sport and about 8 are sectors. Ideally I'd like to reduce that number of categories substantially as I think its OTT. I mean how many sport business related news event come from places like Christmas Island or Eritrea?! :-)

notabenem’s picture

What I don't get is, how is that category_node_get_categories returns 8000 categories if only 350 of them exist?

nicholasthompson’s picture

It doesn't just get a list of categories...

Its SOMETHING like a list of categories and the nodes within them...
The query is...
SELECT n.nid, r.nid AS node_id, c.*, n.title FROM {category} c INNER JOIN {category_node} r ON c.cid = r.cid INNER JOIN {category} cn ON c.cnid = cn.cid INNER JOIN {node} n ON c.cid = n.nid INNER JOIN {node} cnn ON cn.cid = cnn.nid WHERE n.status = 1 AND n.moderate = 0 ORDER BY cn.weight, cnn.title, c.weight, n.title

So we get a list of categories, then join in the nodes (makes sense) but then we join category back in again but on cnid rather than cid... I assume thats maybe parent or container category?

Whatever it is, I just tried a full import of all 24,00 articles + 350 cats and drupal dies. I'm gonna go ahead and blame categories again ;-)

This time I dont think its menu based - i think there is a HUGE inneficiency in the category module somewhere... Some repeated hitting on the database maybe?!

I know when I get to 24,000 articles + 350 cats I cant load a news item up... I run out of memory. I'd guess that this function trying to create an Über array (EVERY DAMN TIME!)

notabenem’s picture

I can't think of anything else atm.