ApacheSolr + NodeImport hitting a brick wall?

yountod - October 30, 2009 - 21:14

I've built a D6 + ApacheSolr + NodeImport server three times now, each more powerful than the last, in order to import about a half million nodes from a CSV file. I have everything running smoothly on one pretty powerful box. I seem to be able to bring in a few thousand records at a time without a hiccup, but any time I take a bigger bite of CSV records I get a total lockup.

More specifically, it seems that as I bring in about 20,000 rows the imports crash and CRON stops running. If I clear the cron variables and flush the caches, CRON runs once more but then locks again. Additionally, I see the Apache2 service peg out at 100% processor. I've let it run for days to no avail. I've also rebooted to see if it was just a rogue process, but it pegs again on first access. Whatever it's doing, it grinds the system to a halt. If I hit the web server from another station a second Apache2 PID starts and also pegs the CPU.

Who's the culprit - D6? Apache2? MySQL? I suspect more that it's Solr optimizing the index or something, but it never seems to get over the hump. I got the same effect on a system with 4GB RAM as I did with 1.5GB, and both times when I had approximately 20,000 newly-imported records not yet sent to SOLR. Although it pegs CPU, it's barely touching memory and disk. I noticed that the Solr index information page stops displaying index terms at 20,000 records - is this just a coincidence? I've read of people importing hundreds of thousands of records, perhaps millions; what invisible ceiling am I hitting, and how might I break through it?

...and I'm suspecting that

yountod - October 30, 2009 - 22:56

...and I'm suspecting that it's the import, I just figured out that I can delete the CSV file "behind the scenes" through FTP, causing the in-progress import to fail, then I can delete it.... Then the Apache2 session doesn't lock up!!! So hmm... did I figure it out?

I guess that would explain why the cron job would pass the first time and fail the 2nd - something's re-initiating the in-progress import on the first cron pass?

Why is a failed & still-in-progress import seizing my system?

Hello me, I just thought I'd

yountod - November 3, 2009 - 22:01

Hello me, I just thought I'd update this issue. It seems to be corrupt or malformed input lines in the CSV sending this thing into orbit. I managed "resume" capability by deleting the offending record, reuploading/overwritting the file on the server, and refreshing the import progress screen. Can't tell why the import lines are brokem though, sometimes an ampersand or a single quote will set it off, other times it just seems to be "distasteful."

Anyway, the troubleshooting continues.

Sincerely,

Me.

Hello again, me.

yountod - November 4, 2009 - 17:58

For other noobz out there:

Make SURE you are properly escaping your import files!!! I use a backslash from MySQL but the default on node_import is a double-quote which sent my system into the netherreaches of the ether. If your imports are bombing, double-check this during the import definition screens!

It turned out to have nothing at all to do with ApacheSolr the rock star mod. Node_Import is great too, don't get me wrong. As usual, it was more me vs. assumption/common-sense than it was a bug.

Did you start off with drupal

Jason Ruyle - November 19, 2009 - 21:48

Did you start off with drupal and export to csv. I see you mention node_import. Does that module export as well? Did you do this so you could do bulk updates, start-overs?

------------------------------
i do stuff

 
 

Drupal is a registered trademark of Dries Buytaert.