Allow batch imports using cron

cosmos00 - November 11, 2007 - 08:34
Project:Node import
Version:5.x-1.6
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:patch (code needs work)
Description

I have 54 csv files that have 700,000 records. The smallest file is 5MB and the largest is 35MB. I'm trying to import the 5MB file (which has about 10,000 rows of data) and Drupal crashes with page no found message. With the default settings it would import about 100 odd records.

I started with tweaking the php.ini file. With the following settings, I was able to import 3200 records out of the 10,000 odd records the file has, and then it crashed.

post_max_size = 40M
upload_max_filesize = 40M
memory_limit = 512M
max_execution_time = 1200
max_input_time = 1200

Have any of you imported files this big successfully with node_import? I would appreciate any advice or thoughts you have.

Thank you!

#1

germandrupaller - December 18, 2007 - 16:08

I get a fatal error while importing a ~ 4MB CSV

maximum execution time exceeded ... in pathauto on line ...

any solutions?

#2

philippejadin - January 10, 2008 - 10:37
Title:Very Large CSV does not import» Allow batch imports using cron
Category:support request» feature request

The only reliable way to do this would be to batch import nodes.

Node import could import a limited number of nodes on each cron run. User import does it : http://drupal.org/project/user_import it could give some inspiration on how to do it.

A setting page could be added before or after the preview step :

- Import [xx] nodes at a time using cron (each time cron is run, a batch of nodes will be imported, use this to avoid php timeout / memory limits)

#3

kingandy - January 11, 2008 - 10:29

I tend to manually split my import files into a workable number of rows. Table width appears to be a factor - I've had some narrow (few, small fields) files import more than 1000 rows at a time, but the one I'm working with at the moment (with about 50 fields, many of them large text beasts) will only do about 400 before timing out.

FWIW, of all the php settings posted above, I think the limiting factor will be max_execution_time - populating, validating and saving a node is a lengthy process, especially if there are many non-standard fields, so PHP just plain runs out of runtime before it's done. It would be possible to tweak the module to constantly extend its runtime (using the set_time_limit() function), though with a long processing time there is a risk that the browser would assume the request has failed and display a timeout message of its own.

With 700,000 rows, though, it might be worth taking a step back and looking at what you're trying to achieve. That would be - in my opinion - a huge number of nodes, you'd have a massive database and things could very well slow down. (In particular you'd want to avoid using pathauto or any kind of node cross-reference modules!) You might do better to keep the info in an external datasource, or at least in its own table, with some kind of custom code to fetch and display it ... I don't know.

That said - an automated batch import procedure would still be good for the module!

#4

philippejadin - January 11, 2008 - 14:36

I'm working on it, I hope to have a patch somewhat soon

#5

philippejadin - January 11, 2008 - 16:06

In order to complete this batch import thing, some things must be changed in the module.

I'd like to extend the node_import_mappings table.

I think the table should contain the filename of the file to be imported, the total number of rows in this file, the number of rows already imported, and if batch import is enabled.

This way, a cron hook could be created. It would check if there are curently files needed to be imported, check if some rows are not yet imported and how many rows can be imported in this cron run.

Using this, it would be easy to launch multiple import jobs, and see where we are using a status page.

Does it looks like a good battle plan? If it is, I will implement this. If not please give advices

#6

germandrupaller - January 14, 2008 - 10:00

sounds good to me.

#7

philippejadin - January 25, 2008 - 10:22
Assigned to:Anonymous» philippejadin

I'm banging my head on this one :-)

It's really complicated to decouple the existing code to allow importing per batch. The _node_import_get_nodes function has not been designed to allow this (and the rest of the module is not very batch import friendly)

This means I'll have to change a lot of the module code. I'd really apreciate suggestions from others who have deep knowledge of the decisions made in the existing module.

#8

earnie - January 25, 2008 - 20:18

If you use a cron hook then you should use that to spawn a process that continues to run in the background. Batching this type of thing takes a long time. If you're interested I can share a function to spawn a wget that continues to run in the background. You may want to have methods to check if the spawned process is continuing to run, has aborted, can restart, etc. There is nothing worse than getting partially through the file without some method of advancing to the point of the abort.

#9

philippejadin - February 1, 2008 - 10:07

The logic found in http://drupal.org/project/job_queue could be used as well. It checks if the time elapsed for this cron run is not more than half of the total amount of time available for one cron run.

#10

Robrecht Jacques - February 1, 2008 - 16:17

Node_import would need some rewriting to allow for this:

1. Create a table node_import_jobs which stores jobid, matches, globals, type, uid, ... (extending the node_import_mappings - I would put it in another table because of the globals may contain stuff not intended for all uid). So after you completed the form, one row would be added.

2. Create a table node_import_rows which stores jobid, row_number, row_data, ... After the form completes, you would completely read the file and store each row (without importing). There should be a column for "success" (NULL if not yet tried, nid if success, 0 if fails) and a "errors" column with the errors if it fails.

3. _node_import_get_nodes() would be rewritten to get the rows from the node_import_rows table instead of from the file directly.

4. Probably even better is to write a _node_import_get_node() which does the bulk of the work like in _node_import_get_nodes() but only imports one node. _node_import_get_nodes() would then call _node_import_get_node() a number of times. Something like:

<?php
function _node_import_get_node(matches, globals, row_data, uid) {}
?>

or maybe
<?php
function _node_import_get_node(jobid, rowid){}
?>

5. The end of the wizard form would redirect to a page that does the import with a processbar, calling a callback that does X nodes a time. If the user goes away from the form, the rest of the nodes would be imported by cron.

6. One needs to be carefull about not doing a node twice, eg once because the user is looking at the form (which calls the callback using js) and once because cron just happens to run on the same time. Probably you need to put some extra column in node_import_rows that tells it that the import has started? I dunno how to solve this race problem nicely.

7. There would need to be a page which shows a listing of all jobs, with a X of Y rows imported, Z errors status. Maybe an edit form to change the settings. Enable/disable jobs or enable/disable batch of jobs. Once a job is completed, there would be a delete operation. Maybe even allow the user to delete the imported nodes (because of too many errors). So the rows would stay in node_import_rows table until a job is deleted.

If you provide some patches, I'm willing to review them carefully.

#11

gagarine - February 6, 2008 - 13:16

subscribing

#12

philippejadin - February 6, 2008 - 13:27

Thank you for all the directions.

1. ok

2. I think we can simply store a pointer to where we are in the file. Using php fseek() is fast and (in theory) infenitely scalable. Importing in a temporary table would mean importing two times, and would still be endangered by php timeout with huge csv files. And I think the logic to get only part of the source material (in this case a csv file) should be abstracted in order to allow other import formats later.
That's how user import does it (fseek) and I think it's the right way to go at least with huge csv files.

3. or use the file pointer to know were to resume?

4. ok

5. I don't know if it's easy to do this kind of progress bar in Drupal 5. I think it's built in D6 though

This represents quite a lot of work, I don't expect this to be ready all in once.

#13

Robrecht Jacques - February 6, 2008 - 14:35

2. You probably still will need some place to store rows with errors though. That's why I was thinking about a rows-table. But you are right that that too may take some time (although a lot less then creating a node). fseek() would be fine too.

5. If you don't do a progress bar, then you don't have the conflict between cron() and viewing the page with the progress bar, so that solves 6. Still it would be nice if somewhere something is said about how far we are. Well, that's not critical.

It's a lot of work, but you can do it step by step and gradually convert to batch/cron enabled imports. So a lot of small patches instead of one big one. That will make it easier to review it too.

#14

philippejadin - February 22, 2008 - 08:56
Assigned to:philippejadin» Anonymous

Our project takes longer than expected, I'm not sure I'll be able to provide a patch in a raisonnable timeframe.

#15

vacilando - February 25, 2008 - 18:43

Subscribing and... philippejadin, you really switched on the light of hope here! Please keep on, Drupal sorely needs the ability to import large batches. If you need testing help don't hesitate to ask.

#16

earnie - February 25, 2008 - 21:03

For rays of hope, I'm processing many large files some more than 100K rows from a UNIX cron process, not a cron hook. I store the rows in a separate file and treat that processing as a third party application. I converted all tables to InnoDB for transactional control. I have added some indexes based on watching the queries and I have removed some indexes based on the fact that the same column is indexed more than once. I have another process that takes the control data table and creates the nodes using the node functions. I have pathauto turned on for SEO paths and each node has a minimum of four taxonomy terms and could have many more. This processing happens at night while I am asleep. My server has 4 CPU and 128M of RAM available.

I recently added some new files to process. One of the issues this brought forth was the search indexing updates. I've made a few modifications and have brought the number of nodes needing indexed from 124715 to 123215 in about four hours process 500 nodes at each cron run with each cron run scheduled every 5 minutes. I am trying to keep track of changes and be sure to submit what is needed. I have created a project to hopefully add this work to the community.

#17

vacilando - February 28, 2008 - 10:44

earnie, this sounds very impressive - can you provide more details so we can try that as well? What project have you created around this - is it http://drupal.org/project/publisher and what exactly does it do?

#18

earnie - February 28, 2008 - 13:50

Yes that is the module. At the moment it isn't much. It will contain the administration of data providers, data files, data column mapping, file process timing and node process timing. I've created a library that uses Drupal API but I treat as a third party instance which requires a database entry in the settings.php file. I did this because I plan to offer the datafeed portions as a service for hire and as a retailed package. On top of the library I've developed an elaborate set of cron scripts. The publisher module will hook_cron but will process the scripts in background rather than causing cron.php to wait for the completion.

Creating the node in batch is simply creating the node object and passing that object into node_save which passes back the nid in the object passed in. You then update the control row with the nid so that if you later need to update it, deactivate it or delete it you can.

Recently, besides the search index issue (which has been resolved as to why I it wasn't processing properly), I've been trying to determine why when I hit edit on one of my nodes for my datafeed content type it would just give a WSOD. I finally determined that it is due to the shear number of vocabulary terms that are in the database and trying to be used for a list box. I spent days on that one and finally wrote a batch script to delete nodes by term name and vocabulary name. I'll add a node edit function to the publisher package that will not include the taxonomy in a list and I can flag the data control of the node being disabled or deleted.

BTW: The search index issue ended up being the search_index update routine which does an update, check for affected rows, and if zero does an insert. The update didn't update and the insert was giving a duplicate key. To resolve I changed INSERT to REPLACE.

Also:

  1. the admin/content/node also gives WSOD because of the same reason; the number of taxonomy terms causing the form list box to consume memory.
  2. the "Items to index per cron run" on admin/settings/search has a maximum of 500 nodes which is too small for me; I've got to look at increasing that number. I am thinking that I need to do the search index update at the time I create the node; I'll be looking into the search API as well to see if I can update the node as it is created.
  3. My datafeed process is currently considered version 0.1, is being used in production and I soon will be looking at going to version 0.2 which will enhance the file reading and cleanup of my library for API that was and is no longer used plus finish documenting what is left. I've found a few of my nodes to be incorrectly formed due to the fact that not all of the data for one row being read and the remainder of that row being treated as a new row. One would think that 50K is enough for one row but ...
  4. If you're interested in helping with publisher, I'm interested in helping you with your datafeed needs. Contact me if you're willing to help.

#19

Summit - March 5, 2008 - 12:55

Subscribing, batch import would be great! +2 for this! Greetings, Martijn

#20

earnie - March 5, 2008 - 19:53

CAUTION: Large datasets have interesting impacts. I've been debugging a Out Of Memory condition with one of my recent SELECT statements. The effect is that mysql_query buffers the entire result giving a MySql errno 2008 which is an out of memory error. There may be possible other methods to use, I'll be testing a few of those.

#21

latte - March 14, 2008 - 01:03

Subscribing! This is awesome!!!! :)

#22

earnie - March 21, 2008 - 12:40

CAUTION
Be sure the Download method on admin/settings/file-system is set to Public and not to Private; even if you have nothing to offer for downloads.

If Download method is set to Private it will degrade the system to not usable. I haven't debugged this yet but my understanding is that when set to Private the system will execute the bootstrap routines with each node_load and maybe more. Once I heard that news I reset my site to Public and my nightmarish headache went away. When set to Private my batched file loads and batched node creations were being killed by the host provider. Now my log files are intact and give good status and the site actually works when MySql is busy inserting data.

I hope this helps everyone who find this to resolve some slowness issue trying to batch the node creations. It was really making me tense.

#23

gthing - May 15, 2008 - 20:37

another vote for this feature!

#24

randomuser - May 20, 2008 - 04:28

subscribing

#25

mennonot - May 28, 2008 - 16:15

+1 subscribing

#26

vacilando - May 30, 2008 - 14:04
Version:5.x-1.2» 5.x-1.6
Status:active» patch (code needs work)

Well, I don't have a solution for the cron solution, but here's a fix that has solved all my worries. Just now, as a test, I have upload 99,131 two-column entries from a 13.5 MB file in one go, where normally I could not upload more than 2,600 :-)

What to do - in version 5.x-1.6, in function _node_import_get_nodes look up this line:
while (($row = $get_row($filepath)) && ($j++ < $preview || $preview == 0)) { (line 615 in this version)
and place this command in a line just after it:
set_time_limit(180);

This basically assures that the timeout of the running PHP process gets re-set with each row of your imported file.

After you start the import, your page will not show the progress, but you can watch it at your /admin/content/node or in your database directly.

Obviously, a cron import is most desirable (and I am glad so much good work is going into it) but if you need to do your large import now, you can - without having to upload first, without compressing it, without having to break it into smaller chunks, etc.

I suggest this be installed in the code (I haven't made a patch as the change is so incredibly simple).

Obviously, I cannot assure this will work on other than Linux servers - that needs to be tested.

Enjoy!

PS In case your input file is large, you may want to increase also your upload_max_filesize and post_max_size in php.ini to values exceeding the size of your file.

PPS In fact you could consider providing a setting for values of set_time_limit, upload_max_filesize and post_max_size in the module settings.

#27

kingandy - May 30, 2008 - 14:17

FWIW set_time_limit() is a standard PHP function, so should be available on any installation of Drupal.

Note however that this is apparently one of the things that gets disabled when PHP is running in 'Safe Mode'. (This is not, as I once thought, the opposite of 'unsafe mode', but is more like Windows' Safe Mode where it runs with the bare minimum of drivers and functionality - it's the mode you use when something has been causing crashes and you don't know what.) AFAIK Drupal shouldn't be run on Safe Mode PHP, and probably wouldn't even work properly anyway, but I think it's worth being aware of.

See http://www.php.net/set_time_limit for more details.

#28

earnie - May 30, 2008 - 15:18

RE: set_time_limit()

If you're looping through a file and inserting records you should make use of set_time_limit() in the loop. I use a default of 5 seconds but you may need more if you have a complex set of updates to the DB. Make use of the variable_get() function to allow you to easily change the time limit.

#29

zeezhao - June 14, 2008 - 10:21

subscribing. thanks.

#30

grah - July 11, 2008 - 22:30

nice work so far

subd.

#31

vacilando - August 21, 2008 - 16:42

Further to my temporary solution as seen in comment #26 above, when importing 130,000 entries recently I ran into problems related to ovewhelmed MySQL and consequently server running out of memory.

The solution was to build in monitoring of server load and let the process sleep if the load was higher than some preset level. For maximum load of 2, after line
while (($row = $get_row($filepath)) && ($j++ < $preview || $preview == 0)) {
I used this:

$timeoutt = 1800;
set_time_limit($timeoutt);
# Don't run the backup if server load is too high.
$maxload = 2;
$load = sys_getloadavg();
$sleep = 20;
if ( $load[0] > $maxload ) {
     sleep( $sleep );
     #echo "Busy server - sleep $sleep seconds<br>";
     set_time_limit($timeoutt);
   }

This way I was able to let the server do the import very slowly - in fact it took several days in total - but the process was totally unattended and reliable.

I recommend adding set_time_limit and perhaps also the load monitoring into the configuration settings of this module.

#32

mcneelycorp - August 29, 2008 - 18:17

+1 subscribing

#33

jshuell - September 8, 2008 - 02:39

+1 Subscribing, very interested in an unattended solution to this. We definitely need something along these lines

 
 

Drupal is a registered trademark of Dries Buytaert.