Allow batch imports using cron [#191197]

Comment #1

germandrupaller commented 18 December 2007 at 16:08

I get a fatal error while importing a ~ 4MB CSV

maximum execution time exceeded ... in pathauto on line ...

any solutions?

Log in or register to post comments

Comment #2

philippejadin commented 10 January 2008 at 10:37

Title:	Very Large CSV does not import	» Allow batch imports using cron
Category:	support	» feature

The only reliable way to do this would be to batch import nodes.

Node import could import a limited number of nodes on each cron run. User import does it : http://drupal.org/project/user_import it could give some inspiration on how to do it.

A setting page could be added before or after the preview step :

- Import [xx] nodes at a time using cron (each time cron is run, a batch of nodes will be imported, use this to avoid php timeout / memory limits)

Log in or register to post comments

Comment #3

kingandy

he/him

commented 11 January 2008 at 10:29

I tend to manually split my import files into a workable number of rows. Table width appears to be a factor - I've had some narrow (few, small fields) files import more than 1000 rows at a time, but the one I'm working with at the moment (with about 50 fields, many of them large text beasts) will only do about 400 before timing out.

FWIW, of all the php settings posted above, I think the limiting factor will be max_execution_time - populating, validating and saving a node is a lengthy process, especially if there are many non-standard fields, so PHP just plain runs out of runtime before it's done. It would be possible to tweak the module to constantly extend its runtime (using the set_time_limit() function), though with a long processing time there is a risk that the browser would assume the request has failed and display a timeout message of its own.

With 700,000 rows, though, it might be worth taking a step back and looking at what you're trying to achieve. That would be - in my opinion - a huge number of nodes, you'd have a massive database and things could very well slow down. (In particular you'd want to avoid using pathauto or any kind of node cross-reference modules!) You might do better to keep the info in an external datasource, or at least in its own table, with some kind of custom code to fetch and display it ... I don't know.

That said - an automated batch import procedure would still be good for the module!

Log in or register to post comments

Comment #4

philippejadin commented 11 January 2008 at 14:36

I'm working on it, I hope to have a patch somewhat soon

Log in or register to post comments

Comment #5

philippejadin commented 11 January 2008 at 16:06

In order to complete this batch import thing, some things must be changed in the module.

I'd like to extend the node_import_mappings table.

I think the table should contain the filename of the file to be imported, the total number of rows in this file, the number of rows already imported, and if batch import is enabled.

This way, a cron hook could be created. It would check if there are curently files needed to be imported, check if some rows are not yet imported and how many rows can be imported in this cron run.

Using this, it would be easy to launch multiple import jobs, and see where we are using a status page.

Does it looks like a good battle plan? If it is, I will implement this. If not please give advices

Log in or register to post comments

Comment #6

germandrupaller commented 14 January 2008 at 10:00

sounds good to me.

Log in or register to post comments

Comment #7

philippejadin commented 25 January 2008 at 10:22

Assigned:

Unassigned

» philippejadin

I'm banging my head on this one :-)

It's really complicated to decouple the existing code to allow importing per batch. The _node_import_get_nodes function has not been designed to allow this (and the rest of the module is not very batch import friendly)

This means I'll have to change a lot of the module code. I'd really apreciate suggestions from others who have deep knowledge of the decisions made in the existing module.

Log in or register to post comments

Comment #8

Anonymous (not verified) commented 25 January 2008 at 20:18

If you use a cron hook then you should use that to spawn a process that continues to run in the background. Batching this type of thing takes a long time. If you're interested I can share a function to spawn a wget that continues to run in the background. You may want to have methods to check if the spawned process is continuing to run, has aborted, can restart, etc. There is nothing worse than getting partially through the file without some method of advancing to the point of the abort.

Log in or register to post comments

Comment #9

philippejadin commented 1 February 2008 at 10:07

The logic found in http://drupal.org/project/job_queue could be used as well. It checks if the time elapsed for this cron run is not more than half of the total amount of time available for one cron run.

Log in or register to post comments

Comment #10

Robrecht Jacques commented 1 February 2008 at 16:17

Node_import would need some rewriting to allow for this:

1. Create a table node_import_jobs which stores jobid, matches, globals, type, uid, ... (extending the node_import_mappings - I would put it in another table because of the globals may contain stuff not intended for all uid). So after you completed the form, one row would be added.

2. Create a table node_import_rows which stores jobid, row_number, row_data, ... After the form completes, you would completely read the file and store each row (without importing). There should be a column for "success" (NULL if not yet tried, nid if success, 0 if fails) and a "errors" column with the errors if it fails.

3. _node_import_get_nodes() would be rewritten to get the rows from the node_import_rows table instead of from the file directly.

4. Probably even better is to write a _node_import_get_node() which does the bulk of the work like in _node_import_get_nodes() but only imports one node. _node_import_get_nodes() would then call _node_import_get_node() a number of times. Something like:

function _node_import_get_node(matches, globals, row_data, uid) {}

or maybe

function _node_import_get_node(jobid, rowid){}

5. The end of the wizard form would redirect to a page that does the import with a processbar, calling a callback that does X nodes a time. If the user goes away from the form, the rest of the nodes would be imported by cron.

6. One needs to be carefull about not doing a node twice, eg once because the user is looking at the form (which calls the callback using js) and once because cron just happens to run on the same time. Probably you need to put some extra column in node_import_rows that tells it that the import has started? I dunno how to solve this race problem nicely.

7. There would need to be a page which shows a listing of all jobs, with a X of Y rows imported, Z errors status. Maybe an edit form to change the settings. Enable/disable jobs or enable/disable batch of jobs. Once a job is completed, there would be a delete operation. Maybe even allow the user to delete the imported nodes (because of too many errors). So the rows would stay in node_import_rows table until a job is deleted.

If you provide some patches, I'm willing to review them carefully.

Log in or register to post comments

Comment #11

gagarine commented 6 February 2008 at 13:16

subscribing

Log in or register to post comments

Comment #12

philippejadin commented 6 February 2008 at 13:27

Thank you for all the directions.

1. ok

2. I think we can simply store a pointer to where we are in the file. Using php fseek() is fast and (in theory) infenitely scalable. Importing in a temporary table would mean importing two times, and would still be endangered by php timeout with huge csv files. And I think the logic to get only part of the source material (in this case a csv file) should be abstracted in order to allow other import formats later.
That's how user import does it (fseek) and I think it's the right way to go at least with huge csv files.

3. or use the file pointer to know were to resume?

4. ok

5. I don't know if it's easy to do this kind of progress bar in Drupal 5. I think it's built in D6 though

This represents quite a lot of work, I don't expect this to be ready all in once.

Log in or register to post comments

Comment #13

Robrecht Jacques commented 6 February 2008 at 14:35

2. You probably still will need some place to store rows with errors though. That's why I was thinking about a rows-table. But you are right that that too may take some time (although a lot less then creating a node). fseek() would be fine too.

5. If you don't do a progress bar, then you don't have the conflict between cron() and viewing the page with the progress bar, so that solves 6. Still it would be nice if somewhere something is said about how far we are. Well, that's not critical.

It's a lot of work, but you can do it step by step and gradually convert to batch/cron enabled imports. So a lot of small patches instead of one big one. That will make it easier to review it too.

Log in or register to post comments

Comment #14

philippejadin commented 22 February 2008 at 08:56

Assigned:

philippejadin

» Unassigned

Our project takes longer than expected, I'm not sure I'll be able to provide a patch in a raisonnable timeframe.

Log in or register to post comments

Comment #15

vacilando commented 25 February 2008 at 18:43

Subscribing and... philippejadin, you really switched on the light of hope here! Please keep on, Drupal sorely needs the ability to import large batches. If you need testing help don't hesitate to ask.

Log in or register to post comments

Comment #16

Anonymous (not verified) commented 25 February 2008 at 21:03

For rays of hope, I'm processing many large files some more than 100K rows from a UNIX cron process, not a cron hook. I store the rows in a separate file and treat that processing as a third party application. I converted all tables to InnoDB for transactional control. I have added some indexes based on watching the queries and I have removed some indexes based on the fact that the same column is indexed more than once. I have another process that takes the control data table and creates the nodes using the node functions. I have pathauto turned on for SEO paths and each node has a minimum of four taxonomy terms and could have many more. This processing happens at night while I am asleep. My server has 4 CPU and 128M of RAM available.

I recently added some new files to process. One of the issues this brought forth was the search indexing updates. I've made a few modifications and have brought the number of nodes needing indexed from 124715 to 123215 in about four hours process 500 nodes at each cron run with each cron run scheduled every 5 minutes. I am trying to keep track of changes and be sure to submit what is needed. I have created a project to hopefully add this work to the community.

Log in or register to post comments

Comment #17

vacilando commented 28 February 2008 at 10:44

earnie, this sounds very impressive - can you provide more details so we can try that as well? What project have you created around this - is it http://drupal.org/project/publisher and what exactly does it do?

Log in or register to post comments

Comment #18

Anonymous (not verified) commented 28 February 2008 at 13:50

Yes that is the module. At the moment it isn't much. It will contain the administration of data providers, data files, data column mapping, file process timing and node process timing. I've created a library that uses Drupal API but I treat as a third party instance which requires a database entry in the settings.php file. I did this because I plan to offer the datafeed portions as a service for hire and as a retailed package. On top of the library I've developed an elaborate set of cron scripts. The publisher module will hook_cron but will process the scripts in background rather than causing cron.php to wait for the completion.

Creating the node in batch is simply creating the node object and passing that object into node_save which passes back the nid in the object passed in. You then update the control row with the nid so that if you later need to update it, deactivate it or delete it you can.

Recently, besides the search index issue (which has been resolved as to why I it wasn't processing properly), I've been trying to determine why when I hit edit on one of my nodes for my datafeed content type it would just give a WSOD. I finally determined that it is due to the shear number of vocabulary terms that are in the database and trying to be used for a list box. I spent days on that one and finally wrote a batch script to delete nodes by term name and vocabulary name. I'll add a node edit function to the publisher package that will not include the taxonomy in a list and I can flag the data control of the node being disabled or deleted.

BTW: The search index issue ended up being the search_index update routine which does an update, check for affected rows, and if zero does an insert. The update didn't update and the insert was giving a duplicate key. To resolve I changed INSERT to REPLACE.

Also:

the admin/content/node also gives WSOD because of the same reason; the number of taxonomy terms causing the form list box to consume memory.
the "Items to index per cron run" on admin/settings/search has a maximum of 500 nodes which is too small for me; I've got to look at increasing that number. I am thinking that I need to do the search index update at the time I create the node; I'll be looking into the search API as well to see if I can update the node as it is created.
My datafeed process is currently considered version 0.1, is being used in production and I soon will be looking at going to version 0.2 which will enhance the file reading and cleanup of my library for API that was and is no longer used plus finish documenting what is left. I've found a few of my nodes to be incorrectly formed due to the fact that not all of the data for one row being read and the remainder of that row being treated as a new row. One would think that 50K is enough for one row but ...
If you're interested in helping with publisher, I'm interested in helping you with your datafeed needs. Contact me if you're willing to help.

Log in or register to post comments

Comment #19

summit commented 5 March 2008 at 12:55

Subscribing, batch import would be great! +2 for this! Greetings, Martijn

Log in or register to post comments

Comment #20

Anonymous (not verified) commented 5 March 2008 at 19:53

CAUTION: Large datasets have interesting impacts. I've been debugging a Out Of Memory condition with one of my recent SELECT statements. The effect is that mysql_query buffers the entire result giving a MySql errno 2008 which is an out of memory error. There may be possible other methods to use, I'll be testing a few of those.

Log in or register to post comments

Comment #21

latte commented 14 March 2008 at 01:03

Subscribing! This is awesome!!!! :)

Log in or register to post comments

Comment #22

Anonymous (not verified) commented 21 March 2008 at 12:40

CAUTION
Be sure the Download method on admin/settings/file-system is set to Public and not to Private; even if you have nothing to offer for downloads.

If Download method is set to Private it will degrade the system to not usable. I haven't debugged this yet but my understanding is that when set to Private the system will execute the bootstrap routines with each node_load and maybe more. Once I heard that news I reset my site to Public and my nightmarish headache went away. When set to Private my batched file loads and batched node creations were being killed by the host provider. Now my log files are intact and give good status and the site actually works when MySql is busy inserting data.

I hope this helps everyone who find this to resolve some slowness issue trying to batch the node creations. It was really making me tense.

Log in or register to post comments

Comment #23

gthing commented 15 May 2008 at 20:37

another vote for this feature!

Log in or register to post comments

Comment #24

randomuser commented 20 May 2008 at 04:28

subscribing

Log in or register to post comments

Comment #25

mennonot commented 28 May 2008 at 16:15

+1 subscribing

Log in or register to post comments

Comment #26

vacilando commented 30 May 2008 at 14:04

Version:	5.x-1.2	» 5.x-1.6
Status:	Active	» Needs work

Well, I don't have a solution for the cron solution, but here's a fix that has solved all my worries. Just now, as a test, I have upload 99,131 two-column entries from a 13.5 MB file in one go, where normally I could not upload more than 2,600 :-)

What to do - in version 5.x-1.6, in function _node_import_get_nodes look up this line:
while (($row = $get_row($filepath)) && ($j++ < $preview || $preview == 0)) { (line 615 in this version)
and place this command in a line just after it:
set_time_limit(180);

This basically assures that the timeout of the running PHP process gets re-set with each row of your imported file.

After you start the import, your page will not show the progress, but you can watch it at your /admin/content/node or in your database directly.

Obviously, a cron import is most desirable (and I am glad so much good work is going into it) but if you need to do your large import now, you can - without having to upload first, without compressing it, without having to break it into smaller chunks, etc.

I suggest this be installed in the code (I haven't made a patch as the change is so incredibly simple).

Obviously, I cannot assure this will work on other than Linux servers - that needs to be tested.

Enjoy!

PS In case your input file is large, you may want to increase also your upload_max_filesize and post_max_size in php.ini to values exceeding the size of your file.

PPS In fact you could consider providing a setting for values of set_time_limit, upload_max_filesize and post_max_size in the module settings.

Log in or register to post comments

Comment #27

kingandy

he/him

commented 30 May 2008 at 14:17

FWIW set_time_limit() is a standard PHP function, so should be available on any installation of Drupal.

Note however that this is apparently one of the things that gets disabled when PHP is running in 'Safe Mode'. (This is not, as I once thought, the opposite of 'unsafe mode', but is more like Windows' Safe Mode where it runs with the bare minimum of drivers and functionality - it's the mode you use when something has been causing crashes and you don't know what.) AFAIK Drupal shouldn't be run on Safe Mode PHP, and probably wouldn't even work properly anyway, but I think it's worth being aware of.

See http://www.php.net/set_time_limit for more details.

Log in or register to post comments

Comment #28

Anonymous (not verified) commented 30 May 2008 at 15:18

RE: set_time_limit()

If you're looping through a file and inserting records you should make use of set_time_limit() in the loop. I use a default of 5 seconds but you may need more if you have a complex set of updates to the DB. Make use of the variable_get() function to allow you to easily change the time limit.

Log in or register to post comments

Comment #29

zeezhao commented 14 June 2008 at 10:21

subscribing. thanks.

Log in or register to post comments

Comment #30

grah commented 11 July 2008 at 22:30

nice work so far

subd.

Log in or register to post comments

Comment #31

vacilando commented 21 August 2008 at 16:42

Further to my temporary solution as seen in comment #26 above, when importing 130,000 entries recently I ran into problems related to ovewhelmed MySQL and consequently server running out of memory.

The solution was to build in monitoring of server load and let the process sleep if the load was higher than some preset level. For maximum load of 2, after line
while (($row = $get_row($filepath)) && ($j++ < $preview || $preview == 0)) {
I used this:

$timeoutt = 1800;
set_time_limit($timeoutt);
# Don't run the backup if server load is too high.
$maxload = 2;
$load = sys_getloadavg();
$sleep = 20;
if ( $load[0] > $maxload ) {
     sleep( $sleep );
     #echo "Busy server - sleep $sleep seconds<br>";
     set_time_limit($timeoutt);
   }

This way I was able to let the server do the import very slowly - in fact it took several days in total - but the process was totally unattended and reliable.

I recommend adding set_time_limit and perhaps also the load monitoring into the configuration settings of this module.

Log in or register to post comments

Comment #32

elvis2 commented 29 August 2008 at 18:17

+1 subscribing

Log in or register to post comments

Comment #33

jshuell commented 8 September 2008 at 02:39

+1 Subscribing, very interested in an unattended solution to this. We definitely need something along these lines

Log in or register to post comments

Comment #34

Vic96 commented 4 January 2009 at 07:28

subscribing

Log in or register to post comments

Comment #35

asak commented 11 January 2009 at 10:56

This thread is getting better every day....

Log in or register to post comments

Comment #36

okday commented 25 March 2009 at 01:59

Is there any solution for the 6.x version?

thank you.

Log in or register to post comments

Comment #37

elvis2 commented 26 March 2009 at 09:02

Wake up everyone! Let's get this thread moving again :)

I came upon this topic again with a current project. After rethinking this through, I am wondering if the following makes sense. Make a small module to basically pickup and process the file you want to import as nodes then handoff parts of the file to node import... We would be using node_import as an API... Saying that...

First, create a mapping of the file. So go to node_import and begin the process to import the file. But, only have one row or so within the csv file. The goal here is to map the fields.

Now, we let our "new" module code do the rest...

Run cron every 5 minutes or so.

- load file
- check module settings to see how many rows to process
- modify each row (cleaning content etc) -- optional
- hook into node_import to find our previous mapped file
- give node import our xx rows (set in new module settings)
- save file (minus the rows inserted)

Next round on cron, do the same. Each time the file size gets smaller and server resources are reserved.

The other thought was to do part of this in mysql, where we load the entire file into a generic table, and on cron check that table. Upon each node insert we remove the table row from the generic table then move on to the next row. I personally think PHP and fwrite will be faster than querying the db, but have not done benchmarks...

Thoughts?

Log in or register to post comments

Comment #38

summit commented 26 March 2009 at 12:51

Hi,
When we go this way, why not make the module do import AND export.
Until now I do not have a solution to import/export lots of nodes with their taxonomy relations. For migrating from D5 to D6 this would be great to have!
So yes, get new energy in this thread and may be together we can find a great import/export nodes solution.
Thinking about this there was the importexportapi module (www.drupal.org/project/importexportapi )
May be integrating this with node_import/export is the way to move forward for D5/D6/D7?
Greetings,
Martijn

Log in or register to post comments

Comment #39

Anonymous (not verified) commented 26 March 2009 at 13:44

What is needed is a data mapping module that can then be used to create an RSS feed of the data so that feedAPI can handle the node creations. The data comes to us in many different shapes but RSS is a standard format that is already used to create and control the nodes. The data from what ever source can be mapped to a table containing a control field along with creation and update timestamps. The data itself would be a serialized text field of either an object (my preference) or an array. During the feedAPI cron run the data mapper would check for updates to feed based on the URL given by feedAPI. And the serialized data formatted into RSS based on the parameters described about the data.

I given it thought, I just haven't had time to spend with it. I also think we need a group to work the data mapper and get off the Node import issue queue to design this since it is way off topic. Someone get the group rolling, I can help with the design, testing and debugging.

Log in or register to post comments

Comment #40

Anonymous (not verified) commented 26 March 2009 at 17:54

I've applied for http://groups.drupal.org/data-mapping where we can discuss the details.

Log in or register to post comments

Comment #41

hutch commented 26 March 2009 at 18:28

I like the idea of doing it bit by bit on a cron
However I would not save the file, just keep track of the pointer and work from there next time. The input file should be left alone and when it is done the administrator should be informed and the file marked as done. That will reduce the resources required somewhat.

Log in or register to post comments

Comment #42

Anonymous (not verified) commented 26 March 2009 at 19:20

Speaking of resource use see the comments here http://groups.drupal.org/node/20597#comments.

Log in or register to post comments

Comment #43

frank ralf commented 26 March 2009 at 19:40

I think this project is really a worthwhile endeavour.

@ Martijn #38
Import / Export API doesn't seem to be supported anymore. However, there's some valuable information about this module, which is a bit hard to find:

http://greenash.net.au/posts/thoughts/import-export-api-final-thoughts (server seems to be down at the moment)

And the module itself comes with exhaustive documentation (quick start guide, API reference).

hth
Frank

Log in or register to post comments

Comment #44

elvis2 commented 27 March 2009 at 08:39

@earnie, isn't RSS importing limited? If we are importing data that is 20 columns wide, and has it's own content type (each file column has a cck field in the content type), then your solution would not work for this situation. Is that correct?

@earnie, I agree on the datamapping module. I think the node_import can help with this. The one problem I will have, for my own use, is mapping taxonomy terms. I run an ecommerce site and it is a real pain to map each product to the terms I want, as the distributor terms and my terms are not the same. In the past I did this within the code but I believe it should be more dynamic.

@hutch, what do you mean by pointer? A file?

Log in or register to post comments

Comment #45

hutch commented 27 March 2009 at 09:20

I just pulled the importexportapi from CVS, it is much more substantial than I expected and might well be a good basis to start from (or continue with).
Quite a lot of work though, needs to be ported to D6 for starters so that it can be used for D5 -> D6 moves which can be problematic.

As we are indulging in a bit of blue-sky thinking, it might be apposite to re-iterate some points about data:

CSV files are great for shifting data but not so good at expressing the structure they must go into, although if the first line contains field definitions then which table.field a given column is to go into can be expressed there. This might be enough for many instances.

XML expresses structure very well but is a bit of a beast to import from especially if the dataset is large. Although XML *can* be parsed one record at a time it is in my experience not easy, the existing php SAX parsers such as google's gc_xmlparser assume you want the whole dataset in one go, which it does very well BTW, once you've got the glue code done.

hth

Log in or register to post comments

Comment #46

frank ralf commented 27 March 2009 at 10:05

Another potential candidate might be JSON (JavaScript Object Notation).

"JSON: The Fat-Free Alternative to XML"
http://www.json.org/xml.html

There's already some Drupal code around for JSON processing:
http://drupal.org/search/apachesolr_search/json

hth
Frank

Log in or register to post comments

Comment #47

Anonymous (not verified) commented 27 March 2009 at 13:24

I've commented at http://groups.drupal.org/node/20597#comments in reply to recent comments here including a possible solution for @mcneelycorp taxonomy problems.

Log in or register to post comments

Comment #48

danieldd commented 1 October 2009 at 13:52

Hello, does anyone know how to implement the solution in #26 in Drupal 6.x?

Edit: Since posting this comment I have successfully used this module (6.x-1.0-rc4 version), without implementing any patches or changes, to upload over 40,000 CCK nodes at a time (15 MB). I had no problems whatsoever. This problem appears to have been addressed and resolved in the Drupal 6 version. I found the process took about an hour for every 10,000 nodes.

I am still unable to import node reference and date fields. Although, even with this drawback, this module has been a lifesaver.

Log in or register to post comments

Comment #49

frank ralf commented 1 October 2009 at 15:19

Version:	5.x-1.6	» 6.x-1.0-rc4
Status:	Needs work	» Postponed (maintainer needs more info)

@danieldd

Just changed the issue settings to reflect your findings. Thanks!

You might have a look at #528726: Pictures to general node import? for some more pointers to import related resources.

hth
Frank

Log in or register to post comments

Comment #50

danieldd commented 1 October 2009 at 20:00

Thanks.

Without wanting to go off topic- Re node reference fields, I just excluded this column from the node import. And made sure there was another field in the table that linked each record to the content type being referenced. I then updated the newly imported table with the correct node references through SQL.

With date fields I just imported as text. These are not critical to my project so this is not (so far) a problem.

Log in or register to post comments

Comment #51

dealancer commented 26 November 2009 at 07:18

Hi everybody,

I have created new module Node import via cron, it requires Node import module.
User or bot may append content to the existent files for tasks and then it would be processed in cron.

I hope it will be useful for you.

Log in or register to post comments

Allow batch imports using cron

Comments