I've run into a problem using CCK with a large dataset. I have 13000+ database entries I'm using and there's a problem if any update is made to the database schema (e.g. I create a new instance of a field in a different form: the field_? column is removed from the original table and a content_field_? table is created and populated). If I try to revert the change back (remove the newly created instance so that the original field_? column is recreated/populated), the database times out as it's taking more than 30 seconds for the some of the queries to be made.

The underlying problem is that there is a loss of data integrity if the timeout occurs. My old table isn't being fully populated with the data that it needs to continue to function. Thankfully, the data isn't being removed from the content_field_? table as the timeout happens prior to the table drop. Still, this incomplete result is certainly not desirable.

While I certainly could change the timeout setting at the PHP level, it would be nice if the schema-altering process could temporarily change the timeout to something longer than 30 seconds so that data integrity can be maintained.

Thanks for any help here!

Comments

yched’s picture

temporarily change the timeout to something longer than 30 seconds
At first looks, I don't think this is a clean / reasonable fix
- Some hosting configurations won't allow this
- I'm not sure php timeout setting is there for the code itself to decide when it should fiddle with it.
- What value should we use anyways ? 30s ? why ? any value will eventually prove to insufficient at some point...

I think operations on large datasets are a generally a delicate issue within cck or even drupal itself.
Any ideas welcome...

moshe weitzman’s picture

the only solution we have for this is what update.php does - spread processing over multiple requests.

also note that doing many node_load() operations in one request is very memory consuming - use the syntax at http://drupal.org/node/123705. not sure if that applies here.

yched’s picture

yes, this was also raised in this issue : http://drupal.org/node/97861

I started some work a while ago on a general 'à la update.php progressive ops' API, but did not really got it polished, and I did not have too much time lately to get back to it...

karens’s picture

yched, why don't you post what you have in case someone else has time to work on this and move it forward. There are a number of places where this will be needed as we have more and more installations using thousands of CCK nodes.

yched’s picture

Yes, I should probably do that - this does require some level of minimal polishing / cleaning / explanations, though.
Will try do do that ASAP.

yched’s picture

yched’s picture

Work in progress submitted at http://drupal.org/node/127539

nicolash’s picture

This might well be a different scenario, but I was able to simply extend the time limit via PHP and Drupal was grinding away happily for however long it was set (in seconds).

set_time_limit(500);

This might not be allowed on a controlled environment, I guess.

yched’s picture

That's right, after i wrote I'm not sure php timeout setting is there for the code itself to decide when it should fiddle with it, I realized that drupal core actually uses set_time_limit(240) in some places.

The code is wrapped in if (!ini_get('safe_mode')) {, and won't run on most shared environments.

NaX’s picture

I don’t know what has been discussed in the passed about things like this, and I don’t know what the current implementation is, and I think many of you know about this, but here is some of my experiences.

I ran into this same problem a while back with a bulk import feature. A very large amount of data was being imported form CSV files. And the imports were taking very long, up to 5min some times. So I started looking at ways to improve this, and what I found was that if you reduce the amount of queries, the time it takes is decreased dramatically. In the end I got it down to under 20 seconds.

You do this by not executing the queries one at a time.

Example:

INSERT INTO x (a,b)
VALUES
 ('1', 'one'),
 ('2', 'two'),
 ('3', 'three');

This worked very well, I did run into a small problem and that was the bulk_insert_buffer_size

The way I got around this was by building many bulk queries.

Example:

$loop_count = 0;
$sql = 'INSERT INTO {x} (a,b) VALUES ';
While () {
  
  //......
  
  if ($loop_count == 0) {
    $sql .= "($a,$b)";
  }
  else {
    $sql .= ", ($a,$b)";
  }
  
  //......

  // must build a query smaller than the bulk_insert_buffer_size
  if (strlen($sql) / 1024 >= 32) {
    $sql .= ';';
    db_query($sql);
    $sql = 'INSERT INTO {x} (a,b) VALUES ';
    $loop_count = 0;
  }

  //......

  $loop_count++;

}

Now I don’t know if this part of the SQL standard or if it is even possible in other databases but it worked fine on a mysql 4 installation.

The other option that I have used when moving large amounts of data from one table to another is the INSERT ... SELECT. Again I don’t know about other databases but it also works really well.
Here are the docs for it. http://dev.mysql.com/doc/refman/5.0/en/insert-select.html

Example:

INSERT INTO x (a,b)
SELECT c, d FROM z;

Hope that helps.

yched’s picture

Title: Large dataset database timeout » use batch API
Version: 5.x-1.x-dev » 6.x-1.x-dev
Category: feature » task

The batch API that is now in D6 should let us implement some workaround for these situations - turning this thread into a task.

karens’s picture

Status: Active » Fixed

This is now in HEAD. It still needs tweaking and debugging, but it's there.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.