When selecting 'Bulk update node paths' under 'Node path settings', I get the following error. Of course, this is for a very large site. Maybe it would be useful to have it use the cron, limit its scope somehow, or do something more graceful than this. I assume that if I run this several times, it will eventually get all the nodes; I've run it three times so far, and it's still doing this.

Fatal error: Maximum execution time of 30 seconds exceeded in /home/vhosts/staging.tpmcafe.com/public_html/modules/pathauto/pathauto.module on line 376

Obviously, the time out happens at whatever line it is at, not necessarily 376.

CommentFileSizeAuthor
#22 bulkupdate.patch2.45 KBgerhard killesreiter

Comments

lambert-1’s picture

This sounds like a PHP issue to me. Check your PHP.ini file for max_execution_time --

http://www.php.net/manual/en/ini.php#ini.list

aaron’s picture

Not sure that it is that. The search cron does just this, to prevent timeouts. (You can configure that at admin/settings/search.) If you have a large site with tens of thousands of posts (like the one I just installed pathauto), you will get time outs, because it just takes a very large amount of time to do it.

Alternatively, I suppose I could set max_execution_time to 10 hours just to make sure... ;)

greggles’s picture

Category: bug » feature

Yeah, so I'd say this is a feature request to have pathauto model the bulk feed generation on the search index creation.

Personally this seems pretty low priority because it doesn't seem like you'd need it very often. Aaron - if you can provide a patch it's much more likely to get fixed. Otherwise, it will probably wait until someone else really needs it.

Aaron - how many nodes are in this site you are describing? And is it a dedicated or shared server? What specs?

aaron’s picture

Assigned: Unassigned » aaron
Priority: Normal » Minor

yeah, i'm going to work on it, because we have at least one large site that needs it, and probably more will follow as they see other sites using it. i'll post something when i get around to it. also a lower priority for us right now, but probably higher than for most other folks.

greggles’s picture

Just a note that I marked http://drupal.org/node/26550 as a duplciate of this issue.

greggles’s picture

@aaron - any progress?

Also, note that there is another issue that suggests a change to the queries to get better performance:
http://drupal.org/node/76172

Thanks.

acidcortex’s picture

explained everything http://drupal.org/node/76172
c it, and patch it if u like

i have a lot of other modifications for drupal, but I can't post them before December
need to work on some other stuph

steph (acidcorteX)

greggles’s picture

Title: Time out with bulk update on large sites » perform bulkupdate in configurable sized chunks to improve scalability

@acidcortex - thanks for the link, but I see these as separate issues.

At some point it's possible to create a site that is so big it will break pathauto. Improving performance doesn't change the fact that the bulkupdate feature doesn't currently scale to a large site well.

So, the workarounds are things like:

  • increase max execution php/apache time in your server configuration
  • run the bulkupdate when the server is relatively quiet (so it isn't contending for resources with other processes)
  • try to improve the performance of your database

Those are all workarounds to the final problem of making pathauto work in configurable sized chunks (so an admin can enter 1000 or 2000 or 10,000) so that then the bulk update will be guaranteed to work after several executions whether the site is on amazing hardware or a shared server.

@aaron - if I've mis-stated your intention then feel free to restate what you see as the goal. This seems like a relatively difficult task because you have to have

  1. a guess at the number of updates that area going to succeed
  2. some means of keeping track of which aliases have been created and which need to be done

Its the second item that's tougher, in my mind, but perhaps someone has a brilliantly simple idea on that point.

aaron’s picture

1. a guess at the number of updates that area going to succeed

my suggestion would be to have this be a user-configurable number, ala search. some sites will be able to handle more numbers, based on server memory, bandwidth usage, etc. maybe have 100/500/1000/2500/5000/10000 or something.

2. some means of keeping track of which aliases have been created and which need to be done

use a variable to store the last node->nid processed, and on the next cron process the next nn nodes, marking the last node->nid for the next cron run. loop when we reach the end.

here's a brainstorm to get things rolling:

$process_limit = variable_get('pathauto_process_limit', 1000);
$last_nid = variable_get('pathauto_last_processed_nid', 0);
$sql = "SELECT nid FROM {node} WHERE nid>%d LIMIT %d";
$results = db_query(db_rewrite_sql($sql), $last_nid, $process_limit);
while ($result = db_fetch_object($results)) {
  $count += 1;
  $last_processed = $result->nid;
  // path processing code goes here
}
// loop back to zero for our next cron run
if ($count < $process_limit) {
  $last_processed = 0;
}
variable_set('pathauto_last_processed_nid', $last_processed);
watchdog(t('processed %count paths, ending on %last', array('%count' => $count, '%last' => $last_processed))), 'pathauto bulk update');
aaron’s picture

obviously the prior solution needs work. this will just loop forever, reprocessing nn nodes every hour or so. we'll just want to add a flag in there to tell us we're done (in the loop-back loop), and when we submit for a new bulk update, we'll unset the flag and start our count back at zero. i think the only problem we'll have is if the server times out, in which case it will just keep starting over. in that case, we'll also want a pre-processing watchdog alert, so the admin can check to see if bulkupdates are starting but not completing. (maybe the alert can even have a built-in suggestion to set the limit lower if it's not completing).

greggles’s picture

Yes - great ideas. Keep them coming!

One comment: the ID that we store will have to work for users, nodes, taxonomy, etc. Anything other kinds of objects now, or in the forseeable future, that it should be aware of?

chrisschaub’s picture

Hi. Is there any status on this issue? It's pretty important, and I'd like to help if needed.

greggles’s picture

I'm not working on it. Much of this work really needs to happen in core, so Drupal6 would be the timeframe to do that.

chrisschaub’s picture

Is pathauto going to be moved into core?

greggles’s picture

schaub123 - if you want to discuss random parts of pathauto please don't do it in the issue queue. The proper place to do that is in the group for paths discussion: http://groups.drupal.org/paths

chrisschaub’s picture

Ok, fair enough, sorry about that. But maybe this should be marked as "won't fix" since I thought it was an open issue. Otherwise I would have looked elsewhere. Thanks.

greggles’s picture

schaub123 - the issue is open regarldess of whether or not I (or anyone else) is currently working on it.

I would only mark it "won't fix" if it was something that should never be fixed.

moshe weitzman’s picture

I recently did a bulk update on a cliednt site and it crashed mysql because of too much memory or too many queries or something. So, I have two suggestions:

- $GLOBALS['conf']['dev_query'] = FALSE will temporarily disable devel.module query logging. That consumes lots of memory for no benefit.
- use the batch operations patch when it lands (very soon): http://drupal.org/node/127539

greggles’s picture

I like the idea of optionally disabling the dev query log - mostly because it's easy to implement.

There's still a lot of hard work to be done to get this to work in configurable chunks even after the progress operations patch lands (which I'm not really interested in working on, personally).

@moshe - if you can try to reproduce the problem to see exactly what fell over and where, it would be interesting. There is a change in the call to node_load so that it resets the cache each time. If you were doing bulk operations on nodes then you can quite easily exhaust php memory if you are using pathauto from before that change went in.

moshe weitzman’s picture

sorry, i moved on and can't put more time into identifying the problem. i was using 1.44.2.5 2007/01/20 23:26:2. the update in question was an aliasing of terms. most terms were in a freetag vocab.

litwol’s picture

Assigned: aaron » litwol

subscribing.

gerhard killesreiter’s picture

Status: Active » Needs review
StatusFileSize
new2.45 KB

Here's a patch that splits node updated into packs of 100.

greggles’s picture

Version: 6.x-1.x-dev » 5.x-2.x-dev
Assigned: litwol » greggles
Priority: Minor » Normal
Status: Needs review » Fixed

This is now fixed - thanks Killes for the brainstorming. I didn't use the progress meter - just doing it in chunks that the user can specify. The default is 50. I'd love to get feedback about whether it should be 50 or 500 or 5000.

If someone wants to use the progress meter well that would of course be awesome, but at least this is now generally "fixed".

I also documented this (and some other features) at http://drupal.org/node/144904 in the handbook. I'd love a review or two there as well.

Anonymous’s picture

Status: Fixed » Closed (fixed)