every day whenever the packaging scripts run (midnight and noon GMT), d.o slows to a crawl for about 5 minutes. :( i'm guessing it's just that the packaging script does a ton of DB queries and file system access all at once. we should figure out exactly what's pounding so hard, and try to optimize it.

CommentFileSizeAuthor
#6 package.patch967 bytesmerlinofchaos

Comments

kbahey’s picture

Start with logging the queries (devel module?) and then run explain against them to see what is going on.

Posting the info here will get many people looking at the slow queries and figuring where the bottleneck is.

Another approach is to fire up mtop when the script is running. That will tell you what is slow on the spot, and how long it takes.

It may be as simple as one or two indexes in the right place.

dww’s picture

there is only 1 real query, which is different depending on if we're packaging official releases or development snapshots:

official release:

mysql> explain SELECT pp.uri, prn.nid, prn.tag, prn.version, c.directory, c.rid FROM project_release_nodes prn
INNER JOIN project_projects pp ON prn.pid = pp.nid INNER JOIN node np ON prn.pid = np.nid
INNER JOIN project_release_projects prp ON prp.nid = prn.pid INNER JOIN cvs_projects c ON prn.pid = c.nid
WHERE np.status = 1 AND prp.releases = 1 AND (prn.rebuild = 0) AND (prn.file_path = '')
ORDER BY pp.uri;
+-------+--------+-------------------------------------------+----------------------+---------+---------+------+----------------------------------------------+
| table | type   | possible_keys                             | key                  | key_len | ref     | rows | Extra                                        |
+-------+--------+-------------------------------------------+----------------------+---------+---------+------+----------------------------------------------+
| prp   | ALL    | PRIMARY,project_release_projects_releases | NULL                 |    NULL | NULL    |  981 | Using where; Using temporary; Using filesort |
| prn   | ref    | project_releases_pid                      | project_releases_pid |       4 | prp.nid |    3 | Using where                                  |
| np    | ref    | PRIMARY,node_status_type                  | PRIMARY              |       4 | prn.pid |    1 | Using where                                  |
| pp    | eq_ref | PRIMARY                                   | PRIMARY              |       4 | prn.pid |    1 |                                              |
| c     | eq_ref | PRIMARY                                   | PRIMARY              |       4 | prn.pid |    1 |                                              |
+-------+--------+-------------------------------------------+----------------------+---------+---------+------+----------------------------------------------+
5 rows in set (0.00 sec)

dev snapshots:

mysql> explain SELECT pp.uri, prn.nid, prn.tag, prn.version, c.directory, c.rid FROM project_release_nodes prn
INNER JOIN node nr ON prn.nid = nr.nid
INNER JOIN project_projects pp ON prn.pid = pp.nid INNER JOIN node np ON prn.pid = np.nid
INNER JOIN project_release_projects prp ON prp.nid = prn.pid INNER JOIN cvs_projects c ON prn.pid = c.nid
WHERE np.status = 1 AND prp.releases = 1 AND (prn.rebuild = 1) AND ((prn.file_path = '') OR (nr.status = 1))
ORDER BY pp.uri;
+-------+--------+-------------------------------------------+----------------------+---------+---------+------+----------------------------------------------+
| table | type   | possible_keys                             | key                  | key_len | ref     | rows | Extra                                        |
+-------+--------+-------------------------------------------+----------------------+---------+---------+------+----------------------------------------------+
| prp   | ALL    | PRIMARY,project_release_projects_releases | NULL                 |    NULL | NULL    |  981 | Using where; Using temporary; Using filesort |
| prn   | ref    | PRIMARY,project_releases_pid              | project_releases_pid |       4 | prp.nid |    3 | Using where                                  |
| pp    | eq_ref | PRIMARY                                   | PRIMARY              |       4 | prn.pid |    1 |                                              |
| np    | ref    | PRIMARY,node_status_type                  | PRIMARY              |       4 | prn.pid |    1 | Using where                                  |
| nr    | ref    | PRIMARY,node_status_type                  | PRIMARY              |       4 | prn.nid |    1 | Using where                                  |
| c     | eq_ref | PRIMARY                                   | PRIMARY              |       4 | prn.pid |    1 |                                              |
+-------+--------+-------------------------------------------+----------------------+---------+---------+------+----------------------------------------------+
6 rows in set (0.02 sec)

so, it's obviously not ideal it's doing a filesort here. but, a little-old 931 row table shouldn't be causing *that* much grief.

other than that, all it's doing are a few tiny updates once each release is packaged:

  db_query("UPDATE {project_release_nodes} SET file_path = '%s', file_hash = '%s', file_date = %d WHERE nid = %d", $file_path, $file_hash, $file_date, $nid);
  db_query("UPDATE {node} SET status = 1 WHERE nid = %d", $nid);

those can't possibly be a problem, since both {node} and {project_release_nodes} are use nid as the primary key.

however, thinking about this more closely, perhaps the real problem is in the organization of the code itself. the basic pseudo code for the packaging script is:

$query = db_query("// Big monstrous query explained above");
while ($release = db_fetch_object($query)) {
  // figure out what kind of release this is
  // do a bunch of CVS and filesystem operations to package it
  // try to update DB to save results
}

so, we've got that big nasty query still "open" while we're doing all these other expensive operations, including these 2 updates to change {node} and {project_release_nodes}. i was always assuming that the db abstraction layer would be pre-fetching the results from the DB, and db_fetch_object() just grabs the next object out of same in-RAM data-structure, but perhaps that's all false. maybe the real problem is that the db connection/query is still in progress, so a bunch of tables are locked, etc, while we're in the outer loop, deciding what to do. compounding the pain is that the work inside the loop is queuing up a bunch more locking UPDATEs. :(

am i full of crap, or might that be a big problem? ;) is the solution (aside from avoiding the filesort above) just grabbing all the releases out of the huge query into RAM, and then looping over those once the query completes?

killes@www.drop.org’s picture

I have not seen any sql related spike at about midnight.

How about simply doing a little sleep() between calls as a quickfix (if you think you need one).

I can't comment on the last paragraph, but it would be strange if that would be a problem.

kbahey’s picture

If the problem is some locking, then adding sleep() is only going to compound the problem.

I think caching the result of the big query, and iterating through the array and doing your stuff should eliminate that, provided that there are no concurrency issues (what happens if something is updated after you cache, may not a problem, but ...).

Before going that route, just run the script from the command line, and in another window run mtop, and yet another run top. See the running queries in the former and how long they take, and the us% and wa% in the latter.

merlinofchaos’s picture

Subscribing. I would like to maybe help on this this weekend.

merlinofchaos’s picture

StatusFileSize
new967 bytes

Here's a patch that should see if having the query open is the problem.

dww’s picture

i just applied that patch to the copy running on d.o. we'll have to see if the packaging run in 3.25 hours is any faster as a result. ;)

Steven’s picture

If the scripts run on drupal2, shouldn't moving them to drupal3 help out with making the site more responsive too? Having mail and cvs be a bit tardy is not as bad as our main web site.

dww’s picture

yeah, i thought about that a while ago. here are the problems:

1) the packaging scripts are doing a CLI drupal_bootstrap(). therefore, they need access to the official drupal installation on d.o. we could use another copy of the source code, with the same DB settings, etc, but i'd be worried about things getting out of sync, etc. i suppose we could setup another automatic rsync mirroring like we do for drupal1...

2) the packaging scripts need to write to the files/projects directory, and drupal3 doesn't see that filesystem. well, they used to, at least. now that we're actually serving everything from ftp.osuosl, i guess this isn't technically true anymore. but, we'd have to change how osuosl is doing the rsync to pull the tarballs. there's also the problem that the project_release.module code is doing filesize() on the tarballs for display purposes (so, even though the download links point to ftp.osuosl, the file sizes we display are coming directly from files/projects/*). if we move things completely away from files/projects, we'd also have to compute the filesize in the packaging script, and store that in the DB. we currently store the file date and md5 hash in the DB, so storing the size would certainly make sense, and would actually solve some other problems. too.

also, i'm still primarily concerned that it's the db server getting slammed, not necessarily drupal2, which is causing the troubles. i could be wrong, and i suppose moving everything to drupal3 would be a way to test this theory, but the above 2 points involve a lot of work, and i'd like to have a little more confidence they'd actually make things better before going down this road...

dww’s picture

Version: 4.7.x-2.x-dev » 5.x-1.x-dev
Priority: Normal » Minor

today, Bdragon and I decided to look at this again. he asked "wtf are you doing all those watchdog() calls for", which is a very good question. ;) we were doing 2 watchdogs for *every* release. in the early days, this was useful to figure out what's going on and ensure proper behavior, but now, it's just excessive... and expensive.

so, i commented out those 2 calls, and this evening's run (~ 2 hours ago as of this writing) was *much* better:

  • it shaved 4 minutes off the total time of the packaging run (from ~23 minutes to 19)
  • instead of d.o being unresponsive for most of the 19 minutes, it was totally snappy the whole time, except for a brief period at the very beginning during the "query of doom", where the script finds all the release nodes that might need to be repackaged

so, this is clearly a big win, and fairly damning evidence about the cost of doing lots of watchdog() in a row on a busy site like d.o... since this was obviously a good move, i committed this to HEAD and installed it for real on d.o: http://drupal.org/cvs?commit=64865

so, we could probably still optimizing the initial query to be less of a killer, which is why i'm not just marking this issued "fixed", but this watchdog() change is a *huge* improvement with very little effort. ;)

kbahey’s picture

This is good news. It confirms that there is performance gains for the new hook_watchdog and syslog module.

Perhaps on drupal 6, d.o can be switched to using syslog entirely, and save the watchdog table write altogether.

senpai’s picture

Status: Active » Closed (won't fix)

This hasn't been a problem in six years, so I'm closing this issue.