so, once again, d.o was dying with "server not responding", etc. lots of folks were complaining in IRC, and it was dying for me, too. so, i just restarted apache on both d1 and d2. i've had to do this quite a bit in the last few weeks. :( also, i notice that the webmaster issue queue and contact page is getting hit a lot with reports of internal server errors on the project download pages and when replying to issues. those problems also seem to go away after an apache restart.

i know it'd be nice to understand what's actually going wrong and fix the underlying problem(s). however, until that time, what about a cron job on d1 and d2 that just restarts apache every N hours (N==24? 12?) automatically? i have interest and ability to make this happen, i just need someone else (Dries, killes, etc) to agree it's a good idea.

thanks,
-derek

Comments

dww’s picture

update: d.o fell over *again* 15 minutes after i restarted apache. just kicked it again. we desperately need to figure out what's going on here. ;)

david strauss’s picture

How about a cron job that checks if Drupal.org is online and only restarts if it's not?

wget | grep -> apache restart if necessary

dww’s picture

sure, that sounds good to me.

kbahey’s picture

Do we get a seg fault error in Apache's error.log?

Like so:

[Sun May 20 22:42:48 2007] [notice] child pid 12747 exit signal Segmentation fault (11)

If we do, then there is a solution more elegant than this: a script that checks for that every 45 seconds (or 60 or whatever) and restarts Apache when this happens. The down time is a minute or less.

You can grab the script (actually one .sh and one .php) from here.

emsearcy’s picture

On May 23, 2007, at 12:06 PM, Matt Rae wrote:

> If I remember right, the issue was that apache processes would start
> to segfault after a random time. The problem was attributed to APC
> which has become more stable since the work around was implemented.
>
> Apache wasn't really going down, it would just thrown a larger amount
> of http 500s.
>
> The osuosl has drupal.org monitored through nagios, but having a cron
> job to restart apache when it dies makes sense, especially if admins
> are asleep.
>
> Matt Rae

This would better be done through a cfengine `services' operation. I'll commit a change into our management repository to do this.

However, like you say Matt, Apache wasn't really going down of late (in the process sense), and in my experience restarting Apache when there *are* 500s is putting the cart before the horse, as the 500s were the result of intended activity, went away on their own, and restarting the server meant leaving important cron-jobs that were causing them (like the search index rebuild) only half-done.

I'm glad that we've worked towards fixing the problem (table locks and long queries), rather than just hacking at the symptom.

I'd suggest we resolve the ticket?

--
Eric Searcy
OSU Open Source Lab

jlambert’s picture

Based on my experience, this is usually related to the accelerator taking a crap.

Install our script here:
http://fisheye.firebright.com/browse/firebright_public/logwatcher

This will watch the apache logs and bounce them as necessary when a 500 occurs.

It's not a permanant fix (I'm still looking for one, let me know), but it works, and it means you don't have 14 minutes of downtime on a cron that runs 15 minutes. Cron is not a viable solution here, unless you do something like this.

Let me know if you have questions.

Jonathan Lambert
Principal, WorkHabit

http://www.WorkHabit.com/
A FireBright Company

emsearcy’s picture

Assigned: dww » emsearcy
Status: Active » Fixed

We have been running a similar python script for the last few months, but thanks to APC tweaks these errors do not emerge any more. I'm assigning to myself and resolving the task.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Component: Webserver » Servers