On my webserver I run very little things, and this is:
- cpanel (I was thinking about hosting some friends)
- mediawiki
- drupal 5.1
I notices that sometimes the server load simply goes up, and with up I mean that the load reported by /proc/loadavg can go from 0.02 to more than 10 in less than a couple of minutes.
The database becomes unresponsive when this happens.

Right now I have a script that simply tries to stop mysql when the server load goes over 10, but this isn't a solution, as I'll have the site down for a couple of minutes.

I know I'm not having that big amount of ram on my virtual server (512mb), and the database is quite big (about 100mb), but I can't upgrade to 1024mb ram really soon and I'd need the site to stay up.
I could get on with a site that becomes slow, but now with a mysql server that simply hangs, sometimes without even logging something in the "slow queries" log.

I can provide my my.cnf file if needed, and some more server configuration details.

Plase help me solve this issue :(

Comments

kbahey’s picture

You need to narrow down which process is causing this.

Check the tools section in this series of articles on Drupal performance tuning and optimization for large web sites.

Start with vmstat and top, then use mtop or "SHOW PROCESSLIST" to monitor the database.

If you have a lot of apache processes for some reason, check if MaxClients is too high and your system is swapping excessively.

DElyMyth’s picture

The process is mysqld, the server load drops if I can stop it, but the cpu usage stays low, I can just find the %wa going up, and the swap usage is going like crazy.
Slow queries start to log only *after* the mysql "block" (server load more than 10), and apache is just fine, it happens with about 30 clients (sometimes less) and the max clients is set to 150 (default value).

I'm running vmstat every 5 minutes (right now every 2), but I'm unable to understand the problem from there, I could copy here a vmstat from a server lock anyway...

Here are some outputs of my script, that monitors server load and tries to stop mysql if load get > 10:
output lines are:
uptime
free
vmstat
ps -ef | grep -c httpd

Everything's fine:
20:35:01 up 11:17, 1 user, load average: 0.18, 0.14, 0.10
total used free shared buffers cached
Mem: 523008 433852 89156 0 38212 71960
-/+ buffers/cache: 323680 199328
Swap: 1048568 242172 806396
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 1 242172 88776 38212 71960 3 3 9 15 30 50 2 0 97 1
12

Always nice:
20:40:03 up 11:23, 1 user, load average: 0.46, 0.19, 0.11
total used free shared buffers cached
Mem: 523008 518700 4308 0 992 16944
-/+ buffers/cache: 500764 22244
Swap: 1048568 251316 797252
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 3 251316 3916 992 16920 4 4 10 15 30 50 2 0 97 1
22

Server's about to die:
20:45:06 up 11:28, 1 user, load average: 23.12, 11.24, 4.51
total used free shared buffers cached
Mem: 523008 519368 3640 0 988 12960
-/+ buffers/cache: 505420 17588
Swap: 1048568 617760 430808
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 24 617804 3524 992 12892 7 9 13 20 31 51 2 0 97 1
63

Getting worse:
20:50:24 up 11:33, 1 user, load average: 44.42, 32.42, 15.77
total used free shared buffers cached
Mem: 523008 518880 4128 0 536 7160
-/+ buffers/cache: 511184 11824
Swap: 1048568 749608 298960
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 46 749608 4120 536 7160 12 12 19 24 32 51 2 0 97 1
56

Last one before crash:
20:55:26 up 11:38, 1 user, load average: 53.40, 44.84, 25.70
total used free shared buffers cached
Mem: 523008 518760 4248 0 1024 16276
-/+ buffers/cache: 501460 21548
Swap: 1048568 960276 88292
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 56 960268 3948 1024 16456 17 15 24 26 33 52 2 0 97 1
62

Here's my /etc/my.cnf:
[mysqld]
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
# Default to using old password format for compatibility with mysql 3.x
# clients (those using the mysqlclient10 compatibility package).
old_passwords=1
log-slow-queries = /var/log/mysql-slow.log
thread_cache_size=50
key_buffer=512K
table_cache=64
sort_buffer_size=512K
read_buffer_size=256K
read_rnd_buffer_size=512K
thread_concurrency=2
query_cache_limit=1M
query_cache_size=32M
query_cache_type=1

[mysql.server]
user=mysql
basedir=/var/lib

[mysqld_safe]
err-log=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid

(some friends tried to help me with it, but nothing could really solve the problem, today I had two crashes...)

kbahey’s picture

It is hard to read the info when it is not in the correct columns. Please post the info again enclosed in the <code> so it is in a fixed font.

Can you install mtop on this system? It gives good MySQL monitoring information.

If not, try SHOW FULL PROCESSLIST when the problem is about to happen.

DElyMyth’s picture

 20:35:01 up 11:17, 1 user, load average: 0.18, 0.14, 0.10
total used free shared buffers cached
Mem: 523008 433852 89156 0 38212 71960
-/+ buffers/cache: 323680 199328
Swap: 1048568 242172 806396
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 1 242172 88776 38212 71960 3 3 9 15 30 50 2 0 97 1
12
20:40:03 up 11:23, 1 user, load average: 0.46, 0.19, 0.11
total used free shared buffers cached
Mem: 523008 518700 4308 0 992 16944
-/+ buffers/cache: 500764 22244
Swap: 1048568 251316 797252
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 3 251316 3916 992 16920 4 4 10 15 30 50 2 0 97 1
22
20:45:06 up 11:28, 1 user, load average: 23.12, 11.24, 4.51
total used free shared buffers cached
Mem: 523008 519368 3640 0 988 12960
-/+ buffers/cache: 505420 17588
Swap: 1048568 617760 430808
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 24 617804 3524 992 12892 7 9 13 20 31 51 2 0 97 1
63
20:50:24 up 11:33, 1 user, load average: 44.42, 32.42, 15.77
total used free shared buffers cached
Mem: 523008 518880 4128 0 536 7160
-/+ buffers/cache: 511184 11824
Swap: 1048568 749608 298960
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 46 749608 4120 536 7160 12 12 19 24 32 51 2 0 97 1
56
 20:55:26 up 11:38, 1 user, load average: 53.40, 44.84, 25.70
total used free shared buffers cached
Mem: 523008 518760 4248 0 1024 16276
-/+ buffers/cache: 501460 21548
Swap: 1048568 960276 88292
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 56 960268 3948 1024 16456 17 15 24 26 33 52 2 0 97 1
62

I tried to reduce the MaxClients from 150 to 50.
I installed mtop, but I guess it will give me important information only when the server's about to crash (as show processlist), anyway, here's the output right now:

load average: 0.16, 0.11, 0.09 mysqld 4.1.22-standard-log up 0 day(s), 13:28 hrs                                                                                
1 threads: 1 running, 7 cached. Queries/slow: 1.1M/0 Cache Hit: 93.93%                                                                                          
Opened tables: 222  RRN: 7.2M  TLW: 21  SFJ: 1  SMP: 7  QPS: 12                                                                                                 
                                                                                                                                                                
ID       USER     HOST             DB           TIME   COMMAND STATE        INFO                                                                                
10773    delymyth localhost                            Query                show full processlist                                                               
---      

Btw, it seems that it didn't crash tonight.
Thanks for the help investigating this :)

Elena

kbahey’s picture

For mtop, yes, you have to see if there are queries that take a lot of time when the load average is high, and the system is about to hang.

Here is how the system looks when it is normal:

 20:35:01 up 11:17, 1 user, load average: 0.18, 0.14, 0.10
total used free shared buffers cached
Mem: 523008 433852 89156 0 38212 71960
-/+ buffers/cache: 323680 199328
Swap: 1048568 242172 806396
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd   free  buff  cache si so bi bo in cs us sy id wa
2 1 242172 88776 38212 71960 3  3  9  15 30 50 2  0  97 1 

Number of Apache processes: 12
Number of blocked processes: 1

Here it is when it is about to crash:

 20:55:26 up 11:38, 1 user, load average: 53.40, 44.84, 25.70
total used free shared buffers cached
Mem: 523008 518760 4248 0 1024 16276
-/+ buffers/cache: 501460 21548
Swap: 1048568 960276 88292
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r  b   swpd     free  buff   cache  si  so  bi bo in cs us sy id wa
0 56 960268 3948 1024 16456 17 15 24 26 33 52 2  0  97 1 

Number of Apache processes: 62

Number of blocked processes: 56

What this says is that there are more Apache created than can do useful work. They keep accumulating waiting to be unblocked so they server the request asked.

I see that Swap In and Swap out have gone up (17/15 vs. 3/3). At that level I would not expect it to cause much issues with thrashing. I would also expect the % wait for IO to go up because of all the disk activitiy involved in swapping, but wa is still 1%, and the idle time is 97% which does not make sense. Also, the cache is still high (16456) meaning some memory is still available. I would have expected that to be exhausted and the processes waiting for disk.

Reducing MaxClients to whatever fits in memory is a good solution, but I can't conclusively say from what you posted where exactly the problem is, apart from too many processes.

kbahey’s picture

I know what is wrong with the data you gave us.

The vmstat command has this idiosyncrasy: the first line displayed is the stats since the system was booted, not the current stats.

So, when this happens again, use the following command: vmstat 1 2, and then post it again. The second line will be the one of interest.

I am sure you will see different values for cpu percentages, and maybe the swap in and out too.

DElyMyth’s picture

Ok, I corrected the monitoring script as you suggested for vmstat.
Yesterday I reduced MaxClients from 150 to 50, but I still think they're too many (512mb ram), I guess I'll try to get it to 30, hoping it will not be too small.
As far as I could read around the internet it's an issue with apache, but why does it stop when I stop mysql?

I also installed the devel module, but I can't believe its output on the main page, since it says sometimes more than 500 queries :(
I re-renabled drupal's cache trying to reduce queries on the main page and limited the robots.txt with a crawl delay of 25 (it was 10).

Anyway, I could notice the block happens when I'm getting a comment spam-run.
With Akismet enabled, and the default 1 minute delay for spammers I still get something like 30 spam comments in a minutes (this is what I read from watchdog, and this means they're many more really...), and to try to limit the problems I have now the comments module with throttle enabled.

I'll post the new logs I think later today, as I'm pretty sure I'll have another crash :(

PS:
Would it help if I upgrade my VPS to 1024mb ram?
And with "help" I mean having it stop crashing about once or more times in a day...

dpearcefl’s picture

Status: Active » Closed (won't fix)

Considering the age of this issue with no responses and that D5 is unsupported, I am closing this issue.