Problem

Every minute ​/var/xdrago/second.sh is run by cron and it contains loops through itself every 10 seconds and checks the load levels and if kills tasks is set thresholds are breached, the default settings result in a load of 18.88 cause php and drush tasks to be killed and a load of 14.44 causes the web server to be killed (this is my and chrisc's reading of the script)

Since these values are based around a server with 4 CPU cores, the default load limits do not work at all well on servers with >4 CPUs.

According to the UNIX-style load calculations for page at Wikipedia, these values are per CPU, so 14.44 represents a load of 14.44 / 4 = 3.61 before the server is killed assuming 4 cores.

However, on a server with, say, 14 cores, the calculation does not work so well: 14.44 / 14 = 1.03. This means even a light load can kill PHP, Nginx services, or Drush tasks.

Example

The BOA virtual server running www.transitionnetwork.org has 14 CPUS, and we had to establish the best values for us by a process of elimination. This clearly also means hacking second.sh with our preferred values -- something that we have to do after every BOA update.

Background issue: https://tech.transitionnetwork.org/trac/ticket/555

FYI our current values that work for the server with 14 cores are:

CTL_ONEX_SPIDER_LOAD=2716
CTL_FIVX_SPIDER_LOAD=2716
CTL_ONEX_LOAD=10108
CTL_FIVX_LOAD=6216
CTL_ONEX_LOAD_CRIT=13216
CTL_FIVX_LOAD_CRIT=10885

Proposed solution

There are two parts/alternatives for the proposed fix:

  1. Allow hard-coded variables to be overridden by those taken from /root/.barracuda.cnf
  2. AND/OR do some calculations based on the number of CPU cores during second.sh runtime, or during BOA updates to set the values in second.sh.

Allowing overrides (1, above) would mean second.sh was altered like this:

...
load_limits()
{
  if [ -e " /root/.barracuda-overrides.cnf" ] ; then
    source /root/.barracuda-overrides.cnf
  else
    CTL_ONEX_SPIDER_LOAD=388
    CTL_FIVX_SPIDER_LOAD=388
    CTL_ONEX_LOAD=1444
    CTL_FIVX_LOAD=888
    CTL_ONEX_LOAD_CRIT=1888
    CTL_FIVX_LOAD_CRIT=1555
  fi
}
...
... near bottom ...
load_limits
control
sleep 10
... rest of file ...
...

And doing a calculation (2) would require the CTL_*_LOAD variables to be calculated by using multipliers for CPUS, e.g.:

CTL_ONEX_SPIDER_LOAD=_CPU_CORES * 0.9
CTL_FIVX_SPIDER_LOAD=_CPU_CORES * 0.9
CTL_ONEX_LOAD=_CPU_CORES * 3.6
CTL_FIVX_LOAD=_CPU_CORES * 2.2
CTL_ONEX_LOAD_CRIT=_CPU_CORES * 4.8
CTL_FIVX_LOAD_CRIT=_CPU_CORES * 4

... Where _CPU_CORES is set using your preferred method of getting cores count.

Comments

Jim Kirkpatrick’s picture

Issue summary: View changes
omega8cc’s picture

Title: Proposal: Replace hard-coded load thresholds for PHP-FPM kill; defaults are overly suicidal on servers many CPU cores » Make hard-coded load thresholds configurable
Status: Active » Fixed

This commit should do the trick, I think: http://drupalcode.org/project/barracuda.git/commit/5c9e954

Thanks for bringing this to our attention.

Jim Kirkpatrick’s picture

Lovely, thanks!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.