So far we have seen the same issue on two servers we have upgraded to BOA-2.0.6-dev for unrelated reasons. One is our own hosted instance with Debian Squeeze and another is remotely managed on Ubuntu Lucid (Linode). This means that the issue is not OS specific, but directly related to the MariaDB-5.5.29 release, which is rather a must-have because of serious security issues fixed.
The problem is mysterious, because MariaDB logs just cryptic/dummy errors, and only when visiting some URLs - on the hosted instance it was just the site's homepage, while *all* other URLs worked fine, and on the remotely managed instance there are some "random" URLs affected on a few sites, like /admin/content.
Feb 10 12:28:59 aegir mysqld: 130210 12:28:59 [Warning] Aborted connection 26545 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 12:41:32 aegir mysqld: 130210 12:41:32 [Warning] Aborted connection 27647 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 13:02:47 aegir mysqld: 130210 13:02:47 [Warning] Aborted connection 29378 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 13:07:22 aegir mysqld: 130210 13:07:22 [Warning] Aborted connection 29739 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 13:11:11 aegir mysqld: 130210 13:11:11 [Warning] Aborted connection 30079 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 13:18:51 aegir mysqld: 130210 13:18:51 [Warning] Aborted connection 30413 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 13:24:48 aegir mysqld: 130210 13:24:48 [Warning] Aborted connection 30604 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Furthermore, the error is not consistent/permanent. It disappears completely after `service mysql restart`. Then comes back, or not. On the hosted instance it disappeared and was seen only once (during php-fpm restart) in the last 48 hours, while on the remotely managed instance it still appears on a few sites and only a few "random" URLs, also after a few mysql restarts and system reboot.
Worse yet, the remotely managed Linode crashed a few hours after the upgrade last night and the filesystem went into read-only state, required "Reboot into Rescue Mode", fsck etc. It is still not confirmed if the crash was related to those MariaDB-5.5.29 issues or it was just some coincidence and upstream (SAN) issues related, but it did happen a few hours after the upgrade.
Since the problem is still mysterious and quite random, so it is not really possible to reproduce it, I'm marking this as major and not critical, but it is still serious enough to be a BOA release blocker, because release would obviously result in more systems affected without any immediate fix or workaround.
Comments
Comment #1
deancrabb commentedHi,
I'm the Linode user in this case. I did confirm with Linode, not a SAN problem. Their response "We do not use SAN, all of our storage is local, on the host, so we can certainly confirm it is not a SAN issue." Given you've had another reported instance I'd be leaning towards it not being related.
These are the first log of errors I had appear on the console when I logged in after all our sites went down.
http://screencast.com/t/9uoE220L
I also got this strange error when attempting to restart PHP after I'd completed the fsck and rebooted the server. Sites were still down at this point. Is this somehow consistent with the localhost issues related above?
http://screencast.com/t/4rxKZWuAl3sY
Here is another screenshot that may help
http://screencast.com/t/nMcju4q6O
We also have New Relic running on the server so I'm happy to give you the login to that, if it assists in getting a deeper look into the issue.
Anyway, I hope that is helpful.
Dean
Comment #2
omega8cc commentedThanks for the additional information and screenshots.
Note that you should never use
php5-fpminit, always BOA-specificphp53-fpmand even then the auto-healing may be faster than you type (it runs every 10 seconds) and may already start php53-fpm before you have attempted it, hence the error, so it is not an issue here at all.Entire issue is also not related to New Relic, because the only other affected (with only one incident) server doesn't have New Relic installed.
My guess is that it is some contrib module which triggers MariaDB-5.5.29 bug. We just can't find it, because it is too random issue.
Comment #3
deancrabb commentedWe have a consistent page on a site that always has the issue, so its not random. Its repeatable every time. The problem is, why just these pages page? Maybe they corrupted on the first failure?
http://seacliffcoast.com.au/events
and
http://www.seacliffcoast.com.au/user#overlay=admin/content
It musn't be auto-healing very well because I can sit there and have the sites fail for several minutes before I manually restart PHP php53-fpm and have them back up immediately afterwards. Its only on the rarer occasions (actually only twice in the last 24 hours after many restarts) that I've had this binding issue appear. It would seem that auto-healing isn't working because we wouldn't have these elongated periods of downtime.
Also on occasion restarting php53-fpm isn't enough to recover it, I've had to also restart php-fpm as well to get everything back.
Cheers
Dean
Comment #4
omega8cc commentedThe issue is random because it doesn't affect all sites, and because in the one/single case we have had on the hosted instance where it affected the site's homepage only, it disappeared completely without any intervention. The fact that it appears to be consistent on some URLs/sites for you doesn't make it less random in a broader/general sense. It just means that there may be even more ways (modules or configuration) which may trigger the issue.
Comment #5
omega8cc commentedI think we have found the culprit - and it seems completely unrelated to MariaDB-5.5.29
I will be able to confirm after another hour of testing.
Comment #6
omega8cc commentedOn the one affected hosted instance it no longer happens, but I was not aware that two days ago one of our sysadmins disabled there GEOS extension, which is included in BOA-2.0.6-dev by default - see #1874156: Add GEOS PHP extension in future release
So it looks like the real reason behind this cascade of errors was GEOS PHP extension, indirectly affecting also New Relic extension, which together resulted in halted/freezed PHP-FPM workers and then mysteriously dropped connections to the db server.
We are removing/disabling it, so it should be available only as an experimental install option with BIG RED warning that it may crash your system.
Comment #7
deancrabb commentedOkay, great to hear. Let's continue to monitor and see how we go. Glad I resisted the strong advice from support to upgrade our database from MariaDB to Percona. ;-)
About the time you did a downgrade to our platform we also removed a code update we'd made to a module on Friday. Our developers swears blue it isn't a problem, but then removed it and our problem went away. I then however received an email 1 minute later from Omega8 saying you'd downgraded us so its possible you fixed it and our developer's code is fine.
Still I'll wait a bit and maybe introduce the developer's code again and see what affect it has.
Thanks for your assistance. Let's continue to monitor.
Dean
Comment #8
omega8cc commentedAs a follow-up: we have just noticed the same errors on that one affected hosted instance again:
But is was only because the GEOS PHP extension has been disabled only in the ini file for php53-fpm, while it was still active in the ini file for php-cli, and it caused these errors during cron task (which runs via drush/php-cli), when it started generating (huge in this case) xmlsitemap. This only confirms that GEOS was the real source of those problems.
Comment #9
omega8cc commentedFixed in commit: http://drupalcode.org/project/barracuda.git/commit/f51e3b2
Comment #10
omega8cc commentedI have just confirmed that all affected sites we have checked so far have geoPHP module enabled, so it explains why only those sites experienced problems, I guess.
Comment #11
omega8cc commentedTo follow up, the mysql and VPS crash mentioned above had most probably different reason - it was caused by critical bugs in APC 3.1.14: #1914294: APC 3.1.14 disappeared from PECL
Comment #12
omega8cc commentedComment #13.0
(not verified) commentedTypos