So far we have seen the same issue on two servers we have upgraded to BOA-2.0.6-dev for unrelated reasons. One is our own hosted instance with Debian Squeeze and another is remotely managed on Ubuntu Lucid (Linode). This means that the issue is not OS specific, but directly related to the MariaDB-5.5.29 release, which is rather a must-have because of serious security issues fixed.

The problem is mysterious, because MariaDB logs just cryptic/dummy errors, and only when visiting some URLs - on the hosted instance it was just the site's homepage, while *all* other URLs worked fine, and on the remotely managed instance there are some "random" URLs affected on a few sites, like /admin/content.

Feb 10 12:28:59 aegir mysqld: 130210 12:28:59 [Warning] Aborted connection 26545 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 12:41:32 aegir mysqld: 130210 12:41:32 [Warning] Aborted connection 27647 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 13:02:47 aegir mysqld: 130210 13:02:47 [Warning] Aborted connection 29378 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 13:07:22 aegir mysqld: 130210 13:07:22 [Warning] Aborted connection 29739 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 13:11:11 aegir mysqld: 130210 13:11:11 [Warning] Aborted connection 30079 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 13:18:51 aegir mysqld: 130210 13:18:51 [Warning] Aborted connection 30413 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 10 13:24:48 aegir mysqld: 130210 13:24:48 [Warning] Aborted connection 30604 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)

Furthermore, the error is not consistent/permanent. It disappears completely after `service mysql restart`. Then comes back, or not. On the hosted instance it disappeared and was seen only once (during php-fpm restart) in the last 48 hours, while on the remotely managed instance it still appears on a few sites and only a few "random" URLs, also after a few mysql restarts and system reboot.

Worse yet, the remotely managed Linode crashed a few hours after the upgrade last night and the filesystem went into read-only state, required "Reboot into Rescue Mode", fsck etc. It is still not confirmed if the crash was related to those MariaDB-5.5.29 issues or it was just some coincidence and upstream (SAN) issues related, but it did happen a few hours after the upgrade.

Since the problem is still mysterious and quite random, so it is not really possible to reproduce it, I'm marking this as major and not critical, but it is still serious enough to be a BOA release blocker, because release would obviously result in more systems affected without any immediate fix or workaround.

Comments

deancrabb’s picture

Hi,

I'm the Linode user in this case. I did confirm with Linode, not a SAN problem. Their response "We do not use SAN, all of our storage is local, on the host, so we can certainly confirm it is not a SAN issue." Given you've had another reported instance I'd be leaning towards it not being related.

These are the first log of errors I had appear on the console when I logged in after all our sites went down.
http://screencast.com/t/9uoE220L

I also got this strange error when attempting to restart PHP after I'd completed the fsck and rebooted the server. Sites were still down at this point. Is this somehow consistent with the localhost issues related above?
http://screencast.com/t/4rxKZWuAl3sY

Here is another screenshot that may help
http://screencast.com/t/nMcju4q6O

We also have New Relic running on the server so I'm happy to give you the login to that, if it assists in getting a deeper look into the issue.

Anyway, I hope that is helpful.

Dean

omega8cc’s picture

Thanks for the additional information and screenshots.

Note that you should never use php5-fpm init, always BOA-specific php53-fpm and even then the auto-healing may be faster than you type (it runs every 10 seconds) and may already start php53-fpm before you have attempted it, hence the error, so it is not an issue here at all.

Entire issue is also not related to New Relic, because the only other affected (with only one incident) server doesn't have New Relic installed.

My guess is that it is some contrib module which triggers MariaDB-5.5.29 bug. We just can't find it, because it is too random issue.

deancrabb’s picture

My guess is that it is some contrib module which triggers MariaDB-5.5.29 bug. We just can't find it, because it is too random issue.

We have a consistent page on a site that always has the issue, so its not random. Its repeatable every time. The problem is, why just these pages page? Maybe they corrupted on the first failure?

http://seacliffcoast.com.au/events
and
http://www.seacliffcoast.com.au/user#overlay=admin/content

always BOA-specific php53-fpm and even then the auto-healing may be faster than you type (it runs every 10 seconds) and may already start php53-fpm before you have attempted it

It musn't be auto-healing very well because I can sit there and have the sites fail for several minutes before I manually restart PHP php53-fpm and have them back up immediately afterwards. Its only on the rarer occasions (actually only twice in the last 24 hours after many restarts) that I've had this binding issue appear. It would seem that auto-healing isn't working because we wouldn't have these elongated periods of downtime.

Also on occasion restarting php53-fpm isn't enough to recover it, I've had to also restart php-fpm as well to get everything back.

Cheers
Dean

omega8cc’s picture

The issue is random because it doesn't affect all sites, and because in the one/single case we have had on the hosted instance where it affected the site's homepage only, it disappeared completely without any intervention. The fact that it appears to be consistent on some URLs/sites for you doesn't make it less random in a broader/general sense. It just means that there may be even more ways (modules or configuration) which may trigger the issue.

omega8cc’s picture

I think we have found the culprit - and it seems completely unrelated to MariaDB-5.5.29

I will be able to confirm after another hour of testing.

omega8cc’s picture

Title: MariaDB-5.5.29 Aborted connections with (Unknown error) reported on some URLs only cause 502 Bad Gateway » GEOS PHP extension causes Aborted connections with (Unknown error) reported on some URLs only
Priority: Major » Critical

On the one affected hosted instance it no longer happens, but I was not aware that two days ago one of our sysadmins disabled there GEOS extension, which is included in BOA-2.0.6-dev by default - see #1874156: Add GEOS PHP extension in future release

So it looks like the real reason behind this cascade of errors was GEOS PHP extension, indirectly affecting also New Relic extension, which together resulted in halted/freezed PHP-FPM workers and then mysteriously dropped connections to the db server.

We are removing/disabling it, so it should be available only as an experimental install option with BIG RED warning that it may crash your system.

deancrabb’s picture

Okay, great to hear. Let's continue to monitor and see how we go. Glad I resisted the strong advice from support to upgrade our database from MariaDB to Percona. ;-)

About the time you did a downgrade to our platform we also removed a code update we'd made to a module on Friday. Our developers swears blue it isn't a problem, but then removed it and our problem went away. I then however received an email 1 minute later from Omega8 saying you'd downgraded us so its possible you fixed it and our developer's code is fine.

Still I'll wait a bit and maybe introduce the developer's code again and see what affect it has.

Thanks for your assistance. Let's continue to monitor.

Dean

omega8cc’s picture

As a follow-up: we have just noticed the same errors on that one affected hosted instance again:

Feb 11 05:34:40 v233c mysqld: 130211  5:34:40 [Warning] Aborted connection 142 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 11 05:40:24 v233c mysqld: 130211  5:40:24 [Warning] Aborted connection 241 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 11 05:40:29 v233c mysqld: 130211  5:40:29 [Warning] Aborted connection 252 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 11 05:40:31 v233c mysqld: 130211  5:40:31 [Warning] Aborted connection 253 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 11 05:40:36 v233c mysqld: 130211  5:40:36 [Warning] Aborted connection 255 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 11 05:50:01 v233c mysqld: 130211  5:50:01 [Warning] Aborted connection 148 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 11 05:50:04 v233c mysqld: 130211  5:50:04 [Warning] Aborted connection 149 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 11 05:50:14 v233c mysqld: 130211  5:50:14 [Warning] Aborted connection 198 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 11 05:50:16 v233c mysqld: 130211  5:50:16 [Warning] Aborted connection 197 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 11 05:50:37 v233c mysqld: 130211  5:50:37 [Warning] Aborted connection 316 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 11 05:50:38 v233c mysqld: 130211  5:50:38 [Warning] Aborted connection 315 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)
Feb 11 05:57:40 v233c mysqld: 130211  5:57:40 [Warning] Aborted connection 180 to db: 'foobar' user: 'foobar' host: 'localhost' (Unknown error)

But is was only because the GEOS PHP extension has been disabled only in the ini file for php53-fpm, while it was still active in the ini file for php-cli, and it caused these errors during cron task (which runs via drush/php-cli), when it started generating (huge in this case) xmlsitemap. This only confirms that GEOS was the real source of those problems.

omega8cc’s picture

Status: Active » Fixed
omega8cc’s picture

I have just confirmed that all affected sites we have checked so far have geoPHP module enabled, so it explains why only those sites experienced problems, I guess.

omega8cc’s picture

To follow up, the mysql and VPS crash mentioned above had most probably different reason - it was caused by critical bugs in APC 3.1.14: #1914294: APC 3.1.14 disappeared from PECL

omega8cc’s picture

Assigned: omega8cc » Unassigned

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Issue summary: View changes

Typos