Today I lost an important database. I asked in #aegir and omega8cc and ergonlogic experienced the same thing.

omega8cc told me that it can happen in edge cases where a half baked clone is deleted, and the settings.php file still has the old credentials. That can make the site being cloned loose it's database. It happened to me, and it has happened to both omega8cc and ergonlogic also.

Might I propose that the settings.php is altered before being put into the cloned site?

Comments

Anonymous’s picture

Sorry to hear this.

It's a bit of a catch 22 - we need to unpack the archive we took from the original site in order to then write its settings.php. So we'd have to save a new version of the settings.php before making the archive - and I'm not sure if that would be a problem when rolling back (note that clone functionality uses the underlying 'deploy' functionality, same as the Migrate task does, and same as when importing a site from an archive)

Anyway, there's a chance we could put _provision_drupal_create_settings_file() into drush_provision_drupal_provision_deploy_validate() maybe. I need to test it.

Another option might be to not drop the database in the db service's rollback hook, for safety - and let the user figure it out for themselves..

Meanwhile, I highly recommend you implement daily database backups outside of the backups Aegir makes when moving your site around, if you aren't already!

lsolesen’s picture

@mig5 I just implemented hosting_backup_queue and hosting_backup_gc which will make a backup. It is probably not a good idea not to clean up after rollback. However, IMHO this should be high on the priority list, because of the extend of the dataloss.

Anonymous’s picture

Note also that a backup of your database was taken in the first place in order to make a clone from. So you could restore from the database.sql located in that tarball in /var/aegir/backups/ and it would've been the most recent (only seconds or minutes old) version of your database of your original site, without modifications. And you'd probably have to add the GRANT yourself too, but easier than starting from scratch or a really old backup.

Anonymous’s picture

And I agree it's a critical issue - that said, we'll have to reproduce the conditions - 3 years of Aegir and it's never happened to me :) do you happen to have the task log that might show us where in the process it started to rollback and delete the wrong database?

lsolesen’s picture

@mig5 Where do I find the task log. I also know it happened to @omega8cc a couple of times. Should be an even better help, I think.

lsolesen’s picture

The task had been hanging for a while without me noticing, so the backup was even older, than the one Linode provided.

Anonymous’s picture

The task log would be in your Task Queue in the right sidebar of your Aegir interface. Presumably you have a failed (red) clone task there, so if you still have it, click view and then copy-paste the whole output into pastie.org and link here

lsolesen’s picture

Sorry, that is not available anymore.

steven jones’s picture

Status: Active » Postponed (maintainer needs more info)

If someone wants to add details of how to reproduce this bug we can start work on a fix.

dman’s picture

This *has* happened to me a long while ago, though I can't recall what caused it at the time.

And just today, I got a horrible scare again. Not precisely from 'cloning' - but from installing a site from scratch.
I tried out a new distribution of a very advanced install profile, and when creating the first new site on it (via Aegir web UI), *something* failed towards the end (message was about a duplicate key in the theme settings table).

The master hostmaster database was dropped!. Aegir committed suicide and the hostmaster site went offline.

I was left with half a site where the new one should be. The new site database was built and left behind. The new site folder was *not* there and the new site vhost was not enabled. Though the alias drushrc file for the new host had been made.
Other sites were still running.
But my hostmaster site had been nuked!

Thankfully I had a nightly backup to restore from. Though it took some time to believe that it had really gone away like that.

It feels like Provision tried to roll back the broken site creation, but decided to 'drop database' in the wrong context.

Freaking scary.
Not a fun thing to reproduce either, so I'll have to clean-room a whole new aegir system, far away from our real live one before I try to replicate it again.

steven jones’s picture

Status: Postponed (maintainer needs more info) » Active

Really sorry that that happened. Thanks for the detailed report. We should be able to cause an artificial error at various parts of the process and work out why the error causes the HM db to be dropped.

lsolesen’s picture

It happened again today. This time I had created a site on a platform (stage.site.dk). However, the site would not install correctly. I removed the task, because it just kept spinning. Then I tried enabling the site, but decided to delete it instead. Everything seemed allright. But this nuked the database for my main site (site.dk) on the same platform, so I had to find an old backup.

For this to happen, there was no migration nor cloning that could cause this.

dman’s picture

I can replicate the hostmaster database drop consistently when trying out http://drupal.org/project/commerce_kickstart

I built the 'commerce' install profile (from drush dl , then drush make, as suggested as an alternative build method at http://drupal.org/node/1291122 , as I preferred my contrib modules in the normal place)
Verified the platform via the Aegir UI fine.
Adding a site via Aegir UI produced the results described above, dropping my master hostmaster database.

My workaround for now is to :
Create a site using the 'minimal' install profile but on the commerce platform.
Manually dropping and re-creating an empty version of the newly created database.
Accessing the new sites install.php and going through the interactive install process.

I know that somewhere in my site settings, there may still be a reference to the 'minimal' profile and that can later screw up module installations, but right now the process works enough for demos.

Ultimately, it's the super-nice install process that commerce has done that probably is incompatible with a provision-automated setup, and that needs to be addressed over there BUT the hostmaster-death is a serious failure state. Install profiles are hard, we know that, but I can't leave these platforms on the aegir system if this will happen :-/

lsolesen’s picture

In #12 I also used commerce_kickstart.

omega8cc’s picture

We are using commerce_kickstart (in BOA) without issues and I have never seen anything like that (deleted hostmaster db), however we are using Aegir 2.x

lsolesen’s picture

I had not the hostmaster db deleted but the db for the main site on the same platform.

omega8cc’s picture

My comment is related to #13 then only.

Note that commerce_kickstart does have built-in db self-destruction feature (site re-install), and while it will break the site in BOA (and should never be used in Aegir), it will not delete its database (in BOA). Not to mention hostmaster database.

dman’s picture

I'm pretty sure it's not from an intentional self-destruct specific to commerce.

My report in #10 was from an early aGov install profile. http://agov.com.au/
This week I got precisely the same result from the commerce profile.
And long ago I did it to myself somehow when trying to make my own profile.

There is something about the failure states (which are of course somewhat unexpected and unpredictable) where the fallback (dropping the database) really is just running against the wrong database.

lsolesen’s picture

And today it happened again. It seems that tasks will spin forever, if two octopuses run fairly big migrates at the same time. And somehow the database for the site got destroyed for the main site after the migrate task was deleted. And there was no backup just prior to the migrate task as it seems that the task stalled there (so I had to use a day old backup).

I do not know how I am supposed to avoid this, as I tried it on a dev-site and the migration was successful.

@mig5 Were you able to get further with your comments in #1?

dman’s picture

It's hard to trace because any logging of just what it was up to when it failed (I presume a hard crash on the part of the child site) may have gone into the logs of the master site which them immediately deletes itself. Next time I'm feeling like causing myself pain maybe I'l try switching to syslogging and see if that throws up and clues. And backup master first :-{. Or revoke the daemons 'DROP DATABASE' rights for a while.
It may even be that trying to *switch* into the context of the new child is the bit that fails, (is that what it does?) which would explain why the current DB is the one that gets 'rolled back' and deleted. This is just guesswork, because I'm not personally to clear on all the internals of what happens there. I'd have thought that the install process was a completely different drush-driven thread.
*guesswork*

But I've learnt not to use aegir to bootstrap the advanced install profiles like Commerce. Instead I build them manually, then import the built DB over top of a 'standard' aegir-built site.

omega8cc’s picture

StatusFileSize
new168.83 KB

Really weird. I have never seen anything like that (hostmaster database trashed) on any existing BOA system. But then, BOA forces by default only 1 task per cron run and we don't use the daemon to speed up the task queue. So maybe it is a result of some race condition plus context messing here?

1 task

helmo’s picture

Version: 6.x-1.9 » 7.x-2.x-dev
StatusFileSize
new8.92 KB

I just had such an event, luckily on a dev box. And I was even able to recover from the backup Aegir made.

What happened:
I had aegir on e.g. aegir.example.com, and a regular site on example.com
I tried to install panopoly.dev.example.com of which the install task failed (why is a different issue, which I'll look into later)
I deleted the panopoly.dev.example.com site, and bam.... gone was the database of my example.com site.

After all this the panopoly.dev.example.com site did had a directory in 'sites', but no settings.php just a drushrc.php. Therefore the settings.php from example.com was mistakenly used.

It looks like the sites/panopoly.dev.example.com was created again by the "Temporarily uncloaking database credentials for backup" code in drush_provision_drupal_provision_backup.

How can we prevent this from happening again, using a different settings.php is bad...
Just skip the backup if we notice the sites/panopoly.dev.example.com is not there?

dman’s picture

After all this the panopoly.dev.example.com site did had a directory in 'sites', but no settings.php just a drushrc.php. Therefore the settings.php from example.com was mistakenly used.

THIS may be a clue!

It also has some similarity with my setup, and would have been a quirk that other installations may not have had.

I have Wildcard DNS set up on my aegir server, so for quick site creation I can just name sites under that domain and they become fully available.

eg, the Aegir master is http://hostmaster.mysite.com/
My DNS says to route *.hostmaster.mysite.com to the hostmaster server IP
So when I tell aegir to create newsite.hostmaster.mysite.com or sprint2.projectname.hostmaster.mysite.com they all just work instantly.
So far so clever.

But YES, maybe if newsite.hostmaster.mysite.com was broken, then Drupals domain resolution would find hostmaster.mysite.com and it would get picked up - and deleted - instead?
But NO, I can understand this being a potential threat if everything was in the same platform, but hostmaster and the demo platform are different vhosts and everything.
So ... hm. Maybe what helmo got is a new symptom, but not the same one that's been killing the aegir hostmaster main DB outright.

anarcat’s picture

Version: 7.x-2.x-dev » 6.x-1.9

@helmo - did you mean to change the version to 2.x? If this affects both 1.9 and 2.x, we should keep this assigned to 1.x so that we fix this critical in the stable branch, I believe.

helmo’s picture

@anarcat: Yes, as with core I think we should first fix in the latest dev and then backport to stable.

I see you've also created a separate issue (#1930740: provision-delete leaves a drushrc.php lying around) related to my comment in #22.

helmo’s picture

Status: Active » Needs review
StatusFileSize
new1.15 KB

I have a solution.... in the rollback of an install we also should delete the drush site alias.

The patch works on both 1.x and 2.x

anarcat’s picture

Status: Needs review » Fixed

i have committed this on both 1.x and 2.x branches, please confirm if the issue is fixed.

thanks.

lsolesen’s picture

Will this prevent the database deletion or only remove the leftover drush alias file.

helmo’s picture

Yes and no.

The case that I describe in #22 was caused by the dangling drush alias.

I hope that this was also the root cause of the initially reported problem. But as those were never reproducible we can't be sure.

If you can reproduce a scenario where this fails on the latest code, then please describe here or in a new issue.

lsolesen’s picture

However, omega8cc thought the issue is caused by a dangling settings.php, not drush alias.

dman’s picture

Reproducable just by trying to install the commerce_kickstart install profile last time I tried, but that wasn't on the 'latest' code, but 6.x-1.9 stable.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

lsolesen’s picture

Status: Closed (fixed) » Active

Should this issue be closed when the answer in #29 is "yes and no"?

helmo’s picture

Status: Active » Fixed

Well, this could stay open for years waiting for someone to confirm that this is not fixed since #27.

So unless you can reproduce a problem with the latest code I prefer to keep this closed.

anarcat’s picture

Status: Fixed » Closed (fixed)
anarcat’s picture

Version: 6.x-1.9 » 6.x-1.x-dev

  • Commit 010f6b2 on dev-drupal-8, 6.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-subdir-multiserver, 6.x-2.x-backports, dev-helmo-3.x by anarcat:
    Issue #1678528 by helmo: Fixed Database deleted on edge cases.