Database deleted on edge cases [#1678528]

Comment	File	Size	Author
#26	provision.code_.1678528-26.patch	1.15 KB	helmo
#22	provision.code_.1678528-22-delete_log.txt	8.92 KB	helmo
#21	1-task-per-cron-run.jpg	168.83 KB	omega8cc

Comment #1

Anonymous (not verified) commented 11 July 2012 at 07:12

Sorry to hear this.

It's a bit of a catch 22 - we need to unpack the archive we took from the original site in order to then write its settings.php. So we'd have to save a new version of the settings.php before making the archive - and I'm not sure if that would be a problem when rolling back (note that clone functionality uses the underlying 'deploy' functionality, same as the Migrate task does, and same as when importing a site from an archive)

Anyway, there's a chance we could put _provision_drupal_create_settings_file() into drush_provision_drupal_provision_deploy_validate() maybe. I need to test it.

Another option might be to not drop the database in the db service's rollback hook, for safety - and let the user figure it out for themselves..

Meanwhile, I highly recommend you implement daily database backups outside of the backups Aegir makes when moving your site around, if you aren't already!

Log in or register to post comments

Comment #2

lsolesen commented 11 July 2012 at 07:31

@mig5 I just implemented hosting_backup_queue and hosting_backup_gc which will make a backup. It is probably not a good idea not to clean up after rollback. However, IMHO this should be high on the priority list, because of the extend of the dataloss.

Log in or register to post comments

Comment #3

Anonymous (not verified) commented 11 July 2012 at 07:32

Note also that a backup of your database was taken in the first place in order to make a clone from. So you could restore from the database.sql located in that tarball in /var/aegir/backups/ and it would've been the most recent (only seconds or minutes old) version of your database of your original site, without modifications. And you'd probably have to add the GRANT yourself too, but easier than starting from scratch or a really old backup.

Log in or register to post comments

Comment #4

Anonymous (not verified) commented 11 July 2012 at 07:33

And I agree it's a critical issue - that said, we'll have to reproduce the conditions - 3 years of Aegir and it's never happened to me :) do you happen to have the task log that might show us where in the process it started to rollback and delete the wrong database?

Log in or register to post comments

Comment #5

lsolesen commented 11 July 2012 at 07:46

@mig5 Where do I find the task log. I also know it happened to @omega8cc a couple of times. Should be an even better help, I think.

Log in or register to post comments

Comment #6

lsolesen commented 11 July 2012 at 07:47

The task had been hanging for a while without me noticing, so the backup was even older, than the one Linode provided.

Log in or register to post comments

Comment #7

Anonymous (not verified) commented 11 July 2012 at 07:47

The task log would be in your Task Queue in the right sidebar of your Aegir interface. Presumably you have a failed (red) clone task there, so if you still have it, click view and then copy-paste the whole output into pastie.org and link here

Log in or register to post comments

Comment #8

lsolesen commented 11 July 2012 at 08:22

Sorry, that is not available anymore.

Log in or register to post comments

Comment #9

steven jones commented 18 August 2012 at 09:21

Status:

Active

» Postponed (maintainer needs more info)

If someone wants to add details of how to reproduce this bug we can start work on a fix.

Log in or register to post comments

Comment #10

dman commented 21 October 2012 at 13:57

This *has* happened to me a long while ago, though I can't recall what caused it at the time.

And just today, I got a horrible scare again. Not precisely from 'cloning' - but from installing a site from scratch.
I tried out a new distribution of a very advanced install profile, and when creating the first new site on it (via Aegir web UI), *something* failed towards the end (message was about a duplicate key in the theme settings table).

The master hostmaster database was dropped!. Aegir committed suicide and the hostmaster site went offline.

I was left with half a site where the new one should be. The new site database was built and left behind. The new site folder was *not* there and the new site vhost was not enabled. Though the alias drushrc file for the new host had been made.
Other sites were still running.
But my hostmaster site had been nuked!

Thankfully I had a nightly backup to restore from. Though it took some time to believe that it had really gone away like that.

It feels like Provision tried to roll back the broken site creation, but decided to 'drop database' in the wrong context.

Freaking scary.
Not a fun thing to reproduce either, so I'll have to clean-room a whole new aegir system, far away from our real live one before I try to replicate it again.

Log in or register to post comments

Comment #11

steven jones commented 26 October 2012 at 18:46

Status:

Postponed (maintainer needs more info)

» Active

Really sorry that that happened. Thanks for the detailed report. We should be able to cause an artificial error at various parts of the process and work out why the error causes the HM db to be dropped.

Log in or register to post comments

Comment #12

lsolesen commented 19 November 2012 at 12:00

It happened again today. This time I had created a site on a platform (stage.site.dk). However, the site would not install correctly. I removed the task, because it just kept spinning. Then I tried enabling the site, but decided to delete it instead. Everything seemed allright. But this nuked the database for my main site (site.dk) on the same platform, so I had to find an old backup.

For this to happen, there was no migration nor cloning that could cause this.

Log in or register to post comments

Comment #13

dman commented 22 November 2012 at 21:12

I can replicate the hostmaster database drop consistently when trying out http://drupal.org/project/commerce_kickstart

I built the 'commerce' install profile (from drush dl , then drush make, as suggested as an alternative build method at http://drupal.org/node/1291122 , as I preferred my contrib modules in the normal place)
Verified the platform via the Aegir UI fine.
Adding a site via Aegir UI produced the results described above, dropping my master hostmaster database.

My workaround for now is to :
Create a site using the 'minimal' install profile but on the commerce platform.
Manually dropping and re-creating an empty version of the newly created database.
Accessing the new sites install.php and going through the interactive install process.

I know that somewhere in my site settings, there may still be a reference to the 'minimal' profile and that can later screw up module installations, but right now the process works enough for demos.

Ultimately, it's the super-nice install process that commerce has done that probably is incompatible with a provision-automated setup, and that needs to be addressed over there BUT the hostmaster-death is a serious failure state. Install profiles are hard, we know that, but I can't leave these platforms on the aegir system if this will happen :-/

Log in or register to post comments

Comment #14

lsolesen commented 23 November 2012 at 06:34

In #12 I also used commerce_kickstart.

Log in or register to post comments

Comment #15

omega8cc commented 23 November 2012 at 12:53

We are using commerce_kickstart (in BOA) without issues and I have never seen anything like that (deleted hostmaster db), however we are using Aegir 2.x

Log in or register to post comments

Comment #16

lsolesen commented 23 November 2012 at 16:32

I had not the hostmaster db deleted but the db for the main site on the same platform.

Log in or register to post comments

Comment #17

omega8cc commented 23 November 2012 at 16:59

My comment is related to #13 then only.

Note that commerce_kickstart does have built-in db self-destruction feature (site re-install), and while it will break the site in BOA (and should never be used in Aegir), it will not delete its database (in BOA). Not to mention hostmaster database.

Log in or register to post comments

Comment #18

dman commented 23 November 2012 at 23:58

I'm pretty sure it's not from an intentional self-destruct specific to commerce.

My report in #10 was from an early aGov install profile. http://agov.com.au/
This week I got precisely the same result from the commerce profile.
And long ago I did it to myself somehow when trying to make my own profile.

There is something about the failure states (which are of course somewhat unexpected and unpredictable) where the fallback (dropping the database) really is just running against the wrong database.

Log in or register to post comments

Comment #19

lsolesen commented 21 February 2013 at 22:11

And today it happened again. It seems that tasks will spin forever, if two octopuses run fairly big migrates at the same time. And somehow the database for the site got destroyed for the main site after the migrate task was deleted. And there was no backup just prior to the migrate task as it seems that the task stalled there (so I had to use a day old backup).

I do not know how I am supposed to avoid this, as I tried it on a dev-site and the migration was successful.

@mig5 Were you able to get further with your comments in #1?

Log in or register to post comments

Comment #20

dman commented 21 February 2013 at 23:45

It's hard to trace because any logging of just what it was up to when it failed (I presume a hard crash on the part of the child site) may have gone into the logs of the master site which them immediately deletes itself. Next time I'm feeling like causing myself pain maybe I'l try switching to syslogging and see if that throws up and clues. And backup master first :-{. Or revoke the daemons 'DROP DATABASE' rights for a while.
It may even be that trying to *switch* into the context of the new child is the bit that fails, (is that what it does?) which would explain why the current DB is the one that gets 'rolled back' and deleted. This is just guesswork, because I'm not personally to clear on all the internals of what happens there. I'd have thought that the install process was a completely different drush-driven thread.
*guesswork*

But I've learnt not to use aegir to bootstrap the advanced install profiles like Commerce. Instead I build them manually, then import the built DB over top of a 'standard' aegir-built site.

Log in or register to post comments

Comment #21

omega8cc commented 22 February 2013 at 01:13

Status	File	Size
new	1-task-per-cron-run.jpg	168.83 KB

Really weird. I have never seen anything like that (hostmaster database trashed) on any existing BOA system. But then, BOA forces by default only 1 task per cron run and we don't use the daemon to speed up the task queue. So maybe it is a result of some race condition plus context messing here?

1 task

Log in or register to post comments

Comment #22

helmo commented 28 February 2013 at 11:33

Version:

6.x-1.9

» 7.x-2.x-dev

Status	File	Size
new	provision.code_.1678528-22-delete_log.txt	8.92 KB

I just had such an event, luckily on a dev box. And I was even able to recover from the backup Aegir made.

What happened:
I had aegir on e.g. aegir.example.com, and a regular site on example.com
I tried to install panopoly.dev.example.com of which the install task failed (why is a different issue, which I'll look into later)
I deleted the panopoly.dev.example.com site, and bam.... gone was the database of my example.com site.

After all this the panopoly.dev.example.com site did had a directory in 'sites', but no settings.php just a drushrc.php. Therefore the settings.php from example.com was mistakenly used.

It looks like the sites/panopoly.dev.example.com was created again by the "Temporarily uncloaking database credentials for backup" code in drush_provision_drupal_provision_backup.

How can we prevent this from happening again, using a different settings.php is bad...
Just skip the backup if we notice the sites/panopoly.dev.example.com is not there?

Log in or register to post comments

Comment #23

dman commented 28 February 2013 at 13:32

After all this the panopoly.dev.example.com site did had a directory in 'sites', but no settings.php just a drushrc.php. Therefore the settings.php from example.com was mistakenly used.

THIS may be a clue!

It also has some similarity with my setup, and would have been a quirk that other installations may not have had.

I have Wildcard DNS set up on my aegir server, so for quick site creation I can just name sites under that domain and they become fully available.

eg, the Aegir master is http://hostmaster.mysite.com/
My DNS says to route *.hostmaster.mysite.com to the hostmaster server IP
So when I tell aegir to create newsite.hostmaster.mysite.com or sprint2.projectname.hostmaster.mysite.com they all just work instantly.
So far so clever.

But YES, maybe if newsite.hostmaster.mysite.com was broken, then Drupals domain resolution would find hostmaster.mysite.com and it would get picked up - and deleted - instead?
But NO, I can understand this being a potential threat if everything was in the same platform, but hostmaster and the demo platform are different vhosts and everything.
So ... hm. Maybe what helmo got is a new symptom, but not the same one that's been killing the aegir hostmaster main DB outright.

Log in or register to post comments

Comment #24

anarcat commented 28 February 2013 at 21:51

Version:

7.x-2.x-dev

» 6.x-1.9

@helmo - did you mean to change the version to 2.x? If this affects both 1.9 and 2.x, we should keep this assigned to 1.x so that we fix this critical in the stable branch, I believe.

Log in or register to post comments

Comment #25

helmo commented 1 March 2013 at 09:36

@anarcat: Yes, as with core I think we should first fix in the latest dev and then backport to stable.

I see you've also created a separate issue (#1930740: provision-delete leaves a drushrc.php lying around) related to my comment in #22.

Log in or register to post comments

Comment #26

helmo commented 5 March 2013 at 12:28

Status:

Active

» Needs review

Status	File	Size
new	provision.code_.1678528-26.patch	1.15 KB

I have a solution.... in the rollback of an install we also should delete the drush site alias.

The patch works on both 1.x and 2.x

Log in or register to post comments

Comment #27

anarcat commented 5 March 2013 at 17:56

Status:

Needs review

» Fixed

i have committed this on both 1.x and 2.x branches, please confirm if the issue is fixed.

thanks.

Log in or register to post comments

Comment #28

lsolesen commented 5 March 2013 at 19:02

Will this prevent the database deletion or only remove the leftover drush alias file.

Log in or register to post comments

Comment #29

helmo commented 6 March 2013 at 10:45

Yes and no.

The case that I describe in #22 was caused by the dangling drush alias.

I hope that this was also the root cause of the initially reported problem. But as those were never reproducible we can't be sure.

If you can reproduce a scenario where this fails on the latest code, then please describe here or in a new issue.

Log in or register to post comments

Comment #30

lsolesen commented 6 March 2013 at 16:12

However, omega8cc thought the issue is caused by a dangling settings.php, not drush alias.

Log in or register to post comments

Comment #31

dman commented 6 March 2013 at 19:33

Reproducable just by trying to install the commerce_kickstart install profile last time I tried, but that wasn't on the 'latest' code, but 6.x-1.9 stable.

Log in or register to post comments

Comment #32

20 March 2013 at 19:40

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Log in or register to post comments

Comment #33

lsolesen commented 21 March 2013 at 12:14

Status:

Closed (fixed)

» Active

Should this issue be closed when the answer in #29 is "yes and no"?

Log in or register to post comments

Comment #34

helmo commented 1 April 2013 at 13:06

Status:

Active

» Fixed

Well, this could stay open for years waiting for someone to confirm that this is not fixed since #27.

So unless you can reproduce a problem with the latest code I prefer to keep this closed.

Log in or register to post comments

Comment #35

anarcat commented 11 April 2013 at 19:21

Status:

Fixed

» Closed (fixed)

Log in or register to post comments

Comment #36

anarcat commented 17 July 2013 at 17:40

Version:

6.x-1.9

» 6.x-1.x-dev

actually, this broke the build, see #2044251: drush command '@none provision-save' could not be found.

Log in or register to post comments

Comment #37

12 June 2014 at 08:41

Commit 010f6b2 on dev-drupal-8, 6.x-2.x, dev-ssl-ip-allocation-refactor, dev-1205458-move_sites_out_of_platforms, 7.x-3.x, dev-subdir-multiserver, 6.x-2.x-backports, dev-helmo-3.x by anarcat:
```
Issue #1678528 by helmo: Fixed Database deleted on edge cases.
```

Log in or register to post comments

Database deleted on edge cases

Comments