We have seen that many many times. The site was modified, some contrib modules added/replaced and then it still verifies without issues, but any attempt to clone it or migrate fails with many errors, very hard to debug, since it is not possible to see which module(s) causes the problem.

From the end user point of view it looks like an epic fail, since Aegir is expected to do just this: make easier cloning/upgrade/migration etc, while none of those works (because of some silently breaking everything contrib module).

It was always a good practice to disable all contrib modules when upgrading Drupal site. It should be (probably) also a good (common) practice in Aegir, so the workflow would look like:

1. The site is temporarily taken offline before clone/migrate task will start.
2. All contrib modules are disabled.
3. Now the clone or migrate task runs.
4. All contrib modules are enabled again.
5. Internal verify task is fired up.
6. On success it enables the site back online.

In fact, one could suppose this is how Aegir works, but apparently it doesn't care and just runs the clone/migrate on the site with all contrib modules enabled, which causes unexpected issues.

Comments

adrian’s picture

Status: Active » Closed (won't fix)

no. because then those disabled modules dont get upgraded.

and it's the upgrades that are killing it.

can you imagine an open atrium upgrade where none of the contrib upgrades have happened ?
you also can't re-enabled them because they depend on each other and there's no way to figure out the proper chain.

those modules need to be fixed.

adrian’s picture

Status: Closed (won't fix) » Closed (works as designed)

or more accurately.

omega8cc’s picture

Title: All contrib modules should be disabled before running clone tasks (probably) to avoid mysterious fails » All contrib modules should be disabled before running clone or migrate tasks (probably) to avoid mysterious fails
Status: Closed (works as designed) » Closed (won't fix)

I agree, but only when it comes to migrate task, which is a common way to perform upgrades (I was wrong it this case, of course).

But what if I want to simply clone the site? Would disabling contrib modules hurt the clone tasks? Drush is smart enough to use dependancies and enable what should be enabled.

I don't think it is enough to say "please fix your module", since Aegir just fails and doesn't help to find why it fails.

omega8cc’s picture

Sorry, the Status got changed because you changed it when I had the page not reloaded.

omega8cc’s picture

Title: All contrib modules should be disabled before running clone or migrate tasks (probably) to avoid mysterious fails » All contrib modules should be disabled before running clone tasks (probably) to avoid mysterious fails

Removing the migrate task from title because of #3

omega8cc’s picture

Title: All contrib modules should be disabled before running clone or migrate tasks (probably) to avoid mysterious fails » All contrib modules should be disabled before running clone tasks (probably) to avoid mysterious fails
Status: Closed (won't fix) » Closed (works as designed)

To explain it a bit - I mentioned the migrate task, because it is used also for site rename, which has nothing to upgrades etc, and fails in this case.

Anonymous’s picture

I actually agree that simply 'renaming' the site by the use of Migrate ought not to initiate an upgrade - that's a bit sneaky without making it obvious.

The problem is that migrate (and thus implied rename) and clone both make use of provision-deploy to reduce code duplication, and provision-deploy is hard-baked in to

a) compare the schemas of enabled packages (it fetches this from the site's drushrc.php)
b) runs drush updatedb in the post hook.

I can't think of any sane way to fix this. Aegir can only work with what your site is telling it, and if the site is doing something wrong, Aegir will fail.

A good practice ought to be to clone first before Migrate. If this same problem happens for Clone, then you should be able to use it as a testing mechanism to anticipate what the results will be on a Migration of the original site.

I would also like to concretely get an example of how to reproduce this sort of thing - what do we know can break a poorly-behaving contrib module on update.php?

We won't avoid being a scapegoat for contrib problems (it's easier to demand more magic from Aegir than expect modules to behave in their own right..), but at least we can perhaps FAQ this to help people avoid the issue entirely.

skwashd’s picture

I agree with Adrian that disabling contrib is going to lead us into a world of pain. The whole idea of aegir is that it handles all of this stuff automagically.

mig5 is right that aegir will get blames because of broken contrib modules, but if we document (or even patch them) then we can kick down the line.

Omega8cc's proposal, in my mind, goes against the whole idea of aegir.

butler360’s picture

I've run into this a number of times. Migrate fails because of module upgrades. The way I fix it usually is by upgrading the module on the old platform and running updb in drush. Often times the first pass of updb will have some errors, but some updates will take place. So I run updb again and it usually runs the last few upgrades and then says everything is fine. Re-verify, migrate, now it works.

So is it just because Aegir only gives it one shot? Would it help to re-run updb in the event that there are errors the first time?

Anonymous’s picture

It should also be noted (at risk of me sound like I'm whining 'don't blame us', Aegir is mostly at the mercy of Drush here, which some people have seen to be a bit more 'conservative' about when it encounters errors, such as during drush updatedb, that a couple of refreshes or re-runs of /update.php in a browser might otherwise push through.

Once Drush encounters what it sees as permanent errors, the show's over, and the rollback hook functions are thrown into action. In that sense, yes, it is a one-shot thing.

I am *convinced* that were you to run 'drush updatedb' manually against the site, you'd see exactly the same behaviour. Aegir is a bit of a scapegoat here by actually caring about those errors and having rollback hooks to try and restore as much sanity as it can.

butler360’s picture

I actually said, or at least meant to say, the exact same thing. I run it manually against the site in order to migrate it through Aegir and I see the same thing, except when I do it manually I can run it again.

I was just curious if Aegir could tell drush, "Hey, do it again!" I'm not blaming Aegir. Although a note/documentation about this somewhere would be helpful.

Anonymous’s picture

No I know, I think we're on the same page here, and that wasn't directly a response to you but to everyone else subscribed.

My point is, trying to work out ways to make Aegir work around this is not the correct approach. Fixing the module so it doesn't do bad things in the first place, is the correct approach, despite it being a nuisance.

We've seen this case many times before. Only once did someone point out 'the Date module is screwing up', I searched and found non-Aegir users reporting the same problem, an issue was filed with the Date module, and it turned out to be poor practice in the hook_install or something like that. Fixed. That's the definition of awesome.

It's easy to expect Aegir to 'apologise for' these modules because that's the tool at hand, when more often than not it's (well, and Drush) the only tool actually obeying the laws (finding errors, and reporting errors!)

I'd sooner have no errors being reported. Maybe it's just me.

I'd rather have a ticket per each time this occurred with a full debug log of what occurred, identifying which module caused a problem and what was the task being run.

Even if it turns out to be an Aegir problem, then we can fix it. There are too many tickets being reported that instead of saying 'here is the error I got', instead say 'I don't like this being a problem for me - I now want Aegir to do something totally different so it doesn't happen again', and that won't help the project improve.

Anonymous’s picture

I am also in my usual dry humour, happy to point out that I am not attacking the contrib modules for being the only things not written to handle upgrade paths properly.

Drupal core is just as flabbergastingly irreverent. Perfect case of what we're all talking about here: #818144: Add creation of semaphore table to update_fix_d6_requirements.

We worked around that one, because we could, but we shouldn't have had to. Aegir has an ugly hack in it from that time as a result. I may be the only one not happy about that.

butler360’s picture

To be clear, you are suggesting reporting the error in the Aegir (or really provision) queue? Or the module's queue? Or both?

What you're saying makes sense, so I just want to know for myself what to do to help out.

Anonymous’s picture

In our queue (Provision), so we can investigate. Tickets are cheap - we can always close or move it if we think it's not our problem :)

This is the same issue as distributions that have install profiles that can't be automated and thus fail in Aegir. If it gets reported in the contrib queue, you will find the maintainer will say 'I can't reproduce, I think it's Aegir..'.

P.S thanks for putting up with my rants :) despite the cynicism, I actually welcome more tickets in our queues and fixing these issues if we can :)

butler360’s picture

Hah, not a problem.

"If it gets reported in the contrib queue, you will find the maintainer will say 'I can't reproduce, I think it's Aegir..'."
-This is exactly why I usually don't report those.

I just ran into this issue two days ago, next time I'll copy the info and put it into the queue. Perhaps a note about this somewhere would be useful, because from an uninformed user's view if "Aegir" can't do the upgrade, then it's failing to do what it's designed to do.

omega8cc’s picture

The reason I didn't post the task log here is: it didn't include anything helpful in debugging. I'm not blaming Aegir, and I didn't intend to suggest anything against Aegir core logic etc. I just have enough support requests to tell you, that Aegir could do something to not fail, when there is no reason to fail - like it is with clone and rename tasks, when Aegir should never run any modules comparisons or db updates etc. It should be a simple like cp -a site site2 and mv site site2.

The error in this case was just:

Undefined index: profiles deploy.provision.inc:84
array_keys(): The first argument should be an array deploy.provision.inc:84
Undefined index: deploy.provision.inc:88
array_merge(): Argument #2 is not an array deploy.provision.inc:88
Undefined index: modules deploy.provision.inc:89
Invalid argument supplied for foreach() deploy.provision.inc:89
The external command could not be executed due to an application error.
Drush command could not be completed.
Drush was not able to start (bootstrap) Drupal. Hint: This error can only o... (Expand)
Output from failed command :
Undefined index: driver environment.inc:939
Undefined index: driver environment.inc:939
PDO support available, but the driver has not been installed. Assuming success.
An error occurred at function : drush_provision_drupal_post_provision_deploy

To debug it, I would need to disable contrib modules one by one, however my experience is that Aegir can fail completely then and this is something wrong in the Aegir internal logic (I think), see: http://drupal.org/node/936390#comment-3555690

I wrote about it already in the past: I don't buy the developer/code oriented logic, like "we need it to keep our code beautiful and optimized and..." etc. Unless you are writing this code for your own pleasure only, of course, but I believe we are working on this to make it better for the real world users, so we have to be prepared to sacrifice at least part of our dev oriented preferences instead of telling people "hey, it is by design!"

To make it even more funny: it is not just about broken modules or install profiles etc. Aegir failed for me many times on some well known modules like ctools, and the point is: it can happen to *any* module, not just to some poorly written one. And while we can't avoid problems on site migrate task, when we have to compare everything, run db updates etc. we should be able to: clone and rename (very often used by real world users in the development workflow) without issues, and to display something really useful for debugging in the task log, when migration fails.

Of course, this is my personal opinion only. However based on many, many support requests I have seen in our internal queue.

omega8cc’s picture

Category: bug » feature
Status: Closed (works as designed) » Needs work

And here is a hot response we just received from the Client:

Hi. I didn't have time to try cloning after turning off each module in sequence so I just turned them all off. Cloned the site and then turned them all on again. Not ideal really but the site had to go live. Sorry I can't be more help.

As you can see, it is not always true, that turning modules off/on on cloning/renaming the site can't help. It can.

Anonymous’s picture

Title: All contrib modules should be disabled before running clone tasks (probably) to avoid mysterious fails » rename (and perhaps clone) should not invoke provision-deploy (or avoid invoking drush updatedb)

The issue was never that it can't help but that it defeats the purpose of trying to upgrade sites if you turn off the modules first.

Renaming this issue to 'rename should not invoke provision-deploy or avoid drush updatedb' then.

Aegir has not been programmed by terrorists to randomly break on well-behaving modules. The P in Provision is not for Pixie Dust. We should get to the bottom of that separately. You had a Drush error whereby it couldn't bootstrap a site. If It couldn't bootstrap a site, it couldn't disable any modules, because it would not have been able to talk to the database. So something else did that. But let's focus on the other issue here for now.

P.S and the logic of disabling modules (which occurs when a schema version didn't match, or a module was missing, or perhaps when a hook_update failed), is done in Drush, or core I think.

You would have seen the same problem had you run drush updatedb on the site, on a non Aegir host. But I'll drop this now.

Anonymous’s picture

So,

notes to self,

If we are to avoid provision-deploy on rename, we should wrap that stuff in a conditional based on whether the target name has changed.

This is tricky since we only set the one target_name, only earlier on is it decided what that will be (if the site is just being renamed or not).

Also I am unsure at this stage whether the rollback and post hook stuff would need to be modified, they are based on the logic a tarball had been extracted, when trying to clean up. That would not be relevant here.

adrian’s picture

if you want to make rename, it means it can only do it on the same platform. ever.

omega8cc’s picture

That is expected. It is generally bad idea to use rename while changing the platform (migrate).

The same with clone - it is a bad idea to clone directly to different target platform. It causes a lot of issues and support request, and FAQ entries, because it is not reliable. You should always re-verify everything first (which is not obvious, btw), then clone in the current platform and then try to migrate. Never directly, so this (current) ability to clone to the different platforms (or rename & migrate in a one batch) could be safely removed, since it rarely works as expected (because people never remember to verify source and target platform before trying the migrate task).

Both clone and rename could be a standalone tasks, not mixed with possibility to choose different target platform. It is maybe safe for experienced users, but every first time user fails at that point (at least this is what I have seen already 100x times).

Anonymous’s picture

We could probably easily silently invoke a Verify of the current and target platform on all those tasks (migrate, clone, rename). We do so already with invoking a backup.

If that would cut out 99% of the failures, I'd happily put that in.

If we remove that functionality altogether (cloning, renaming to a different target platform), we'll have 500+ support requests in about 3 seconds: 'this used to be one step, now it is two, I hate you guys, you are evil, it's not fair'.

P.S cloning to the same platform then migrating the clone off is only delaying the schema comparison logic between two platforms, that would otherwise be done all at the same time in one go - it's not a solution, if what you're saying happens all the time, you'd see it happen again, only that it took you two steps instead of one.

I'd rather have a power tool for power users. I'd rather fix these problems than remove the functionality just because paying clients are putting you under pressure. (but I can say that, I have no clients :) )

omega8cc’s picture

Adding silent Verify there will help, for sure! But it will not help with the main problem here. I really believe there is no reason for clone/rename to use all this stuff under the hood which is designed for migration task (comparing modules/versions, running db updates etc). I understand it could introduce some duplicated code, but again: do we prefer to avoid duplicated code and keep this 'epic fail case' possible (this is how it looks for people, no matter how perfect Aegir is in other areas), or do we prefer to improve things for people?

hadsie’s picture

I'm not really sure I understand this ticket...

... it looks like an epic fail, since Aegir is expected to do just this: make easier cloning/upgrade/migration ...

In fact, what aegir is doing here is more important than making something easier, it's preventing an upgrade because there's bugs in the code base. The way one manages their sites with aegir is quite different that managing drupal sites without aegir. For example, they shouldn't be making changes to a live site. Once a platform is in use, it's codebase should remain static. If you want to add in new modules or perform upgrades to modules create a new platform and then migrate your site(s) over to that.

The same with clone - it is a bad idea to clone directly to different target platform

One of my primary use cases for clone is testing on a new platform. Prior to migrating a production site to a new platform I clone a copy of it to the new platform for testing, if anything goes wrong (in terms of updates) then aegir has just saved me (the opposite of an epic fail :) ). If it works, great. Then I test my cloned site and if all is good I migrate my production site to the new platform.

To me it sounds like your users want to use aegir, but keep their existing workflow to managing a drupal site, which doesn't actually make any sense in the context of aegir. If your users are making updates to their code on live/production sites, then they're really making their lives a lot more difficult and not taking advantage of what aegir gives them at all.

adrian’s picture

clone = copy, migrate = move.
you're suggesting a copy command that can only copy files in the same directory, or a move that can only do the same.

additionally, this thread has still not got an honest to god mechanism to reproduce exactly what you are talking about with the anxillary logs.

omega8cc’s picture

@hadsie

This thread is not about secure (with rollback) upgrades. It is about Aegir failing when there is *no reason* to fail, because there is nothing broken with the site and it just fails to clone or rename, because it is using too heavy tasks (under the hood) designed for migration. It is because of code re-using to avoid code duplication, which is itself a good development practice, but in this case it causes a lot of issues we could avoid, if the clone and rename would do just that: clone and rename.

There is nothing wrong with introducing very little changes to the site without going with entire platform cloning. Furthermore, it would be even more work to get the same result: fail.

Clone and rename can be used as a shortcuts to test new stuff on the site's copy without creating a full cloned platform for it. It should work. Or at least it should never fail on tasks like clone and rename - this is the issue here.

omega8cc’s picture

@adrian

It is hard to reproduce, which is rather obvious. Aegir is too cryptic about it (maybe it is even not possible to make it more verbose, I don't know). Of course I don't suggest that clone should just copy files. No, it should copy files and database without any checks for versions etc, like it does on migration, to avoid those hard to debug issues. Similarly with rename - it should just rename its domain! What is the point about using migrate checks under the hood to make so simple job? I know: to keep the code compact, well written, etc. But users don't buy this, they just experience it fails on very simple task.

Hitby’s picture

Subscribing...

steven jones’s picture

Version: » 6.x-2.x-dev
Status: Needs work » Postponed (maintainer needs more info)

@omega8cc Reading through this issue I can't really work out what's being asked for here, could you explain or close please?

omega8cc’s picture

Status: Postponed (maintainer needs more info) » Closed (won't fix)

If I still understand this old issue correctly, it was a request to make both Clone and Rename task much more light, so they don't perform all that stuff normally required on Migrate (typically used to upgrade the site) which can cause unexpected fails. The problem is that there is no longer a separate Rename task (it is a hidden feature of the Migrate task), while Clone needs(?) to support cloning also to the different target platform (which is almost always a bad idea, imo), so none of my requests were realistic, probably.

That is why I wrote The best recipes for disaster and then added (in my Provision fork) extra Verify tasks for source and target platform and for the site (all invoked via hostmaster, because it doesn't work when invoked via backend/provision only - there is a separate issue about it) as a part of every Clone and Migrate/Rename task - basically to automate the good habits people never remember about.

omega8cc’s picture

While my initial idea here (disabling modules on the fly) was obviously totally wrong, I'm adding here a follow-up for reference:

Our solution in BOA is to automatically run extra verify tasks for both source and target platform and the site itself as a part of migration task, so before it effectively proceeds with migration, Aegir internal database related to both platforms and the site gets updated. It also solves other issues.

See for reference: http://drupalcode.org/project/barracuda.git/blob/HEAD:/CHANGELOG.txt#l417

The patch: http://drupal.org/node/1004526#comment-5843080