As documented in this thread we have been lucky enough to migrate to git before everybody else here, so the migration back to git for us is just a matter of doing a git clone, no need to convert from CVS. sdboyer already took care of that here but since we're going to do the switch, which will rewrite part of the history to match commit authorship anyways, we figure we might as well use that opportunity to cleanup all of our historical commits.

I was just chatting with sdboyer about this and he said it was okay to go ahead with this, and asked me to open this issue. :)

Our problem is that we have a quite messy history:

hostmaster$ git rev-list --all --format='%an <%ae>' | grep -v '^commit'| sort -u | wc -l
28
provision$ git rev-list --all --format='%an <%ae>' | grep -v '^commit'| sort -u | wc -l
22

... whereas in fact there are 10 authors in provision and 11 (maybe 12 or 13) in hostmaster. The reason why those listings pick up more is that there is a variety of authors emails and names in this history. A few problematic examples:

hostmaster$ git rev-list --all --format='%an <%ae>' | grep -v '^commit'| sort | uniq -c
    305 anarcat <anarcat>
    134 anarcat <anarcat@koumbit.org>
     10 Antoine Beaupre <anarcat@ceres.koumbit.net>
     11 Antoine Beaupre <anarcat@hostmaster.koumbit.net>
     18 Antoine Beaupre <anarcat@koumbit.org>
      2 Antoine Beaupré <anarcat@koumbit.org>

The first is from our original CVS migration. The second is a misconfigured "real name" part. The third and fourth are misconfigured emails for hotfixes I did in production and testing. The fifth is without the accent and the last one is the right one, although it seems to be the least used one.

Another type of issues:

hostmaster$ git rev-list --all --format='%an <%ae>' | grep -v '^commit'| sort | uniq -c
      1 aegir <aegir@gambit.greenbeedigital.com.au>
      1 aegir <aegir@li94-211.members.linode.com>
      2 Miguel Jacq <miguel@crunchbang.(none)>
     78 Neil Drumm <drumm@drumm-x200.(none)>
      6 root <root@r11567.ovh.net>

The first two we have no freaking clue who they are. Since it's only two commits: though luck, we'll just ignore that, I guess. But the next two are more problematic: they're not actually valid email addresses (yay OSX...) And the last one is a bit like the two first ones.

Now I understand it's not your job to take care of cleaning up our messy repo. Maybe we can do a pass on our own before the final migration to clean it up first, in which case we would just need to sync with the final migration. But if you're going to remap the commit authors anyways, maybe this could all be done at the same time? Anything we need to do for this? I could provide mapping for the accounts I know of, for example...

One thing I want to experiment with too is to understand what will be the effect of rewriting the history like this - if we need to reclone from scratch, that could be problematic. I was told by sdboyer that this could be worked around by git-rebase(1), as it could figure out those are only metadata changes and ditch the old commits... I guess only testing will tell there!

Thank you for your amazing work on this epic migration!

Comments

sdboyer’s picture

Great. On further reflection though, small issue. If I do those transforms directly, then here's the place/way I'd need to tuck them in:

  // Now handle special-case exceptions, typically filter-branches.
  switch ($project->nid) {
    // ubercart_marketplace
    case 277418:
      $job['operation']['passthru'] = array("filter-branch -f --prune-empty --tree-filter " . escapeshellarg("'rm -rf mp_tokens'") . " -- --all", TRUE);
      break;

    // adaptive_context
    case 176635:
      $job['operation']['passthru'] = array("filter-branch -f --prune-empty --tree-filter " . escapeshellarg("'rm -rf ac_access ac_group jqselect'") . " -- --all", TRUE);
      break;

    // ecommerce
    case 5841:
      $job['operation']['passthru'] = array("filter-branch -f --prune-empty --tree-filter " . escapeshellarg("'rm -rf contrib/inventorymangement contrib/worldpay'") . " -- --all", TRUE);
      break;

    // user_board
    case 471518:
      $job['operation']['passthru'] = array("filter-branch -f --prune-empty --tree-filter " . escapeshellarg("'rm -rf user_board_activity user_board_userpoints user_board_views'") . " -- --all", TRUE);
      break;

    // idthemes cluster of bullshit
    case 525938:
      $job['operation']['passthru'] = array("filter-branch -f --prune-empty --tree-filter " . escapeshellarg("'rm -rf idt001 idt002 idt011 idt012'") . " -- --all", TRUE);
      break;
    case 525904:
    case 525938:
    case 526216:
    case 526532:
      $job['operation']['passthru'] = array("filter-branch -f --prune-empty --tree-filter " . escapeshellarg("'rm -rf branches'") . " -- --all", TRUE);
      break;
  }

Keep in mind that these are jobs payloads being built, passed to beanstalk, unserialized then run by a drush worker on the other side. So, three problems here:

  • Those passthrus are already not working, likely something to do with escaping the sh script that's embedded in the git command being sent. I'll be debugging those this afternoon and fixing that, but...
  • The type of command required to do a filter-branch that rejiggers commit authorship is much more complicated, with lots more icky characters, e.g. http://help.github.com/changing-author-info/ . In the long run I'd like our system to be able to support stuff like that, but it's just not that robust yet and probably won't be before launch.
  • Even if I did figure out escaping, Beanstalk's got a cap on the size of the job payloads that can be passed through. I can raise that, but I hear it starts misbehaving if jobs exceed 10K. I could amortize it across multiple jobs...but really, we're getting into territory where I'm taking too much of the very limited time that I've got left, especially given that the easier solution.

If you guys were to do these filter-branch transformations yourself, and to your satisfaction, then publish them somewhere, all I have to do is update the URI the scripts are cloning from. That would be a LOT easier - plus, you could then experiment with the rebasing and see if you can make it cleanly through.