As documented in this thread we have been lucky enough to migrate to git before everybody else here, so the migration back to git for us is just a matter of doing a git clone, no need to convert from CVS. sdboyer already took care of that here but since we're going to do the switch, which will rewrite part of the history to match commit authorship anyways, we figure we might as well use that opportunity to cleanup all of our historical commits.
I was just chatting with sdboyer about this and he said it was okay to go ahead with this, and asked me to open this issue. :)
Our problem is that we have a quite messy history:
hostmaster$ git rev-list --all --format='%an <%ae>' | grep -v '^commit'| sort -u | wc -l
28
provision$ git rev-list --all --format='%an <%ae>' | grep -v '^commit'| sort -u | wc -l
22
... whereas in fact there are 10 authors in provision and 11 (maybe 12 or 13) in hostmaster. The reason why those listings pick up more is that there is a variety of authors emails and names in this history. A few problematic examples:
hostmaster$ git rev-list --all --format='%an <%ae>' | grep -v '^commit'| sort | uniq -c
305 anarcat <anarcat>
134 anarcat <anarcat@koumbit.org>
10 Antoine Beaupre <anarcat@ceres.koumbit.net>
11 Antoine Beaupre <anarcat@hostmaster.koumbit.net>
18 Antoine Beaupre <anarcat@koumbit.org>
2 Antoine Beaupré <anarcat@koumbit.org>
The first is from our original CVS migration. The second is a misconfigured "real name" part. The third and fourth are misconfigured emails for hotfixes I did in production and testing. The fifth is without the accent and the last one is the right one, although it seems to be the least used one.
Another type of issues:
hostmaster$ git rev-list --all --format='%an <%ae>' | grep -v '^commit'| sort | uniq -c
1 aegir <aegir@gambit.greenbeedigital.com.au>
1 aegir <aegir@li94-211.members.linode.com>
2 Miguel Jacq <miguel@crunchbang.(none)>
78 Neil Drumm <drumm@drumm-x200.(none)>
6 root <root@r11567.ovh.net>
The first two we have no freaking clue who they are. Since it's only two commits: though luck, we'll just ignore that, I guess. But the next two are more problematic: they're not actually valid email addresses (yay OSX...) And the last one is a bit like the two first ones.
Now I understand it's not your job to take care of cleaning up our messy repo. Maybe we can do a pass on our own before the final migration to clean it up first, in which case we would just need to sync with the final migration. But if you're going to remap the commit authors anyways, maybe this could all be done at the same time? Anything we need to do for this? I could provide mapping for the accounts I know of, for example...
One thing I want to experiment with too is to understand what will be the effect of rewriting the history like this - if we need to reclone from scratch, that could be problematic. I was told by sdboyer that this could be worked around by git-rebase(1), as it could figure out those are only metadata changes and ditch the old commits... I guess only testing will tell there!
Thank you for your amazing work on this epic migration!
Comments
Comment #1
sdboyer commentedGreat. On further reflection though, small issue. If I do those transforms directly, then here's the place/way I'd need to tuck them in:
Keep in mind that these are jobs payloads being built, passed to beanstalk, unserialized then run by a drush worker on the other side. So, three problems here:
If you guys were to do these filter-branch transformations yourself, and to your satisfaction, then publish them somewhere, all I have to do is update the URI the scripts are cloning from. That would be a LOT easier - plus, you could then experiment with the rebasing and see if you can make it cleanly through.