Closed (fixed)
Project:
The Great Git Migration
Component:
Decisions
Priority:
Critical
Category:
Task
Assigned:
Issue tags:
Reporter:
Created:
14 Feb 2010 at 13:39 UTC
Updated:
8 Jan 2011 at 13:17 UTC
For phase 2 of the GIT roadmap, we need to determine and deploy a solution for access control and repository life-cycle management.
Read-only cloning will be possible for all repositories over git:// and http:// protocols.
For read-write / pushing, the best practice is to use a shared shell account, as introduced by gitosis and rewritten later in Perl as gitolite.
What we need is:
Comments
Comment #1
CorniI commentedIt's unfortunately not that easy, but there's at least an outline for what we need at http://groups.drupal.org/node/23464
I'm not sure how much of that has been implemented yet, (but looking at the repo, it seems to be implemented).
The current code for this is in http://github.com/haxney/versioncontrol_git/ which hasn't been merged into versioncontrol_git mainline yet, because it duplicates part of the log parser over there. We would need to merge both parsers together in the versioncontrol_git 2.x branch, to gain advantage of the work marvil07 did during his GSoC project.
Comment #2
CorniI commentedI've talked further with DamZ about this:
- We want (and can use) gitolite on d.o
- When we keep the gitolite user names and d.o user names in sync, we can use the GL_USER env variable to determine the pusher
- gitolite does access checking per-repo and per-branch for us
- versioncontrol_git will be extended to log the pusher and to log the commits to the database
- We need the pusher for accountability
- versioncontrol_git needs to be extended to generate a valid gitolite config for all managed repositories.
I think that's all ;)
Comment #3
anarcat commentedIf we're switching from passwords to SSH keys, there's going to be some user education required here. Setting up SSH keys on your machine isn't quite trivial for the lambda user (and in some cases like Windows, I just wouldn't know how to do that).
Could we still consider a password-based solution? Or is gitolie just inevitable for some other reason?
Comment #4
anarcat commentedOh and another thing: anyone seriously considered reusing code from gitorious? How do they deal with access control?
Comment #5
pwolanin commentedyou can browse their source tree - looks also like an ssh-based solution but it's all ruby versus perl and not obviosu how to reuse just a segment of it.
http://gitorious.com/gitorious/mainline/blobs/master/lib/gitorious/ssh/c...
github already has docs for Windows: http://help.github.com/msysgit-key-setup/
There is no reason we can't eventually also have read/clone access over http however.
Comment #6
anarcat commentedSo it seems everybody is doing the SSH keys hack. I would be pretty happy with that myself, provided we can have multiple SSH keys per user though.
If gitorious is ruby/rails, then we probably can't reuse that code. Then again, gitolite is perl... ;)
Comment #7
webchickThe standard way to do SSH key generation in Windows (at least back in 2006 when I last used it) is with a tool called PuTTYgen, available at http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html. It's ugly, but it works.
This is definitely a big barrier to entry for new contributors, though. I'm willing to write docs for this for all three platforms if it's not possible to do standard username/password authentication, but making it seamless would be my preference.
Comment #8
gerhard killesreiter commentedSorry,no, we can not have user/password authentification, at least according to all the people I spoke to. Wouln't have worked with BZR either. ;)
Comment #9
anarcat commentedWell, this is a problem only for module and theme maintainers. "patch-only" contributors should still be able to pull the code anonymously (through git:// or http://, although I prefer the former) and submit "real patches" to the issue queue.
For module maintainers, this *is* a bit more complicated than "just setting up a password in your profile", but not much... we'll have to write some glue there to manage those keys though...
How does it work right now? There's a cron job that pulls the CVS passwords from the database?
Comment #10
gerhard killesreiter commentedyes.
Comment #11
pwolanin commentedI think both git and bzr can (in theory) push over http or https, but I think it would make more sense to use a secure ssh method that we can apply to both core and contrib rather than hacking around with uncommon solutions.
Comment #12
damien tournoud commentedJust for the record: this discussion is concentrating on push / read-write access. Read-only access for all repositories will still be possible over git:// and http://. I updated the original post above.
Comment #13
dwwPardon my ignorance about gitolite, but after reading through the project page for about 15 minutes, I still haven't found a technical summary of how it works. Can someone more familiar with it explain what's going on? Like, in UNIX sysadmin terms. ;) Here's my rough (probably totally wrong) guess based on what I've read in this issue and what I saw on the project page:
Someone please tell me all the ways this summary is wrong. ;)
Based on this (probably flawed) understanding, it seems there are a few assumptions:
A) Everyone can connect to the gitolite user, but presumably it's locked down Real Tight(tm).
B) The gitolite user can push commits to all its configured repos and masquerade as any specific git user it wants.
I guess that's reasonable, and certainly vastly easier to manage than 4000 LDAP accounts, etc. But, I just want to try to understand the implications of going this route.
Thanks!
-Derek
Comment #14
CorniI commented@13: yep, that sounds good.
4.) This is some ssh magic in ~/.ssh/authorized_keys, which does most of the work. Just look at an example file, then it's obvious.
A)
the gitolite user can't do anything (no login shell) because one command is executed as soon as he logins with a given pubkey. This login can't do anything except the execution of gitolite.
B) git has no real concept about 'users', there's just one user who owns the files (UNIX file permissions) and this is the gitolite user.
The concept of git users in gitolite is just there so that you can setup multiple pubkeys and access rules for repositories in a sane way. For d.o, we want to map these git users to the d.o user ids.
Comment #15
dwwRe: "git has no real concept about 'users',"
huh? that can't possibly be true, since git knows the author of each commit. I guess I haven't looked at the innards enough to know how that data is recorded by git. but, it's obviously there. in my previous experience with git, we all had separate shell users that could push to a master repo, so I assumed the authorship was recorded by means of what shell user was pushing the commit. that's why I'm confused about how it works with gitolite...
Thanks,
-Derek
Comment #16
anarcat commentedgit has no concept of "users", it has a concept of "authors". those are stored in the git history, so when you push, regardless of which *user* actually pushes the changes, the author recorded is the one that committed the change.
so if i have a shared repo on my machine where user A, B and C do commits and me, as user D, pushes those commits to Drupal.org, I won't even show up in the commit log, only A B and C will.
this is a fundamental change in the way things work from CVS. In git, the changes appear to be from who did the actual commit, not the push. this will have significant implications on the way we track commits on drupal.org.
Comment #17
CorniI commentedgit records an author and a committer for each commit. Each can be arbritrary, user-provided strings, so nothing you can authenticate/access-check against. These are normally stored in ~/.git/config, so that when you do a commit (on your local workstation) the right committer and author is recorded in the git database.
gitolite users though are connected to ssh pubkeys and thus provide a 'hard' way to identify someone, in a way which allows access-checking etc needed for pushes.
Comment #18
marvil07 commentedgitolite is described as:
I did not try it yet(but I tried gitosis, and it rocks, but so less), but seems like gitolite is the tool we want.
I'm still not sure what should be the way of interact between commit restrictions module and tools like gitolite, so any suggestions are welcome :-)
Comment #19
sitaramc commented@13: the following links might help. They (esp the first link below) were written for much more novice users but have useful info even for people who know a bit about ssh already.
- http://sitaramc.github.com/0-installing/9-gitolite-basics.html#IMPORTANT...
- http://github.com/sitaramc/gitolite/blob/pu/doc/3-faq-tips-etc.mkd#two_l...
- http://github.com/sitaramc/gitolite/blob/pu/doc/6-ssh-troubleshooting.mk...
regards,
sitaram
ps: I'm the author of gitolite and if there is anything I can do to help, I'd be glad to. Feel free to send me an email either at sitaramc@gmail.com or sitaram.chamarty@tcs.com. The latter gets to me faster, at least during the daytime (I live in India, so that's UTC+0530)
Comment #20
dww@sitaramc: thanks for the offer, nice to see you here! ;) Yes, the first link with the explanation of launching gl-auth-command with a specific "gitolite/git user" (do you have terminology you use for this that we should adopt here?) for each pub key via a command directive in authorized_keys makes perfect sense...
Comment #21
sitaramc commented@dww: thank Andrew Burcin for pointing me here ;-)
terminology to distinguish the "fake" (or "virtual"!) gitolite user from the real (unix) user? No I dont believe there is any; the context usually makes it clear in most cases. Once we get past the fact that at the Unix level there is only one user, after that if you say "user" you're only talking about the "virtual" users.
By the way, may I ask how many users you expect to have? And how many repos?
Comment #22
pwolanin commentedWell - we will likely start with > 4000 repos if we have one per contrib project, plus possibly 10's of thousands of per-user sandbox or project fork repos.
Seems like it might make sense to at least have separate unix users (and hence gtolite setups) for core, contrib, and sandbox repos?
That way we minimize the risk of a config error allowing people to start committing to core...
Comment #23
pwolanin commentedLooks like github has in the range of 10's of thousands - not sure what they are using to manage per-repo access.
Also, for sandboxes at least, seems like we really don't need the per-branch controls.
Comment #24
dwwpwolanin: yes, #23 (separate gitolite users and configurations for core vs. contrib vs. sandbox) makes a ton of sense to me.
Comment #25
sitaramc commented@23: I believe everyone uses the same ssh trick, but github has a custom sshd that keeps the authkeys info in a database... sshd does a linear scan so that would *not* scale. See below for more on this...
----
The Fedora folks (who're currently testing gitolite for their own server access) have about 10000+ users and 10000+ repos.
I had to create a whole another branch ("big-config") for them, for 2 reasons. The first was that the mainline gitolite dies with an out of memory error due to expanding all the user and repo groups they have in their gitolite config. The "big-config" branch therefore does not expand repo groups and user groups, and does that later (at the time of actual access) instead. [The only real side-effect is a slight functionality change if you're using "deny" rules but nothing major.]
The second reason was they wanted each user to actually have a Unix userid, (although they would still be forced to go through gitolite, not get a shell). Think of this as splitting up the single authorized_keys file into 10000 individual authkeys files in that many user accounts. For the same reason github uses a custom sshd, (sshd doing a linear scan), this is probably a good thing!
This means the "big-config" branch of gitolite does not even bother with ssh keys; it can't write those keys anyway. due to permissions issues. I believe Fedora already has a database-backed system that actually contains the permissions, and a script reads that database and puts out those authkeys files etc. All they had to do was prefix their users' authkeys lines with the correct "command=..." incantation.
Now, while gitolite can be as flexible as you want it to be, the sshd piece is sorta out of my control. I'll try and run a benchmark of some kind in a day or two with a few thousand keys and report back but at some point it *will* become an issue.
Comment #26
pwolanin commentedHere's one such sshd patch: http://github.com/wuputah/openssh-for-git
any idea if that's what github is using?
Seems like this might be a good use for sqlite3 rather than needing a real DB server running?
Comment #27
david straussLaunchpad uses a custom SSHd, too. AFAIK, it is open-source and Python-based.
Comment #28
sitaramc commented@#26: wonderful! From the readme, it sounds like exactly what we'd need here.
I'm also wondering if even sqlite is needed; perhaps just good old DBM will do, simply because it's guaranteed to already be available along with perl.
A modified gitolite "compile" script would simply write to this DBM file instead of the normal authkeys file; nothing else needs to change within gitolite itself. The rest of it would happen according to that README there (script to take incoming key, read DBM file and find a match, return valid "command=..." options and exit).
I am very optimistic that this will work but I probably won't have time for at least a coupe of days to test it.
Also, I'm ignoring the whole "do this only within a chroot" warning in that README for now ;-)
Comment #29
andb commented-- About the need for branch level write permission in contrib
I'm still not convinced of the value of individual sandboxes in respect to working with contrib. If a user wants to "branch" (lets avoid fork - while semantically correct, it has negative connotations) then its far better to branch in the original repo. Further, there are more than enough alternatives out there for individuals to use for their own coding work.
So I would suggest to use branch level control: let the repo owner / project owner make branches in his repo as he sees fit and assign write access to whichever d.o accounts he wishes. The owner also would be the one who could delete branches he sees as being superfluous. This will allow other users to directly contribute code to the actual project space as opposed to a sandbox, making it easier to find all the branches. It will also require dramatically less storage space - compare multiple personal sandbox repos as opposed to many branches in a single project repo. An additional benefit is that it will be much easier for beginning git users, especially those using guis, to track commits and merge in changes when all is in a single repo.
In summary, if a solution keeping project branches together in a repo would be adopted, branch level write access will be required, in addition to a simple mechanism to manage it.
Comment #30
webchickIf I understand this proposal properly, I'm pretty sure that would be untenable for something like Drupal core. There you have two "core maintainers" with commit access to the repo, and about 800+ people working on random stuff.
If I as one of the two core maintainers had to go and create a branch manually for any feature request that more than one person wanted to collaborate on, and manage ACLs on a per-feature request basis, dealing with 100 people a day pinging me on IRC or e-mail or Twitter or ... asking me to add them to this, that, or the other branch... I think I would totally lose my marbles. :)
A big strength of our current "system" is that you need no one's permission to start working on something. Check out code from CVS, hack it to bits, upload a patch. If there was a "gatekeeper" around contributing to core, that would really harm our community's collaboration model.
Comment #31
andb commentedThe term gatekeeper can have two meanings here which should be clarified - letting people in to start working, and controlling the quality of the changes. The first gatekeeping should be discouraged, the second is obviously critical.
You point is well taken. I was suggesting branch access for contrib, not core.
That said, branch level access could be beneficial allowing you to delegate core commits to more people, and you then acting as the primary commit gatekeeper. Imagine a small number of "feature" branches for each core component, for example, with you integrating into the main development branch and with you again making the merge to the stable branch. Further, imagine assigning a bugfix branch to a trusted developer. You clearly wouldn't want to have to make hundreds of branches, but having branch level access could really make your work both during development and then during maintenance much easier.
Keep in mind: with git, one can follow the same CVS workflow of pulling, hacking, patching. But one can also pull, hack and publish (making the changed repo publicly available) - allowing you as the core maintainer to make a branch, perhaps an "all incoming" branch for yourself, and merge in the latest changes from other external repos. If the branch doesnt merge in, you inform the developer who can easily pull your changes into his repo, making him responsible for merge working. Patches seem to be a flawed idea in the world of dvcs - a changeset rolled against a certain version, which is doomed to soon fail when the version changes, resulting in multiple artifacts lying around without any real way to track the relationship between them. With dvcs you can get not only the changeset but preserve the whole history of commits, capturing the thought process behind the development.
Comment #32
dww@webchick: #29 from andb started with the qualifier: "About the need for branch level write permission in contrib"...
I agree, it's completely untenable for our core workflow (although, this same mechanism could be used by Dries to actually enforce that webchick can only commit to the DRUPAL-7 branch, and Gabor to the DRUPAL-6 branch, and XXX to the DRUPAL-8 branch, etc...). However, mostly, this is about how we want to handle repos and branches for contrib projects. I tend to agree with andb: we should plan to make it possible for contrib project owners to not just grant blanket write access to all branches of the canonical repo for their projects (what we do now via CVS) and only allow branches for specific release series. It'd be nice if merlinofchaos could give specific write access to different branches of views or panels to the specific teams working on those branches, and delegate authority for those users to push changes into the canonical repo for those projects (but only for the branches they maintain). Everyone is always free at any time to clone a git repo and have their own place (doesn't have to be on d.o) to hack things to pieces and generate patches. And, we could certainly *allow* individual users to clone project repos into their sandboxes if they wanted to publicly share their "fork" to add certain features, such that project maintainers could push those commits "upstream" without just applying a patch and committing that directly. But, it'd be nice not to *force* that workflow in cases where there really are a group of people working on a module.
Furthermore, beyond the "official" maintainers for specific release-able branches, it'd be nice to extend the model further to let merlionofchaos do stuff like add a "6.x-3.x-dww-hacks-for-project-issues" experimental feature branch off the views 6.x-3.x branch. maybe only me and a few other crazies have write access to this branch. and we all couldn't commit to the "mainline" branches of views. but, i could directly commit to my own branch in the "main" views repo, do my own stuff, and it'd be a lot easier for the project page to list this as an available branch to checkout/clone (so others can see what I'm doing, build on it, whatever), and it'd be a lot easier for earl to cherry-pick useful stuff from this branch into his mainline branch(es). (okay, it's a slightly weird example, since I actually do have commit access to all of views right now, but pretend I didn't -- and, I certainly don't have blessings to commit random stuff I think would be useful for project* without talking to earl, first).
I agree, this last paragraph doesn't make as much sense for core. In that case, it's pretty much all going to be either the mainline canonical release branches maintained by the "core committers", and sandbox clones (exact format/structure TBD). In this case, I don't think we need the core project to do ACLs for special-case branches for coordinating specific efforts. The only feature we'd need so we can do reasonable things for efforts like fields-in-core or dbtng would be to allow ACLs on individual branches of sandbox repos (and/or, allow multiple sandbox repos per user, and do per-repo ACLs, etc). Basically, someone (e.g. crell in the dbtng case) would "host" the sandbox clone of core for the work of adding dbtng. he'd give out write access to this sandbox repo for all the other "trusted" collaborators who were going to co-maintain this branch. Again, anyone could still clone this branch at any time and do their own work -- they wouldn't necessarily have to ping crell to get added to the ACL. But, if crell wanted to delegate the authority for them to actually commit to the "main" feature branch, he could...
Comment #33
damien tournoud commentedPlease. We don't need branch-level access at all. Let's just create separate repositories. It will be all more reliable and simpler to implement.
Comment #34
meba commented@Damien: while I agree with both, don't you think that the question isn't "do we need branch level access" but "do we take our CVS approach and move it to GIT or do we create a whole new experience, while maintaining a level of simplicity for normal users?" The world is moving from CVS approach to GIT branch style approach very quickly. Do we hop in or not?
In my ideal situation, I would, as a maintainer of some simple module, use git as a simple tool - checkout, commit, create release. But in the same time, I would like to be able to do more when my project grows.
Your solution with separate repos does the same as branch level solution. But it's a significant step away from how GIT works to how CVS works. So you created a new repo and worked on some feature for 2 months. Now somebody takes it and commits it to the main repo. Magically, some feature appeared in the system. Where did it come from? If you have branches, you know what is hapenning, you make it much easier to merge changes. All development is hapenning on ONE place. You want to work on some project? Simply browse all branches, take some nice one you like and help.
Futher, your system obviously works. But it's dramatically more complicated for beginners.
Comment #35
andb commentedFor all those interested in this thread, I highly recommend installing gitosis to see how it supports various important features of git. I understand that many people have been using cvs and svn for years, however its important to embrace best practices for git to get the most out of this change for the upcoming decade.
The idea behind sandboxes is great. But sandboxes shouldnt be unique filesystem space. See how gitosis implements personal branches. Anyone can make them. The great thing about making a sandbox just a branch of an existing repo is that when beginner user pulls a repo with a gui, he can then see all the branches and merge the changes he wants into his own personal development branch (aka sandobx).
If sandboxes are done seperately, which would be the norm for CVS, there are numerous problems. For example, in my CVS sandbox, Id make a new directory for each new module I wanted to work on. Id have a media mover directory, a workflow directory, a rules directory. Id generate patches to send to the maintainers as I implemented changes. This isn't the way to use git. Each sandbox would be a seperate git repo. So instead of having 1 sandbox, I would have 1 per project. Already this will likely require more complexity then the branching solution.
In the seperate repo as sandbox there are 2 problems: 1. the need to track (track in terms of git checkouts) multiple branches from multiple repos and 2. the need to document or list all the repos which relate to a given project.
1. For a beginner, he'd have to know how to create a new branch and track different branches from a different repo inside his local repo. Not the easiest thing for some people to grasp.
2. I want to do some work on the workflow module. In the sandbox as separate filespace model - How do I track down everyone else who has been working on it? Patches will cease to exist in the future - people will pull commits from git. So we'll also have to track each and every user sandbox that relates to a module. The branching model eliminates this need completely.
The UI for the branch as a sandbox would be amazingly simple. In the example of contrib, Id go to the project I want a sandbox for. I press a button "Make a sandbox of this for me". A branch is made with my UID or username as the name. My key would be automatically associated with this repo, giving me write access. When I go to my page, I see all the branches I've created, essentially my "sandbox". Then on my development machine I can merge from any other existing branches and submit posts to the issue queues to review my changes by referring to my commits instead of the patch files as done today. After finishing my work, after the module maintainer has incorporated my changes, I go to my user page and press a delete button next to the now unneeded repo name.
At the heart of git is branching and merging. Please dont try to implement git is a drop-in replacement for cvs, implement git as git.
Because of this, I still strongly feel that branch level access is an important feature to make sure we support.
Comment #36
damien tournoud commented@meba: either you don't understand how GIT works, or you misunderstand my proposal. Everywhere on the planet, people creates clone of GIT repositories and host them in separate repositories (when you click "Fork" on github, a new clone repository is created for you). That's how we will make it work.
I'm only strongly arguing *against* storing branches from unrelated people into the same repository.
Comment #37
david straussIf we don't do branch-level access, then we lose a nice git capability (that Bazaar notably lacks without a relatively uncommon plugin): the ability to pull all branches for a project in one go.
Comment #38
david straussActually, nix my previous comment. If projects each get git branches, we'll have exactly the same access-control resolution we have now with a fairly easy implementation. Are people arguing that we should put the branches for multiple projects in the same git repo?
Comment #39
andb commentedDavid - all thread participants seem to agree that each contrib project should have its own git repo, in fact for contrib nothing else makes sense. Its the only logical way to go. The discussion about sandboxes, or personal branches is the main issue - if I want to help contribute to an existing module project, should my work go into a completely separate repository or should it be my own branch in a single repository for that module project? From this issue stems the critical question - do we need to address branch level access in the d.o implementation of git.
Your comment #37 was exactly on point - by keeping all people's work in a single repository for each module, you can pull and track all the work done on it in one easy step.
Comment #40
fagoI think per user per project repositories + assisting users keeping their repositories up2date is the way to go.
See the discussion at g.d.o. about the organization of repositories: http://groups.drupal.org/node/50438 - let's keep it there pls!
Comment #41
webchickI think I understand what andb and meba are talking about now, and the idea of all major development on a contrib module being tracked in a single place is nice, but just to throw another wrench into the works... ;)
This workflow (developer(s) ask maintainer for a feature branch, maintainer creates it and assigns developer(s) to ACL, developer(s) commit stuff with impunity to special branch, maintainer cherry-picks what they like into the "proper" branch) would work fine for Views or any other module that has an active, responsive maintainer to set up the branch with an ACL. But bear in mind that modules with active, responsive maintainers is maybe 5% of our total. :P And those maintainers that are active and responsive often have 50,000 other things that they're doing, so see above my concerns about managing core's ACL.
Can we have the best of both worlds? Anyone can branch to their sandbox, but the project page is aware of this and says "Other branches you might be interested in..."
Comment #42
andb commentedWebchick, responded to the g.d.o discussion fago lists, its the better place for this discussion.
Please note that the implementation of the access control completely depends on the outcome of that discussion. Any choices about access made prior to agreeing to the architecture the final solution would be grossly premature.
Comment #43
sdboyer commentedFirst... @Damien#33: YES. Most of what I've been reading people saying about using branches for is a BAD IDEA, and is seriously swimming upstream. Git is designed to have repositories interact, NOT branches, so that's the way we need to build the architecture. The UI and workflow experience we build on top of it can be variable, but I think everyone should focus a lot more on what they want they want workflows to look like, and do less hand-waving over what underlying architecture they think best matches it.
@webchick#41: Yes, we totally can have the best of both worlds. Under no circumstances should we be mixing unrelated branches into a single repository.
Comment #44
sdboyer commentedLemme try to sum it up, and bring it back to the original ACL point:
1) There are clearly legitimate reasons for having per-branch access controls. For those who aren't used to DVCSes, remember that those ACLs refer _only_ to who has write access to the definitive branch. e.g., the one at http://git.drupal.org/core/drupal.git - anyone can clone it and commit locally at their leisure. Gitolite looks like it will handle this requirement adeptly.
2) It's also reasonable to believe that these branch ACLs would be useful for contrib as well as core.
3) We do not need complex ACLs for sandboxes - if it's yours, you can push to it. Otherwise, no. Remember that this does NOT limit collaborative potential - it only makes sandboxes useless as a place for collaborative publishing of code. Which is fine.
4) There are scaling concerns. These can be substantially mitigated by using different UNIX users, with different gitolite configs and probably a replacement SSH daemon, for the different purposes: core@, contribution@, sandbox@. It also means that, if necessary, we can implement the former two (more crucial ones) first, without needing to solve the considerably bigger scaling challenge that sandbox@ may be.
...I was gonna write more, but realized we need to come to a definitive answer on the sandbox question first, so I'll just leave it at that nice little summary until then :)
Comment #45
webchickThanks for the nice summary! Question about:
So where does this collaborative publishing take place on drupal.org, then? This is a critical one for me, because all this crap going on in the core issue queue atm with Initiative A on GitHub and Initiative B on Git Mirror Over There, all of which is totally opaque to patch reviewers and core committers.
Copy/paste/tweak from http://groups.drupal.org/node/50438#comment-135708:
My workflow concerns are on the big patches. Fields in core. New database abstraction layer. D7UX. New core theme. These aren't 1-25 lines of changes done by a single author, these are 500-1000+ line changes done by 5+ authors. These are the changes where CVS and our patch workflow have completely failed us, and result in people moving off-site to places like Launchpad, GitHub, etc. to collaborate.
When this happens, the community loses all transparency into what's going on, and misses out on a huge learning opportunity to see the changes and discussions as they evolve. Patch reviewers need to learn new tools in order to review, so they don't, and end up getting the patch in a more-or-less finished state, when it is too big to review. People who want to use the same core workflow they use for the 95% of their other patches need to learn new tools, and get logins on new websites/services in order to help out with the efforts. So as a result, they don't, and so a huge portion of our contributor base is blocked form helping. And now only that original team of 5-6 developers has any fricking clue what that code does, and is on the hook for maintaining it for all eternity.
Comment #46
webchick(or should this question be moved to the group? I'm terribly confused about what this issue is for now. :P)
Comment #47
CorniI commentedit should've been on g.d.o :P
And that's what the per-project, per-issue repo solution over there is for, but nobody seems to like it, though it's the _easiest_ solution to the problem of the big patches.
Btw, I think this issue is finished, as we decided to go with gitolite (including first-hand support from the author, very cool!) because it provides everything we need or may need. Where we allow fine-grained ACLs has to be discussed as part of the post to which fago linked. Discussing that here is imho off-topic and duplicating arguments. When that has been discussed, we can start implementing that, and this will include the generation of one or more gitolite config files, but that's then an issue for a drupal_git_integration or vcs_git module.
Comment #48
gerhard killesreiter commentedcan I get a definition of "sandbox" in the new shiny scheme of things?
When you write "sandbox", I usually think of http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox
and that is not where I want to go.
Comment #49
eclipsegc commentedI'm not sure "sandbox" really means ANYTHING. Are we really going to require a logical delineation between "Oh I think this might eventually turn into a real project so... I should put it in location A" vs "This is probably never going to amount to anything, so I'll put it in location B"?
I know it should be rather simple to simply clone something out of that repo and into a new location if we DID want to make it a "project", however wouldn't it be simpler/more-straightforward for us to eliminate the sandbox entirely? By this I mean, to say that when you make a new project page, you're just going to either spawn off a new repo for it, or point it at an existing one. Beyond a pretty naming scheme for that repository, I'm not sure it really buys us anything to have a separate sandbox area with completely different (and potentially confusing) rules. New repositories (for whatever use) could be spawned from the user's page, or whatever the mechanism becomes.
That being said, it does bring up the question for me of how we treat core repositories. I honestly don't see how it's different from any other "blessed" repo. Namely it's project page refers to one repo, that's drupal's repo, and only the project maintainers have access, end of story. All other development for it would, assumably, happen within individual's clones for it, from which patches would be cherry picked back into drupal's repo.
To elaborate on webchick's early question I've always envisioned a listing of clones of your project as being integral to properly managing your project. Whether that's tackled on its own or per issue is definitely OT for this issue, but I thought I'd at least mention since it was asked.
Eclipse
Comment #50
eclipsegc commentedI have no clue why that got moved to db component, sorry.
Comment #51
sdboyer commentedYes, it does. In fact, it means quite a lot. But the discussing it - and anything else related to repo structure - further here doesn't make sense - that's what the g.d.o thread is for.
Perhaps more importantly, we can punt on exactly what sandboxes will be used for until phase 2.
Comment #52
avpadernoThe idea was to give access to a sandbox to every users who is applying for a CVS account, and not allow them to create projects, until the code they provided in the sandbox doesn't get approved. There is another proposal about how new projects are created, and who can create them.
In both the cases, the sandbox would be used differently from now.
Comment #53
rcross commentedmaybe its past being discussed, but I figured I would point out http://www.indefero.net which is a php-based implementation of systems like gitosis and gitolite. It has additional functionality, like an issue tracker, etc but I would imagine that the main pieces could be extracted to work for us. It is also GPL'd, so there shouldn't be any license conflicts.
Also, I'm not sure where to post this - but would it be helpful for any of the implementors/infrastructure team to talk with some of the various sysadmins/architects who are managing the GitHub service?
Comment #54
gordon commentedJust looking through this and most of it is going in the wrong direction.
This is actually quite simple, and I have done this for one of my services. Basically if you do a man authorized_keys you will see that you can add a script that gets called no matter what where additional checking can be done. I am actually using this to check that the drupal user (based upon the key that they are uploaded) is able to access a project which is joined to a organic group. Basically I end up with a line like so in my authorized_keys file
This passes the user id to a php script which checks to see if I have access. This script can also sanitize the git command and make sure they are not doing bad things.
Checking access to entire repositories is going to be quite easy, but in the case of core where people have write access to certain branches it will be difficult but I think we could look into some of the git hooks to prevent this.
Comment #55
sdboyer commentedPlease "look through" a little harder. None of what you shared is new, and some of it is a step backwards. Though some of the latter bits are a bit off track, the core of the discussion is not off track.
We're well aware of the capability to inject an intermediary script in authorized_keys. The issues that are more complicated have to do with scaling, which plain authorized_keys will not do, and (as you mention) using hooks to do more complex branch-level ACLs.
Comment #56
sdboyer commentedFirst, just to note - I've been through the git hooks again, and am confident that doing per-branch ACLs will be pretty trivial.
Just realized that the model of core@/contrib@/sandbox@ as unix-account-git-gateways doesn't work. Either it has to be project@/issue@/sandbox@ (simpler and more elegant), or it has to be core@/contrib@/issue@/sandbox@ (hackier, but isolates core from any contrib screw-ups). I think the reason is somewhat self-evident in the former schema's naming, but lemme splain.
Given that there are these three discrete sets of logic, the only real question is whether there should be an artificial split introduced in the project repos to make it less likely that core ACLs get mucked up, as pwolanin said in #714034-22: Determine the access control solution for git. My current thinking is that's probably overkill, and we can do fine with just the project@/issue@/sandbox@ model.
Also, it occurs to me that a persistent nosql backend (say, MongoDB) could be an ideal data storage solution for most or all of the user ACL information - not including ssh keys (so probably no need to think about gitolite integration), though those keys might be used for lookups. I'm thinking this because git hooks are going to 1) need a variable bunch of data (depending on which of the three logic sets above is being exercised) on a single user all at once, 2) we're going to want to update that data incrementally so straight key/value store isn't that effective, and 3) we want it to be fast and cheap to access, but the data won't be hit so often to justify in-memory storage a la memcache.
Comment #57
sdboyer commentedIgnore the bit about per-branch ACLs, at least - seems like that should be quite adequately covered by gitolite.
Comment #58
dww@sdboyer: For your benefit (and anyone else who wasn't at the Git sprint on Thursday @ DCSF), we discussed the likelihood that the gitolite config wouldn't scale well at all as a flat config file. That's what "-- another scalability problem with 10000s of lines in the gitolite config file" means from the Git phase 2 list is about. We kicked around the idea of making gitolite's config system plugable, precisely so we could use something like mongodb (or sqlite) as the backend if needed. Given that the gitolite author/maintainer has shown up in this issue (see @sitaramc in #19 and beyond), we speculated that he'd be very open to working with us on improving the flexibility + scalability of gitolite to be able to handle a setup like the one we'll be building... We certainly have a lot of experience building systems with plugable backends for various things. ;)
Comment #59
sdboyer commentedAhh, that's what that was about. Well I'm going to give myself a little pat on the back for independently coming to a similar conclusion :)
And hey, it looks like mongo even has a perl driver! http://search.cpan.org/dist/MongoDB/lib/MongoDB/Tutorial.pod
Comment #60
sdboyer commentedAdding tag
Comment #61
sdboyer commentedWe've mostly just been talking about the ACL approach here, but just a note that we should keep it that way - I've opened #782764: Ensure gitolite can scale to 10k, 100k+ users as a place to talk about the implications of scaling gitolite.
Comment #62
adrinux commentedOne thought about SSH keys and novice users/windows users – is it workable to generate keys for them on d.org? Generating the keys is easy enough, but there's an obvious issue of securely moving the keys to their local machines and getting them in the right place.
Thoughts?
Comment #63
david straussOf course, but it defeats the entire purpose of the public/private key model.
Comment #64
gordon commentedGenerating a key is *NOT* that hard, and we can provide step by step documentation on how to generate SSH keys for those that don't know.
I am assuming that they we will also be providing a git://git.drupal.org read only access so people can clone a repository without requiring a user account or having to provide a public key.
Comment #65
sdboyer commented@gordon: yep, no need for ssh for cloning. ssh will only be necessary for write operations.
Comment #66
sdboyer commentedUpdating title so it doesn't bug me anymore :)
Comment #67
wasare commentedWhat do you think consider native and well tested resources like WebDav?
Other alternative now are use the native "Smart HTTP Transport" supported by git.
In both cases, i belive the ACL and auth method can be provide by HTTP server (apache) using any backend supported (databases, LDAP, openid).
We can replace also the default git-http-backend wrapper with a customized script.
In my opinion that is a good alternative, because we could have git write access without ssh_keys and using common ports (80 and/or 443) that normally are open.
(I started some tests but i didn't finish)
Some references:
http://progit.org/2010/03/04/smart-http.html
http://github.com/blog/642-smart-http-support
http://www.kernel.org/pub/software/scm/git/docs/git-http-backend.html
http://xrunhprof.wordpress.com/2010/06/01/trac-and-git-http-backend/
http://github.com/schacon/grack
http://www.toofishes.net/blog/git-smart-http-transport-lighttpd/
http://kaeso.wordpress.com/2008/02/02/git-repository-with-apache-via-web...
http://www.kernel.org/pub/software/scm/git/docs/howto/setup-git-server-o...
http://www.jedi.be/blog/2009/05/06/8-ways-to-share-your-git-repository/
Comment #68
sdboyer commented@wasare - yknow, looking back through the thread, i see anarcat asked this question before:
And really, I'm not sure. I can't recall any reasons why SSH is a must. So we should at least step back and discuss it, especially given that SSH is a not-inconsiderable extra barrier to entry for some folks, and can be awfully inconvenient on Windows.
...Y'know what, on further consideration & reading, I'm actually very +1 to scrapping the SSH approach and going for HTTP. We do need to make sure we can take care of all our ACL needs, of course, but it does sound like that won't be a problem either. I want some more input from other folks who've been active on this thread before we definitively say that's the new direction, though.
Comment #69
drummFor HTTP auth SVN we used http://drupal.org/project/export_users_dbm on infrastructure.drupal.org; much smaller userbase.
Comment #70
anarcat commentedHey, here I am quoted... I guess that warrants a response. ;)
I have done a good few experiments with various authentication mechanism and the reason why I was questioning the use of SSH keys was that Apache (and SSH, through PAM) has some mechanisms to talk directly with third party authentication mechanism, most interestingly: MySQL.
Regardless of the protocol used (SSH or HTTP), there should therefore be a way to authenticate directly against the Drupal user table. I've done it before for PAM (so for SSH), and it's nothing really complicated. I assume it would also be similarly trivial for Apache (howto for mod-auth-mysql in debian). I assume there's a special role that says "this user has CVS access right now" that is stored in the mysql database right now, and this could be factored into the authentication system. Similarly, there could be access control for projects based on MySQL.
For me, this seems to be a very clean solution for authentication, again, regardless of the protocol chosen, which should be chosen to accomodate the lambda user as much as possible (as opposed to authentication, which should be chosen to be simpler to manage and more secure).
I would favor SSH, personnally, as I'm more familiar with its interoperability with git, but I have also looked into "webdav" (HTTP-based) alike solutions for git and those could also work. I would highly recommend using a secure transport in any case: it's time that we start enforcing trust paths in the Drupal codebase...
I would try to avoid exporting the (huge) committers list to a separate database, it sounds to me like a maintenance nightmare.
And if you are worried about security of that sacro-saint drupal.org database (I sure hope you are! :), we could use a special mysql account with privileges only on some tables (select on user, and those magic vcs tables) and columns (username and password...), etc.
Comment #71
pwolanin commentedA few thoughts - a password-based security mechanism is almost always weaker in practice than key-based, since people choose memorable (i.e. guessable) passwords.
If I understand, Drupal core commits are currently done only over ssh for security.
Thus, any proposal to try a password-auth-based system means that we will have to still maintain different authentication mechanisms for core and contrib. To me it seems like we want to be standardizing as much as possible instead.
Also, we are worried about users shifting over to github (and the like, e.g. launchpad) - which requires ssh keys, so clearly this isn't that big a barrier.
Comment #72
damien tournoud commentedI'm slightly torn on this. The "smart" HTTP client is only available starting Git 1.6.6, and requires a server-side daemon. On the other hand, it allows us to completely skip Gitolite, and we rely on the tools we have designed for SVN for repository-level access control.
What would be difficult in that set up is the branch-level access control. Do we need that at all?
Comment #73
pwolanin commentedBranch-level access control would be really nice - e.g. for core or contrib to have defined (and enforced) branch maintainers.
Yes, in 95% of the cases it is not needed, but I can imagine situations where it might be useful to give more people access to push to a certain branch. For example, during an initial update of a module for Drupal 7, you might allow several people to push to the master branch (or a temporary branch) during a sprint?
Comment #74
webchickI wonder if we could push per-branch access control to "phase next", since it's not something we currently have, and it sounds like it's going to complicate the process by quite a bit.
Comment #75
avpadernoIt would be really nice to have a per-branch access control, but I agree we can wait for that.
I can see many cases where a project have a co-maintainer that works on the new branch (which is experimental, in some ways), and who should not touch the other stable branch.
Comment #76
wasare commented@sdboyer git over SSH it's really perfect, by security view point, but a password-based security enforced over HTTPS is the best choice for most people and plataforms, and excelent for the security too.
@anarcat OK, skip the SSH keys is possible!
@pwolanin i believe that some users are shifting over to github not because setup ssh keys are easy, but because CVS is very outdated and insecure (in 2010)
@Damien Tournoud, @pwolanin, i'm not expert, a matter of fact i'm a starter into git, but i think with a "distributed" revision control system we need pay attention to #2 roadmap migration topic (http://groups.drupal.org/node/80029). The new workflow would be the first thing to decide, in my opinion.
With git all users will have a whole repository at your local computer, and freedom to branch and merge locally. Two examples of workflows with git: http://progit.org/book/ch5-1.html
@webchick, @kiamlaluno to create a new branch with differential access I would do a simple "git clone", this new full copy of the project would have the correct ACL
Comment #77
sdboyer commentedREALLY need to decide on this.
Comment #78
sdboyer commentedAt Drupalcon, David, Narayan and I discussed using launchpad's twisted-based custom sshd to take care of everything we've been so hung up on here. Narayan's writing something up with more details.
Short version is that we should be able to do password and key-based auth, and do it over ssh, with a fairly small amount of custom code.
Comment #79
chrisstrahl commentedWe're going to have nnewton publish this and then open separate issues for the implementation.
sdboyer will be in touch for this Git migration sprint.
Comment #80
nnewton commentedhttp://drupal.org/node/910562
Linking to an issue created to discuss here the option we discussed in person at CPH2010.
-N
Comment #81
sdboyer commentedWe're gonna wait till the end of this week to see if serious issues are raised with #910562: Implement a git-shell wrapper with python-twisted ; if not, we'll mark this fixed since we've officially figured out the solution we'll be using.
Comment #82
miklI’d say this sounds like a very good solution. Python and Twisted are very powerful for this kind of stuff :)
Comment #83
sdboyer commentedThanks, Mikl.
OK, no bellyaching, so I'll assume we're good!