We are currently attempting to find a way to get module checkout statistics from cvs pserver. This is non-trivial, as pserver doesn't keep track of checkout statistics at all and all we have in the logs are connects from xinetd. We have figured out several ways to go about this, each having various degrees of 'hakish'-ness.
One idea is to have a wrapper script around cvs pserver. The pserver appears to build a directory structure of the module it is checking out in /tmp/cvs-serv(pid). The directory structure for a checkout includes only the CVS files, but it should be enough to recognize which module its checking out. So, to actualize this we would need a wrapper script around pserver. This script would be called by xinetd and would call pserver itself. It would then lookup the correct dir in /tmp and identify the module being checked out. We could then insert this into a log or into a db table. Everyone here hates this idea, me included and I even thought of it.
A less "disgusting" way of doing this would be to write a packet sniffer to identify the beginning of cvs pserver commands and get the module name out of the TCP connection. This would be much cleaner, but also more difficult to implement.
The third and preferred way of doing this would be to move Drupal to a modern revision control system, such as subversion or git. Subversion in particular would be very easy to collect stats from as the logs would be outputted from Apache and could just be parsed with awstats. There are other advantages to this, mainly security. The fact that we are running cvs pserver has always been a concern at the lab; you only need to look at the security track record and the quality of the code to understand why.
I realize there has been resistance to moving to subversion (or any other modern revision control system) in the past, but I'd like to bring this topic up again to understand exactly why that is and if we can find a workable solution here. We need these statistics, but I'd like to use getting them as an opportunity to enhance the security and viability of Drupal's infrastructure.
Comments
Comment #1
kbahey commentedYou have my +1 on moving to a more modern version control system. CVS is long in the tooth. The work of jpesto in the Google SoC should be a good platform to build on from there since it abstracts the Drupal and project module integration.
I don't want this thread to devolve into a "my favorite is such and such" or "no, this one sucks".
So, perhaps a separate thread specifically for which VCS is in order (svn, bzr, git, ..etc).
Comment #2
moshe weitzman commentedWe never moved to Subversion and friends because noone delivered the software development required to do so. Occasionally folks would volunteer, but noone sat down and rewrote cvs.module and the cvs scripts. Dries has never been philosophically opposed to switching, and has said so several times.
The problem remains that we don't have the software nor the volunteers to write it. As far as I know, jpetso has received no help from the community writing a subversion backend for project. I could be wrong there though.
Even if we were technically ready to convert, we now have an installed base of over a thousand contributors who have painfully learned CVS and I cannot imagine the chaos if they all had to learn new practices for tagging releases and such. I witnessed the rainfall of questions that dww endured the last time we made major VCS changes and it wasn't pretty. Documentation was no match for the deluge.
But if we could get over those two hurdles, switching is a tasty treat. Every Drupal consulting firm I know uses Subversion in house, despite working on Drupal projects all day long.
Comment #3
kbahey commentedSubversion is no longer the only game in town.
I have used git and find it has some really cool features (shared by bzr, namely the distributed nature, and no single version of the truth required, which makes it neat for distributing custom patched versions).
Jakob Petsovits' summer of code project that I mentioned has an overview here http://jakob.petsovits.at/rcs-abstraction-for-project-module
He wrote some comparisons of VCS software here http://groups.drupal.org/taxonomy/term/2353
And here is the GSoC wrapup http://groups.drupal.org/node/5827
And see his post-GSoC progress here http://groups.drupal.org/node/6295
The code is in cvs.d.o already.
We can raise some funds to productize this abstraction if needed, but we need to decide on which backend to use before that, by having a discussion going, with dww and hunmonk in it of course.
Comment #4
killes@www.drop.org commentedI am opposed to moving to CVS for all the reasons people already mentioned and then the nice fact that you can check in Drupal modules into SVN including the CVS files. Then you can do "svn up" and "cvs up" at the same time. "cvs diff" is especially convenient if you have to modify a module.
Also, svn has no easy replacement of "cvs diff -up" (the "p" part). This option makes reviewing patches much easier.
Everything but svn is totally out of the question because there's no windows GUI client.
Comment #5
chx commentedSubversion is the only alternative as there is no GUI / IDE integration for other tools. However, http://bazaar-vcs.org/BzrForeignBranches/Subversion and http://www.kernel.org/pub/software/scm/git/docs/git-svn.html lets you use more advanced tools with the central repository with SVN. Also it seems could use SVK per http://drupal.org/node/16631#comment-33732 .
http://groups.drupal.org/node/6180 call for the port of versioncontrol* to Drupal 6 so it will be a lot easier to write SVN for Project. There was little incentive to write that. However, you might find me writing it... more and more often we find ourselves in situations where two people work on the same contrib (we have grown!) and svn merge is badly needed. Also I would like to point out http://www.arune.se/tech:svnwithmysqlauth and http://modauthmysql.sourceforge.net/ -- these could provide a much simpler version for access control. SVN is no silver bullet but it seems now there are arguments to move while previously all we had was "tags/branches sucks thus CVS sucks. Let's move to SVN" to which the answer was "young grashopper, SVN does not solve that".
Comment #6
chx commentedkilles, I think your needs will be covered by svk/git/bzr even could be with svn itself -- you work on your branch and merge with HEAD/master/whateveritiscalled which is upped from the central repository. I know bzr and git can do it. It mirrors current practice -- as a programmer, you use another, more enhance version control.
I filed an issue to the svn tracker asking for a -p option http://subversion.tigris.org/issues/show_bug.cgi?id=2995 . If the Drupal community decides for SVN I am ready to offer a buck or two to the SVN community to get this issue out of our way.
Comment #7
chx commentedFurther research solved another difference between cvs and svn -- while usually (without -F) cvs tag is not modifiable, svn 'tag' is just a copy of the repo which can be modified. Something like http://subversion.tigris.org/tools_contrib.html#svnperms_py can solve this.
Comment #8
aclight commentedThat's actually not quite true. Early this summer I posted in the Project tracking and releases group on g.d.o (http://groups.drupal.org/node/4272) about the direct port of cvs.module I have written to work with SVN and which I'm running on my site. However, given that jpetso had already started or was just about to start his SoC project with the VCS abstraction stuff, it was decided (for good reason) to wait until that was done and then consider whether d.o should use SVN or stick with CVS. If anyone wants my code they are free to contact me, since it's changed a bit since when I last uploaded it on the thread mentioned above.
jpetso posted over at http://groups.drupal.org/node/6295 (see "Goodness #2") that he's writing the subversion backend and getting school credit for doing so, so I'm not sure how much help he could/would take with writing it. In one of the comments on that post he said he hoped to have it written by October, but he hasn't been back with an update more recently.
Hopefully he's on track with the subversion backend. Several people have already volunteered to help him test it once he gets it written.
I, for one, am all for moving to SVN.
Comment #9
gregglesregarding the point killes raised in #4 about
cvs diff -upI believe that is solved with the slightly more long winded svn version issvn diff --diff-cmd /usr/bin/diff -x "-up".If you don't like typing that out all the time (who would) shell tricks can alias ddiff (Drupal diff) to be that command
alias ddiff='svn diff --diff-cmd /usr/bin/diff -x "-up"'Comment #10
jeremy commentedEven if a migration off CVS finally happens, I imagine it will take a little while. Adding to Narayan's original comment, I was looking through the pserver code today -- it's very simple and thus would be easy to add some logging directly into
src/server.c. All commands pass through the functiondo_cvs_command(), so it should be relatively trivial to add support for pattern matching making it configurable which commands to log / not log.Alternatively, if you're only interested in checkout statistics, you could add logging in
src/checkout.c.Finally, there's also already tracing logic available. However, it would dump a lot more information than you probably want, and thus would likely slow things down. If I create a patch for enabling pserver logging, I would utilize buffers and make it as efficient as possible (I've written C logging libraries in the past for a previous day job) Let me know if you're at all interested, and I could provide you a patch perhaps later next week.
Comment #11
chx commentedhttp://www.red-bean.com/fitz/presentations/2007-07-27-OSCON-svn-worst-pr...
Comment #12
jeremy commentedNot really true. For example, git comes with git-gui... [screenshots] Git support for windows was originally limited to Cygwin, but there's now also a native port available, which includes the beginnings of a TortoiseCVS clone.
There are a couple other gui tools available too, though I'm not sure how portable they are. I'm also not sure what integration you're looking for, but it's clear that git has gone mainstream and so it's only a matter of time before lacking tools will be created.
If going through the pain of changing everything in the Drupal source control infrastructure, going to a distributed tool would offer a lot of benefits to a project like Drupal with developers all over the world, and different maintainers for each major release. Claiming that "svn" is the only solution is a rather limited view.
Comment #13
Crell commentedJust as important as the technical details are the usability details. Look at all the trouble poor Derek has to go through trying to teach people to use CVS properly. SVN is at least conceptually similar, if not identical. A distributed system would be a lot more to wrap one's head around. Do we really want to do that to our contrib authors? (I could probably learn a distributed system if given enough time. I don't know that's true for a large percentage of contrib authors. The learning curve there could be quite messy.)
Comment #14
nnewton commentedClearly work would have to be done to switch revision control systems considering how deeply tied CVS is to how drupal.org works (scripts, modules, etc). However, I would comment that at some point we are going to have to do this. CVS is (sadly) not dead yet, but its definitely on its way to the grave.
Your first point seems to have been the blocker for this before. (checking out a cvs tree into subversion) It seems that this is a very common practice and we don't want to break it. However, I refuse to believe we can't find some sort of work-around for this (possibly simply not using svn). It would be a shame for a largely social issue (this practice) to stand in the way forever, but on the other hand we can't simply break it without seriously impacting business use of drupal and that is exactly what we don't want to impact negatively.
I think there was a solution to the diff problem later on in this thread and I would be very surprised if we couldn't fix that some other way if the proposed solution doesn't work (and again..assumes a usage of svn).
As for a windows GUI client, git has at least two, I will have to check their quality. Mercurial also has several, the quality question is still out there, but one interfaces with tortoise.
Jeremy, if you have an interest in doing this that would be great. I think there is another developer willing to work on this if you want to point him/her in the right direction. I would require my sign-off on the patch before putting it into production though. I can't put into works how little I like the pserver code and a one-off patch against it makes my skin itch. However, its _far_ better than the hack of a script that seems like the temp solution at the moment. Any tips or work you want to put into this would be greatly appreciated, by at least me :).
Here is in my opinion the biggest problem. Technical issues can always be solved in some way, its this social issue that will be huge. I have done retraining before and its a huge pain. However:
A. I'd be willing to help any way I can with this as I _really_ want pserver gone.
B. Your going to have to do this someday and moving to something with a long shelf-life (ie git...as its not going anywhere) may make retraining easier long-term. I'd also note that using something like git is incrementally difficult. What I mean by this is that using git as if it was cvs is not that difficult and advanced users could use its advanced aspects if they so choose. Not that this will help much, but I imagine if we do this with good documentation and a team of people to help developers it wouldn't be so catastrophic.
Comment #15
jeremy commentedEvidently git offers a limited CVS emulation mode allowing it to be accessed with any standard CVS client. This could be useful for a transition period, however it's got some serious limitations that could make it useful for little more than checking out the source tree with familiar tools...
I use both Mercurial and Git on my local networks, and find them both to be excellent, fast, and reliable source control tools.
Sure, I'd be happy to put in some effort on this when I get a little free time. Tag1 Consulting will sponsor my time. However, it would be later next week at the earliest, depending on how long some other jobs take. I'd rather do it right than make it an ugly hack, in which case I'd probably submit it to the CVS developers as a potential new feature -- that would hopefully at least get it a proper review before you put it into production. Are you using version 1.11.22?
If someone else gets to it before me, that would of course be great. Here's my plan of attack, should that prove helpful to the other developer you mention: add some simple debug output to
do_cvs_command()then launch a local instance of pserver and do some transactions. The goal would be to learn exactly what information is available at that point in the process. Once known, then I'd add some simple logging routines and a new configuration option for enabling/configuring the logging. Once that was working, optionally the next step would be to add support for more advanced configurations, supporting regex patterns for specifying what should be logged... it should be simple enough, and seems like it would be quite useful.BTW: A simpler method would be to just add some counters that get incremented each time various actions happen. You wouldn't have any historical information, but you would have an overall idea as to what is happening. Perhaps that's really all you're looking for?
Comment #16
chx commentedIt's not just GUIs but IDE and tools as well. As I pointed out, SVN can be used both with git and bzr clients if you so wish and those solve killes' problem.
Comment #17
jpetso commentedSubscribing, there might be some interesting news happening in here.
Let me finish up the last pieces of Version Control API minus release node integration (1.0 of the API module is planned for tomorrow), and I'll come up with a status update afterwards.
Short version for the impatient: the SVN backend hasn't yet been started, but is scheduled as next task for me. There were no contributions from the community by now, although halkeye seemed to play a bit with the Version Control API, but I don't know if something evolved from this and what his current plans are. I'd appreciate helping hands (never mind countability for my course, that'll work out sufficiently), and otherwise delay a first working version of the SVN backend to mid-November or so. Notoriously bad in keeping schedules, you know.
Comment #18
aclight commented@jpetso--I hadn't done anything with a svn backend because I thought you would want to do it yourself since you were getting credit for it. But if that's not a worry for you, I'm in a good position time-wise to help out with this. I'll follow up with you directly.
Comment #19
dwwA) This thread was started (and titled) about checkout stats from CVS, and is now the N+1th iteration of "XXX is better than CVS, let's switch". If it weren't for the 2 comments in here concretely about getting checkout stats out of CVS (#10, #15), I'd probably mark it duplicate or postponed and ignore it.
B) No where in the thread is there discussion about why logging these checkouts is critical in the first place, so I'm downgrading the priority until someone can explain why this is so important.
C) pserver is terrible, no question. It makes me squirm that we use it. CVS ext: over ssh is vastly superior, but basically impossible for drupal to use without a massive headache for OSUOSL (jailed shell accounts with ssh keys, etc, for everyone with a CVS account). Ugh.
D) I haven't yet played with git, so I don't have direct personal experience. However, my sense is that given how completely many (most?) drupal developers fail to even understand simple, centralized version control, if we throw them into the ocean of distributed version control, we'll never (ever) see the end of the support load and "wow, i really screwed up git, can you help me?" issues.
E) Unless there's an avalanche of help, nothing is going to happen on any attempt to move off of CVS until everything listed here is completed: http://groups.drupal.org/node/6180 -- Getting 6.0 core out is blocked on getting d.o upgraded to 6.x, and that's blocked on that punch list. Releasing 6.0 core is more critical than resolving a logging problem, or even getting rid of pserver itself (which, no doubt, would be nice). jpetso's SoC project was certainly the first "avalanche of help", and aclight (and of course, hunmonk) have all been a gift from heaven when it comes to project* work. But, we're still at the very early stages here.
F) The good news is that my plan outlined in #6180 is to not port cvs.module itself to 6.x, but to port jpetso's versioncontrol* code, and finish all the integration work with project*, get all that code ready for prime-time, etc. So, that's at least the next step towards potentially moving away from CVS, but we're talking weeks or months before that step is done, and months before the dust settles and we're ready for the next step after that. [Note, @jpetso: I appreciate your enthusiasm, and I understand you're getting credit or other benefits for the SVN backend, but it seems like "the next task" for versioncontrol* is to finish the CVS backend, finish the project* integration, get it deployed on d.o, and port it to 6.x -- in terms of the long-term health of this API and suite of modules, those all need to happen before this code is really going to get adopted widely and used, even if you (and aclight, and whomever else wants to help) can implement the SVN backend in reasonably short time.]
G) Dries's recent survey of d.o users showed that module ratings/sorting/classification/browsing stuff is the #1 feature request for the community. So, honestly, I'd probably put most of the efforts outlined at http://groups.drupal.org/node/6186 ahead of moving off CVS in terms of a prioritized list of tasks.
In summary, we either need a healthy dose of patience, or a massive influx of help (and/or cash) to even get closer to being able to leave CVS.
[Note: if anyone happens to still be reading, I should point out that 6.0 core is going to be blocked for a while without that same influx of help and/or cash, regardless of the (IMHO disasterous) proposal to include ditching CVS in the mix of all the crap we have to get done on the d.o infra and in project* in the next few weeks/months.]
Every time this thread has come up before, no one actually has a good reason to leave CVS, they just don't understand it and think it can't accomplish something their new-favorite-tool can do. Finally we have a decent reason (pserver is internally lame and hard to get logging out of) from someone who actually understands, which is definitely a refreshing change from the past. ;) That's the other reason I'm willing to keep this thread alive instead of just killing it. ;) However, unless we switch to a distributed system, I don't see what we'd really gain by moving. CVS might be end-of-lifed someday, in which case even switching to SVN would be worth doing, but until that happens, that's not really a reason to call this a critical task. And, given the software development level of the bulk of the drupal dev community, I can't imagine recommending a switch to a distributed system, unless we decide to seriously raise the bar on contributing to drupal (which might be a good thing, I'm not sure).
Finally, the social problems and documentation task are literally monumental -- please don't underestimate what would be necessary in this regard (@moshe and crell: thanks for bringing this up earlier in the thread already -- much appreciated). And, you're going to have to find some else to be the project* + git expert to take over my position as the resident CVS guru. I'm already fighting off burn-out as it is, and if we go through another major version control transition again, I'm either going to have to not be in the middle of it or I'll just run away screaming from the Drupal community. For perspective, the last change wasn't even that major of a change: same backend system, everyone got to use the same tools, etc. All I did [sic] was introduce the notion of tags (which were already known via core) to the rest of contrib. That's it. Well, and I guess I cleaned up and expanded the notion of branches (which people were already using). However, it's been basically a year, and people still screw it up, and I still want to go crazy and shoot people who send me personal emails asking about CVS questions which are answered in the handbook pages myself and webchick spent many hours (days) working on. This isn't my job, no one's paying me to do this crap, and it's definitely not fun anymore. All I wanted (nearly 2 years ago when I first started advocating for my changes) was to fix a horribly lame, broken system, and introduce something that would improve my own ability to maintain and use Drupal code. To do so, I had to do a lot of heavy lifting, and I'm glad I did it, and the Drupal project is better off for it. But, I got seriously burned in the process, and I'm not going to do it again.
p.s. @Narayan: sorry if it seems like I'm angry at you, personally, for starting this thread -- I'm not. As you can see, there's a lot of history here, which is why this is somewhat emotionally charged for me. No hard feelings, ok? :)
Comment #20
nnewton commentedIndeed and I think we have a solution from Jeremy, it is appreciated and it would be _great_ to get stats pushed into cvs and not have to use a hack.
I never meant to set this as critical, sorry about that. It's for some work amazon is doing. Thanks for downgrading.
No question, although I may look into it at some point.
My thought here was that git can be used as a central revision control system if you so choose. It is actually quite flexible in how it can be used. However...ya, even I have seen the CVS questions in #drupal.
Just to be clear here, I 'subverted' this issue from a cvs stats only thread into a SCM thread to get the conversation started. I was in no way saying, lets do this next week and preempt drupal6.
From the OSL side, we have the patience...I just wanted to see what the actual problems with a move would be and get people talking about it. Again, not saying we do this now...but I'd personally like it on the list somewhere.
Feature-wise CVS fits your needs right now and our major (and only really) concern is the security of pserver. We may work out something to fix that as well. From an 'interested observer' view, you guys will have to change eventually and I'd _really_ recommend not going to subversion. As you say, its easy to not understand the full capabilities of CVS (and subversion for that matter) and to be honest the work-flow between the two is....the same. Subversion is a more secure..somewhat technically better CVS. It allows for better branching technically because of cheap copies, but its merge ability is laughable.
I'm not pushing git either (its just what I know), we are moving to that at the OSL, but thats not really shocking. My main point is, this will be a long transition...it will be painful and it will be "sometime in the future." It would be good to transition to something that is worth the transition and is not just CVS 'part 2'. Anyone object to me opening an issue about planning a future SCM and cherry picking comments out of this thread? I shouldn't have hijacked the discussion and it would be nice to not have to repeat this exact conversation down the road (again).
SCM is a rather important part of an open source project and should not be the domain of one person IMO. Partly because its that important...partly because you get burn-out like this. Thanks for handling this aspect of the project for so long and by bringing this up I in no way meant to say 'lets change now and let him handle it'.
Nah, I didn't think you were angry at me. I knew there was a lot of history, I just didn't particularly know what it was.
Thanks for the detailed response and of course...no hard feelings. I'm rather difficult to upset, right up until you tell me to compile with -fomit-all-instructions.
Comment #21
killes@www.drop.org commentedOk, a short summary on my opinion:
1) Ths cvs stats are decidedly on the "nice to have but not vital" list of things. If Jeremy or somebody else wants to work on it, that's great, but moving to a different RCS only because of this won't happen.
2) I'd like to see people interested in switching RCS to get together and prepare a list of options with pros and cons and a suggested route for moving.
I personally don't see us moving to a different RCS before spring 2008, if at all.
Comment #22
Amazon commentedCheckout stats from CVS are critical for marketing, but I agree not critical for infrastructure.
Drupal has had about 800 000 Downloads of core in 2007. However, it has 600 000 checkouts from CVS each month! We are trying to determine what percentage of the 600 000 CVS Checkouts are Drupal core. This will help with Marketing to be able to give an idea of how popular Drupal is. We know from Dries's survey that most admins run 3-5 Drupal sites on average. We also know that developers and professional Drupal shops tend to manage their code using version control and usually start by checking CVS into their repository. This could help with making estimates of how many Drupal sites there are.
With Drupal 6 and the update module we can also start to draw a relationship between how many copies of Drupal are downloaded and how many sites are checking for updates.
Comment #23
drummI thought we get far more-accurate numbers from the update status system.
Comment #24
jeremy commentedI finally had some time to look into this. In getting CVS pserver up and running in my sandbox, I noticed that it already logs everything you need in order to provide the information dumped by running "cvs history". You'll just need to write a script to extract the information you need from these existing logs.
To enable/configure what CVS logs, edit CVSROOT/config and edit the LogHistory section. If you just want to log checkouts, set it to "LogHistory=O". A complete list of what can be logged is found in
src/history.c, so to log checkouts and exports you'd set it to "LogHistory=OE":The log file is written to CVSROOT/history. Here's an example of my cvs log:
You'll note that includes one modified file, and two checkouts. A full understanding of what is logged can be found in a comment at the top of
src/history.c. Briefly, the first character denotes the action, then comes a date ("a fixed length 8-char hex representation of a Unix time_t"). After the first "|" is the user that performed the action. The next section shows where the file was checked out to (<remote> in this case). Then comes action specific fields, but you can see in the first line I added revision 1.2 of one.c, in the second line I checkout out the "example" directory, and in the third line I checked out the "example/one" subdirectory.Comment #25
nnewton commentedThanks for finding that, I am apparently retarded. That is exactly what we need and those files have the info in them at least back to Oct 30th, I can just write a script to parse them.
Comment #26
nnewton commentedHere is a short sample of the history file:
O439bcd91|anonymous|*15|contributions/modules/simple_access||simple_access
O439bf917|anonymous|/*0|drupal/modules||drupal/modules
O439bfb6f|anonymous|/*0|drupal|HEAD|drupal
O439c0d3a|anonymous|/*0|drupal||drupal
O439c1b35|anonymous|/*0|drupal||drupal
O439c1b54|anonymous|/*0|drupal||drupal
O439c3612|anonymous|/*0|drupal||drupal
O439c36c5|anonymous|/*0|drupal||drupal
O439c3cd8|anonymous|/*0|drupal||drupal
O439c437d|anonymous|/*0|drupal||drupal
O439c4780|anonymous|/*0|drupal||drupal
O439c4d48|anonymous|/*0|contributions/sandbox/nedjo/modules/tabs||contributions/sandbox/nedjo/modules/tabs
O439c4de0|anonymous|/*0|drupal|DRUPAL-4-6|drupal
O439c4f0b|anonymous|/*0|drupal|DRUPAL-4-6|drupal
O439c4fcd|anonymous|/*0|drupal|DRUPAL-4-6|drupal
O439c6a55|anonymous|/*0|drupal|DRUPAL-4-7|drupal
O439c6b4c|anonymous|/*0|drupal||drupal
O439c6c0c|anonymous|/*0|drupal|HEAD|drupal
O439c6e9c|anonymous|/*0|drupal||drupal
O439c7e01|anonymous|/*0|drupal||drupal
O439c895f|anonymous|/*0|drupal||drupal
O439c8e48|anonymous|/*0|drupal|DRUPAL-4-6|drupal
O439c8e98|anonymous|/*0|drupal|DRUPAL-4-6|drupal
O439c97cb|anonymous|/*0|drupal||drupal
O439ca087|anonymous|/*0|drupal||drupal
O439ca3a6|anonymous|/*0|drupal||drupal
O439cacc6|anonymous|/*0|drupal|HEAD|drupal
O439cb57c|anonymous|/*0|drupal||drupal
O439d1aa0|anonymous|/*0|drupal||drupal
O439d211f|anonymous|/D*1|drupal||Drupal
Above are only checkouts, the real file has those interspersed with sections like this:
P439bc6c5|anonymous||drupal|1.68|.htaccess
P439bc6c5|anonymous||drupal|1.3|INSTALL.mysql.txt
P439bc6c5|anonymous||drupal|1.4|INSTALL.pgsql.txt
P439bc6c5|anonymous||drupal|1.26|INSTALL.txt
P439bc6c5|anonymous||drupal|1.2|UPGRADE.txt
P439bc6c5|anonymous||drupal|1.33|cron.php
P439bc6c5|anonymous||drupal|1.89|index.php
C439bc6c5|anonymous||drupal|1.166|update.php
P439bc6c5|anonymous||drupal|1.15|xmlrpc.php
P439bc6c5|anonymous|*6|drupal/database|1.160|updates.inc
P439bc6c5|anonymous|*6|drupal/includes|1.77|bootstrap.inc
C439bc6c5|anonymous|*6|drupal/includes|1.497|common.inc
P439bc6c5|anonymous|*6|drupal/includes|1.48|database.inc
P439bc6c5|anonymous|*6|drupal/includes|1.47|database.mysql.inc
P439bc6c5|anonymous|*6|drupal/includes|1.10|database.mysqli.inc
P439bc6c5|anonymous|*6|drupal/includes|1.23|database.pgsql.inc
P439bc6c5|anonymous|*6|drupal/includes|1.56|file.inc
C439bc6c5|anonymous|*6|drupal/includes|1.30|form.inc
P439bc6c5|anonymous|*6|drupal/includes|1.11|image.inc
C439bc6c5|anonymous|*6|drupal/includes|1.4|install.inc
P439bc6c5|anonymous|*6|drupal/includes|1.60|locale.inc
P439bc6c5|anonymous|*6|drupal/includes|1.94|menu.inc
P439bc6c5|anonymous|*6|drupal/includes|1.70|module.inc
Comment #27
Amazon commentedI have the CVS History file. I am going to try to import into this tool: http://cvshist.sourceforge.net/
Comment #28
Amazon commentedI tried to use cvshistory. The problems are two fold.
First the parsing is failing. From update_cvs_history_db.sh
echo "generate database-importable data from the cvs history file"
awk -f gen_mysqlimport_data.awk $CVS_HISTORY_OUTPUT_FILE > $MYSQL_HISTORY_DATA_FILE
Which yields the format below. As you can see the text seperators | are not parsing the data into columns. This means we can't query the data and use it to look for patterns about usage of Drupal modules. Tracking CVS checkouts to look for most popular modules and combinations of modules would be valuable for some upcoming pivot work we are doing.
The AWK script creates a 1.7GB file which is a single query.
INSERT INTO history (code, action_date, cvs_user, revision, file_name, path_in_repo, working_name) VALUES
('P439bc6c5|anonymous||drupal|1.68|.htaccess',' :00','','','','','')
,
('P439bc6c5|anonymous||drupal|1.3|INSTALL.mysql.txt',' :00','','','','','')
,
('P439bc6c5|anonymous||drupal|1.3|INSTALL.mysql.txt',' :00','','','','','')
,
('P439bc6c5|anonymous||drupal|1.4|INSTALL.pgsql.txt',' :00','','','','','')
,
('P439bc6c5|anonymous||drupal|1.4|INSTALL.pgsql.txt',' :00','','','','','')
On my 32 Bit 1GB development server, this is too much to ingest into MySQL. Even with max_allowed_packets set to 500MB it fails with "ERROR 1153 (08S01) at line 1: Got a packet bigger than 'max_allowed_packet' bytes".
So we switched back to a more general parsing.
Checkouts from CVS
cat /home/amazon/CVSROOT/history | grep -P "^O" | wc -l
538190
Core checkouts from CVS
cat /home/amazon/CVSROOT/history | grep -P "^O" | grep -P "drupal$"|wc -l
53923
This means that about 10% of Drupal checkouts are Drupal core. If we are doing 600 000 checkouts per month then we have an additional distribution of 60K Drupal instances. I'll need to confirm some of these numbers. It's not clear for example, how many lines of checkouts are taken into account for a checkout. In some cases every top level directory counts as checkout.
In short having a script that parses CVS history into a database table would be good for making module recommendations and for keeping track of total distribution of Drupal software from Drupal.org.
Comment #29
jeremy commentedMail me a small section of the log and I'll write you a PHP script to dump the data into the database, if that's what you want to do. I don't want the full log, just enough to be sure my script is working, maybe 5,000 lines. (ie, "tail -5000 history > jeremy.txt") Once you get all the data into the database, you can manually craft queries to get the information you want.
(BTW: Let me know if you only want checkouts, or if you want other data too. Be specific, as it's quickest to only grab the data you want, not more.)
Comment #30
dwwOh yes, this issue again... ;)
@Amazon, I think it's highly misleading to call this "additional distribution of 60K Drupal instances".
A) We don't really have a good way to know, but I'd bet good money that a sizeable fraction, if not the overwhelming majority, of those core checkouts are test sites from developers. I know I re-checkout all of core from CVS quite frequently when bringing up and tearing down different test sites in various places.
B) Even though CVS sees them as different commands, and records them as such in the history file, if you do a "cvs checkout" on top of an existing workspace, it's functionally the same as an update. I also frequently "re-checkout" core on top of a test site. I just confirmed (watched the /cvs/drupal/CVSROOT/history in one window while I ran checkouts on my laptop in another) and these sorts of updates get logged as another checkout ("O"), even though, under the covers, they're just a plain old update.
C) Your regexps are also wrong. Behold:
(checking out into a directory called something other than "drupal" is rather common, I'd assume...)
That command yields the following history file entry:
(Wow, I sort of forgot just how much info CVS records -- Big Brother indeed!) ;)
Point being, your regexp is going to miss this, since the
-P "drupal$"isn't going to fire, since "drupal" isn't the end of the line. You really need to split on '|' and inspect the 4th column to be safe.Personally, I'd be really afraid about parsing the history file and storing that in the DB. It's generating a *ton* of noise, and I'm not sure what kinds of meaningful statements we can make about the statistics we'd be trying to summarize.
Fundamentally, I'm not convinced that checkout statistics from CVS tell us much, and I'd be incredibly hesitant to make any definitive claims, especially "marketing" claims, based on them. Although, I guess that's exactly what marketing is all about. ;) Oh well, it's not my job to tell the marketing team how to do their job, but from the technical standpoint, I think you're standing on very thin ice if you see the above stats and claim "60K new drupal instances a month, just from CVS!".
Comment #31
Amazon commentedYou are absolutely correct. Assuming new instances would be wrong. We should state the basic facts, there are X many core checkouts per month. The reason I wanted to calculate CVS checkouts is that all the professional developers I know either checkout from CVS directly or they check out of CVS into their own repository. If we didn't include the professional developers in our distribution numbers then we would be doing ourselves a disservice.
I'll be careful not to state at all, how many new sites this leads too.
Comment #32
gerhard killesreiter commented