G'day.

I assume the infra team knows me. If not, please comment and I'll list the boring crap.

I'd like, ideally, just access to the sanitized Drupal.org daily database. At this point in time, I do not intend to write code "for" drupal.org, as any code I'm currently writing is D7 specific, and d.o has a ways to go there. My request for the drupal.org daily database (which, according to these deprecated instructions, requires BZR - is that still correct?) is for a data-mining project that can best be summed up with as "Drupal.org achievements".

There are various reputation-based and points and badges initiatives being proposed (here and here). My work would involve data-mining Drupal.org and creating sample achievements based on this data, in eventual hopes and plans of having it integrated in a far-future version of d.o (i.e., after a D7 upgrade). Until said integration occurred, I would likely create a secondary site (maybe on druplicon.info, maybe somewhere else) that would take the results of this data-mining + achievements and display it. Because of this secondary site + D7 focus, a your-server-provided environment would not be ideal (annoying reimaging daily for the latest database, largely ignored D6 codebase, etc.). Note that the database dump I would require would presume, at the least, that the UID and usernames from d.o are retained - I, of course, assume that email addresses and passwords have been sanitized away, but I don't see any rationale or need for them, so that's perfectly fine.

Note: a denial of this request is "OK", per se, if it's going to cause undue stress on any
of the processes. I can, of course, simply spider drupal.org through HTTP for the same data.

Comments

greggles’s picture

I'm in favor of this application and, in general, allowing access to a sanitized version of our data.

@Morbus Iff - since you are one of a) the first people to request this b) a group of people who care about security/privacy - would you be willing to help make sure the sanitizing script is catching everything it should? That script is in git at http://drupalcode.org/project/infrastructure.git/blob/refs/heads/master:...

Morbus Iff’s picture

@greggles: Yep, I'd be fine with that.

Some initial concerns based on just looking at the script:

  • We're not sanitizing user.pass - even though it's MD5'd, we should not provide the hashes to end-users.
    • Potential downside is that the downloader wouldn't be able to login to their own account to start.
    • Possible solution is to just MD5('example password') for uid1 and instruct folks to edit account passwords.
    • Another would be to just assign all passwords to 'example password' for ease of use. It is devel after all.
  • We don't seem to be doing anything else to the users table, which contains:
    • Their current email in user.mail (might be irrelevant due to profile usage, but...)
    • We're not removing values in user.init, which contains the first email they ever signed up with.
greggles’s picture

It's possible that the script I linked to isn't the only one being run. I suggest asking drumm for the complete set.

greggles’s picture

I think the scripts are currently written to make the files useful on our staging sites and not necessarily for distribution.

There are also some scripts in http://drupalcode.org/project/infrastructure.git/tree/refs/heads/master:... that run after the one I linked to which may do some of the things you mentioned (though I didn't immediately notice any that fix the problems you mention).

Thanks for the feedback already!

Morbus Iff’s picture

Is a DB-only approach entirely unsupported now? Based on the (deprecated) http://drupal.org/node/1018070, it would seem that, at one time, there were daily database dumps already, but based on what I'm reading above, they were never used by any end-users ("[you're the] the first [person] to request this")? Based on the diff history of that doc, it would seem this stuff was in place a year ago (February 2011), but was only recently doc'd as deprecated (March 2012). I *can* do what I'd like to experiment with using good ol' HTML scraping, but it'd need to be rewritten entirely should it ever become an official d.o improvement (which is the grandest, most cost-worthly downside).

drumm’s picture

We do get this request every couple of months. What I want to see is development going into providing APIs for everyone to use. This is more transparent for users, data we distribute is clearly shown in API documentation. And it is more fair to others wanting data, since we don't hand out our DB dump to everyone. An example API is https://association.drupal.org/api. We already have a big need for a basic profile data API since bakery is not covering that any more, #1232870: Profile data synchronization.

The sanitized DBs are available with shell access to util. During the great redesign completed in 2010, we did provide them over HTTP, with a password, but this has been discontinued to simplify infrastructure. In general, it is much quicker and easier to spin up a dev site than to walk a contributor through downloading everything and getting a local dev site working properly.

I don't think we are ready to make these publicly available. We may have sanitization done well at points in time for a given site, but I don't think we are ready to say it is always perfect. We are adding new things more often now, and reviewing for privacy isn't something we do every time. I think, after every schema change, we would want to stop automated DB dumps until reviewed and reactivated.

The sanitization scripts run in stages:

  1. common.staging.sql
  2. [db].staging.sql
  3. → minimally sanitized dump for staging.devdrupal.org
  4. common.sql
  5. [db].sql (these two should really have dev in their name)
  6. → fully sanitized dump for redesign.devdrupal.org

In the next few weeks we will be ramping up work on Drupal.org's upgrade to Drupal 7. A first part of that is building out codebases, staging sites, and dev sites on Drupal 7. We do want to focus on getting the Drupal 7 upgrade done without scope creep, but that would provide the infrastructure for developing for Drupal.org directly. If ready, we can deploy it soon after the basic upgrade is done.

Morbus Iff’s picture

So, @drumm, what are you saying?

The data and logic I am looking for is unlikely to ever be available in a digestible API, because it is humorously or aggressively decided (such as, off the top of my head, "Responded to an issue that had 150 comments" or "Was the 100th post" or "Was the 50th unique poster" or "Posted every day of the week for three weeks" or "Committed after midnight" or "Fed the trolls [changed something from "by design" to "active"]", et cetera, et cetera, et cetera). All stuff I can find out through dumpster diving (HTTP spidering/HTML slurping) and all stuff that is unlikely to appear in a publicly accessible API. Thus, me helping out on development of an API for everyone to use doesn't really help me out - I see no huge advantage to a JSON or XML version of a node vs. the HTML version of a node (and, for my achievement based needs, "everything on a node and all its comments" is desirable output).

As indicated above, the ultimate plan is to provide an Achievements/Badge-based system for drupal.org ON drupal.org. Using the direct database tables in the proposed/submitted code is going to be more palatable than consuming d.o's public API on d.o's own servers. However, as Achievements does not work on Drupal 6, there's little point in me starting development using a Drupal 6 codebase (as provided by the current sandbox infrastructure). Nor do I really want to wait until Q3 (per webchick's d.o roadmap session) to start work using a Drupal 7 sandbox infrastructure. Quite simply: even given Q3 being far away, I don't think the Achievements work would be happily "complete" by that time, so I'm not looking to launch this alongside a d.o D7 upgrade, which matches to your "scope creep" sensibilities.

My comments above about providing a secondary site during this experimental process is both to more clearly offer a commentable demo of the proposed functionality, but also to satisfy the "6+ months of on/off coding and fiddling" I expect for such a project, with minimal code rewriting (with full code rewrites being required if I were to HTML scrape, or even if I were to use a fully-my-needs public API). If you're ultimately concerned about me attempting to create a CTR alternative using "privileged" information (which is a fallacy as I won't be acting on anything that is not already public), then say so and I'll just proof of concept everything using an HTML scraper, and I'll just have to rewrite it all when D7 is available on d.o. I was, ideally, going to do that anyways until I found out about d.o infrastructure sandboxes. But, let's not be confused: the goal here is to bring achievements/badges to d.o on D7, not to provide a third-party, offsite-hosted, CTR-competing, d.o achievements thingy.

I am perfectly fine with retrieving the data over SSH, assuming it is updated on a semi-regular basis. I have no interest in setting up a local d.o dev site - I will have a fresh Drupal 7 dev site with its own database, and then the d.o database is merely a secondary database that is there for data only - not to run or server anything from. I wouldn't need any further instruction besides "here's where it is, now go do your exports and imports yourself".

On an unrelated-note-brought-up-by-greggles, I'll look at the sanitizing workflow you've outlined above shortly.
Of course, without seeing the database itself, I can only rely on probable ideas about what data remains in the d.o dumps.

greggles’s picture

since we don't hand out our DB dump to everyone.

IMO we really really should. As Morbus has pointed out, API is far less useful and most of the data is already available via HTML scraping.

@morbus - you misquoted me and drew an inaccurate conclusion ;) I said you're one of the first to request and care about security/privacy. Especially in terms of people who are asking to download it for a purpose that is not immediately related to improving drupal.org (i.e. the improvement's timeline is more than a month in the future).

Morbus Iff’s picture

@morbus - you misquoted me and drew an inaccurate conclusion ;)

Holy crap, I really did badly misquote you. Sigh. I don't think the *conclusion* I was going for is any different though: that, at one time, new database dumps were available on a regular basis, but they were deprecated for either ease-of-maintenance, security, or lack-of-interest. Drumm suggests it was because ease-of-maintenance ("to simplify infrastructure") and security ("reviewing for privacy isn't something we do every time").

My understanding is that I would get access to a sanitized DB if I went with a full-on d.o development sandbox. And, from there, I presume I could make a dump of them and import it on my own development server for use as a secondary datasource from a Drupal 7 installation. My only concern with this process is that I would need to test ongoing changes to the underlying data, which would mean new reimaging requests. If those reimaging requests are automated in a shiny UI, then I don't have to bother anyone else. If they're not automated, then my reimaging requests will likely become annoying. Similarly, due to the length of this project, the sandbox likely wouldn't be able to reaped anytime soon, and based on other sandbox issues in this queue, the server is always hovering near full disk space, and I wouldn't want to really be a detriment to that. Thus, the "oOoh, there's a deprecated doc here saying there's database exports available, so I'll go that route instead!"

Also, a note of clarity: I'm not asking for the database "just" for the schema - if that's all I cared about, one could simply emulate a d.o data structure and get a representative schema. I do, in fact, need the data, for "percentage completion". That is: we know any badge/achievements system would have to respect/unlock based on the legacy contributions (code and non-code) of a user. If achievements/badges were launched and we found everyone had 90% badge completion, That'd Be Very Bad - it would seem like there's nothing left to achieve, which would make the entire project nearly pointless and certainly uninspiring. Thus, the available data will help determine the difficulty, breadth, and number of achievements available.

webchick’s picture

While in general I'm very +1 to creating more robust APIs around Drupal.org data, I really can't envision another way to work on this feature (currently ranked #12 of 107 at http://drupal-association.ideascale.com/) aside from sanitized DB dump access. Morbus has been with the community long enough (and has access to crap-tons of other data should he want it via the Druplicon bot), that I would trust him with this level of access. He's also the maintainer of the Achievements module, and based on comments at http://drupal-association.ideascale.com/a/dtd/Add-a-reputation-badges-sy... and elsewhere, it seems like the Achievements-esque way of approaching this has the most support (as opposed to, say, User Points), making him likely the best person to run point on this.

Long-winded way of saying "I would support this request."

rfay’s picture

I should note that anybody who has a sandbox already has access to a sanitized database, and could dump it and use it for various purposes. So someone could just request a sandbox... Is that a a violation of something I don't remember?

greggles’s picture

Yes, it's generally inappropriate for someone with a sandbox to dump the db and send it to someone else.

rfay’s picture

@greggles, I would agree with that. But I was suggesting that anybody who needs a sanitized database for themselves can do so by getting a sandbox site and discretely using the database as they see fit (in their own dev methodology). Would you also frown on that? If so, do we make that clear?

Morbus Iff’s picture

@rfay, as mentioned above, the one bad thing about sandbox environments is that they must be reimaged/rebuilt to get a new copy of the database and, last I knew, this was a manual affair that the developer could not initiate themselves. I would need a new copy of the database on a regular basis, so a sandbox environment would cause undue strain on the infra team.

rfay’s picture

@Morbus Iff, agreed. But... just FYI... it's not actually a strain to push a single button in Jenkins. I, who strenuously avoid new commitments, would take this on (and I have the privs). It's still not the best solution. Just responding.

webchick’s picture

Yeah, I can also easily destroy/re-deploy a dev environment. It's literally a 30-second (+30-45 mins wait while it rebuilds) process.

webchick’s picture

BTW, #1182144: Dev site owners cannot create/destroy environments is the feature request to allow this.

drumm’s picture

Title: I want a drupal.org development site for … » I want a drupal.org development site for achievements

#112805: JSON menu callback for project issues is a an example of an API being worked on for project issues. Looks like it was actually deployed, http://drupal.org/node/1502810/project-issue/json. From that you can certainly find some of your examples. You would want to keep a local copy of the interesting parts in a convenient place. However, I don't think an API is actually a solution here, since we want this on Drupal.org itself.

What we did for sanitized DB dumps was provide them with an HTTP password which we handed out to people working on the Drupal.org redesign launched in 2010. The best way to get to them today is have SSH access to util. They are regenerated daily.

Dev sites come with full access, so you can drush sql-dump and work locally. As long as you don't distribute it, have a secure server, and don't host it publicly, that's okay.

We do have destroying and creating dev sites automated. People with Jenkins access can do this quickly. #1182144: Dev site owners cannot create/destroy environments is for making that more accessible.

I recommend:

  • Get a dev site
  • Turn off all modules
  • Upgrade to Drupal 7
  • Install achievements
  • Work on the server, and keep a copy of the code elsewhere so the dev site can be rebuilt

We are focusing on upgrading Drupal.org to 7 this month, so things will rapidly get more-working on Drupal 7. You won't have everything, but you should have enough to get started. For example, project issues are mostly regular comments and nodes, that should upgrade just fine without project modules. It is as close to the production environment as you can get for this, no messing about with multiple DBs.

A small deployment turning some custom code into achievements would be a good thing to do first. This gives us some time to make sure things are working well in production. And less custom code is always nice. An example is "Documentation Over 10 edits" on profile pages. And replacing the giant list of checkboxes like "I contributed Drupal modules"

If the dev sites don't work, then you want to request ssh to util to scp the sanitized DBs.

Morbus Iff’s picture

Let's go ahead and proceed with "Get Morbus a dev site" then.

rfay’s picture

Status: Active » Reviewed & tested by the community

Morbus is a trusted contributor with a clear purpose. I think this is a no-brainer.

webchick’s picture

Status: Reviewed & tested by the community » Fixed

Ok, on IRC Morbus confirmed he'd like a dev site to start on this, so I added his user/ssh key to the stagingvm box, and kicked off a build of achievements-drupal.redesign.devdrupal.org. Should be done in 45 mins or so!

Thanks for working on this Morbus! :D

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

tvn’s picture

Status: Closed (fixed) » Active

Is this dev site still needed? Seems it was never logged into.

Morbus Iff’s picture

If you need the space, take it down :(

About a week after it was setup, life got screwy and I never got to do start what I wanted to.

drumm’s picture

Assigned: Unassigned » drumm
Status: Active » Fixed

Destroyed. If you, or anyone else, find time in the future, we can spin up a fresh dev site, already on D7.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.