I think that we should provide a scripted export of the database (or databases) behind drupal.org that has all private data removed and maybe some or all of the content removed.

Why?

  • It can be useful for performance tuning specifically for drupal.org or benchmarking for patches for Drupal itself.
  • People working on the redesign and/or patches for the modules often need dummy data to test with. (I know we have d.o testing profile which makes this easier, but more real data is oftenbetter)
  • People can analyze the database more easily compared to scraping the html, storing the results, and then analyzing

Why not?

  1. Spammers might get it and set up sites that contain just this content
  2. Private data is contained in this database

Solutions to the reasons against

  1. Spammers are already setting up sites based on HTML crawling. And, if necessary to make this generally available, we can remove the data most important to spammers (i.e. node bodies and comment bodies). There are times when the node bodies and comment bodies are still important to people, but that can be addressed later or as a separate issue.
  2. We can overwrite or remove private data.

How?

We already have a script to remove most of the private data and it is used when copies of the database are shared (this happens very infrequently, but it happens and we have a process for it). We should review that a bit more to make sure it handles unpublished content and IP addresses. We should also make sure it removes unnecessary tables like caches and search index which can be calculated on the consuming side rather than included in the download file. Then we really only need some work (which I'm willing to do) to automate the creation of the file and pushing it over to the FTP site for download.

Comments

gerhard killesreiter’s picture

Why not?

  1. It is just a bad idea. I do not want to read anything like "the Drupal association makes the database of drupal.org publicly avaliable". There are a lot of people who would take exception on such news. The fact that we'd of course remove private informaton would probably cnveniently omitted.
joshk’s picture

How about an interim step of having a sanitized database we can provide to community developers. We've been getting inquiries around this for g.d.o, and I'd like to facilitate more active/interesting development from people besides the small group who currently has direct access. A safe/sanitized database dump would help that along quite a bit.

joshk’s picture

Just as a followup, I think this is pretty analogous to what we already do in terms of letting people access VMs for the redesign. There's some vetting and then a sort of "deputization." This is quite a bit different from releasing the data publicly.

greggles’s picture

Title: provide generally available, regular exports of the drupal.org database » provide selectively available, regular exports of the drupal.org database

#3 pushes this in a different direction - changing title.

I'm interested in seeing my original intent fulfilled, but would take selectively available.

The thing I don't like about selectively available dumps is that it's somewhat hard to know who should be on the list of people who get them.

avpaderno’s picture

I think this is pretty analogous to what we already do in terms of letting people access VMs for the redesign.

I am not sure about the audience of this task, but VMs for redesign are not accessible to everybody, AFAIK.

mlncn’s picture

The ideal policy and export provision would be publicly available, rather than making the question of who should get access a recurring, timewasting, and likely contentious issue.

All private data should be excluded or scrubbed, but all information that is otherwise publicly viewable should be retained.

The privacy policy should therefore note that public information (posts, profile information like country etc) will be made available for the purpose of creating aggregate statistics. If we can get the language correctly here it can be rolled into (and maybe jumpstart) #178776: Create a privacy policy for drupal.org

benjamin, agaric

greggles’s picture

Status: Active » Needs work

Here's the current script that's being run from hudson and stored on util in /usr/local/sbin/drupal-sanitize-dump.sql

-- CAUTION: DO NOT RUN THIS ON DATABASE WHERE YOU CARE ABOUT THE INFORMATION!!!

-- Munge emails for security.
UPDATE users SET mail = CONCAT(name, '@localhost'), init = CONCAT(name, '@localhost');
UPDATE comments SET mail = CONCAT(name, '@localhost');
UPDATE directory SET mail = CONCAT(name, '@localhost');
UPDATE mailhandler SET mail = CONCAT(name, '@localhost');
UPDATE authmap SET authname = CONCAT(aid, '@localhost');
UPDATE client SET mail = CONCAT(cid, '@localhost');
UPDATE donations SET mail = CONCAT(did, '@localhost');
UPDATE project_issue_projects SET mail_digest = 'foo@localhost', mail_copy = 'foo@localhost';
UPDATE projects SET mail = CONCAT("empty", '@localhost');
UPDATE simplenews_subscriptions SET mail = CONCAT(snid, '@localhost');

-- Get rid of irrelevant data.
TRUNCATE accesslog;
TRUNCATE cache;
TRUNCATE cache_filter;
TRUNCATE cache_menu;
TRUNCATE cache_page;
-- TRUNCATE cache_project_release;
TRUNCATE devel_queries;
TRUNCATE devel_times;
TRUNCATE flood;
TRUNCATE history;
TRUNCATE search_dataset;
TRUNCATE search_index;
TRUNCATE search_total;
TRUNCATE sessions;
TRUNCATE watchdog;

-- Remove sensitive variables
DELETE FROM variable WHERE name = 'drupal_private_key';

Based on comments on this blog post I think it's missing munging of passwords, any other keys, and private profile values:

UPDATE users SET mail = CONCAT(name, '@localhost'), init = CONCAT(name, '@localhost'), pass = MD5(CONCAT('MILDSECRET', name));

DELETE FROM variable WHERE name LIKE '%key%';

DELETE FROM profile_values WHERE fid IN (select fid from profile_fields where visibility in (1, 4));
greggles’s picture

Also, this script is pretty specific to drupal.org or project* based sites. It won't be sufficient for a site like groups.drupal.org.

The access, authmap, client tables seem interesting.

cache_form should probably be truncated as well.

I'd like to review this more but don't have time right now. It seems reasonable to say that the script needs updating and another review before we share this with anyone.

mikey_p’s picture

I wanted to point out the documentation from http://staging6.drupal.org/download-drupal-org-database-solr-index which lists the following SQL which I believe also helps shrink the dump to a size that is more managable for the average redesign developer/implementor, or whoever is using it:

DELETE FROM node WHERE type IN ('forum','project_issue') AND created < 1204040047;
DELETE node, project_issues FROM node INNER JOIN project_issues WHERE node.nid = project_issues.nid AND project_issues.sid IN (7,14,2);
DELETE node FROM node LEFT JOIN comments ON node.nid = comments.nid WHERE node.type = 'forum' AND comments.nid IS NULL;
DELETE node_revisions FROM node_revisions LEFT JOIN node ON node.nid = node_revisions.nid AND node.nid IS NULL;
DELETE node_comment_statistics FROM node_comment_statistics LEFT JOIN node ON node.nid = node_comment_statistics.nid AND node.nid IS NULL;
DELETE node_counter FROM node_counter LEFT JOIN node ON node.nid = node_counter.nid AND node.nid IS NULL;
DELETE comments FROM comments LEFT JOIN node ON node.nid = comments.nid AND node.nid IS NULL;
rfay’s picture

Status: Needs work » Fixed

Data on the not-fully-sanitized snapshots that are currently being taken (for authorized users only):
https://infrastructure.drupal.org/rsync-database-snapshots

drumm’s picture

Status: Fixed » Needs review

Still not fixed, we need to put #7 into /usr/local/sbin/drupal-sanitize-dump.sql on util.

And a separate script for #9 to make a subsequent smaller dump.

drumm’s picture

Also, generally look for email addresses and other private data, like drop the clients table.

basic’s picture

I've added #7 to the existing /usr/local/sbin/drupal-sanitize-dump.sql, and created a /usr/local/sbin/drupal-reduce-dump.sql for #9. We had talked about possibly putting these scripts in bzr to prevent the blocks that cfengine adds?

drumm’s picture

These are now in bzr+ssh://util.drupal.org/bzr/scripts/. Reduce gets the DB from 860MB to 267MB.

We still need to review the db dumps for anything we don't want to disclose.

drumm’s picture

We need to remove unpublished nodes, comments, and others.

coltrane’s picture

subscribe

drumm’s picture

I did another pass on this yesterday, still could use some review. /var/dumps/mysql/drupal_redesign_database_snapshot-current.sql.gz on util is what to review.

mlncn’s picture

StatusFileSize
new392 bytes

Could i get access to util for this? My public key is attached.

Amazon’s picture

Let's keep this restricted to people working on the redesign or infrastructure right now. Access to this data should be a privilege, and represents a significant effort to make it available. I'd prefer we keep the reward of getting access to this data restricted to those who contribute to the overall effort to improve drupal.org.

If there was a specific proposal to do research on the databases we could evaluate on their own merits.

drumm’s picture

I did another pass and found more to sanitize and reduce. Inactive node revisions older than 100 days are now stripped. This gets us down to 289MB compressed.

Changes committed to bzr and the further sanitization will be seen in hudson-drupal_redesign_database_snapshot-20.sql.gz when hudson builds it.

@Benjamin Melançon - we aren't providing ssh access to util for this. See http://groups.drupal.org/drupalorg-redesign-implementers.

greggles’s picture

StatusFileSize
new3.41 KB

I'm assuming that most of the tables in #862242: Drop old tables we no longer need on drupal.org database will get dropped before we really start sharing this.

  1. The access table contains several rules for blocking users with certain mail domains. I think we should truncate that.
  2. I added a bunch of delete statements to keep records associated with deleted content consistent
  3. Ideally someone familiar with the project module would write delete statements for the project_* tables
  4. I did not review the 668 variables in the variables table. Many of these we could delete (there are 11 for xapian, for example)

I tried importing the db to a scratch place and it was going to take hours, so the attached untested patch is more suggestions than firm ideas. In particular the deletes and joined tables should be tested to make sure they work and don't delete too much.

drumm’s picture

Changed one of the comments queriesto work and running again.

greggles’s picture

Great, thanks drumm.

We've somewhat mixed the purposes of this script and drupal-reduce-dump.sql - I think the queries at the end of drupal-sanitize-dump.sql could be re-run after drupal-reduce-dump.sql to maintain internal consistency.

Maybe we should have a third script for consistency between tables that can be the one version of those queries?

drumm’s picture

Notes from our call
- Greg says this is "basically" ready to go, good enough for me, but a stronger answer would be better
- Access to this should be consistent with BZR, maybe stash the http password in bzr/redesign.drupal.org
- Chad noted he wanted to double check project's storage of emails. We do already do

UPDATE project_issue_projects SET mail_digest = 'foo@localhost', mail_copy = 'foo@localhost';
UPDATE projects SET mail = CONCAT("empty", '@localhost');
hunmonk’s picture

UPDATE project_issue_projects SET mail_digest = 'foo@localhost', mail_copy = 'foo@localhost';

this should take care of everything in project_issue module.

UPDATE projects SET mail = CONCAT("empty", '@localhost');

i'm pretty sure the 'projects' table is another one of those old tables that needs to die -- it's not in the current project schema.

drumm’s picture

For dropping, #862242: Drop old tables we no longer need on drupal.org database. We do already drop that list for sanitization.

greg.1.anderson’s picture

[Slightly OT]: Note that if you use drush to move your Drupal database, sql-sync now supports a --sanitize option which is hookable to extend to your own sanitization scheme. See #861822: Add an option to sanitize email addresses and passwords from user table post sql-sync. and drush.api.php in drush-HEAD.

drumm’s picture

Status: Needs review » Fixed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

drumm’s picture

FYI, the DB size has crept up. I am lowering the threshold to remove old content from 100 to 80 days. This affects forum posts, issues, and non-active node revisions.

vacilando’s picture

+1

greggles’s picture

Title: provide selectively available, regular exports of the drupal.org database » Provide generally available, regular exports of data from the drupal.org database
Status: Closed (fixed) » Active

Narayan mentioned that he's open to the idea of creating these if we only include data we know to be OK to include.

That makes a ton of sense to me.

For example, node titles, bodies, and revision history for published nodes are all available to web scrapers. If we export that in some format it should be OK to distribute the resulting file.

This has the benefit of being something we can start small and add to without worrying that a schema change suddenly exposes private data.

I think all nodes, comments, project data, maintainer data, project usage data are pretty interesting to folks so I plan to work towards that and post something here that would export just the information from those that either is or should be publicly available.

webchick’s picture

Oh that's FANTASTIC!

I'll provide a run-down of specific tables/columns I routinely pull in reports for general community health-checking later tonight, that should all be safe to include in said data dump. If I forget, please ping me. :D

drumm’s picture

I filed #1884780: Make sure schema changes are handled in snapshot as an interim step. With either blacklist or whitelist, it is good to review our sanitization whenever the schema changes. Implementing that issue will force us to do so.

drumm’s picture

For this, I think this should be implemented as a near-final phase snapshot.sh, http://drupalcode.org/project/infrastructure.git/blob/HEAD:/snapshot/sna....

What is a good way to implement this? Maybe use a second DB, loaded up with only the schema, then some INSERT into whitelist SELECT from less-sanitized-db queries?

greggles’s picture

Title: Provide generally available, regular exports of data from the drupal.org database » Provide generally available, regular exports of already public data from the drupal.org database

Including an important keyword in the title as seen at #1068432: Provide regular exports of public data which feels like a duplicate.

eliza411’s picture

greggles: In coordination with nnewton, drumm, and others, we're set up to look into this issue intensely for the next week. Is it okay to assign it to halstead while he's working on it?

There's an outline of the plan at https://groups.drupal.org/node/367308 We're specifically focused on Drupal.org for the moment and would welcome the chance to collaborate should you be available.

greggles’s picture

Assigned: greggles » Unassigned
Issue summary: View changes

Yes, absolutely. I didn't remember I'd assigned this to myself - removing that since I'm not actively working on it.

halstead’s picture

I have a start that generates SQL to make clean databases. You can see the schema is defined in an object one table at a time http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli.... If anything is added to the schema it will not be included in the destination database and a warning will be printed. Adding to the whitelisted schema can be as simple as copying the error message into the the whitelist.

Different profiles can be overlayed with a -p switch. This overlay makes http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli... for version of the database to be used internally and preserves some data that is normally discarded.

Custom handlers for specific tables do the job of generating selects that only grab the data we want. http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli.... Avoiding copying data we plan to delete is ideal but the statements can become difficult to read/maintain.

So I hacked in a final cleanup set of statements that discard additional unwanted data at the end. I added it to the skeleton profile as a simple list of statements. The method can be cleaned up in the future. These are more about trimming data then sanitizing private data. http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli...

Does this seem like a good approach?

halstead’s picture

Assigned: Unassigned » halstead

After a great discussion with sdboyer this morning I've separated the handling of columns from the overall table statement generation. I thought of columns as fields since columns are used for all the variable names. The field work is now done here: http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli.... While the selects are done in individual files if needed at http://drupalcode.org/project/infrastructure.git/tree/refs/heads/whiteli....

This is much easier to read and probably better to maintain. ;)

Thanks Sam.

moshe weitzman’s picture

Thanks for working on this.

It would be ideal to build this on top of well known software like drush sql-sync. That command does a mysqldump from "live", imports into a new DB, and then sanitizes. Some helpful features:

  1. The dump can include/exclude tables based on wildcards, whitelist, blacklist, etc. There can be multiple profiles as well - see $options['structure-tables'] in example.drushrc.php
  2. Post import operations for sanitize and trim can be done in santize hooks

It is true that this method does not allow for selecting only some table from whitelisted tables. Perhaps thats a nice-to-have feature we can skip.

If sql-sync gets awkward, you can still use its building blocks like sql-dump, sql-santitize, etc.

nnewton’s picture

I actually like that this is in python, as its likely to run on a server that we won't install PHP on. Reviewing atm, so far I like it.

-N

eliza411’s picture

Adding a connection to webchick's issue suggesting practical ways the DB image can be further trimmed.

greg.1.anderson’s picture

The nice thing about sanitize hooks is that the responsibility for cleaning up private data can live inside a Drush extension that is part of the contrib module that defines the private data. The advantage of doing this is that the sanitization work done for drupal.org could also benefit other projects and sites that also need to do sanitization.

That's the theory, anyway. The reality is that so far, no one has ever written a sanitize hook that cleans up contrib data. Paranoia defines a sanitize hook, but it only sanitizes tables defined in core. I can therefore easily understand if time constraints and convenience (existing code) lead this to be done using python or other non-Drush tools. I would hope, though, that installing php on the target server wouldn't be a non-starter; if Drush were at least an option, someone might convert the initial tools to sanitize hooks later. In the long run, this would benefit more people.

joachim’s picture

> And, if necessary to make this generally available, we can remove the data most important to spammers (i.e. node bodies and comment bodies). There are times when the node bodies and comment bodies are still important to people

Could node and comment bodies be replaced with a rougly equivalent quantity of lorem ipsum?

For things such as working on design or d.org functionality that would be perfectly adequate.

drumm’s picture

No, we do want real content. If it is public on the internet, it should be in this snapshot.

The work that remains is deploying halstead's work.

mgifford’s picture

This is an interesting & important idea.

I've been recently pushing for A/B testing to allow us to try out some new ideas & evaluate the impact:
https://groups.drupal.org/node/400558

I'm just not sure how we'd get that data to determine what the impact actually is. It would be possible to scrape some information from the existing site to determine if a certain change affected more odd or even uid's, but it would be much nicer to be able to monitor that data somehow.

basic’s picture

digiv and coltrane are working on a security review for the sql produced by the whitelist sanitization scripts before we can finish deploying the new system.

coltrane’s picture

I reviewed the drupal_export database on db4-old today and found some private data that needs cleaning and some others I think should be cleaned to limit results from data mining. Additionally there are some tables with unnecessary data that could be cleaned up to the make the DB smaller.

I'm also confused now on what is running to make this export db, is it drush or a python script? I'd like to review what's running to make the export db.

Needs sanitizing

  1. table multiple_email - truncate it all?
  2. table sshkey
  3. table contact, column recipients - set to empty?
  4. table sshkey

Could be sanitized?

Contains somewhat personal/private data OR is somewhat risky to allow mining it:

  1. authmap
  2. profile_value
  3. flag_content
  4. field_budget
  5. field_data_field_user_geolocation
  6. field_revision_field_user_geolocation
  7. drupalorg_ind_civimembership
  8. drupalorg_org_civimembership
  9. drupalorg_crosssite_ind_civimembership
  10. drupalorg_crosssite_org_civimembership
  11. role_activity

Data not necessary

Tables could be truncated to lower DB size:

  • user_restrictions
  • queue
  • project_release_package_errors
  • httbl
basic’s picture

Needs sanitizing -- done
Data not necessary -- done

Could be sanitized: I'm not sure about these, it seems like they could just be truncated like the rest of the tables. I went ahead and truncated these tables but I'm not sure the repercussions of doing that. There should be a new copy of the database to review tomorrow morning.

Component: Database » Servers
mitchell’s picture

This issue is very important to me. I'd happily put up a round or two of beers to those who take this through to the finish line.

yesct’s picture

yesct’s picture

irinaz’s picture

@mitchell , is this issue still relevant? If yes, could you give more details? If no, could we close this issue?

PS - I like your comment about beers :)

greggles’s picture

Status: Active » Closed (won't fix)

I think the path for this is via the API.

If folks need a database then it would be good to articulate why the API doesn't work.