Drupal.org

provide selectively available, regular exports of the drupal.org database

Project:Drupal.org infrastructure
Component:Database
Category:task
Priority:normal
Assigned:greggles
Status:closed (fixed)

Issue Summary

I think that we should provide a scripted export of the database (or databases) behind drupal.org that has all private data removed and maybe some or all of the content removed.

Why?

  • It can be useful for performance tuning specifically for drupal.org or benchmarking for patches for Drupal itself.
  • People working on the redesign and/or patches for the modules often need dummy data to test with. (I know we have d.o testing profile which makes this easier, but more real data is oftenbetter)
  • People can analyze the database more easily compared to scraping the html, storing the results, and then analyzing

Why not?

  1. Spammers might get it and set up sites that contain just this content
  2. Private data is contained in this database

Solutions to the reasons against

  1. Spammers are already setting up sites based on HTML crawling. And, if necessary to make this generally available, we can remove the data most important to spammers (i.e. node bodies and comment bodies). There are times when the node bodies and comment bodies are still important to people, but that can be addressed later or as a separate issue.
  2. We can overwrite or remove private data.

How?

We already have a script to remove most of the private data and it is used when copies of the database are shared (this happens very infrequently, but it happens and we have a process for it). We should review that a bit more to make sure it handles unpublished content and IP addresses. We should also make sure it removes unnecessary tables like caches and search index which can be calculated on the consuming side rather than included in the download file. Then we really only need some work (which I'm willing to do) to automate the creation of the file and pushing it over to the FTP site for download.

Comments

#1

Why not?

  1. It is just a bad idea. I do not want to read anything like "the Drupal association makes the database of drupal.org publicly avaliable". There are a lot of people who would take exception on such news. The fact that we'd of course remove private informaton would probably cnveniently omitted.

#2

How about an interim step of having a sanitized database we can provide to community developers. We've been getting inquiries around this for g.d.o, and I'd like to facilitate more active/interesting development from people besides the small group who currently has direct access. A safe/sanitized database dump would help that along quite a bit.

#3

Just as a followup, I think this is pretty analogous to what we already do in terms of letting people access VMs for the redesign. There's some vetting and then a sort of "deputization." This is quite a bit different from releasing the data publicly.

#4

Title:provide generally available, regular exports of the drupal.org database» provide selectively available, regular exports of the drupal.org database

#3 pushes this in a different direction - changing title.

I'm interested in seeing my original intent fulfilled, but would take selectively available.

The thing I don't like about selectively available dumps is that it's somewhat hard to know who should be on the list of people who get them.

#5

I think this is pretty analogous to what we already do in terms of letting people access VMs for the redesign.

I am not sure about the audience of this task, but VMs for redesign are not accessible to everybody, AFAIK.

#6

The ideal policy and export provision would be publicly available, rather than making the question of who should get access a recurring, timewasting, and likely contentious issue.

All private data should be excluded or scrubbed, but all information that is otherwise publicly viewable should be retained.

The privacy policy should therefore note that public information (posts, profile information like country etc) will be made available for the purpose of creating aggregate statistics. If we can get the language correctly here it can be rolled into (and maybe jumpstart) #178776: Create a privacy policy for drupal.org

benjamin, agaric

#7

Status:active» needs work

Here's the current script that's being run from hudson and stored on util in /usr/local/sbin/drupal-sanitize-dump.sql

-- CAUTION: DO NOT RUN THIS ON DATABASE WHERE YOU CARE ABOUT THE INFORMATION!!!

-- Munge emails for security.
UPDATE users SET mail = CONCAT(name, '@localhost'), init = CONCAT(name, '@localhost');
UPDATE comments SET mail = CONCAT(name, '@localhost');
UPDATE directory SET mail = CONCAT(name, '@localhost');
UPDATE mailhandler SET mail = CONCAT(name, '@localhost');
UPDATE authmap SET authname = CONCAT(aid, '@localhost');
UPDATE client SET mail = CONCAT(cid, '@localhost');
UPDATE donations SET mail = CONCAT(did, '@localhost');
UPDATE project_issue_projects SET mail_digest = 'foo@localhost', mail_copy = 'foo@localhost';
UPDATE projects SET mail = CONCAT("empty", '@localhost');
UPDATE simplenews_subscriptions SET mail = CONCAT(snid, '@localhost');

-- Get rid of irrelevant data.
TRUNCATE accesslog;
TRUNCATE cache;
TRUNCATE cache_filter;
TRUNCATE cache_menu;
TRUNCATE cache_page;
-- TRUNCATE cache_project_release;
TRUNCATE devel_queries;
TRUNCATE devel_times;
TRUNCATE flood;
TRUNCATE history;
TRUNCATE search_dataset;
TRUNCATE search_index;
TRUNCATE search_total;
TRUNCATE sessions;
TRUNCATE watchdog;

-- Remove sensitive variables
DELETE FROM variable WHERE name = 'drupal_private_key';

Based on comments on this blog post I think it's missing munging of passwords, any other keys, and private profile values:

UPDATE users SET mail = CONCAT(name, '@localhost'), init = CONCAT(name, '@localhost'), pass = MD5(CONCAT('MILDSECRET', name));

DELETE FROM variable WHERE name LIKE '%key%';

DELETE FROM profile_values WHERE fid IN (select fid from profile_fields where visibility in (1, 4));

#8

Also, this script is pretty specific to drupal.org or project* based sites. It won't be sufficient for a site like groups.drupal.org.

The access, authmap, client tables seem interesting.

cache_form should probably be truncated as well.

I'd like to review this more but don't have time right now. It seems reasonable to say that the script needs updating and another review before we share this with anyone.

#9

I wanted to point out the documentation from http://staging6.drupal.org/download-drupal-org-database-solr-index which lists the following SQL which I believe also helps shrink the dump to a size that is more managable for the average redesign developer/implementor, or whoever is using it:

DELETE FROM node WHERE type IN ('forum','project_issue') AND created < 1204040047;
DELETE node, project_issues FROM node INNER JOIN project_issues WHERE node.nid = project_issues.nid AND project_issues.sid IN (7,14,2);
DELETE node FROM node LEFT JOIN comments ON node.nid = comments.nid WHERE node.type = 'forum' AND comments.nid IS NULL;
DELETE node_revisions FROM node_revisions LEFT JOIN node ON node.nid = node_revisions.nid AND node.nid IS NULL;
DELETE node_comment_statistics FROM node_comment_statistics LEFT JOIN node ON node.nid = node_comment_statistics.nid AND node.nid IS NULL;
DELETE node_counter FROM node_counter LEFT JOIN node ON node.nid = node_counter.nid AND node.nid IS NULL;
DELETE comments FROM comments LEFT JOIN node ON node.nid = comments.nid AND node.nid IS NULL;

#10

Status:needs work» fixed

Data on the not-fully-sanitized snapshots that are currently being taken (for authorized users only):
https://infrastructure.drupal.org/rsync-database-snapshots

#11

Status:fixed» needs review

Still not fixed, we need to put #7 into /usr/local/sbin/drupal-sanitize-dump.sql on util.

And a separate script for #9 to make a subsequent smaller dump.

#12

Also, generally look for email addresses and other private data, like drop the clients table.

#13

I've added #7 to the existing /usr/local/sbin/drupal-sanitize-dump.sql, and created a /usr/local/sbin/drupal-reduce-dump.sql for #9. We had talked about possibly putting these scripts in bzr to prevent the blocks that cfengine adds?

#14

These are now in bzr+ssh://util.drupal.org/bzr/scripts/. Reduce gets the DB from 860MB to 267MB.

We still need to review the db dumps for anything we don't want to disclose.

#15

We need to remove unpublished nodes, comments, and others.

#16

subscribe

#17

I did another pass on this yesterday, still could use some review. /var/dumps/mysql/drupal_redesign_database_snapshot-current.sql.gz on util is what to review.

#18

Could i get access to util for this? My public key is attached.

AttachmentSize
id_rsa_ben_pub.txt 392 bytes

#19

Let's keep this restricted to people working on the redesign or infrastructure right now. Access to this data should be a privilege, and represents a significant effort to make it available. I'd prefer we keep the reward of getting access to this data restricted to those who contribute to the overall effort to improve drupal.org.

If there was a specific proposal to do research on the databases we could evaluate on their own merits.

#20

I did another pass and found more to sanitize and reduce. Inactive node revisions older than 100 days are now stripped. This gets us down to 289MB compressed.

Changes committed to bzr and the further sanitization will be seen in hudson-drupal_redesign_database_snapshot-20.sql.gz when hudson builds it.

@Benjamin Melançon - we aren't providing ssh access to util for this. See http://groups.drupal.org/drupalorg-redesign-implementers.

#21

I'm assuming that most of the tables in #862242: Drop old tables we no longer need on drupal.org database will get dropped before we really start sharing this.

  1. The access table contains several rules for blocking users with certain mail domains. I think we should truncate that.
  2. I added a bunch of delete statements to keep records associated with deleted content consistent
  3. Ideally someone familiar with the project module would write delete statements for the project_* tables
  4. I did not review the 668 variables in the variables table. Many of these we could delete (there are 11 for xapian, for example)

I tried importing the db to a scratch place and it was going to take hours, so the attached untested patch is more suggestions than firm ideas. In particular the deletes and joined tables should be tested to make sure they work and don't delete too much.

AttachmentSize
636340_suggested_tweaks.patch 3.41 KB

#22

Changed one of the comments queriesto work and running again.

#23

Great, thanks drumm.

We've somewhat mixed the purposes of this script and drupal-reduce-dump.sql - I think the queries at the end of drupal-sanitize-dump.sql could be re-run after drupal-reduce-dump.sql to maintain internal consistency.

Maybe we should have a third script for consistency between tables that can be the one version of those queries?

#24

Notes from our call
- Greg says this is "basically" ready to go, good enough for me, but a stronger answer would be better
- Access to this should be consistent with BZR, maybe stash the http password in bzr/redesign.drupal.org
- Chad noted he wanted to double check project's storage of emails. We do already do

UPDATE project_issue_projects SET mail_digest = 'foo@localhost', mail_copy = 'foo@localhost';
UPDATE projects SET mail = CONCAT("empty", '@localhost');

#25

UPDATE project_issue_projects SET mail_digest = 'foo@localhost', mail_copy = 'foo@localhost';

this should take care of everything in project_issue module.

UPDATE projects SET mail = CONCAT("empty", '@localhost');

i'm pretty sure the 'projects' table is another one of those old tables that needs to die -- it's not in the current project schema.

#26

For dropping, #862242: Drop old tables we no longer need on drupal.org database. We do already drop that list for sanitization.

#27

[Slightly OT]: Note that if you use drush to move your Drupal database, sql-sync now supports a --sanitize option which is hookable to extend to your own sanitization scheme. See #861822: Add an option to sanitize email addresses and passwords from user table post sql-sync. and drush.api.php in drush-HEAD.

#28

Status:needs review» fixed

#29

Status:fixed» closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

#30

FYI, the DB size has crept up. I am lowering the threshold to remove old content from 100 to 80 days. This affects forum posts, issues, and non-active node revisions.

#31

+1

nobody click here