Provide generally available, regular exports of already public data from the drupal.org database [#636340]

Comment	File	Size	Author
#21	636340_suggested_tweaks.patch	3.41 KB	greggles
#18	id_rsa_ben_pub.txt	392 bytes	mlncn

Comment #1

Gerhard Killesreiter CreditAttribution: Gerhard Killesreiter commented 18 November 2009 at 16:59

Why not?

It is just a bad idea. I do not want to read anything like "the Drupal association makes the database of drupal.org publicly avaliable". There are a lot of people who would take exception on such news. The fact that we'd of course remove private informaton would probably cnveniently omitted.

Log in or register to post comments

Comment #2

joshk CreditAttribution: joshk commented 10 January 2010 at 23:37

How about an interim step of having a sanitized database we can provide to community developers. We've been getting inquiries around this for g.d.o, and I'd like to facilitate more active/interesting development from people besides the small group who currently has direct access. A safe/sanitized database dump would help that along quite a bit.

Log in or register to post comments

Comment #3

joshk CreditAttribution: joshk commented 10 January 2010 at 23:48

Just as a followup, I think this is pretty analogous to what we already do in terms of letting people access VMs for the redesign. There's some vetting and then a sort of "deputization." This is quite a bit different from releasing the data publicly.

Log in or register to post comments

Comment #4

greggles

he/him

English

Denver, Colorado, USA

CreditAttribution: greggles commented 11 January 2010 at 00:44

Title:

provide generally available, regular exports of the drupal.org database

» provide selectively available, regular exports of the drupal.org database

#3 pushes this in a different direction - changing title.

I'm interested in seeing my original intent fulfilled, but would take selectively available.

The thing I don't like about selectively available dumps is that it's somewhat hard to know who should be on the list of people who get them.

Log in or register to post comments

Comment #5

apaderno

he/him

Italian

Brescia, 🇮🇹 🇪🇺

CreditAttribution: apaderno commented 25 January 2010 at 03:16

I think this is pretty analogous to what we already do in terms of letting people access VMs for the redesign.

I am not sure about the audience of this task, but VMs for redesign are not accessible to everybody, AFAIK.

Log in or register to post comments

Comment #6

mlncn CreditAttribution: mlncn commented 4 June 2010 at 17:07

The ideal policy and export provision would be publicly available, rather than making the question of who should get access a recurring, timewasting, and likely contentious issue.

All private data should be excluded or scrubbed, but all information that is otherwise publicly viewable should be retained.

The privacy policy should therefore note that public information (posts, profile information like country etc) will be made available for the purpose of creating aggregate statistics. If we can get the language correctly here it can be rolled into (and maybe jumpstart) #178776: Create a privacy policy for drupal.org

_{benjamin, agaric}

Log in or register to post comments

Comment #7

greggles

he/him

English

Denver, Colorado, USA

CreditAttribution: greggles commented 12 July 2010 at 18:58

Status:

Active

» Needs work

Here's the current script that's being run from hudson and stored on util in /usr/local/sbin/drupal-sanitize-dump.sql

-- CAUTION: DO NOT RUN THIS ON DATABASE WHERE YOU CARE ABOUT THE INFORMATION!!!

-- Munge emails for security.
UPDATE users SET mail = CONCAT(name, '@localhost'), init = CONCAT(name, '@localhost');
UPDATE comments SET mail = CONCAT(name, '@localhost');
UPDATE directory SET mail = CONCAT(name, '@localhost');
UPDATE mailhandler SET mail = CONCAT(name, '@localhost');
UPDATE authmap SET authname = CONCAT(aid, '@localhost');
UPDATE client SET mail = CONCAT(cid, '@localhost');
UPDATE donations SET mail = CONCAT(did, '@localhost');
UPDATE project_issue_projects SET mail_digest = 'foo@localhost', mail_copy = 'foo@localhost';
UPDATE projects SET mail = CONCAT("empty", '@localhost');
UPDATE simplenews_subscriptions SET mail = CONCAT(snid, '@localhost');

-- Get rid of irrelevant data.
TRUNCATE accesslog;
TRUNCATE cache;
TRUNCATE cache_filter;
TRUNCATE cache_menu;
TRUNCATE cache_page;
-- TRUNCATE cache_project_release;
TRUNCATE devel_queries;
TRUNCATE devel_times;
TRUNCATE flood;
TRUNCATE history;
TRUNCATE search_dataset;
TRUNCATE search_index;
TRUNCATE search_total;
TRUNCATE sessions;
TRUNCATE watchdog;

-- Remove sensitive variables
DELETE FROM variable WHERE name = 'drupal_private_key';

Based on comments on this blog post I think it's missing munging of passwords, any other keys, and private profile values:

UPDATE users SET mail = CONCAT(name, '@localhost'), init = CONCAT(name, '@localhost'), pass = MD5(CONCAT('MILDSECRET', name));

DELETE FROM variable WHERE name LIKE '%key%';

DELETE FROM profile_values WHERE fid IN (select fid from profile_fields where visibility in (1, 4));

Log in or register to post comments

Comment #8

greggles

he/him

English

Denver, Colorado, USA

CreditAttribution: greggles commented 12 July 2010 at 19:13

Also, this script is pretty specific to drupal.org or project* based sites. It won't be sufficient for a site like groups.drupal.org.

The access, authmap, client tables seem interesting.

cache_form should probably be truncated as well.

I'd like to review this more but don't have time right now. It seems reasonable to say that the script needs updating and another review before we share this with anyone.

Log in or register to post comments

Comment #9

mikey_p CreditAttribution: mikey_p commented 12 July 2010 at 22:14

I wanted to point out the documentation from http://staging6.drupal.org/download-drupal-org-database-solr-index which lists the following SQL which I believe also helps shrink the dump to a size that is more managable for the average redesign developer/implementor, or whoever is using it:

DELETE FROM node WHERE type IN ('forum','project_issue') AND created < 1204040047;
DELETE node, project_issues FROM node INNER JOIN project_issues WHERE node.nid = project_issues.nid AND project_issues.sid IN (7,14,2);
DELETE node FROM node LEFT JOIN comments ON node.nid = comments.nid WHERE node.type = 'forum' AND comments.nid IS NULL;
DELETE node_revisions FROM node_revisions LEFT JOIN node ON node.nid = node_revisions.nid AND node.nid IS NULL;
DELETE node_comment_statistics FROM node_comment_statistics LEFT JOIN node ON node.nid = node_comment_statistics.nid AND node.nid IS NULL;
DELETE node_counter FROM node_counter LEFT JOIN node ON node.nid = node_counter.nid AND node.nid IS NULL;
DELETE comments FROM comments LEFT JOIN node ON node.nid = comments.nid AND node.nid IS NULL;

Log in or register to post comments

Comment #10

rfay

English

Palisade, CO, USA

CreditAttribution: rfay commented 12 July 2010 at 22:18

Status:

Needs work

» Fixed

Data on the not-fully-sanitized snapshots that are currently being taken (for authorized users only):
https://infrastructure.drupal.org/rsync-database-snapshots

Log in or register to post comments

Comment #11

drumm

he/him

NY, US

CreditAttribution: drumm commented 13 July 2010 at 00:33

Status:

Fixed

» Needs review

Still not fixed, we need to put #7 into /usr/local/sbin/drupal-sanitize-dump.sql on util.

And a separate script for #9 to make a subsequent smaller dump.

Log in or register to post comments

Comment #12

drumm

he/him

NY, US

CreditAttribution: drumm commented 13 July 2010 at 00:34

Also, generally look for email addresses and other private data, like drop the clients table.

Log in or register to post comments

Comment #13

basic CreditAttribution: basic commented 13 July 2010 at 16:43

I've added #7 to the existing /usr/local/sbin/drupal-sanitize-dump.sql, and created a /usr/local/sbin/drupal-reduce-dump.sql for #9. We had talked about possibly putting these scripts in bzr to prevent the blocks that cfengine adds?

Log in or register to post comments

Comment #14

drumm

he/him

NY, US

CreditAttribution: drumm commented 13 July 2010 at 23:00

These are now in bzr+ssh://util.drupal.org/bzr/scripts/. Reduce gets the DB from 860MB to 267MB.

We still need to review the db dumps for anything we don't want to disclose.

Log in or register to post comments

Comment #15

drumm

he/him

NY, US

CreditAttribution: drumm commented 14 July 2010 at 02:39

We need to remove unpublished nodes, comments, and others.

Log in or register to post comments

Comment #16

coltrane

he/him

CreditAttribution: coltrane commented 19 July 2010 at 23:45

subscribe

Log in or register to post comments

Comment #17

drumm

he/him

NY, US

CreditAttribution: drumm commented 19 July 2010 at 23:51

I did another pass on this yesterday, still could use some review. /var/dumps/mysql/drupal_redesign_database_snapshot-current.sql.gz on util is what to review.

Log in or register to post comments

Comment #18

mlncn CreditAttribution: mlncn commented 20 July 2010 at 04:01

File	Size
id_rsa_ben_pub.txt	392 bytes

Could i get access to util for this? My public key is attached.

Log in or register to post comments

Comment #19

Amazon CreditAttribution: Amazon commented 20 July 2010 at 04:55

Let's keep this restricted to people working on the redesign or infrastructure right now. Access to this data should be a privilege, and represents a significant effort to make it available. I'd prefer we keep the reward of getting access to this data restricted to those who contribute to the overall effort to improve drupal.org.

If there was a specific proposal to do research on the databases we could evaluate on their own merits.

Log in or register to post comments

Comment #20

drumm

he/him

NY, US

CreditAttribution: drumm commented 22 July 2010 at 21:23

I did another pass and found more to sanitize and reduce. Inactive node revisions older than 100 days are now stripped. This gets us down to 289MB compressed.

Changes committed to bzr and the further sanitization will be seen in hudson-drupal_redesign_database_snapshot-20.sql.gz when hudson builds it.

@Benjamin Melançon - we aren't providing ssh access to util for this. See http://groups.drupal.org/drupalorg-redesign-implementers.

Log in or register to post comments

Comment #21

greggles

he/him

English

Denver, Colorado, USA

CreditAttribution: greggles commented 23 July 2010 at 02:11

File	Size
636340_suggested_tweaks.patch	3.41 KB

I'm assuming that most of the tables in #862242: Drop old tables we no longer need on drupal.org database will get dropped before we really start sharing this.

The access table contains several rules for blocking users with certain mail domains. I think we should truncate that.
I added a bunch of delete statements to keep records associated with deleted content consistent
Ideally someone familiar with the project module would write delete statements for the project_* tables
I did not review the 668 variables in the variables table. Many of these we could delete (there are 11 for xapian, for example)

I tried importing the db to a scratch place and it was going to take hours, so the attached untested patch is more suggestions than firm ideas. In particular the deletes and joined tables should be tested to make sure they work and don't delete too much.

Log in or register to post comments

Comment #22

drumm

he/him

NY, US

CreditAttribution: drumm commented 23 July 2010 at 14:25

Changed one of the comments queriesto work and running again.

Log in or register to post comments

Comment #23

greggles

he/him

English

Denver, Colorado, USA

CreditAttribution: greggles commented 23 July 2010 at 15:02

Great, thanks drumm.

We've somewhat mixed the purposes of this script and drupal-reduce-dump.sql - I think the queries at the end of drupal-sanitize-dump.sql could be re-run after drupal-reduce-dump.sql to maintain internal consistency.

Maybe we should have a third script for consistency between tables that can be the one version of those queries?

Log in or register to post comments

Comment #24

drumm

he/him

NY, US

CreditAttribution: drumm commented 28 July 2010 at 20:12

Notes from our call
- Greg says this is "basically" ready to go, good enough for me, but a stronger answer would be better
- Access to this should be consistent with BZR, maybe stash the http password in bzr/redesign.drupal.org
- Chad noted he wanted to double check project's storage of emails. We do already do

UPDATE project_issue_projects SET mail_digest = 'foo@localhost', mail_copy = 'foo@localhost';
UPDATE projects SET mail = CONCAT("empty", '@localhost');

Log in or register to post comments

Comment #25

hunmonk CreditAttribution: hunmonk commented 28 July 2010 at 20:30

UPDATE project_issue_projects SET mail_digest = 'foo@localhost', mail_copy = 'foo@localhost';

this should take care of everything in project_issue module.

UPDATE projects SET mail = CONCAT("empty", '@localhost');

i'm pretty sure the 'projects' table is another one of those old tables that needs to die -- it's not in the current project schema.

Log in or register to post comments

Comment #26

drumm

he/him

NY, US

CreditAttribution: drumm commented 28 July 2010 at 20:44

For dropping, #862242: Drop old tables we no longer need on drupal.org database. We do already drop that list for sanitization.

Log in or register to post comments

Comment #27

greg.1.anderson CreditAttribution: greg.1.anderson commented 11 August 2010 at 05:06

[Slightly OT]: Note that if you use drush to move your Drupal database, sql-sync now supports a --sanitize option which is hookable to extend to your own sanitization scheme. See #861822: Add an option to sanitize email addresses and passwords from user table post sql-sync. and drush.api.php in drush-HEAD.

Log in or register to post comments

Comment #28

drumm

he/him

NY, US

CreditAttribution: drumm commented 11 August 2010 at 22:39

Status:

Needs review

» Fixed

Log in or register to post comments

Comment #29

25 August 2010 at 22:40

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Log in or register to post comments

Comment #30

drumm

he/him

NY, US

CreditAttribution: drumm commented 8 July 2011 at 17:44

FYI, the DB size has crept up. I am lowering the threshold to remove old content from 100 to 80 days. This affects forum posts, issues, and non-active node revisions.

Log in or register to post comments

Comment #31

Vacilando CreditAttribution: Vacilando commented 31 August 2011 at 18:05

+1

Log in or register to post comments

Comment #32

greggles

he/him

English

Denver, Colorado, USA

CreditAttribution: greggles commented 10 December 2012 at 19:45

Title:	provide selectively available, regular exports of the drupal.org database	» Provide generally available, regular exports of data from the drupal.org database
Status:	Closed (fixed)	» Active

Narayan mentioned that he's open to the idea of creating these if we only include data we know to be OK to include.

That makes a ton of sense to me.

For example, node titles, bodies, and revision history for published nodes are all available to web scrapers. If we export that in some format it should be OK to distribute the resulting file.

This has the benefit of being something we can start small and add to without worrying that a schema change suddenly exposes private data.

I think all nodes, comments, project data, maintainer data, project usage data are pretty interesting to folks so I plan to work towards that and post something here that would export just the information from those that either is or should be publicly available.

Log in or register to post comments

Comment #33

webchick

she/they

English

Vancouver 🇨🇦

CreditAttribution: webchick commented 11 December 2012 at 00:25

Oh that's FANTASTIC!

I'll provide a run-down of specific tables/columns I routinely pull in reports for general community health-checking later tonight, that should all be safe to include in said data dump. If I forget, please ping me. :D

Log in or register to post comments

Comment #34

drumm

he/him

NY, US

CreditAttribution: drumm commented 10 January 2013 at 01:44

I filed #1884780: Make sure schema changes are handled in snapshot as an interim step. With either blacklist or whitelist, it is good to review our sanitization whenever the schema changes. Implementing that issue will force us to do so.

Log in or register to post comments

Comment #35

drumm

he/him

NY, US

CreditAttribution: drumm commented 10 January 2013 at 01:49

For this, I think this should be implemented as a near-final phase snapshot.sh, http://drupalcode.org/project/infrastructure.git/blob/HEAD:/snapshot/sna....

What is a good way to implement this? Maybe use a second DB, loaded up with only the schema, then some INSERT into whitelist SELECT from less-sanitized-db queries?

Log in or register to post comments

Comment #36

greggles

he/him

English

Denver, Colorado, USA

CreditAttribution: greggles commented 21 October 2013 at 19:08

Title:

Provide generally available, regular exports of data from the drupal.org database

» Provide generally available, regular exports of already public data from the drupal.org database

Including an important keyword in the title as seen at #1068432: Provide regular exports of public data which feels like a duplicate.

Log in or register to post comments

Comment #37

eliza411 CreditAttribution: eliza411 commented 7 November 2013 at 05:05

greggles: In coordination with nnewton, drumm, and others, we're set up to look into this issue intensely for the next week. Is it okay to assign it to halstead while he's working on it?

There's an outline of the plan at https://groups.drupal.org/node/367308 We're specifically focused on Drupal.org for the moment and would welcome the chance to collaborate should you be available.

Log in or register to post comments

Comment #38

greggles

he/him

English

Denver, Colorado, USA

CreditAttribution: greggles commented 7 November 2013 at 13:21

Assigned:	greggles	» Unassigned
Issue summary:	View changes

Yes, absolutely. I didn't remember I'd assigned this to myself - removing that since I'm not actively working on it.

Log in or register to post comments

Comment #39

halstead CreditAttribution: halstead commented 14 November 2013 at 01:10

I have a start that generates SQL to make clean databases. You can see the schema is defined in an object one table at a time http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli.... If anything is added to the schema it will not be included in the destination database and a warning will be printed. Adding to the whitelisted schema can be as simple as copying the error message into the the whitelist.

Different profiles can be overlayed with a -p switch. This overlay makes http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli... for version of the database to be used internally and preserves some data that is normally discarded.

Custom handlers for specific tables do the job of generating selects that only grab the data we want. http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli.... Avoiding copying data we plan to delete is ideal but the statements can become difficult to read/maintain.

So I hacked in a final cleanup set of statements that discard additional unwanted data at the end. I added it to the skeleton profile as a simple list of statements. The method can be cleaned up in the future. These are more about trimming data then sanitizing private data. http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli...

Does this seem like a good approach?

Log in or register to post comments

Comment #40

halstead CreditAttribution: halstead commented 15 November 2013 at 04:49

Assigned:

Unassigned

» halstead

After a great discussion with sdboyer this morning I've separated the handling of columns from the overall table statement generation. I thought of columns as fields since columns are used for all the variable names. The field work is now done here: http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli.... While the selects are done in individual files if needed at http://drupalcode.org/project/infrastructure.git/tree/refs/heads/whiteli....

This is much easier to read and probably better to maintain. ;)

Thanks Sam.

Log in or register to post comments

Comment #41

moshe weitzman CreditAttribution: moshe weitzman commented 15 November 2013 at 05:17

Thanks for working on this.

It would be ideal to build this on top of well known software like drush sql-sync. That command does a mysqldump from "live", imports into a new DB, and then sanitizes. Some helpful features:

The dump can include/exclude tables based on wildcards, whitelist, blacklist, etc. There can be multiple profiles as well - see $options['structure-tables'] in example.drushrc.php
Post import operations for sanitize and trim can be done in santize hooks

It is true that this method does not allow for selecting only some table from whitelisted tables. Perhaps thats a nice-to-have feature we can skip.

If sql-sync gets awkward, you can still use its building blocks like sql-dump, sql-santitize, etc.

Log in or register to post comments

Comment #42

nnewton CreditAttribution: nnewton commented 15 November 2013 at 23:05

I actually like that this is in python, as its likely to run on a server that we won't install PHP on. Reviewing atm, so far I like it.

-N

Log in or register to post comments

Comment #43

eliza411 CreditAttribution: eliza411 commented 22 December 2013 at 17:07

Adding a connection to webchick's issue suggesting practical ways the DB image can be further trimmed.

Log in or register to post comments

Comment #44

greg.1.anderson CreditAttribution: greg.1.anderson commented 23 December 2013 at 19:19

The nice thing about sanitize hooks is that the responsibility for cleaning up private data can live inside a Drush extension that is part of the contrib module that defines the private data. The advantage of doing this is that the sanitization work done for drupal.org could also benefit other projects and sites that also need to do sanitization.

That's the theory, anyway. The reality is that so far, no one has ever written a sanitize hook that cleans up contrib data. Paranoia defines a sanitize hook, but it only sanitizes tables defined in core. I can therefore easily understand if time constraints and convenience (existing code) lead this to be done using python or other non-Drush tools. I would hope, though, that installing php on the target server wouldn't be a non-starter; if Drush were at least an option, someone might convert the initial tools to sanitize hooks later. In the long run, this would benefit more people.

Log in or register to post comments

Comment #45

joachim CreditAttribution: joachim commented 6 January 2014 at 23:00

> And, if necessary to make this generally available, we can remove the data most important to spammers (i.e. node bodies and comment bodies). There are times when the node bodies and comment bodies are still important to people

Could node and comment bodies be replaced with a rougly equivalent quantity of lorem ipsum?

For things such as working on design or d.org functionality that would be perfectly adequate.

Log in or register to post comments

Comment #46

drumm

he/him

NY, US

CreditAttribution: drumm commented 6 January 2014 at 23:11

No, we do want real content. If it is public on the internet, it should be in this snapshot.

The work that remains is deploying halstead's work.

Log in or register to post comments

Comment #47

mgifford

he/him

English

CreditAttribution: mgifford commented 31 January 2014 at 19:16

This is an interesting & important idea.

I've been recently pushing for A/B testing to allow us to try out some new ideas & evaluate the impact:
https://groups.drupal.org/node/400558

I'm just not sure how we'd get that data to determine what the impact actually is. It would be possible to scrape some information from the existing site to determine if a certain change affected more odd or even uid's, but it would be much nicer to be able to monitor that data somehow.

Log in or register to post comments

Comment #48

basic CreditAttribution: basic commented 26 February 2014 at 16:52

digiv and coltrane are working on a security review for the sql produced by the whitelist sanitization scripts before we can finish deploying the new system.

Log in or register to post comments

Comment #49

coltrane

he/him

CreditAttribution: coltrane commented 7 March 2014 at 18:06

I reviewed the drupal_export database on db4-old today and found some private data that needs cleaning and some others I think should be cleaned to limit results from data mining. Additionally there are some tables with unnecessary data that could be cleaned up to the make the DB smaller.

I'm also confused now on what is running to make this export db, is it drush or a python script? I'd like to review what's running to make the export db.

Needs sanitizing

table multiple_email - truncate it all?
table sshkey
table contact, column recipients - set to empty?
table sshkey

Could be sanitized?

Contains somewhat personal/private data OR is somewhat risky to allow mining it:

authmap
profile_value
flag_content
field_budget
field_data_field_user_geolocation
field_revision_field_user_geolocation
drupalorg_ind_civimembership
drupalorg_org_civimembership
drupalorg_crosssite_ind_civimembership
drupalorg_crosssite_org_civimembership
role_activity

Data not necessary

Tables could be truncated to lower DB size:

user_restrictions
queue
project_release_package_errors
httbl

Log in or register to post comments

Comment #50

basic CreditAttribution: basic commented 12 March 2014 at 00:42

Needs sanitizing -- done
Data not necessary -- done

Could be sanitized: I'm not sure about these, it seems like they could just be truncated like the rest of the tables. I went ahead and truncated these tables but I'm not sure the repercussions of doing that. There should be a new copy of the database to review tomorrow morning.