I think that we should provide a scripted export of the database (or databases) behind drupal.org that has all private data removed and maybe some or all of the content removed.
Why?
- It can be useful for performance tuning specifically for drupal.org or benchmarking for patches for Drupal itself.
- People working on the redesign and/or patches for the modules often need dummy data to test with. (I know we have d.o testing profile which makes this easier, but more real data is oftenbetter)
- People can analyze the database more easily compared to scraping the html, storing the results, and then analyzing
Why not?
- Spammers might get it and set up sites that contain just this content
- Private data is contained in this database
Solutions to the reasons against
- Spammers are already setting up sites based on HTML crawling. And, if necessary to make this generally available, we can remove the data most important to spammers (i.e. node bodies and comment bodies). There are times when the node bodies and comment bodies are still important to people, but that can be addressed later or as a separate issue.
- We can overwrite or remove private data.
How?
We already have a script to remove most of the private data and it is used when copies of the database are shared (this happens very infrequently, but it happens and we have a process for it). We should review that a bit more to make sure it handles unpublished content and IP addresses. We should also make sure it removes unnecessary tables like caches and search index which can be calculated on the consuming side rather than included in the download file. Then we really only need some work (which I'm willing to do) to automate the creation of the file and pushing it over to the FTP site for download.
Comment | File | Size | Author |
---|---|---|---|
#21 | 636340_suggested_tweaks.patch | 3.41 KB | greggles |
#18 | id_rsa_ben_pub.txt | 392 bytes | mlncn |
Comments
Comment #1
Gerhard Killesreiter CreditAttribution: Gerhard Killesreiter commentedWhy not?
Comment #2
joshk CreditAttribution: joshk commentedHow about an interim step of having a sanitized database we can provide to community developers. We've been getting inquiries around this for g.d.o, and I'd like to facilitate more active/interesting development from people besides the small group who currently has direct access. A safe/sanitized database dump would help that along quite a bit.
Comment #3
joshk CreditAttribution: joshk commentedJust as a followup, I think this is pretty analogous to what we already do in terms of letting people access VMs for the redesign. There's some vetting and then a sort of "deputization." This is quite a bit different from releasing the data publicly.
Comment #4
greggles#3 pushes this in a different direction - changing title.
I'm interested in seeing my original intent fulfilled, but would take selectively available.
The thing I don't like about selectively available dumps is that it's somewhat hard to know who should be on the list of people who get them.
Comment #5
apadernoI am not sure about the audience of this task, but VMs for redesign are not accessible to everybody, AFAIK.
Comment #6
mlncn CreditAttribution: mlncn commentedThe ideal policy and export provision would be publicly available, rather than making the question of who should get access a recurring, timewasting, and likely contentious issue.
All private data should be excluded or scrubbed, but all information that is otherwise publicly viewable should be retained.
The privacy policy should therefore note that public information (posts, profile information like country etc) will be made available for the purpose of creating aggregate statistics. If we can get the language correctly here it can be rolled into (and maybe jumpstart) #178776: Create a privacy policy for drupal.org
benjamin, agaric
Comment #7
gregglesHere's the current script that's being run from hudson and stored on util in /usr/local/sbin/drupal-sanitize-dump.sql
Based on comments on this blog post I think it's missing munging of passwords, any other keys, and private profile values:
Comment #8
gregglesAlso, this script is pretty specific to drupal.org or project* based sites. It won't be sufficient for a site like groups.drupal.org.
The access, authmap, client tables seem interesting.
cache_form should probably be truncated as well.
I'd like to review this more but don't have time right now. It seems reasonable to say that the script needs updating and another review before we share this with anyone.
Comment #9
mikey_p CreditAttribution: mikey_p commentedI wanted to point out the documentation from http://staging6.drupal.org/download-drupal-org-database-solr-index which lists the following SQL which I believe also helps shrink the dump to a size that is more managable for the average redesign developer/implementor, or whoever is using it:
Comment #10
rfayData on the not-fully-sanitized snapshots that are currently being taken (for authorized users only):
https://infrastructure.drupal.org/rsync-database-snapshots
Comment #11
drummStill not fixed, we need to put #7 into /usr/local/sbin/drupal-sanitize-dump.sql on util.
And a separate script for #9 to make a subsequent smaller dump.
Comment #12
drummAlso, generally look for email addresses and other private data, like drop the clients table.
Comment #13
basic CreditAttribution: basic commentedI've added #7 to the existing /usr/local/sbin/drupal-sanitize-dump.sql, and created a /usr/local/sbin/drupal-reduce-dump.sql for #9. We had talked about possibly putting these scripts in bzr to prevent the blocks that cfengine adds?
Comment #14
drummThese are now in bzr+ssh://util.drupal.org/bzr/scripts/. Reduce gets the DB from 860MB to 267MB.
We still need to review the db dumps for anything we don't want to disclose.
Comment #15
drummWe need to remove unpublished nodes, comments, and others.
Comment #16
coltranesubscribe
Comment #17
drummI did another pass on this yesterday, still could use some review. /var/dumps/mysql/drupal_redesign_database_snapshot-current.sql.gz on util is what to review.
Comment #18
mlncn CreditAttribution: mlncn commentedCould i get access to util for this? My public key is attached.
Comment #19
Amazon CreditAttribution: Amazon commentedLet's keep this restricted to people working on the redesign or infrastructure right now. Access to this data should be a privilege, and represents a significant effort to make it available. I'd prefer we keep the reward of getting access to this data restricted to those who contribute to the overall effort to improve drupal.org.
If there was a specific proposal to do research on the databases we could evaluate on their own merits.
Comment #20
drummI did another pass and found more to sanitize and reduce. Inactive node revisions older than 100 days are now stripped. This gets us down to 289MB compressed.
Changes committed to bzr and the further sanitization will be seen in hudson-drupal_redesign_database_snapshot-20.sql.gz when hudson builds it.
@Benjamin Melançon - we aren't providing ssh access to util for this. See http://groups.drupal.org/drupalorg-redesign-implementers.
Comment #21
gregglesI'm assuming that most of the tables in #862242: Drop old tables we no longer need on drupal.org database will get dropped before we really start sharing this.
I tried importing the db to a scratch place and it was going to take hours, so the attached untested patch is more suggestions than firm ideas. In particular the deletes and joined tables should be tested to make sure they work and don't delete too much.
Comment #22
drummChanged one of the comments queriesto work and running again.
Comment #23
gregglesGreat, thanks drumm.
We've somewhat mixed the purposes of this script and drupal-reduce-dump.sql - I think the queries at the end of drupal-sanitize-dump.sql could be re-run after drupal-reduce-dump.sql to maintain internal consistency.
Maybe we should have a third script for consistency between tables that can be the one version of those queries?
Comment #24
drummNotes from our call
- Greg says this is "basically" ready to go, good enough for me, but a stronger answer would be better
- Access to this should be consistent with BZR, maybe stash the http password in bzr/redesign.drupal.org
- Chad noted he wanted to double check project's storage of emails. We do already do
Comment #25
hunmonk CreditAttribution: hunmonk commentedthis should take care of everything in project_issue module.
i'm pretty sure the 'projects' table is another one of those old tables that needs to die -- it's not in the current project schema.
Comment #26
drummFor dropping, #862242: Drop old tables we no longer need on drupal.org database. We do already drop that list for sanitization.
Comment #27
greg.1.anderson CreditAttribution: greg.1.anderson commented[Slightly OT]: Note that if you use drush to move your Drupal database, sql-sync now supports a --sanitize option which is hookable to extend to your own sanitization scheme. See #861822: Add an option to sanitize email addresses and passwords from user table post sql-sync. and drush.api.php in drush-HEAD.
Comment #28
drummComment #30
drummFYI, the DB size has crept up. I am lowering the threshold to remove old content from 100 to 80 days. This affects forum posts, issues, and non-active node revisions.
Comment #31
Vacilando CreditAttribution: Vacilando commented+1
Comment #32
gregglesNarayan mentioned that he's open to the idea of creating these if we only include data we know to be OK to include.
That makes a ton of sense to me.
For example, node titles, bodies, and revision history for published nodes are all available to web scrapers. If we export that in some format it should be OK to distribute the resulting file.
This has the benefit of being something we can start small and add to without worrying that a schema change suddenly exposes private data.
I think all nodes, comments, project data, maintainer data, project usage data are pretty interesting to folks so I plan to work towards that and post something here that would export just the information from those that either is or should be publicly available.
Comment #33
webchickOh that's FANTASTIC!
I'll provide a run-down of specific tables/columns I routinely pull in reports for general community health-checking later tonight, that should all be safe to include in said data dump. If I forget, please ping me. :D
Comment #34
drummI filed #1884780: Make sure schema changes are handled in snapshot as an interim step. With either blacklist or whitelist, it is good to review our sanitization whenever the schema changes. Implementing that issue will force us to do so.
Comment #35
drummFor this, I think this should be implemented as a near-final phase snapshot.sh, http://drupalcode.org/project/infrastructure.git/blob/HEAD:/snapshot/sna....
What is a good way to implement this? Maybe use a second DB, loaded up with only the schema, then some
INSERT into whitelist SELECT from less-sanitized-db
queries?Comment #36
gregglesIncluding an important keyword in the title as seen at #1068432: Provide regular exports of public data which feels like a duplicate.
Comment #37
eliza411 CreditAttribution: eliza411 commentedgreggles: In coordination with nnewton, drumm, and others, we're set up to look into this issue intensely for the next week. Is it okay to assign it to halstead while he's working on it?
There's an outline of the plan at https://groups.drupal.org/node/367308 We're specifically focused on Drupal.org for the moment and would welcome the chance to collaborate should you be available.
Comment #38
gregglesYes, absolutely. I didn't remember I'd assigned this to myself - removing that since I'm not actively working on it.
Comment #39
halstead CreditAttribution: halstead commentedI have a start that generates SQL to make clean databases. You can see the schema is defined in an object one table at a time http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli.... If anything is added to the schema it will not be included in the destination database and a warning will be printed. Adding to the whitelisted schema can be as simple as copying the error message into the the whitelist.
Different profiles can be overlayed with a -p switch. This overlay makes http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli... for version of the database to be used internally and preserves some data that is normally discarded.
Custom handlers for specific tables do the job of generating selects that only grab the data we want. http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli.... Avoiding copying data we plan to delete is ideal but the statements can become difficult to read/maintain.
So I hacked in a final cleanup set of statements that discard additional unwanted data at the end. I added it to the skeleton profile as a simple list of statements. The method can be cleaned up in the future. These are more about trimming data then sanitizing private data. http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli...
Does this seem like a good approach?
Comment #40
halstead CreditAttribution: halstead commentedAfter a great discussion with sdboyer this morning I've separated the handling of columns from the overall table statement generation. I thought of columns as fields since columns are used for all the variable names. The field work is now done here: http://drupalcode.org/project/infrastructure.git/blob/refs/heads/whiteli.... While the selects are done in individual files if needed at http://drupalcode.org/project/infrastructure.git/tree/refs/heads/whiteli....
This is much easier to read and probably better to maintain. ;)
Thanks Sam.
Comment #41
moshe weitzman CreditAttribution: moshe weitzman commentedThanks for working on this.
It would be ideal to build this on top of well known software like drush sql-sync. That command does a mysqldump from "live", imports into a new DB, and then sanitizes. Some helpful features:
$options['structure-tables']
in example.drushrc.phpIt is true that this method does not allow for selecting only some table from whitelisted tables. Perhaps thats a nice-to-have feature we can skip.
If sql-sync gets awkward, you can still use its building blocks like sql-dump, sql-santitize, etc.
Comment #42
nnewton CreditAttribution: nnewton commentedI actually like that this is in python, as its likely to run on a server that we won't install PHP on. Reviewing atm, so far I like it.
-N
Comment #43
eliza411 CreditAttribution: eliza411 commentedAdding a connection to webchick's issue suggesting practical ways the DB image can be further trimmed.
Comment #44
greg.1.anderson CreditAttribution: greg.1.anderson commentedThe nice thing about sanitize hooks is that the responsibility for cleaning up private data can live inside a Drush extension that is part of the contrib module that defines the private data. The advantage of doing this is that the sanitization work done for drupal.org could also benefit other projects and sites that also need to do sanitization.
That's the theory, anyway. The reality is that so far, no one has ever written a sanitize hook that cleans up contrib data. Paranoia defines a sanitize hook, but it only sanitizes tables defined in core. I can therefore easily understand if time constraints and convenience (existing code) lead this to be done using python or other non-Drush tools. I would hope, though, that installing php on the target server wouldn't be a non-starter; if Drush were at least an option, someone might convert the initial tools to sanitize hooks later. In the long run, this would benefit more people.
Comment #45
joachim CreditAttribution: joachim commented> And, if necessary to make this generally available, we can remove the data most important to spammers (i.e. node bodies and comment bodies). There are times when the node bodies and comment bodies are still important to people
Could node and comment bodies be replaced with a rougly equivalent quantity of lorem ipsum?
For things such as working on design or d.org functionality that would be perfectly adequate.
Comment #46
drummNo, we do want real content. If it is public on the internet, it should be in this snapshot.
The work that remains is deploying halstead's work.
Comment #47
mgiffordThis is an interesting & important idea.
I've been recently pushing for A/B testing to allow us to try out some new ideas & evaluate the impact:
https://groups.drupal.org/node/400558
I'm just not sure how we'd get that data to determine what the impact actually is. It would be possible to scrape some information from the existing site to determine if a certain change affected more odd or even uid's, but it would be much nicer to be able to monitor that data somehow.
Comment #48
basic CreditAttribution: basic commenteddigiv and coltrane are working on a security review for the sql produced by the whitelist sanitization scripts before we can finish deploying the new system.
Comment #49
coltraneI reviewed the drupal_export database on db4-old today and found some private data that needs cleaning and some others I think should be cleaned to limit results from data mining. Additionally there are some tables with unnecessary data that could be cleaned up to the make the DB smaller.
I'm also confused now on what is running to make this export db, is it drush or a python script? I'd like to review what's running to make the export db.
Needs sanitizing
Could be sanitized?
Contains somewhat personal/private data OR is somewhat risky to allow mining it:
Data not necessary
Tables could be truncated to lower DB size:
Comment #50
basic CreditAttribution: basic commentedNeeds sanitizing -- done
Data not necessary -- done
Could be sanitized: I'm not sure about these, it seems like they could just be truncated like the rest of the tables. I went ahead and truncated these tables but I'm not sure the repercussions of doing that. There should be a new copy of the database to review tomorrow morning.
Comment #52
mitchell CreditAttribution: mitchell commentedThis issue is very important to me. I'd happily put up a round or two of beers to those who take this through to the finish line.
Comment #53
YesCT CreditAttribution: YesCT commentedComment #54
YesCT CreditAttribution: YesCT commentedComment #55
irinaz CreditAttribution: irinaz as a volunteer and at Fibonacci Web Studio commented@mitchell , is this issue still relevant? If yes, could you give more details? If no, could we close this issue?
PS - I like your comment about beers :)
Comment #56
gregglesI think the path for this is via the API.
If folks need a database then it would be good to articulate why the API doesn't work.