I think that we should provide a scripted export of the database (or databases) behind drupal.org that has all private data removed and maybe some or all of the content removed.
- It can be useful for performance tuning specifically for drupal.org or benchmarking for patches for Drupal itself.
- People working on the redesign and/or patches for the modules often need dummy data to test with. (I know we have d.o testing profile which makes this easier, but more real data is oftenbetter)
- People can analyze the database more easily compared to scraping the html, storing the results, and then analyzing
- Spammers might get it and set up sites that contain just this content
- Private data is contained in this database
Solutions to the reasons against
- Spammers are already setting up sites based on HTML crawling. And, if necessary to make this generally available, we can remove the data most important to spammers (i.e. node bodies and comment bodies). There are times when the node bodies and comment bodies are still important to people, but that can be addressed later or as a separate issue.
- We can overwrite or remove private data.
We already have a script to remove most of the private data and it is used when copies of the database are shared (this happens very infrequently, but it happens and we have a process for it). We should review that a bit more to make sure it handles unpublished content and IP addresses. We should also make sure it removes unnecessary tables like caches and search index which can be calculated on the consuming side rather than included in the download file. Then we really only need some work (which I'm willing to do) to automate the creation of the file and pushing it over to the FTP site for download.