hide scratch.d.o (and any others) from search engines
greggles - April 3, 2008 - 19:05
| Project: | Drupal.org infrastructure |
| Component: | Webserver |
| Category: | task |
| Priority: | normal |
| Assigned: | Narayan Newton |
| Status: | active |
Description
I just noticed that scratch.d.o is getting into Google - http://www.google.com/search?q=site%3Ascratch.drupal.org
Can that be prevented at the webserver level? Perhaps a based htpasswd protection with a well known password (drupal/drupal) ?
We have the same problem with groupsbeta.d.o as well http://www.google.com/search?q=site%3Ahttp%3A%2F%2Fgroupsbeta.drupal.org...
Are there any other test sites that we need to hide?
The motivation to do this is that duplicate content confuses the search engines which makes their results for our sites worse which makes it harder for project members and new people alike to find stuff on *.d.o.

#1
also project.drupal.org?
#2
Can't robots.txt tell google not to index it? What's wrong with:
User-agent: *Disallow: /
??
#3
Seems good to me. I was leaning towards an apache directive in case these any of these are multisite (and therefore share a robots.txt).
#4
I've added a global Disallow to robots.txt and verified myself for scratch.d.o via the google webmaster tools. If it isn't removed from the index on the next crawl, I'll submit a request for its removal.
Thanks for noticing and reporting this.
#5
http://groupsbeta.drupal.org/robots.txt is not fixed.
#6
I think that Google has been ignoring robots.txt in recent months.
If this is the case, then perhaps hide the site behind HTTP auth with drupal/drupal as the user/password, and publish that on some page.
I have had people wrongly followup to issues on scratch.d.o thinking it is the real site, so the above will serve the two purposes: hide it from google, and make users aware it is a test site.
#7
@scor
Forgot about that one, thanks.
@kbahey
I very much doubt that, as if robots.txt were being ignored load on the server would have spiked hugely with google indexing pages it shouldn't. Also, that would have been pretty big news internet-wide. What makes you think this?
#8
I read about it several months ago. So it was news then.
Some clues here.
#9
this is fixed now again.
@kbahey: mostly bogus, I'd say.
#10
I just stumbled on google results from scratch.d.o (see screenshot).
http://www.google.com/search?hl=en&client=firefox-a&rls=org.mozilla%3Aen...
Google doesn't clean its index unless one asks. can we do the above, or at least setup some basic HTTP auth with drupal/drupal as kbahey suggested?
#11
Really? We saw results from www2.d.o and www1.d.o fall out of the indexes pretty quickly when they stopped being served. I'm not sure we've waited long enough yet.
For tracking purposes - the data center I'm hitting right now has ~4,000 pages for groupsbeta.d.o and ~21,600 for scratch.d.o (the data center IP is 209.85.165.103). Let's watch this for another week or two and see what happens.
HTTP Authentication seems kind of rough to me - it will require coordinating with testers to get them the username/password (even if it is simple I think it will reduce the number of testers).
Yet another an alternate idea is to just stop using these subdomains and instead standardize on some other "practice" subdomain. Then these subdomains would start giving 404s.
Narayan? Killes? Any feelings on that idea?
#12
From my tests google removes domains that dont resovle from it results very quickly but not necessarily from its index for months, this applies for urls that are disallowed via the robots.txt as well but it takes even longer before they are fully removed.
The problem is quite simple to solve though, as google sees each subdomain as a seperate site, just add them to a webmaster account as a new site and remove the entire domain, as long as this is combined with a robots.txt disallow the problem is resolved. If no-one has the time to do this add a full list of the domains here and I will add to mine and remove them as long as someone can create the appropriate html files that I would need to authorise my account. I monitor stacks of sites daily in webmaster console anyway so a few more wont make a difference.
p.s. the link to google ignoring robots.txt is a red herring, i have permanent tests setup for this and other issues and it DOES adhere to robots, sometimes it might take some time to clear its cached copy but thats about it.
#13
Another one:
http://test.drupal.org/robots.txt
http://www.google.com/search?q=site:test.drupal.org
#14
How about what I suggested in #6: we hide it behind a .htaccess HTTP Auth user and password (drupal/drupal)?
Publish the password on some d.o page so it is public.
This will sure make it invisible to crawlers, and will make user confusion (mistaking it for "The Real d.o") far less likely.
#15
I've added test.d.o to my google webmaster account and done as Fintan suggested.