Using Robots.TXT To Block Crawling To Similar Pages In Different Sub-Directories

sockah - November 16, 2008 - 14:49

Hello all,

On my site, I have a wiki-style encyclopedia. There is a revisions page that lists old versions of the wiki pages. Search bots have been crawling these and a indexing a lot of duplicate content. I want to prevent the search engines from indexing those pages.

I was planning on using the robots.txt file to prevent the indexing of wiki revisions pages. I have over a hundred pages like this so I don't want to add each individually to the robots.txt.

However, I am having difficulting coming up with a "Disallow:" command that covers all of them because they are not in the same subdirectory. Here are a couple example links to the pages I am trying to hide from search engines:

http://example.com/node/41/revisions/108/view
http://example.com/node/38/revisions/77/view

Any suggestions? Thanks.

Poking around some more.

sockah - November 16, 2008 - 17:07

Poking around some more. Would this do the trick?

Disallow: */revisions/

Unfortunately you can't use

gpk - November 16, 2008 - 18:40

Unfortunately you can't use wildcards in a Disallow directive. http://www.robotstxt.org/orig.html
Except that Googlebot does respect some patternshttp://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40360

Another way would be to use Apache (or possibly Drupal hook_init()) to deny selected bots based on their User agent string and using a regex to match the path.

gpk
----
www.alexoria.co.uk

Ah, thanks GPK. I'm

sockah - November 16, 2008 - 22:03

Ah, thanks GPK.

I'm thinking of trying what's below. Its a less elegant solution but I believe it will do the trick.

In the robots.txt, I'll add

# Paths (clean URLs)
Disallow: /node/38/revisions/
Disallow: /node/39/revisions/
Disallow: /node/40/revisions/
etc.

I believe that should block crawling of

http://example.com/node/38/revisions/108/view
http://example.com/node/38/revisions/77/view
http://example.com/node/39/revisions/77/view

 
 

Drupal is a registered trademark of Dries Buytaert.