Using Robots.TXT To Block Crawling To Similar Pages In Different Sub-Directories
Hello all,
On my site, I have a wiki-style encyclopedia. There is a revisions page that lists old versions of the wiki pages. Search bots have been crawling these and a indexing a lot of duplicate content. I want to prevent the search engines from indexing those pages.
I was planning on using the robots.txt file to prevent the indexing of wiki revisions pages. I have over a hundred pages like this so I don't want to add each individually to the robots.txt.
However, I am having difficulting coming up with a "Disallow:" command that covers all of them because they are not in the same subdirectory. Here are a couple example links to the pages I am trying to hide from search engines:
http://example.com/node/41/revisions/108/view
http://example.com/node/38/revisions/77/view
Any suggestions? Thanks.

Poking around some more.
Poking around some more. Would this do the trick?
Disallow: */revisions/
Unfortunately you can't use
Unfortunately you can't use wildcards in a Disallow directive. http://www.robotstxt.org/orig.html
Except that Googlebot does respect some patternshttp://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=40360
Another way would be to use Apache (or possibly Drupal hook_init()) to deny selected bots based on their User agent string and using a regex to match the path.
gpk
----
www.alexoria.co.uk
Ah, thanks GPK. I'm
Ah, thanks GPK.
I'm thinking of trying what's below. Its a less elegant solution but I believe it will do the trick.
In the robots.txt, I'll add
# Paths (clean URLs)
Disallow: /node/38/revisions/
Disallow: /node/39/revisions/
Disallow: /node/40/revisions/
etc.
I believe that should block crawling of
http://example.com/node/38/revisions/108/view
http://example.com/node/38/revisions/77/view
http://example.com/node/39/revisions/77/view