Controlling search engine indexing with robots.txt

The robots.txt file is the mechanism almost all search engines use to allow website administrators to tell the bots what they would like indexed. By adding this file to your web root, you can forbid search engine bots to index certain parts of your website. Example: see the drupal.org robots.txt.

A robots.txt is included with Drupal 5.x. and newer versions, though there are SEO problems with Drupal's default robots.txt file even in Drupal 7. If you want to create a custom robots.txt file, please follow the instructions below. For more details check http://www.robotstxt.org.

Create a file containing the content as shown below and call it "robots.txt". Lines beginning with the pound ("#") sign are comments and can be deleted.

# Small robots.txt
# More information about this file can be found at
# <a href="http://www.robotstxt.org/">http://www.robotstxt.org/</a>

# In case your drupal site is in a subdirectory of your web root  (e.g. /drupal)
# add the name of this directory before the / (slash) below
# example:  Disallow: /drupal/aggregator

# to stop a polite robot indexing an example dir
# add a line like:  user-agent: polite-bot 
# and:  Disallow: /example-dir/

# Paths (clean URLs)
User-agent: *
Crawl-Delay: 10
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /search/
Disallow: /book/print
Disallow: /logout
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

# Paths (no clean URLs)
User-agent: *
Crawl-Delay: 10
Disallow: /?q=aggregator
Disallow: /?q=tracker
Disallow: /?q=comment/reply
Disallow: /?q=node/add
Disallow: /?q=user/register
Disallow: /?q=user/password
Disallow: /?q=user/login
Disallow: /?q=search/
Disallow: /?q=book/print

The code above instructs search engine bots to avoid pages that contain content that is meant only for users, for instance the search page, or the add comment pages.

A common SEO problem on Drupal sites is that search engines will index URL parameters that should not be indexed. The wildcard (*) is not an official part of the robots.txt standard, but Google and Bing will obey it. Most Drupal sites should include these rules in the robots.txt file:

# Blocks user "track" pages
Disallow: /*/track$
# Blocks common URL parameters created by the Views module on tables
Disallow: /*sort=
Disallow: /*size=

Some bots obey the "Crawl-delay:" parameter. Since Drupal sites seem to be popular with search engines and lots of people have more aggressive bots than visitors at their site, it might be wise to slow down the robots by adding a line like this to your robots.txt:

User-Agent: *
Crawl-Delay: 10

10 is the delay in seconds between page requests.

Both "Slurp" (Yahoo's and altaVista's bot) and the Microsoft bots for Live Search obey this parameter. Googlebot does not use the "crawl-delay" parameter yet but will likely do so in an upcoming version. (You can, however, control the crawl rate used by Googlebot via their Webmaster Tools Home page.)

Change the file as you wish and save it. Now upload it to your webserver and make sure you put it into your web root. If you have installed Drupal in a subdirectory (for example /drupal), then change the URLs in robots.txt, but place the file in your web root anyway and not in Drupal's root folder.

Now watch the robots visit your site and after some time, monitor your log files ("referrer log") to see how many visitors came from a search engine.

If you are using a multi-site setup and you want to control robot setting for each site individually, you will not be able to use robots.txt. Please use the RobotsTxt module instead.

Comments

Clarification of terms

tcheard commented 10 January 2011 at 00:41

I believe some terms should be clarified here, such as "Controlling search engine" and "forbid search engine bots".

A robots.txt file does not 'protect' a page from being viewed by a bot, as a bot does not have to follow the robots.txt rules.

Using an analogy, a robots.txt file is like a sign on a property that says 'do not enter'. This sign does not stop anyone from entering that property. All it does is tell people you don't want them entering your property. If you want to stop people from entering the property, you should build a wall around it as well (ie. put things behind a login).

This is a common misconception and can lead web developers into problems, with sensitive information being captured.

There are in fact 'bad' bots that even specifically look at the robots.txt file and then view only pages that are disallowed.

Just my $.02 :)

CyberSymbol commented 16 June 2011 at 10:25

If you close the /comment/reply, close to the same considerations and comment/nojs/reply.
Disallow: /comment/reply
Disallow: /comment/nojs/reply

google shows restricted paths

Roger34 commented 16 July 2011 at 12:43

I use default robots.tx in drupal 6.22. But google webmaster central, performance overview shows that prohibited directories are also accessed. I have the following listed in the webmaster central under example page loading time:
/admin/content/add 1.9
/node/add/story 2.3
/node/add/article 3.1
/rss.xml 0.6
/node/15008/edit 2.2
/admin/settings 0.9
/admin/reports/status 1.6
/admin/reports/status/run-cron 120.01

In addition to Disallow: /admin/ do I also need to specify for example: /admin/reports/status/run-cron

Would appreciate a reply.
Thanks

Disallow:

Rakhi commented 8 February 2012 at 06:33

Disallow: /user/password/
Disallow: /user/password

Do the above two tages have different meaning?? All my individual pages have duplicate page created as node and taxonomy even though I have changed setting, set aliase but many pages are crawled.

Please suggest should I disallow node and taxonomy using the tag:

Disallow: /node
Disallow: /taxonomy

My website links you can check site:http://www.open-source-development.com/

After browsing 6-7 pages you can find all node and taxonomy pages, please suggest.

Should I disallow search also??

_

shamio commented 17 March 2012 at 12:23

If entering both of those links, goes to a single page, both of them are the same and they don't differ.
and about allowing or disallowing nodes and other folders,it depends on you to disallow or allow search engines to crawl your special folders like node or taxonomy. If you don't want search engines to index them, them copy above codes into your robots.txt file. Also about limiting search folder, its the same. If enabling search engines, makes heavy traffic on your website and your hosting in shared and not powerful, its better to disallow them from crawling search folder.

Don't disallow /node or you

Z2222 commented 28 August 2013 at 19:43

Don't disallow /node or you will block unaliased content pages. I think that it's better to use the Global Redirect module to redirect /node URLs to their aliased versions.

For more information about Drupal's robots.txt file see this post.

Googlebot weirdness

asb commented 24 February 2014 at 14:36

As far as I can tell, neither Drupal's default robots.txt, nor the directives in the above robots.txt suffice.

In my robots.txt, I'm using:

# Disallow URLs with destination parameter
Disallow: /user/register?destination=node*
Disallow: /user/register?destination=comment*
Disallow: /user/register?destination=*
Disallow: /user/login?destination=image*
Disallow: /user/login?destination=*
Disallow: /user?destination=*

Googlebot still crawls these URLs:

66.249.66.72 - - [24/Feb/2014:14:44:36 +0100] "GET /user/login?destination=foo%2Fbar%2F35639 HTTP/1.1" 200 8015 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.66.72 - - [24/Feb/2014:14:44:55 +0100] "GET /user/login?destination=foo%2Fbaz%2F37317 HTTP/1.1" 200 8015 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.66.72 - - [24/Feb/2014:14:45:06 +0100] "GET /user/login?destination=foo%2Fbla%2F6233 HTTP/1.1" 200 8014 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

As it appears, the directive Disallow: /user/login?destination=* does not work for me. This is a problem as Googlebot accesses therse URLs in a massive scale als just gets an "Access denied" (which is a waste of server bandwidth). Is the wildcard (*) really supported by Googlebot, or am I using it wrong?

Also I have:

Disallow: /comment/reply/
Disallow: /?q=comment/reply/
Disallow: /comment/reply
Disallow: /?q=comment/reply

Googlebot still crawls URLs like this:

66.249.66.72 - - [24/Feb/2014:14:45:16 +0100] "GET /comment/reply/44619 HTTP/1.1" 302 681 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
66.249.66.72 - - [24/Feb/2014:15:22:08 +0100] "GET /comment/reply/68460 HTTP/1.1" 302 679 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

This is even worse since this just gives a blank comment form with HTTP reply code 302 - also a waste of server bandwidth.

How do I tell Googlebot that it is not supposed to follow links to comment forms and access restricted content?