Community Documentation

Controlling search engine indexing with robots.txt

Last updated September 4, 2009. Created by ezraw on May 7, 2005.
Edited by Rainy Day, JohnNoc, LeeHunter, Xano. Log in to edit this page.

The robots.txt file is the mechanism almost all search engines use to allow website administrators to tell the bots what they would like indexed. By adding this file to your web root, you can forbid search engine bots to index certain parts of your website. Example: see the drupal.org robots.txt.

A robots.txt is included with Drupal 5.x. and newer versions. If you want to create a custom robots.txt file, please follow the instructions below. For more details check http://www.robotstxt.org.

Create a file containing the content as shown below and call it "robots.txt". Lines beginning with the pound ("#") sign are comments and can be deleted.

# Small robots.txt
# More information about this file can be found at
# <a href="http://www.robotstxt.org/">http://www.robotstxt.org/</a>

# In case your drupal site is in a subdirectory of your web root  (e.g. /drupal)
# add the name of this directory before the / (slash) below
# example:  Disallow: /drupal/aggregator

# to stop a polite robot indexing an example dir
# add a line like:  user-agent: polite-bot
# and:  Disallow: /example-dir/

# Paths (clean URLs)
User-agent: *
Crawl-Delay: 10
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /files
Disallow: /search
Disallow: /book/print

# Paths (no clean URLs)
User-agent: *
Crawl-Delay: 10
Disallow: /?q=aggregator
Disallow: /?q=tracker
Disallow: /?q=comment/reply
Disallow: /?q=node/add
Disallow: /?q=user
Disallow: /?q=files
Disallow: /?q=search
Disallow: /?q=book/print

The code above instructs search engine bots to avoid pages that contain content that is meant only for users, for instance the search page, or the add comment pages.

Many bots obey the "Crawl-delay:" parameter. Since Drupal sites seem to be popular with search engines and lots of people have more aggressive bots than visitors at their site, it might be wise to slow down the robots by adding a line like this to your robots.txt:

User-Agent: *
Crawl-Delay: 10

10 is the delay in seconds between page requests.

Both "Slurp" (Yahoo's and altaVista's bot) and the Microsoft bots for Live Search obey this parameter. Googlebot does not use the "crawl-delay" parameter yet but will likely do so in an upcoming version. (You can, however, control the crawl rate used by Googlebot via their Webmaster Tools Home page.)

Change the file as you wish and save it. Now upload it to your webserver and make sure you put it into your web root. If you have installed Drupal in a subdirectory (for example /drupal), then change the URLs in robots.txt, but place the file in your web root anyway and not in Drupal's root folder.

Now watch the robots visit your site and after some time, monitor your log files ("referrer log") to see how many visitors came from a search engine.

If you are using a multi-site setup and you want to control robot setting for each site individually, you will not be able to use robots.txt. Please use the RobotsTxt module instead.

Comments

Clarification of terms

I believe some terms should be clarified here, such as "Controlling search engine" and "forbid search engine bots".

A robots.txt file does not 'protect' a page from being viewed by a bot, as a bot does not have to follow the robots.txt rules.

Using an analogy, a robots.txt file is like a sign on a property that says 'do not enter'. This sign does not stop anyone from entering that property. All it does is tell people you don't want them entering your property. If you want to stop people from entering the property, you should build a wall around it as well (ie. put things behind a login).

This is a common misconception and can lead web developers into problems, with sensitive information being captured.

There are in fact 'bad' bots that even specifically look at the robots.txt file and then view only pages that are disallowed.

Just my $.02 :)

If you close the /comment/reply, close to the same considerations and comment/nojs/reply.
Disallow: /comment/reply
Disallow: /comment/nojs/reply

google shows restricted paths

I use default robots.tx in drupal 6.22. But google webmaster central, performance overview shows that prohibited directories are also accessed. I have the following listed in the webmaster central under example page loading time:
/ad​min​/co​nte​nt/​add 1.9
/no​de/​add​/st​ory 2.3
/no​de/​add​/ar​tic​le 3.1
/rss.x​ml 0.6
/no​de/​15008/e​dit 2.2
/ad​min​/se​tti​ngs 0.9
/ad​min​/re​por​ts/​sta​tus 1.6
/ad​min​/re​por​ts/​sta​tus​/ru​n-c​ron 120.01

In addition to Disallow: /admin/ do I also need to specify for example: /ad​min​/re​por​ts/​sta​tus​/ru​n-c​ron

Would appreciate a reply.
Thanks

Site Building Guide

Drupal’s online documentation is © 2000-2012 by the individual contributors and can be used in accordance with the Creative Commons License, Attribution-ShareAlike 2.0. PHP code is distributed under the GNU General Public License.