Controlling search engine indexing with robots.txt
The robots.txt file is the mechanism almost all search engines use to allow website administrators to tell the bots what they would like indexed. By adding this file to your web root, you can forbid search engine bots to index certain parts of your website. Example: see the drupal.org robots.txt.
A robots.txt is included with Drupal 5.x. and newer versions. If you want to create a custom robots.txt file, please follow the instructions below. For more details check http://www.robotstxt.org.
Create a file containing the content as shown below and call it "robots.txt". Lines beginning with the pound ("#") sign are comments and can be deleted.
# Small robots.txt
# More information about this file can be found at
# <a href="http://www.robotstxt.org/">http://www.robotstxt.org/</a>
# In case your drupal site is in a subdirectory of your web root (e.g. /drupal)
# add the name of this directory before the / (slash) below
# example: Disallow: /drupal/aggregator
# to stop a polite robot indexing an example dir
# add a line like: user-agent: polite-bot
# and: Disallow: /example-dir/
# Paths (clean URLs)
User-agent: *
Crawl-Delay: 10
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /files
Disallow: /search
Disallow: /book/print
# Paths (no clean URLs)
User-agent: *
Crawl-Delay: 10
Disallow: /?q=aggregator
Disallow: /?q=tracker
Disallow: /?q=comment/reply
Disallow: /?q=node/add
Disallow: /?q=user
Disallow: /?q=files
Disallow: /?q=search
Disallow: /?q=book/printThe code above instructs search engine bots to avoid pages that contain content that is meant only for users, for instance the search page, or the add comment pages.
Many bots obey the "Crawl-delay:" parameter. Since Drupal sites seem to be popular with search engines and lots of people have more aggressive bots than visitors at their site, it might be wise to slow down the robots by adding a line like this to your robots.txt:
User-Agent: *
Crawl-Delay: 1010 is the delay in seconds between page requests.
Both "Slurp" (Yahoo's and altaVista's bot) and the Microsoft bots for Live Search obey this parameter. Googlebot does not use the "crawl-delay" parameter yet but will likely do so in an upcoming version. (You can, however, control the crawl rate used by Googlebot via their Webmaster Tools Home page.)
Change the file as you wish and save it. Now upload it to your webserver and make sure you put it into your web root. If you have installed Drupal in a subdirectory (for example /drupal), then change the URLs in robots.txt, but place the file in your web root anyway and not in Drupal's root folder.
Now watch the robots visit your site and after some time, monitor your log files ("referrer log") to see how many visitors came from a search engine.
If you are using a multi-site setup and you want to control robot setting for each site individually, you will not be able to use robots.txt. Please use the RobotsTxt module instead.
