The robots.txt
file is the mechanism almost all search engines use to allow website administrators to tell the bots what they would like indexed. By adding this file to your web root, you can forbid search engine bots to index certain parts of your website. Example: see the drupal.org robots.txt.
A robots.txt is included with Drupal 5.x. and newer versions, though there are SEO problems with Drupal's default robots.txt file even in Drupal 7. If you want to create a custom robots.txt file, please follow the instructions below. For more details check http://www.robotstxt.org.
Create a file containing the content as shown below and call it "robots.txt". Lines beginning with the pound ("#") sign are comments and can be deleted.
# Small robots.txt
# More information about this file can be found at
# <a href="http://www.robotstxt.org/">http://www.robotstxt.org/</a>
# In case your drupal site is in a subdirectory of your web root (e.g. /drupal)
# add the name of this directory before the / (slash) below
# example: Disallow: /drupal/aggregator
# to stop a polite robot indexing an example dir
# add a line like: user-agent: polite-bot
# and: Disallow: /example-dir/
# Paths (clean URLs)
User-agent: *
Crawl-Delay: 10
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /search/
Disallow: /book/print
Disallow: /logout
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
# Paths (no clean URLs)
User-agent: *
Crawl-Delay: 10
Disallow: /?q=aggregator
Disallow: /?q=tracker
Disallow: /?q=comment/reply
Disallow: /?q=node/add
Disallow: /?q=user/register
Disallow: /?q=user/password
Disallow: /?q=user/login
Disallow: /?q=search/
Disallow: /?q=book/print
The code above instructs search engine bots to avoid pages that contain content that is meant only for users, for instance the search page, or the add comment pages.
A common SEO problem on Drupal sites is that search engines will index URL parameters that should not be indexed. The wildcard (*) is not an official part of the robots.txt standard, but Google and Bing will obey it. Most Drupal sites should include these rules in the robots.txt file:
# Blocks user "track" pages
Disallow: /*/track$
# Blocks common URL parameters created by the Views module on tables
Disallow: /*sort=
Disallow: /*size=
Some bots obey the "Crawl-delay:" parameter. Since Drupal sites seem to be popular with search engines and lots of people have more aggressive bots than visitors at their site, it might be wise to slow down the robots by adding a line like this to your robots.txt:
User-Agent: *
Crawl-Delay: 10
10 is the delay in seconds between page requests.
Both "Slurp" (Yahoo's and altaVista's bot) and the Microsoft bots for Live Search obey this parameter. Googlebot does not use the "crawl-delay" parameter yet but will likely do so in an upcoming version. (You can, however, control the crawl rate used by Googlebot via their Webmaster Tools Home page.)
Change the file as you wish and save it. Now upload it to your webserver and make sure you put it into your web root. If you have installed Drupal in a subdirectory (for example /drupal
), then change the URLs in robots.txt, but place the file in your web root anyway and not in Drupal's root folder.
Now watch the robots visit your site and after some time, monitor your log files ("referrer log") to see how many visitors came from a search engine.
If you are using a multi-site setup and you want to control robot setting for each site individually, you will not be able to use robots.txt. Please use the RobotsTxt module instead.
Comments
Clarification of terms
I believe some terms should be clarified here, such as "Controlling search engine" and "forbid search engine bots".
A robots.txt file does not 'protect' a page from being viewed by a bot, as a bot does not have to follow the robots.txt rules.
Using an analogy, a robots.txt file is like a sign on a property that says 'do not enter'. This sign does not stop anyone from entering that property. All it does is tell people you don't want them entering your property. If you want to stop people from entering the property, you should build a wall around it as well (ie. put things behind a login).
This is a common misconception and can lead web developers into problems, with sensitive information being captured.
There are in fact 'bad' bots that even specifically look at the robots.txt file and then view only pages that are disallowed.
Just my $.02 :)
If you close the /comment/reply, close to the same considerations and comment/nojs/reply.
Disallow: /comment/reply
Disallow: /comment/nojs/reply
google shows restricted paths
I use default robots.tx in drupal 6.22. But google webmaster central, performance overview shows that prohibited directories are also accessed. I have the following listed in the webmaster central under example page loading time:
/admin/content/add 1.9
/node/add/story 2.3
/node/add/article 3.1
/rss.xml 0.6
/node/15008/edit 2.2
/admin/settings 0.9
/admin/reports/status 1.6
/admin/reports/status/run-cron 120.01
In addition to Disallow: /admin/ do I also need to specify for example: /admin/reports/status/run-cron
Would appreciate a reply.
Thanks
Disallow:
Disallow: /user/password/
Disallow: /user/password
Do the above two tages have different meaning?? All my individual pages have duplicate page created as node and taxonomy even though I have changed setting, set aliase but many pages are crawled.
Please suggest should I disallow node and taxonomy using the tag:
Disallow: /node
Disallow: /taxonomy
My website links you can check site:http://www.open-source-development.com/
After browsing 6-7 pages you can find all node and taxonomy pages, please suggest.
Should I disallow search also??
_
If entering both of those links, goes to a single page, both of them are the same and they don't differ.
and about allowing or disallowing nodes and other folders,it depends on you to disallow or allow search engines to crawl your special folders like node or taxonomy. If you don't want search engines to index them, them copy above codes into your robots.txt file. Also about limiting search folder, its the same. If enabling search engines, makes heavy traffic on your website and your hosting in shared and not powerful, its better to disallow them from crawling search folder.
Don't disallow /node or you
Don't disallow
/node
or you will block unaliased content pages. I think that it's better to use the Global Redirect module to redirect/node
URLs to their aliased versions.For more information about Drupal's robots.txt file see this post.
Googlebot weirdness
As far as I can tell, neither Drupal's default
robots.txt
, nor the directives in the aboverobots.txt
suffice.In my
robots.txt
, I'm using:Googlebot still crawls these URLs:
As it appears, the directive
Disallow: /user/login?destination=*
does not work for me. This is a problem as Googlebot accesses therse URLs in a massive scale als just gets an "Access denied" (which is a waste of server bandwidth). Is the wildcard (*) really supported by Googlebot, or am I using it wrong?Also I have:
Googlebot still crawls URLs like this:
This is even worse since this just gives a blank comment form with HTTP reply code 302 - also a waste of server bandwidth.
How do I tell Googlebot that it is not supposed to follow links to comment forms and access restricted content?