Project:Drupal core
Version:5.x-dev
Component:base system
Category:task
Priority:normal
Assigned:Unassigned
Status:closed (fixed)

Issue Summary

I've recently been using Nutch (a web crawler) to build a Drupal-specific search engine. I get to watch how web-crawlers behave when they look at Drupal sites. It is appalling. We let crawlers search our login page, our request new password page, different sorted views of our tables, and lots of other stuff that is just wasteful. Furthermore, if you ever sit and watch your Apache logs roll by, you'll know that search engine traffic is a very large percentage of all traffic for many sites. The situation could be fixed by including a robots.txt file.

Or could it? What about multisite configurations? Well, just putting a robots.txt file in the top level directory locks every site into using one file, which clearly won't suffice.

Thus I propose that we adopt the strategy I used for my robotstxt module for core. We add an alias "robots.txt", add a variable of the same name, and output it when robots.txt is asked for. You can edit the variable as an administrator in a textarea and we provide sensible defaults tailored to Drupal sites.

I'll roll a patch and it won't be more than about 10 LOC (minus the actual robots.txt we ship with), but I want to know from a core committer that there is interest.

Comments

#1

Change in thinking has occurred. I now think that Drupal should ship with a default robots.txt and let the robotstxt module suffice for people with multisite needs. The search is on for the optimal robots.txt file. Here's a start:

User-agent: *
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /admin/
Disallow: /user/register
Disallow: /user/password

#2

The kind reviewers of this "patch" will need to create the file with the above text rather than apply a patch. The file should be called robots.txt and be in the root directory.

#3

I think this is a very good idea.

Here is the robots.txt file I've been using with my 4.5 site. Obviously, some paths have changed in 4.6, 4.7 and 4.8. But perhaps it will give you a couple more ideas.

User-agent: *

Crawl-Delay: 10

Disallow: */add/
Disallow: /?q=admin
Disallow: /admin/
Disallow: /database/
Disallow: /includes/
Disallow: /modules/
Disallow: /scripts/
Disallow: /themes/
Disallow: /xmlrpc.php
Disallow: ?q=admin
Disallow: cron.php
Disallow: error.php
Disallow: xmlrpc.php

#4

Your patch assumes that clean URLs are enabled?

#5

I don't know much about robots.txt, but this is what I use, partly as a result of the threads on hiding feeds and print pages:
Disallow: /node/feed
Disallow: /blog/feed
Disallow: /aggregator/sources
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /files
Disallow: /search
Disallow: /book/print
Disallow: /archive
Disallow: /trackback

I added the recommended pieces above since I didn't have any of them in my file.
Maria

#6

It just goes to show that the above approach is all wrong. Dries, what do you think of building this into the menu system, so that modules can do this:

<?php
hook_menu
()...

$item[] = array(
 
'path' => 'some/path',
 
'crawl' => false,
);
?>

That way we could generate robots.txt dynamically and take into account all modules' paths, as well as things like clean urls.

#7

please take a look at an old bookpage i wrote up at http://drupal.org/node/22265

esp the option to use:

User-agent: *
Crawl-Delay: 10

Dries was against including a robots.txt functionality in 2005 ( http://drupal.org/node/14177 ) but I think it is very very standard to ship with a default robots.txt and we should. in fact, i would rather ship Drupal with a robots.txt then a favicon.

see also http://cvs.drupal.org/viewcvs/drupal/drupal/robots.txt?hideattic=0&rev=1...

#8

I'm OK with a _simple_ robots.txt.

1. Keep it short and simple.
2, Add some documentation so people can extend it as they see fit.

#9

how about:

# small robots.txt
# more information about this file can be found at
# http://www.robotstxt.org/wc/robots.html
#
lines beginning with the pund ("#") sign are comments and can be deleted.

# if case your drupal site is in a directory
# lower than your docroot (e.g. /drupal)
# please add this before the /-es below

# to stop a polite robot indexing an exampledir
# add a line like (without the #'s)
# user-agent: polite-bot
# Disallow: /exampledir/

# a list of know bots can be found at
# http://www.robotstxt.org/wc/active/html/index.html
#
see http://www.sxw.org.uk/computing/robots/check.html
#
for syntax checking

User-agent: *
Crawl-Delay: 10
Disallow: /comment/reply
Disallow: /node/add
Disallow: /files
Disallow: /search
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /admin/
Disallow: /user/register
Disallow: /user/password
Disallow: /?q=admin
Disallow: /xmlrpc.php
Disallow: /?q=admin
Disallow: /cron.php
Disallow: /error.php
Disallow: /xmlrpc.php

It might be a bit longish but covers most basic options an it documented. Note that wildcards in robots.txt dont work so lines like */add* wont work

#10

You've got Disallow: /?q=admin in there twice.

#11

# robots.txt
#
# This file aims to prevent the crawling and idexing of certain parts of your site by
# webcrawlers and spiders run by sites like Yahoo! and Google. By telling
# these "robots" where not to go on your site, you save bandwidth and server
# resources, and the quality of their crawling and indexing is improved as well.
#
# For more information about the robots.txt standard, see:
#    http://www.robotstxt.org/wc/robots.html
#
#
To stop a polite robot from indexing an exampledir,
# add a uncommented line (without the #'s), like the following:
#
# user-agent: polite-bot
# Disallow: /exampledir/

# A list of know 'bots can be found at:
#   http://www.robotstxt.org/wc/active/html/index.html
#

# See this site for syntax checking:
http://www.sxw.org.uk/computing/robots/check.html
User-agent:
*
Crawl-Delay: 10

# Directories
Disallow: /files/
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/

# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt

# Paths (Clean URLs)
Disallow: /admin/
Disallow: /node/add/
Disallow: /search/
Disallow: /comment/reply/
Disallow: /contact
Disallow: /user/register
Disallow: /user/password
Disallow: /logout

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=comment/reply/
Disallow: /?q=contact
Disallow: /?q=user/register
Disallow: /?q=user/password
Disallow: /?q=logout

#12

These are usually added as a standard:

# W3C Link checker
User-agent: W3C-checklink
Disallow:

# Exclude stress-testing tools
User-Agent: stress-agent
Disallow: /

#13

There are some typos in Robert's text.

#14

<code>
# robots.txt
#
# This file aims to prevent the crawling and idexing of certain parts of your site by
# webcrawlers and spiders run by sites like Yahoo! and Google. By telling
# these "robots" where not to go on your site, you save bandwidth and server
# resources, and the quality of their crawling and indexing is improved as well.
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
#
To stop a polite robot from indexing an exampledir,
# add an uncommented line (without #), like the following:
#
# user-agent: polite-bot
# Disallow: /exampledir/

# A list of known 'bots can be found at:
# http://www.robotstxt.org/wc/active/html/index.html
#
#
See this site for syntax checking:
# http://www.sxw.org.uk/computing/robots/check.html
User-agent:
*
Crawl-Delay: 10

# Directories
Disallow: /files/
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/

# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt

# Paths (Clean URLs)
Disallow: /admin/
Disallow: /node/add/
Disallow: /search/
Disallow: /comment/reply/
Disallow: /contact
Disallow: /user/register
Disallow: /user/password
Disallow: /logout

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=comment/reply/
Disallow: /?q=contact
Disallow: /?q=user/register
Disallow: /?q=user/password
Disallow: /?q=logout

AttachmentSizeStatusTest resultOperations
robots_0.txt1.65 KBIgnored: Check issue status.NoneNone

#15

I have been using this since 4.5 or so.

The ideas about are great (clean vs. regular URLs, excluding feeds, print pages, ...etc.)

User-agent: *
  Crawl-Delay: 10
  Disallow: /database
  Disallow: /includes
  Disallow: /modules
  Disallow: /scripts
  Disallow: /themes
  Disallow: /aggregator
  Disallow: /tracker
  Disallow: /comment/reply
  Disallow: /node/add
  Disallow: /search

#16

added aggregator

AttachmentSizeStatusTest resultOperations
robots_1.txt1.7 KBIgnored: Check issue status.NoneNone

#17

Status:needs review» needs work

Some lines end with a trailing slash while others don't. Is that intentional?

The robots.txt file doesn't validate at all. Test with http://www.sxw.org.uk/computing/robots/check.html.

#18

Crawl-delay is non-standard but obeyed by at least a couple major spiders. I removed line breaks per the validator's suggestion. I read also that directories must be followed by a trailing slash, so I added that to both the clean and non-clean URLs section, though it is a question (an probably not consistent) how spiders will handle the non-clean directives.

AttachmentSizeStatusTest resultOperations
robots_2.txt1.71 KBIgnored: Check issue status.NoneNone

#19

Status:needs work» needs review

#20

Status:needs review» fixed

Committed to CVS HEAD. Thanks.

#21

should we add the lines
Disallow: /user/login and Disallow: /?q=user/login

#22

Status:fixed» active

Yeah, good catch. Dries, do you need a patch?

#23

if we protect *.TXT files we dont need to list them here anymore

see http://drupal.org/node/79018

#24

I agree with Robert http://drupal.org/node/75916#comment-123192, it would be great if the robots.txt could be created automatically as part of the menu system.

For now, manually editing my robots.txt is just fine but letting modules define defaults crawl/not-crawl for menu paths seems like a good idea.

Maybe the crawlability of all such paths could be administred on a special admin page if people don't like the defaults.

#25

Version:x.y.z» 5.x-dev

Patch adds lines suggested in #21

I have very little experience making patches. Hope it works :-)

AttachmentSizeStatusTest resultOperations
robots.txt_3.patch602 bytesIgnored: Check issue status.NoneNone

#26

Category:feature request» task
Status:active» needs review

#27

Status:needs review» fixed

Committed to CVS HEAD. Thanks.

#28

Status:fixed» needs review

What's the rationale for disallowing the aggregator? I consider that content, not administrative functions like the other items.

AttachmentSizeStatusTest resultOperations
allow-aggregator-robots-txt.patch.txt462 bytesIgnored: Check issue status.NoneNone

#29

Version:5.x-dev» 6.x-dev

I would like to see this go in the development version first.

#30

I've just been using the google webmaster tools to test out various aspects of the site incl. the robots.txt file. I've come to a startling conclusion.

Disallow: /user/password != Disallow: /user/password/

and

Disallow: /user/password/ *does not include* Disallow: /user/password

I'm running a 5.1 site and I noticed all the things that shouldn't be indexed are being. i.e. /contact. and /user/login

To properly protect certain paths it is necessary to:

Disallow: /admin
Disallow: /admin/

#31

Status:needs review» patch (to be ported)

Well, aggregator could be content you would like to get indexed (like content gathered from your subsites), or foreign content you would not like to have indexed. I changed the default now to let it be indexed, as you suggest, but this decision is different from site to site. I am not entirely sure this should be ported back, but setting it to that state as drumm indicated.

#32

Removed all trailing slashes, see also:
http://groups.drupal.org/node/5391#comment-15648

AttachmentSizeStatusTest resultOperations
robots-patch.1.9.txt1.36 KBIgnored: Check issue status.NoneNone

#33

Removed all trailing slashes, see also:
http://groups.drupal.org/node/5391

AttachmentSizeStatusTest resultOperations
robots-patch.1.9_0.txt1.36 KBIgnored: Check issue status.NoneNone

#34

There is another patch for robots.txt here:
http://drupal.org/node/180379

Someone recommended that I open a new issue for it. It's my first submitted patch -- I hope I did it right...

#35

Version:6.x-dev» 5.x-dev
Assigned to:robertDouglass» Anonymous
Status:patch (to be ported)» reviewed & tested by the community

The attached patch removes the aggregator entries from the robots.txt in Drupal 5. It would seem that the patch has, in all other respects, already been applied to Drupal 6, except for the trailing slashes issue, which I'd say is more at home in #180379: Fixing Robots.txt. This bug is about providing a default robots.txt, and that very robots.txt is now available in both D5, D6, and D7. As soon as D5 has been updated to be similar to the robots.txt of D6 this issue ended up with, please mark this fixed and/or closed.

(#28 still applies as well, though with a wee bit of fuzz.)

Edit: Updated patch. Had some old stuff in it.

AttachmentSizeStatusTest resultOperations
robots.txt.d5.allow_aggregator.patch697 bytesIgnored: Check issue status.NoneNone

#36

Status:reviewed & tested by the community» fixed

Committed to 5.x.

#37

Status:fixed» closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.