Include a default robots.txt [#75916]

Comment	File	Size	Author
#35	robots.txt.d5.allow_aggregator.patch	697 bytes	Freso
#33	robots-patch.1.9_0.txt	1.36 KB	yaph
#32	robots-patch.1.9.txt	1.36 KB	yaph
#28	allow-aggregator-robots-txt.patch.txt	462 bytes	sillygwailo
#25	robots.txt_3.patch	602 bytes	pcwick
#18	robots_2.txt	1.71 KB	robertdouglass
#16	robots_1.txt	1.7 KB	robertdouglass
#14	robots_0.txt	1.65 KB	robertdouglass

Comment #1

robertdouglass commented 7 August 2006 at 10:12

Change in thinking has occurred. I now think that Drupal should ship with a default robots.txt and let the robotstxt module suffice for people with multisite needs. The search is on for the optimal robots.txt file. Here's a start:

User-agent: *
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /admin/
Disallow: /user/register
Disallow: /user/password

Log in or register to post comments

Comment #2

robertdouglass commented 9 August 2006 at 20:15

The kind reviewers of this "patch" will need to create the file with the above text rather than apply a patch. The file should be called robots.txt and be in the root directory.

Log in or register to post comments

Comment #3

Chris Johnson commented 10 August 2006 at 14:33

I think this is a very good idea.

Here is the robots.txt file I've been using with my 4.5 site. Obviously, some paths have changed in 4.6, 4.7 and 4.8. But perhaps it will give you a couple more ideas.

User-agent: *

Crawl-Delay: 10

Disallow: */add/
Disallow: /?q=admin
Disallow: /admin/
Disallow: /database/
Disallow: /includes/
Disallow: /modules/
Disallow: /scripts/
Disallow: /themes/
Disallow: /xmlrpc.php
Disallow: ?q=admin
Disallow: cron.php
Disallow: error.php
Disallow: xmlrpc.php

Log in or register to post comments

Comment #4

dries commented 10 August 2006 at 17:59

Your patch assumes that clean URLs are enabled?

Log in or register to post comments

Comment #5

mariagwyn commented 10 August 2006 at 18:22

I don't know much about robots.txt, but this is what I use, partly as a result of the threads on hiding feeds and print pages:
Disallow: /node/feed
Disallow: /blog/feed
Disallow: /aggregator/sources
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /files
Disallow: /search
Disallow: /book/print
Disallow: /archive
Disallow: /trackback

I added the recommended pieces above since I didn't have any of them in my file.
Maria

Log in or register to post comments

Comment #6

robertdouglass commented 10 August 2006 at 18:44

It just goes to show that the above approach is all wrong. Dries, what do you think of building this into the menu system, so that modules can do this:


hook_menu()...

$item[] = array(
  'path' => 'some/path',
  'crawl' => false,
);

That way we could generate robots.txt dynamically and take into account all modules' paths, as well as things like clean urls.

Log in or register to post comments

Comment #7

bertboerland commented 10 August 2006 at 20:55

please take a look at an old bookpage i wrote up at http://drupal.org/node/22265

esp the option to use:

User-agent: *
Crawl-Delay: 10

Dries was against including a robots.txt functionality in 2005 ( http://drupal.org/node/14177 ) but I think it is very very standard to ship with a default robots.txt and we should. in fact, i would rather ship Drupal with a robots.txt then a favicon.

Log in or register to post comments

Comment #8

dries commented 11 August 2006 at 18:25

I'm OK with a _simple_ robots.txt.

1. Keep it short and simple.
2, Add some documentation so people can extend it as they see fit.

Log in or register to post comments

Comment #9

bertboerland commented 11 August 2006 at 18:44

how about:

# small robots.txt
# more information about this file can be found at
# http://www.robotstxt.org/wc/robots.html
# lines beginning with the pund ("#") sign are comments and can be deleted.

# if case your drupal site is in a directory
# lower than your docroot (e.g. /drupal)
# please add this before the /-es below

# to stop a polite robot indexing an exampledir
# add a line like (without the #'s)
# user-agent: polite-bot
# Disallow: /exampledir/

# a list of know bots can be found at
# http://www.robotstxt.org/wc/active/html/index.html
# see http://www.sxw.org.uk/computing/robots/check.html
# for syntax checking

User-agent: *
Crawl-Delay: 10
Disallow: /comment/reply
Disallow: /node/add
Disallow: /files
Disallow: /search
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /admin/
Disallow: /user/register
Disallow: /user/password
Disallow: /?q=admin
Disallow: /xmlrpc.php
Disallow: /?q=admin
Disallow: /cron.php
Disallow: /error.php
Disallow: /xmlrpc.php

It might be a bit longish but covers most basic options an it documented. Note that wildcards in robots.txt dont work so lines like */add* wont work

Log in or register to post comments

Comment #10

rewted commented 11 August 2006 at 18:59

You've got Disallow: /?q=admin in there twice.

Log in or register to post comments

Comment #11

robertdouglass commented 11 August 2006 at 21:26

# robots.txt
#
# This file aims to prevent the crawling and idexing of certain parts of your site by
# webcrawlers and spiders run by sites like Yahoo! and Google. By telling
# these "robots" where not to go on your site, you save bandwidth and server
# resources, and the quality of their crawling and indexing is improved as well.
# 
# For more information about the robots.txt standard, see:
#    http://www.robotstxt.org/wc/robots.html
#
# To stop a polite robot from indexing an exampledir,
# add a uncommented line (without the #'s), like the following:
#
# user-agent: polite-bot
# Disallow: /exampledir/

# A list of know 'bots can be found at:
#   http://www.robotstxt.org/wc/active/html/index.html
# 
# See this site for syntax checking:
#  http://www.sxw.org.uk/computing/robots/check.html

User-agent: *
Crawl-Delay: 10

# Directories
Disallow: /files/
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/

# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt

# Paths (Clean URLs)
Disallow: /admin/
Disallow: /node/add/
Disallow: /search/
Disallow: /comment/reply/
Disallow: /contact
Disallow: /user/register
Disallow: /user/password
Disallow: /logout

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=comment/reply/
Disallow: /?q=contact
Disallow: /?q=user/register
Disallow: /?q=user/password
Disallow: /?q=logout

Log in or register to post comments

Comment #12

figaro commented 11 August 2006 at 21:55

These are usually added as a standard:

# W3C Link checker
User-agent: W3C-checklink
Disallow:

# Exclude stress-testing tools
User-Agent: stress-agent
Disallow: /

Log in or register to post comments

Comment #13

dries commented 12 August 2006 at 12:12

There are some typos in Robert's text.

Log in or register to post comments

Comment #14

robertdouglass commented 12 August 2006 at 12:45

Status	File	Size
new	robots_0.txt	1.65 KB

<code>
# robots.txt
#
# This file aims to prevent the crawling and idexing of certain parts of your site by
# webcrawlers and spiders run by sites like Yahoo! and Google. By telling
# these "robots" where not to go on your site, you save bandwidth and server
# resources, and the quality of their crawling and indexing is improved as well.
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
# To stop a polite robot from indexing an exampledir,
# add an uncommented line (without #), like the following:
#
# user-agent: polite-bot
# Disallow: /exampledir/

# A list of known 'bots can be found at:
# http://www.robotstxt.org/wc/active/html/index.html
#
# See this site for syntax checking:
# http://www.sxw.org.uk/computing/robots/check.html


User-agent: *
Crawl-Delay: 10

# Directories
Disallow: /files/
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/

# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt

# Paths (Clean URLs)
Disallow: /admin/
Disallow: /node/add/
Disallow: /search/
Disallow: /comment/reply/
Disallow: /contact
Disallow: /user/register
Disallow: /user/password
Disallow: /logout

# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=comment/reply/
Disallow: /?q=contact
Disallow: /?q=user/register
Disallow: /?q=user/password
Disallow: /?q=logout

Log in or register to post comments

Comment #15

kbahey commented 13 August 2006 at 04:41

I have been using this since 4.5 or so.

The ideas about are great (clean vs. regular URLs, excluding feeds, print pages, ...etc.)

User-agent: *
  Crawl-Delay: 10
  Disallow: /database
  Disallow: /includes
  Disallow: /modules
  Disallow: /scripts
  Disallow: /themes
  Disallow: /aggregator
  Disallow: /tracker
  Disallow: /comment/reply
  Disallow: /node/add
  Disallow: /search

Log in or register to post comments

Comment #16

robertdouglass commented 13 August 2006 at 06:06

Status	File	Size
new	robots_1.txt	1.7 KB

added aggregator

Log in or register to post comments

Comment #17

dries commented 13 August 2006 at 09:24

Status:

Needs review

» Needs work

Some lines end with a trailing slash while others don't. Is that intentional?

The robots.txt file doesn't validate at all. Test with http://www.sxw.org.uk/computing/robots/check.html.

Log in or register to post comments

Comment #18

robertdouglass commented 13 August 2006 at 09:55

Status	File	Size
new	robots_2.txt	1.71 KB

Crawl-delay is non-standard but obeyed by at least a couple major spiders. I removed line breaks per the validator's suggestion. I read also that directories must be followed by a trailing slash, so I added that to both the clean and non-clean URLs section, though it is a question (an probably not consistent) how spiders will handle the non-clean directives.

Log in or register to post comments

Comment #19

robertdouglass commented 14 August 2006 at 09:53

Status:

Needs work

» Needs review

Log in or register to post comments

Comment #20

dries commented 14 August 2006 at 10:42

Status:

Needs review

» Fixed

Committed to CVS HEAD. Thanks.

Log in or register to post comments

Comment #21

ideviate commented 18 August 2006 at 09:23

should we add the lines
Disallow: /user/login and Disallow: /?q=user/login

Log in or register to post comments

Comment #22

robertdouglass commented 18 August 2006 at 10:23

Status:

Fixed

» Active

Yeah, good catch. Dries, do you need a patch?

Log in or register to post comments

Comment #23

bertboerland commented 21 August 2006 at 14:45

if we protect *.TXT files we dont need to list them here anymore

see http://drupal.org/node/79018

Log in or register to post comments

Comment #24

knugar commented 3 October 2006 at 14:59

I agree with Robert http://drupal.org/node/75916#comment-123192, it would be great if the robots.txt could be created automatically as part of the menu system.

For now, manually editing my robots.txt is just fine but letting modules define defaults crawl/not-crawl for menu paths seems like a good idea.

Maybe the crawlability of all such paths could be administred on a special admin page if people don't like the defaults.

Log in or register to post comments

Comment #25

pcwick commented 7 January 2007 at 04:54

Version:

x.y.z

» 5.x-dev

Status	File	Size
new	robots.txt_3.patch	602 bytes

Patch adds lines suggested in #21

I have very little experience making patches. Hope it works :-)

Log in or register to post comments

Comment #26

pcwick commented 7 January 2007 at 20:46

Category:	feature	» task
Status:	Active	» Needs review

Log in or register to post comments

Comment #27

dries commented 8 January 2007 at 12:08

Status:

Needs review

» Fixed

Committed to CVS HEAD. Thanks.

Log in or register to post comments

Comment #28

sillygwailo

Toronto, ON

commented 15 January 2007 at 05:55

Status:

Fixed

» Needs review

Status	File	Size
new	allow-aggregator-robots-txt.patch.txt	462 bytes

What's the rationale for disallowing the aggregator? I consider that content, not administrative functions like the other items.

Log in or register to post comments

Comment #29

drumm

he/him

NY, US

commented 27 June 2007 at 04:31

Version:

5.x-dev

» 6.x-dev

I would like to see this go in the development version first.

Log in or register to post comments

Comment #30

cooperaj commented 27 June 2007 at 12:56

I've just been using the google webmaster tools to test out various aspects of the site incl. the robots.txt file. I've come to a startling conclusion.

Disallow: /user/password != Disallow: /user/password/

and

Disallow: /user/password/ *does not include* Disallow: /user/password

I'm running a 5.1 site and I noticed all the things that shouldn't be indexed are being. i.e. /contact. and /user/login

To properly protect certain paths it is necessary to:

Disallow: /admin
Disallow: /admin/

Log in or register to post comments

Comment #31

gábor hojtsy

he/him

Hungarian

Hungary

commented 27 June 2007 at 22:39

Status:

Needs review

» Patch (to be ported)

Well, aggregator could be content you would like to get indexed (like content gathered from your subsites), or foreign content you would not like to have indexed. I changed the default now to let it be indexed, as you suggest, but this decision is different from site to site. I am not entirely sure this should be ported back, but setting it to that state as drumm indicated.

Log in or register to post comments

Comment #32

yaph commented 2 August 2007 at 21:37

Status	File	Size
new	robots-patch.1.9.txt	1.36 KB

Removed all trailing slashes, see also:
http://groups.drupal.org/node/5391#comment-15648

Log in or register to post comments

Comment #33

yaph commented 2 August 2007 at 21:38

Status	File	Size
new	robots-patch.1.9_0.txt	1.36 KB

Removed all trailing slashes, see also:
http://groups.drupal.org/node/5391

Log in or register to post comments

Comment #34

Drupalzilla.com commented 3 October 2007 at 00:44

There is another patch for robots.txt here:
http://drupal.org/node/180379

Someone recommended that I open a new issue for it. It's my first submitted patch -- I hope I did it right...

Log in or register to post comments

Comment #35

Freso commented 20 February 2008 at 21:20

Version:	6.x-dev	» 5.x-dev
Assigned:	robertdouglass	» Unassigned
Status:	Patch (to be ported)	» Reviewed & tested by the community

Status	File	Size
new	robots.txt.d5.allow_aggregator.patch	697 bytes

The attached patch removes the aggregator entries from the robots.txt in Drupal 5. It would seem that the patch has, in all other respects, already been applied to Drupal 6, except for the trailing slashes issue, which I'd say is more at home in #180379: Fix path matching in robots.txt. This bug is about providing a default robots.txt, and that very robots.txt is now available in both D5, D6, and D7. As soon as D5 has been updated to be similar to the robots.txt of D6 this issue ended up with, please mark this fixed and/or closed.

(#28 still applies as well, though with a wee bit of fuzz.)

Edit: Updated patch. Had some old stuff in it.

Log in or register to post comments

Comment #36

drumm

he/him

NY, US

commented 25 February 2008 at 02:18

Status:

Reviewed & tested by the community

» Fixed

Committed to 5.x.

Log in or register to post comments

Comment #37

Anonymous (not verified) commented 10 March 2008 at 02:23

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.

Log in or register to post comments

Include a default robots.txt

Comments