Though it is not "a standard" within the "non standard" robots.txt, many bots obey the "Crawl-delay:" parameter. Since drupal sites seem to be popular with search engines and lost of people have more aggresive bots than visitors at their site, it might be wise to slow down the robots by adding a robots.txt line like:

User-Agent: *
Crawl-Delay: 10

(time in seconds between page requests)
Slurp (yahoo/AV) and MSFT bots obey this paramter, Googlebot not yet but will most likely in 2.1+

Does it makes sense to ship drupal with a default robots.txt with this parameter? If so, then there should be something in the documentaion about moving this to docroot in case drupal is installed in a subdirectory.

Comments

bertboerland’s picture

Assigned: Unassigned » bertboerland

Seems like there is no robots.txt anymore in cvs?

The old one was something like (delay added)

# small robots.txt
# more information about this file can be found at
# http://www.robotstxt.org/wc/robots.html

# if case your drupal site is in a directory
# lower than your docroot (e.g. /drupal)
# please add this before the /-es below

# to stop a polite robot indexing an exampledir
# add a line like
# user-agent: polite-bot
# Disallow: /exampledir/

# a list of know bots can be found at
# http://www.robotstxt.org/wc/active/html/index.html

# see http://www.sxw.org.uk/computing/robots/check.html
# for syntax checking

User-agent: *
Crawl-Delay: 10
Disallow: /?q=admin
Disallow: /admin/
Disallow: /cron.php
Disallow: /xmlrpc.php
Disallow: /database/
Disallow: /includes/
Disallow: /modules/
Disallow: /scripts/
Disallow: /themes/
Disallow: */add/

Morbus Iff’s picture

It appears that robots.txt was removed in 2002.

Uwe Hermann’s picture

There's no indication why it was removed, that would be interesting.

+1 for adding a default robots.txt file, if you ask me.

Morbus Iff’s picture

-1 from me - there's not enough oh my god, it's soOOooO required.

The [Files] rule in the .htaccess restricts ALL access to most of the files listed in your robots.txt (could you do some more research to find out where your copy came from? I can't find anything even remotely like it in the CVS repos). The only additional support that your robots.txt gives is for /?q=admin and /admin/ (largely unspiderable already due to access errors) /cron.php and /xmlrpc.php (largely unindexable because they return no content) and the one you've self-professed as being even more unofficial than the already-unofficial specs.

Arguably, the only worthwhile addition smells too much of a hack to me.

TDobes’s picture

The robots.txt referenced here as the "old" one is not actually old... it's from the contrib CVS. Here's the old robots.txt. It disallows everything, so I wouldn't recommend it.

I also would have to say -1 to adding the robots.txt file from contrib (referenced in comment #1) to core, as it disallows some things I wouldn't want to hide from crawlers. (i.e. the entire themes and modules directory structure... what about CSS for the WayBack Machine? Additionally, it mentions several URL's I'd prefer not to make public (i.e. cron.php, xmlrpc.php, and scripts). Although they should be safeguarded in other ways, I'd prefer to not go advertising them in a robots.txt file. As long as they aren't linked anywhere on the site, crawlers won't encounter them anyway.

That said, I wouldn't mind distributing Drupal with a robots.txt, even if it's just an empty file. As part of every new install, I usually do a "touch robots.txt" just to avoid 404's from crawlers looking for the file. The crawl-delay seems like a reasonable plan as long as we know for certain that it has no negative effects upon crawlers that do not support it. For sites than enable anonymous content creation, I wouldn't mind disallowing the node/add/* pages; indexing them makes little sense.

Dries’s picture

Not going to commit this one. Maybe we should document this in the handbook though.

bertboerland’s picture

Jacob’s picture

Version: x.y.z » 5.5

bertboerland,

I'd like to report about an error.
http://www.robotstxt.org/wc/active/html/index.html link is dead on http://drupal.org/node/22265

Jacob.

bertboerland’s picture

Title: Introduce crawl delay in robots.txt » Introduce crawl delay in robots.txt in help pages
Status: Closed (won't fix) » Fixed

deleted 404 links. thanks for reporting, solved

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.