| Project: | Drupal core |
| Version: | 5.x-dev |
| Component: | base system |
| Category: | task |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
Issue Summary
I've recently been using Nutch (a web crawler) to build a Drupal-specific search engine. I get to watch how web-crawlers behave when they look at Drupal sites. It is appalling. We let crawlers search our login page, our request new password page, different sorted views of our tables, and lots of other stuff that is just wasteful. Furthermore, if you ever sit and watch your Apache logs roll by, you'll know that search engine traffic is a very large percentage of all traffic for many sites. The situation could be fixed by including a robots.txt file.
Or could it? What about multisite configurations? Well, just putting a robots.txt file in the top level directory locks every site into using one file, which clearly won't suffice.
Thus I propose that we adopt the strategy I used for my robotstxt module for core. We add an alias "robots.txt", add a variable of the same name, and output it when robots.txt is asked for. You can edit the variable as an administrator in a textarea and we provide sensible defaults tailored to Drupal sites.
I'll roll a patch and it won't be more than about 10 LOC (minus the actual robots.txt we ship with), but I want to know from a core committer that there is interest.
Comments
#1
Change in thinking has occurred. I now think that Drupal should ship with a default robots.txt and let the robotstxt module suffice for people with multisite needs. The search is on for the optimal robots.txt file. Here's a start:
User-agent: *Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /admin/
Disallow: /user/register
Disallow: /user/password
#2
The kind reviewers of this "patch" will need to create the file with the above text rather than apply a patch. The file should be called robots.txt and be in the root directory.
#3
I think this is a very good idea.
Here is the robots.txt file I've been using with my 4.5 site. Obviously, some paths have changed in 4.6, 4.7 and 4.8. But perhaps it will give you a couple more ideas.
User-agent: *
Crawl-Delay: 10
Disallow: */add/
Disallow: /?q=admin
Disallow: /admin/
Disallow: /database/
Disallow: /includes/
Disallow: /modules/
Disallow: /scripts/
Disallow: /themes/
Disallow: /xmlrpc.php
Disallow: ?q=admin
Disallow: cron.php
Disallow: error.php
Disallow: xmlrpc.php
#4
Your patch assumes that clean URLs are enabled?
#5
I don't know much about robots.txt, but this is what I use, partly as a result of the threads on hiding feeds and print pages:
Disallow: /node/feed
Disallow: /blog/feed
Disallow: /aggregator/sources
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /user
Disallow: /files
Disallow: /search
Disallow: /book/print
Disallow: /archive
Disallow: /trackback
I added the recommended pieces above since I didn't have any of them in my file.
Maria
#6
It just goes to show that the above approach is all wrong. Dries, what do you think of building this into the menu system, so that modules can do this:
<?php
hook_menu()...
$item[] = array(
'path' => 'some/path',
'crawl' => false,
);
?>
That way we could generate robots.txt dynamically and take into account all modules' paths, as well as things like clean urls.
#7
please take a look at an old bookpage i wrote up at http://drupal.org/node/22265
esp the option to use:
User-agent: *Crawl-Delay: 10
Dries was against including a robots.txt functionality in 2005 ( http://drupal.org/node/14177 ) but I think it is very very standard to ship with a default robots.txt and we should. in fact, i would rather ship Drupal with a robots.txt then a favicon.
see also http://cvs.drupal.org/viewcvs/drupal/drupal/robots.txt?hideattic=0&rev=1...
#8
I'm OK with a _simple_ robots.txt.
1. Keep it short and simple.
2, Add some documentation so people can extend it as they see fit.
#9
how about:
# small robots.txt
# more information about this file can be found at
# http://www.robotstxt.org/wc/robots.html
# lines beginning with the pund ("#") sign are comments and can be deleted.
# if case your drupal site is in a directory
# lower than your docroot (e.g. /drupal)
# please add this before the /-es below
# to stop a polite robot indexing an exampledir
# add a line like (without the #'s)
# user-agent: polite-bot
# Disallow: /exampledir/
# a list of know bots can be found at
# http://www.robotstxt.org/wc/active/html/index.html
# see http://www.sxw.org.uk/computing/robots/check.html
# for syntax checking
User-agent: *
Crawl-Delay: 10
Disallow: /comment/reply
Disallow: /node/add
Disallow: /files
Disallow: /search
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /admin/
Disallow: /user/register
Disallow: /user/password
Disallow: /?q=admin
Disallow: /xmlrpc.php
Disallow: /?q=admin
Disallow: /cron.php
Disallow: /error.php
Disallow: /xmlrpc.php
It might be a bit longish but covers most basic options an it documented. Note that wildcards in robots.txt dont work so lines like */add* wont work
#10
You've got Disallow: /?q=admin in there twice.
#11
# robots.txt
#
# This file aims to prevent the crawling and idexing of certain parts of your site by
# webcrawlers and spiders run by sites like Yahoo! and Google. By telling
# these "robots" where not to go on your site, you save bandwidth and server
# resources, and the quality of their crawling and indexing is improved as well.
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
# To stop a polite robot from indexing an exampledir,
# add a uncommented line (without the #'s), like the following:
#
# user-agent: polite-bot
# Disallow: /exampledir/
# A list of know 'bots can be found at:
# http://www.robotstxt.org/wc/active/html/index.html
#
# See this site for syntax checking:
# http://www.sxw.org.uk/computing/robots/check.html
User-agent: *
Crawl-Delay: 10
# Directories
Disallow: /files/
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
# Paths (Clean URLs)
Disallow: /admin/
Disallow: /node/add/
Disallow: /search/
Disallow: /comment/reply/
Disallow: /contact
Disallow: /user/register
Disallow: /user/password
Disallow: /logout
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=comment/reply/
Disallow: /?q=contact
Disallow: /?q=user/register
Disallow: /?q=user/password
Disallow: /?q=logout
#12
These are usually added as a standard:
# W3C Link checker
User-agent: W3C-checklink
Disallow:
# Exclude stress-testing tools
User-Agent: stress-agent
Disallow: /
#13
There are some typos in Robert's text.
#14
<code>
# robots.txt
#
# This file aims to prevent the crawling and idexing of certain parts of your site by
# webcrawlers and spiders run by sites like Yahoo! and Google. By telling
# these "robots" where not to go on your site, you save bandwidth and server
# resources, and the quality of their crawling and indexing is improved as well.
#
# For more information about the robots.txt standard, see:
# http://www.robotstxt.org/wc/robots.html
#
# To stop a polite robot from indexing an exampledir,
# add an uncommented line (without #), like the following:
#
# user-agent: polite-bot
# Disallow: /exampledir/
# A list of known 'bots can be found at:
# http://www.robotstxt.org/wc/active/html/index.html
#
# See this site for syntax checking:
# http://www.sxw.org.uk/computing/robots/check.html
User-agent: *
Crawl-Delay: 10
# Directories
Disallow: /files/
Disallow: /database/
Disallow: /includes/
Disallow: /misc/
Disallow: /modules/
Disallow: /sites/
Disallow: /themes/
Disallow: /scripts/
Disallow: /updates/
Disallow: /profiles/
# Files
Disallow: /xmlrpc.php
Disallow: /cron.php
Disallow: /update.php
Disallow: /install.php
Disallow: /INSTALL.mysql.txt
Disallow: /INSTALL.pgsql.txt
Disallow: /CHANGELOG.txt
Disallow: /MAINTAINERS.txt
Disallow: /LICENSE.txt
Disallow: /UPGRADE.txt
# Paths (Clean URLs)
Disallow: /admin/
Disallow: /node/add/
Disallow: /search/
Disallow: /comment/reply/
Disallow: /contact
Disallow: /user/register
Disallow: /user/password
Disallow: /logout
# Paths (no clean URLs)
Disallow: /?q=admin/
Disallow: /?q=node/add/
Disallow: /?q=search/
Disallow: /?q=comment/reply/
Disallow: /?q=contact
Disallow: /?q=user/register
Disallow: /?q=user/password
Disallow: /?q=logout
#15
I have been using this since 4.5 or so.
The ideas about are great (clean vs. regular URLs, excluding feeds, print pages, ...etc.)
User-agent: *Crawl-Delay: 10
Disallow: /database
Disallow: /includes
Disallow: /modules
Disallow: /scripts
Disallow: /themes
Disallow: /aggregator
Disallow: /tracker
Disallow: /comment/reply
Disallow: /node/add
Disallow: /search
#16
added aggregator
#17
Some lines end with a trailing slash while others don't. Is that intentional?
The robots.txt file doesn't validate at all. Test with http://www.sxw.org.uk/computing/robots/check.html.
#18
Crawl-delay is non-standard but obeyed by at least a couple major spiders. I removed line breaks per the validator's suggestion. I read also that directories must be followed by a trailing slash, so I added that to both the clean and non-clean URLs section, though it is a question (an probably not consistent) how spiders will handle the non-clean directives.
#19
#20
Committed to CVS HEAD. Thanks.
#21
should we add the lines
Disallow: /user/login and Disallow: /?q=user/login
#22
Yeah, good catch. Dries, do you need a patch?
#23
if we protect *.TXT files we dont need to list them here anymore
see http://drupal.org/node/79018
#24
I agree with Robert http://drupal.org/node/75916#comment-123192, it would be great if the robots.txt could be created automatically as part of the menu system.
For now, manually editing my robots.txt is just fine but letting modules define defaults crawl/not-crawl for menu paths seems like a good idea.
Maybe the crawlability of all such paths could be administred on a special admin page if people don't like the defaults.
#25
Patch adds lines suggested in #21
I have very little experience making patches. Hope it works :-)
#26
#27
Committed to CVS HEAD. Thanks.
#28
What's the rationale for disallowing the aggregator? I consider that content, not administrative functions like the other items.
#29
I would like to see this go in the development version first.
#30
I've just been using the google webmaster tools to test out various aspects of the site incl. the robots.txt file. I've come to a startling conclusion.
Disallow: /user/password != Disallow: /user/password/
and
Disallow: /user/password/ *does not include* Disallow: /user/password
I'm running a 5.1 site and I noticed all the things that shouldn't be indexed are being. i.e. /contact. and /user/login
To properly protect certain paths it is necessary to:
Disallow: /admin
Disallow: /admin/
#31
Well, aggregator could be content you would like to get indexed (like content gathered from your subsites), or foreign content you would not like to have indexed. I changed the default now to let it be indexed, as you suggest, but this decision is different from site to site. I am not entirely sure this should be ported back, but setting it to that state as drumm indicated.
#32
Removed all trailing slashes, see also:
http://groups.drupal.org/node/5391#comment-15648
#33
Removed all trailing slashes, see also:
http://groups.drupal.org/node/5391
#34
There is another patch for robots.txt here:
http://drupal.org/node/180379
Someone recommended that I open a new issue for it. It's my first submitted patch -- I hope I did it right...
#35
The attached patch removes the aggregator entries from the robots.txt in Drupal 5. It would seem that the patch has, in all other respects, already been applied to Drupal 6, except for the trailing slashes issue, which I'd say is more at home in #180379: Fixing Robots.txt. This bug is about providing a default robots.txt, and that very robots.txt is now available in both D5, D6, and D7. As soon as D5 has been updated to be similar to the robots.txt of D6 this issue ended up with, please mark this fixed and/or closed.
(#28 still applies as well, though with a wee bit of fuzz.)
Edit: Updated patch. Had some old stuff in it.
#36
Committed to 5.x.
#37
Automatically closed -- issue fixed for two weeks with no activity.