Fixing Robots.txt
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | other |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs work |
This patch cuts down on the duplicate content that is spidered by search engine crawlers. It is difficult to make a single robots.txt file for all Drupal sites, but this one fixes some errors in the current default file and adds some new rules.
Wildcards (*) and end-of-string ($) characters are not part of the robots.txt standard, but they are accepted by Google, Yahoo, and MSN. You can test the rules in Google's Robots.txt tool in their Webmaster Tools: http://www.google.com/webmasters/tools
For example, you need to use a wildcard to block these tables that can be sorted in multiple ways:
http://drupal.org/forum/2?sort=asc&order=Last+reply&page=428
The rule is:
Disallow: /*sort=
Pages like these do not need to get spidered because there are other ways for crawlers to access the content:
drupal.org/tracker/12322
drupal.org/tracker/%5Buserid?page=9
drupal.org/tracker/order/author/desc?page=55
Trailing slashes have been removed from some URLs in the patch, but not all. For example, the current rule to block /modules/ is:
Disallow: /modules/
You cannot access the URL example.com/modules from a browser -- the server will add the trailing slash for you because it is a directory. Also, if you remove the trailing slash in this case, it might block a page of regular content with a URL like: example.com/modules-are-good
It is generally better not to have duplicate content indexed by search engines unless your site has a lot of link popularity. The aggregator module can create a lot of duplicate content.
The path /aggregator is blocked without a trailing slash because the following URLs should probably be blocked:
/aggregator
/aggregator/sources
/aggregator/categories
/aggregator?page=3
Removing the trailing slash on /aggregator would also block someone's post named example.com/aggregator-module-tutorial. So it may be better to change the attached patch to:
Disallow: /aggregator/
Disallow: /aggregator?
The following rule prevents all feeds from being crawled, except the main one at /rss.xml.
Disallow: /*/feed$
The following rule prevents the user track pages from being crawled:
Disallow: /*/track$
The first rule below prevents /search (a 302 redirect) from being hit, while the second one blocks the search results:
Disallow: /search$
Disallow: /search/
If you did Disallow: /search it would also block pages of content like example.com/search-engine-optimization -- that is why there are two rules above.
There is no way to make a single perfect robots.txt file for all sites, but this patch should help cut down on the server load from crawlers and improve general SEO of the basic Drupal installation.
Further explanations of the reasoning behind these proposed changes can be found on my Drupal robots.txt tutorial here:
http://drupalzilla.com/robots-txt
Each module that you add to a Drupal installation can require additional robots.txt rules. I left out robots.txt rules for modules in the attached patch, even though it might be a good idea to put some rules in for common modules. I'm building a database of Drupal modules with instructions on how to modify the robots.txt file for each extra module that is installed:
http://drupalzilla.com/module
| Attachment | Size | Status | Test result | Operations |
|---|---|---|---|---|
| robots_2.patch | 1.36 KB | Ignored | None | None |

#1
thanks for working on this ... please use unified diff format for diffs. see diffandpatch. we're so used to them that i can't recall which lines of yours are adds versus deletes.
#2
Sorry about that (my first time submitting a patch).
I've attached a new patch made with
cvs diff -up.#3
setting back to review.
#4
Why do we want to do
+Disallow: /node$and its q equivalent?Do we really want to remove
-Disallow: /contact/and add+Disallow: /contact$and others or do we just want to add the $ ending ones? The three engines you mention aren't the only ones.We should think about creating robots.txt on the fly in the module activation processes. Activate/deactivate module could add/remove robots.txt text.
#5
Wouldn't that mean duplicate content between example.com/ and example.com/node if /node is the front page?
#6
Yes, I suppose. But then there is http://drupal.org/project/globalredirect which would correct that issue without needing to modify the robots.txt file. And adding http://drupal.org/project/gsitemap can help even further. However for the default install, I see your point.
#7
example.com/node is duplicate content of example.com/
example.com/node/1 shouldn't be blocked though.
It looked to me like the current rule,
Disallow: /contact/was an attempt to block the contact forms. It doesn't block the default contact form because of trailing slash. But it might be best just to leave that rule in its current form.That is a good idea because a lot of modules create duplicate content problems -- as long as you could still have precise control over the robots.txt rules by hand.
#8
dynamic robots.txt for modules could maybe be an addition to this: http://drupal.org/node/53579? - either way it's a very nice idea.
#9
– per default, this is true, but it can be easily changed, even by people who do not have access to the file system and thus won't be able to edit robots.txt.
Also, where does your use of "$" come from? I haven't been able to discern its function from anything I could find at robotstxt.org or Wikipedia...
It was also agreed upon in issue 75916 to have aggregator indexed by default, so that should be changed to "If you do not want you aggregator pages to be indexed, uncomment the following line".
Finishing this, I'd recommend you to read through issue 75916, as it contains some hints and has some discussion on this.
#10
$ is an end of line character, not in the spec, but recognised by all major search engines (this is covered in the issue discussion and I almost cut and pasted).
example.com/node - I agree with though, a lot of sites don't use /node as the front page.
#11
Bumping to 7.x
#12
Okay, I've had some time to turn this over in my mind, and I'm feeling rather uneasy about using
*and$in the robots.txt, as they're not standard. Google and co. might well support it, but I'll bet you that there are tons of (polite) robots out there that actually follow the standard and doesn't care for (or possibly doesn't even know of) the extensions Google et al. use. And to me they seem like they would confuse standards compliant robots.The attached patch removes some of the trailing slashes, namely
/foo/bar/to/foo/bar, leaving/foo/alone (in case someone wants to make a/fooxyznode). Even if the non-standards approach is deemed a road worth continuing along, this patch will provide a temporary boost to robots.txt's effectiveness.#13
Patch still applied.
#14
Actually, the patch adds new rules without affecting the existing ones... was that what you meant to do?
#15
Yes, this is what I meant to do.
#16
edit: Nevermind :\ Should read the actual patch before replying..
#17
Um... huh? Which rule, specifically, prevents http://drupal.org/forum/...etc from being indexed? These are all of the new ones:
+Disallow: /comment/reply+Disallow: /node/add
+Disallow: /user/register
+Disallow: /user/password
+Disallow: /user/login
+Disallow: /?q=comment/reply
+Disallow: /?q=node/add
+Disallow: /?q=user/password
+Disallow: /?q=user/register
+Disallow: /?q=user/login
#18
The last submitted patch failed testing.
#19
See: #335122: Test clean HEAD after every commit and http://pastebin.ca/1258476
#20
The last submitted patch failed testing.
#21
Re-roll.
#22
Also: Marked #278775: Allow robots.txt to disallow URLs with "sort" and "filter" in them a duplicate of this.
#23
reposting for bot's sake.
#24
#25
I wonder why we need
/?q=user/logout/-- can something follow the logout-part of the path?#26
Isn't it because that leads you to a 403? Same as admin?
#27
It was added with #75916: Include a default robots.txt (commit), but that issue doesn't seem to mention why it is using the slash at the end of it. I think the safe thing to do is to keep it; should we find out it causes trouble, it can be removed later.
#28
Don't see any remaining issues here, unless we want to get rid of some of the trailing slashes.
#29
We could do with a comment at the top of this file that explains why the paths are repeated. Although I would love a reason better than "We don't know why the slashes are there" :P I'm wondering if we should just remove them, since the contents of this file with this patch are absolutely baffling.
Any SEO experts in the house?
#30
http://www.google.com/support/webmasters/bin/answer.py?answer=35237
#31
@earnie: Can you explain how that page explains why we need both trailing and not trailing slashes on every path? And if so, could you formulate that into a comment and re-roll the patch?
#32
I think it more says we need the ones with the slash more than we need the ones without it. See http://www.google.com/support/webmasters/bin/answer.py?answer=40360&ctx=... for examples.
In particular:
# To block a directory and everything in it, follow the directory name with a forward slash.
Disallow: /junk-directory/
#33
IISC there's something important missing here: wildcard paths for multilingual site, as posted on #347515: robots.txt: add wildcarded paths for multilingual sites:
# For multi-language sites (wildcards supported at least# by GoogleBot, MSNBot and Yahoo Slurp web spiders)
# Paths (clean URLs)
Disallow: /*/admin/
Disallow: /*/comment/reply/
Disallow: /*/contact/
Disallow: /*/logout/
Disallow: /*/node/add/
Disallow: /*/search/
Disallow: /*/user/register/
Disallow: /*/user/password/
Disallow: /*/user/login/
# Paths (no clean URLs)
Disallow: /*/?q=admin/
Disallow: /*/?q=comment/reply/
Disallow: /*/?q=contact/
Disallow: /*/?q=logout/
Disallow: /*/?q=node/add/
Disallow: /*/?q=search/
Disallow: /*/?q=user/password/
Disallow: /*/?q=user/register/
Disallow: /*/?q=user/login/
#34
Is there a reason why /user/ isn't blocked? The current robots.txt file blocks "/user/logon" but nothing addresses "/user" (which routes to the same logon page). Seems to me addition of the following would be required:
Disallow: /user/
Disallow: /?q=user/
Similarly, attempts to navigate to /system/ or /system/files/ result in a "page not found" error. This is good. But when files are attached to nodes, those files become available as /system/files/foo.txt (replace foo.txt with the appropriate filename+extension). I have seen said file attachments indexed by Google (not good, imho). Wouldn't the following additions to robots.txt prevent the indexing of any node-attached files?
Disallow: /system/files/
Disallow: /?q=system/files/
#35
readjusting version and status as there is already a patch in play that needs work according to webchicks comments in #29
#36
Thanks @VeryMisunderstood, I'm working with 6.14 and didn't consider how changing the version from what was defaulted might be a problem. My apologies.
http://robotstxt.org is supposed to be the definitive source but they're currently generating a 503 server error. So I hit Wikipedia.
What that tells me is inclusion of the trailing slash obviates the need for additional entries including the same path. So 1-3 below will disallow indexing of content within specific directories subordinate to "http://foo.com/user/" while 4 accomplishes the same in addition to disallowing _everything else_ subordinate to "http://foo.com/user/":
As far as I can tell, the current approach (inclusion of the trailing slash) is correct. But it seems to me "Disallow: /user/" could replace "Disallow: /user/register/, /user/password/, /user/login/" (same for the non-clean URL equivalents).
Finally, in my initial comment (#34), I also suggested adding another dir path (/system/files/) so as to block indexing of any files attached to nodes. I think it might make sense to simply block /system/ but haven't looked at that carefully to prevent unintended consequences.
#37
.
#38
Following up on my last post (#36), http://robotstxt.org continues to be offline. Wikipedia isn't a bad source but also not the most reliable. So I checked W3C (w3.org) and they provide the same details on "robots.txt".
I believe this confirms my suspicion that "Disallow: /user/" would negate "Disallow: /user/register/", /user/password/", and "/user/login/" (same for the non-clean URL equivalents). Additionally, "/user/" yields a login page that is currently not blocked from indexing. So adding "Disallow: /user/" (to replace the 3 /user paths listed in the current robots.txt) will block the one that is currently not getting blocked from indexing, while also doing what the three current entries attempt to accomplish.
Something else I just thought of... how would all this work with a multi-site installation? I'd like to test this but to-date have not been able to successfully complete a multi-site installation (yes, I've tried following the handbook references). If someone can coach me through a test multi-site installation, I'd be happy to look into this.
And as I mentioned in #36, I believe it makes sense to add "Disallow: /system/files/" (and the non-clean URL equiv). This would block the indexing of node file attachments, presuming the file system default of "/files" or "(/something)/files" is retained.
#39
when using the private file system, the files folder should be moved above the public root which as far as I can tell disallows anon users to reach them. Bots index as anon users? a file system set as private but left in the public root is essentially public regardless of setting?
I'd gladly help you with a multisite install. I've done a few. However, this thread isn't the place for those instructions. Feel free to use my contact tab or create a forum thread. May even want to do a search on the forums as I've posted my successful steps multiple times.