This patch cuts down on the duplicate content that is spidered by search engine crawlers. It is difficult to make a single robots.txt file for all Drupal sites, but this one fixes some errors in the current default file and adds some new rules.
Wildcards (*) and end-of-string ($) characters are not part of the robots.txt standard, but they are accepted by Google, Yahoo, and MSN. You can test the rules in Google's Robots.txt tool in their Webmaster Tools: http://www.google.com/webmasters/tools
For example, you need to use a wildcard to block these tables that can be sorted in multiple ways:
http://drupal.org/forum/2?sort=asc&order=Last+reply&page=428
The rule is:
Disallow: /*sort=
Pages like these do not need to get spidered because there are other ways for crawlers to access the content:
drupal.org/tracker/12322
drupal.org/tracker/%5Buserid?page=9
drupal.org/tracker/order/author/desc?page=55
Trailing slashes have been removed from some URLs in the patch, but not all. For example, the current rule to block /modules/ is:
Disallow: /modules/
You cannot access the URL example.com/modules from a browser -- the server will add the trailing slash for you because it is a directory. Also, if you remove the trailing slash in this case, it might block a page of regular content with a URL like: example.com/modules-are-good
It is generally better not to have duplicate content indexed by search engines unless your site has a lot of link popularity. The aggregator module can create a lot of duplicate content.
The path /aggregator is blocked without a trailing slash because the following URLs should probably be blocked:
/aggregator
/aggregator/sources
/aggregator/categories
/aggregator?page=3
Removing the trailing slash on /aggregator would also block someone's post named example.com/aggregator-module-tutorial. So it may be better to change the attached patch to:
Disallow: /aggregator/
Disallow: /aggregator?
The following rule prevents all feeds from being crawled, except the main one at /rss.xml.
Disallow: /*/feed$
The following rule prevents the user track pages from being crawled:
Disallow: /*/track$
The first rule below prevents /search (a 302 redirect) from being hit, while the second one blocks the search results:
Disallow: /search$
Disallow: /search/
If you did Disallow: /search it would also block pages of content like example.com/search-engine-optimization -- that is why there are two rules above.
There is no way to make a single perfect robots.txt file for all sites, but this patch should help cut down on the server load from crawlers and improve general SEO of the basic Drupal installation.
Further explanations of the reasoning behind these proposed changes can be found on my Drupal robots.txt tutorial here:
http://drupalzilla.com/robots-txt
Each module that you add to a Drupal installation can require additional robots.txt rules. I left out robots.txt rules for modules in the attached patch, even though it might be a good idea to put some rules in for common modules. I'm building a database of Drupal modules with instructions on how to modify the robots.txt file for each extra module that is installed:
http://drupalzilla.com/module
Comment | File | Size | Author |
---|---|---|---|
#45 | 180379-comment-url.patch | 568 bytes | andypost |
#23 | 180379_fixing_robotstxt-21-d7.patch | 1.12 KB | chx |
#21 | 180379_fixing_robotstxt-21-d7.patch | 1.12 KB | Freso |
#12 | robots.txt.d7.trailing_slashes.patch | 1.06 KB | Freso |
#2 | robots_3.patch | 1.75 KB | Drupalzilla.com |
Comments
Comment #1
moshe weitzman CreditAttribution: moshe weitzman commentedthanks for working on this ... please use unified diff format for diffs. see diffandpatch. we're so used to them that i can't recall which lines of yours are adds versus deletes.
Comment #2
Drupalzilla.com CreditAttribution: Drupalzilla.com commentedSorry about that (my first time submitting a patch).
I've attached a new patch made with
cvs diff -up
.Comment #3
catchsetting back to review.
Comment #4
Anonymous (not verified) CreditAttribution: Anonymous commentedWhy do we want to do
+Disallow: /node$
and its q equivalent?Do we really want to remove
-Disallow: /contact/
and add+Disallow: /contact$
and others or do we just want to add the $ ending ones? The three engines you mention aren't the only ones.We should think about creating robots.txt on the fly in the module activation processes. Activate/deactivate module could add/remove robots.txt text.
Comment #5
catchWouldn't that mean duplicate content between example.com/ and example.com/node if /node is the front page?
Comment #6
Anonymous (not verified) CreditAttribution: Anonymous commentedYes, I suppose. But then there is http://drupal.org/project/globalredirect which would correct that issue without needing to modify the robots.txt file. And adding http://drupal.org/project/gsitemap can help even further. However for the default install, I see your point.
Comment #7
Drupalzilla.com CreditAttribution: Drupalzilla.com commentedexample.com/node is duplicate content of example.com/
example.com/node/1 shouldn't be blocked though.
It looked to me like the current rule,
Disallow: /contact/
was an attempt to block the contact forms. It doesn't block the default contact form because of trailing slash. But it might be best just to leave that rule in its current form.That is a good idea because a lot of modules create duplicate content problems -- as long as you could still have precise control over the robots.txt rules by hand.
Comment #8
catchdynamic robots.txt for modules could maybe be an addition to this: http://drupal.org/node/53579? - either way it's a very nice idea.
Comment #9
Freso CreditAttribution: Freso commented– per default, this is true, but it can be easily changed, even by people who do not have access to the file system and thus won't be able to edit robots.txt.
Also, where does your use of "$" come from? I haven't been able to discern its function from anything I could find at robotstxt.org or Wikipedia...
It was also agreed upon in issue 75916 to have aggregator indexed by default, so that should be changed to "If you do not want you aggregator pages to be indexed, uncomment the following line".
Finishing this, I'd recommend you to read through issue 75916, as it contains some hints and has some discussion on this.
Comment #10
catch$ is an end of line character, not in the spec, but recognised by all major search engines (this is covered in the issue discussion and I almost cut and pasted).
example.com/node - I agree with though, a lot of sites don't use /node as the front page.
Comment #11
catchBumping to 7.x
Comment #12
Freso CreditAttribution: Freso commentedOkay, I've had some time to turn this over in my mind, and I'm feeling rather uneasy about using
*
and$
in the robots.txt, as they're not standard. Google and co. might well support it, but I'll bet you that there are tons of (polite) robots out there that actually follow the standard and doesn't care for (or possibly doesn't even know of) the extensions Google et al. use. And to me they seem like they would confuse standards compliant robots.The attached patch removes some of the trailing slashes, namely
/foo/bar/
to/foo/bar
, leaving/foo/
alone (in case someone wants to make a/fooxyz
node). Even if the non-standards approach is deemed a road worth continuing along, this patch will provide a temporary boost to robots.txt's effectiveness.Comment #13
lilou CreditAttribution: lilou commentedPatch still applied.
Comment #14
cburschkaActually, the patch adds new rules without affecting the existing ones... was that what you meant to do?
Comment #15
Freso CreditAttribution: Freso commentedYes, this is what I meant to do.
Comment #16
BartVB CreditAttribution: BartVB commentededit: Nevermind :\ Should read the actual patch before replying..
Comment #17
cburschkaUm... huh? Which rule, specifically, prevents http://drupal.org/forum/...etc from being indexed? These are all of the new ones:
Comment #19
lilou CreditAttribution: lilou commentedSee: #335122: Test clean HEAD after every commit and http://pastebin.ca/1258476
Comment #21
Freso CreditAttribution: Freso commentedRe-roll.
Comment #22
Freso CreditAttribution: Freso commentedAlso: Marked #278775: Allow robots.txt to disallow URLs with "sort" and "filter" in them a duplicate of this.
Comment #23
chx CreditAttribution: chx commentedreposting for bot's sake.
Comment #24
Anonymous (not verified) CreditAttribution: Anonymous commentedComment #25
Dries CreditAttribution: Dries commentedI wonder why we need
/?q=user/logout/
-- can something follow the logout-part of the path?Comment #26
catchIsn't it because that leads you to a 403? Same as admin?
Comment #27
Freso CreditAttribution: Freso commentedIt was added with #75916: Include a default robots.txt (commit), but that issue doesn't seem to mention why it is using the slash at the end of it. I think the safe thing to do is to keep it; should we find out it causes trouble, it can be removed later.
Comment #28
cburschkaDon't see any remaining issues here, unless we want to get rid of some of the trailing slashes.
Comment #29
webchickWe could do with a comment at the top of this file that explains why the paths are repeated. Although I would love a reason better than "We don't know why the slashes are there" :P I'm wondering if we should just remove them, since the contents of this file with this patch are absolutely baffling.
Any SEO experts in the house?
Comment #30
Anonymous (not verified) CreditAttribution: Anonymous commentedhttp://www.google.com/support/webmasters/bin/answer.py?answer=35237
Comment #31
webchick@earnie: Can you explain how that page explains why we need both trailing and not trailing slashes on every path? And if so, could you formulate that into a comment and re-roll the patch?
Comment #32
Anonymous (not verified) CreditAttribution: Anonymous commentedI think it more says we need the ones with the slash more than we need the ones without it. See http://www.google.com/support/webmasters/bin/answer.py?answer=40360&ctx=... for examples.
In particular:
Comment #33
eMPee584 CreditAttribution: eMPee584 commentedIISC there's something important missing here: wildcard paths for multilingual site, as posted on #347515: robots.txt: add wildcarded paths for multilingual sites:
Comment #34
Anonymous (not verified) CreditAttribution: Anonymous commentedIs there a reason why /user/ isn't blocked? The current robots.txt file blocks "/user/logon" but nothing addresses "/user" (which routes to the same logon page). Seems to me addition of the following would be required:
Disallow: /user/
Disallow: /?q=user/
Similarly, attempts to navigate to /system/ or /system/files/ result in a "page not found" error. This is good. But when files are attached to nodes, those files become available as /system/files/foo.txt (replace foo.txt with the appropriate filename+extension). I have seen said file attachments indexed by Google (not good, imho). Wouldn't the following additions to robots.txt prevent the indexing of any node-attached files?
Disallow: /system/files/
Disallow: /?q=system/files/
Comment #35
VM CreditAttribution: VM commentedreadjusting version and status as there is already a patch in play that needs work according to webchicks comments in #29
Comment #36
Anonymous (not verified) CreditAttribution: Anonymous commentedThanks @VeryMisunderstood, I'm working with 6.14 and didn't consider how changing the version from what was defaulted might be a problem. My apologies.
http://robotstxt.org is supposed to be the definitive source but they're currently generating a 503 server error. So I hit Wikipedia.
What that tells me is inclusion of the trailing slash obviates the need for additional entries including the same path. So 1-3 below will disallow indexing of content within specific directories subordinate to "http://foo.com/user/" while 4 accomplishes the same in addition to disallowing _everything else_ subordinate to "http://foo.com/user/":
As far as I can tell, the current approach (inclusion of the trailing slash) is correct. But it seems to me "Disallow: /user/" could replace "Disallow: /user/register/, /user/password/, /user/login/" (same for the non-clean URL equivalents).
Finally, in my initial comment (#34), I also suggested adding another dir path (/system/files/) so as to block indexing of any files attached to nodes. I think it might make sense to simply block /system/ but haven't looked at that carefully to prevent unintended consequences.
Comment #37
mattyoung CreditAttribution: mattyoung commented.
Comment #38
Anonymous (not verified) CreditAttribution: Anonymous commentedFollowing up on my last post (#36), http://robotstxt.org continues to be offline. Wikipedia isn't a bad source but also not the most reliable. So I checked W3C (w3.org) and they provide the same details on "robots.txt".
I believe this confirms my suspicion that "Disallow: /user/" would negate "Disallow: /user/register/", /user/password/", and "/user/login/" (same for the non-clean URL equivalents). Additionally, "/user/" yields a login page that is currently not blocked from indexing. So adding "Disallow: /user/" (to replace the 3 /user paths listed in the current robots.txt) will block the one that is currently not getting blocked from indexing, while also doing what the three current entries attempt to accomplish.
Something else I just thought of... how would all this work with a multi-site installation? I'd like to test this but to-date have not been able to successfully complete a multi-site installation (yes, I've tried following the handbook references). If someone can coach me through a test multi-site installation, I'd be happy to look into this.
And as I mentioned in #36, I believe it makes sense to add "Disallow: /system/files/" (and the non-clean URL equiv). This would block the indexing of node file attachments, presuming the file system default of "/files" or "(/something)/files" is retained.
Comment #39
VM CreditAttribution: VM commentedwhen using the private file system, the files folder should be moved above the public root which as far as I can tell disallows anon users to reach them. Bots index as anon users? a file system set as private but left in the public root is essentially public regardless of setting?
I'd gladly help you with a multisite install. I've done a few. However, this thread isn't the place for those instructions. Feel free to create a forum thread. May even want to do a search on the forums as I've posted my successful steps multiple times.
Comment #40
j0nathan CreditAttribution: j0nathan commentedsubscribing
Comment #41
mlbrgl CreditAttribution: mlbrgl commented@zacamjo - #38
Wouldn't "Disallow: /user/" also block "/user/[USER ID]" paths, that community sites might want to keep getting indexed, when they are public?
http://www.google.com/search?q=site%3Adrupal.org%2Fuser%2F
Comment #42
YK85 CreditAttribution: YK85 commentedI was wondering if someone can help setup the robot.txt for drupal 6 for a multilingual site? Thank you!
Comment #43
j0nathan CreditAttribution: j0nathan commentedHi, here is another example of a modified robots.txt file, for multilingual Drupal 6 site:
https://wiki.koumbit.net/DrupalRobots
Comment #44
YK85 CreditAttribution: YK85 commentedIt seems like #43 link is using the method in #33.
I'm still not clear if all 8 lines shown below needs to be in the robot.txt for each url:
Does anyone know for sure?
Comment #45
andypostD7 introduced comment/% urls for comments this brings a huge trouble with content duplication
So proposal is totally disable /comment/
Comment #46
RobLoachRelated: #495608: Move parts of robotstxt module into core.
Comment #47
pillarsdotnet CreditAttribution: pillarsdotnet commented#45: 180379-comment-url.patch queued for re-testing.
Comment #48
andypostThis trouble mostly cause by "Last comments" block which points to comment/ID#comment-ID
Another way to fix this to change a block to display links for comments like /node/NID#comment-ID
Comment #49
Ayesh CreditAttribution: Ayesh commentedNo need to mention that a simple problem in robots.txt can be a fatal problem for sites that mainly depend on Google's traffic.
Keeping Google traffic in mind, I could set query params in Google webmaster central to INDEX page, sort and order queries. GW has a really cool feature to set which query do what.
About
/comment/ID
URLs, I got content duplication warnings and a robots.txt entry to disallow them worked great. But Do we really need to give each comment a URL ?D6's comment URL pattern looks nice but without the node ID, comment URLs are a little misleading.
Comment #51
kscheirer#45: 180379-comment-url.patch queued for re-testing.
Comment #53
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedRe #44:
Would block (for example)
/administration-guide
Would block /?q=administration-guide
Would block
/content/admin/
Would block
/content/administration-guide
Comment #54
Anonymous (not verified) CreditAttribution: Anonymous commentedRe #53: And why do we want a robot accessing administration-guide anyway?
Comment #55
maciej.zgadzaj CreditAttribution: maciej.zgadzaj commentedBecause that could be an article alias, content of which someone could want to have indexed by a search engine?
Comment #56
andypostThe related discussion about links to comments #2113323: Rename Comment::permalink() to not be ambiguous with ::uri()
Comment #57
DevElCuy CreditAttribution: DevElCuy commentedThere is a broken link at line 14. The new link is: http://www.robotstxt.org/robotstxt.html
Comment #58
hass CreditAttribution: hass commentedhttp://www.frobee.com/robots-txt-check link in robots.txt is broken.
Comment #59
ronaldmulero CreditAttribution: ronaldmulero commentedRelated: #1137848 - /filter/tips page is listed by search engines
Comment #60
DevElCuy CreditAttribution: DevElCuy commentedfollowing #58, there is a good syntax checker that requires no account creation like the google one: https://webmaster.yandex.com/robots.xml
Patch attached.
Comment #61
gbisht CreditAttribution: gbisht commented@develCuy please put the issue in needs review after submitting the patch.
Comment #62
jonhattanIn the general term it would be more accurate to link to http://www.robotstxt.org/checker.html, wikipedia or any other trusted source but none of them provide a listing.
Comment #63
alexpottHmmm the patch on #60 is completely unrelated to the issue summary. I think that the fact that the http://www.frobee.com/robots-txt-check is broken should be a new issue. That new issue should discuss whether or not we should link to a validator in robots.txt - to me this seems superfluous.
Comment #64
cilefen CreditAttribution: cilefen commented#63 was fixed in #2446657: Dead link on robots.txt.
Comment #65
cilefen CreditAttribution: cilefen commentedComment #66
Ayesh CreditAttribution: Ayesh commentedThere's a lot to fix in the robots.txt file.
#2446657: Dead link on robots.txt
#1137848: /filter/tips page is listed by search engines
Still, it needs some rework, now that Google recommends to not block CSS/JS folders for its mobile-friendly SEO rankings (1, 2). Of course we shouldn't be focusing on just Google, but I do not see the motivation behind blocking module and theme paths, login pages (People do search for "facebook login", "facebook sign up", etc).
Comment #67
deepakaryan1988Removing sprint weekend tag!!
As suggested by @YesCT
Comment #68
deepakaryan1988Sorry, these issues were actually worked on during the 2015 Global Sprint
Weekend https://groups.drupal.org/node/447258
Comment #69
lpalgarvio CreditAttribution: lpalgarvio commentedComment #74
salvisFound this old issue...
According to https://developers.google.com/search/reference/robots_txt, /fish/ does not match /fish, i.e. /admin/ doesn't match /admin, so GoogleBot may try to access /admin (and hit 304) if some hacker links there.
If you have a "Log in" Link on your front page, you'll find that Google fully indexes /user/login, even though our robots.txt has
Disallow: /user/login/
Comment #77
apadernoComment #78
philsward CreditAttribution: philsward commentedConsidering this problem has been around since Drupal 5 and this issue has been around for over a decade now, I don't see it ever getting committed.
Crazy how difficult it is to get a simple text file committed for Drupal.
Somebody may as well close the issue as "Won't Fix".
Comment #79
leymannxI think there's just some fundamental (human) SEO expertise needed to bring this issue forward.
It definitely needs some more attention, yes.
Comment #81
cilefen CreditAttribution: cilefen commentedGoogle have just open-sourced their robots.txt parser.
Comment #86
longwaveI'm not sure what actionable tasks there are for this issue. There is a lot of discussion of different factors but there doesn't seem to be anything concrete we can move forward with. I think all Drupal core can hope to do here is ship a simple robots.txt file that covers some basic paths used by core, as it does at present. Site owners can edit the file directly or install robotstxt module if they wish to override the default settings.
I don't think we can use the * or $ operators, while these are supported by some search engines they are almost certainly not accepted by all.
#74 was resolved in #3123285: Actually exclude user register, login, logout, and password pages from search results in robots.txt (current rules are broken)
I suggest that this issue should be closed but if there are specific, actionable problems with any of the lines in the current robots.txt that these are discussed in new issues.
Comment #88
catchIt's been a few months since #86, let's close this one.