This patch cuts down on the duplicate content that is spidered by search engine crawlers. It is difficult to make a single robots.txt file for all Drupal sites, but this one fixes some errors in the current default file and adds some new rules.

Wildcards (*) and end-of-string ($) characters are not part of the robots.txt standard, but they are accepted by Google, Yahoo, and MSN. You can test the rules in Google's Robots.txt tool in their Webmaster Tools: http://www.google.com/webmasters/tools

For example, you need to use a wildcard to block these tables that can be sorted in multiple ways:
http://drupal.org/forum/2?sort=asc&order=Last+reply&page=428

The rule is:
Disallow: /*sort=

Pages like these do not need to get spidered because there are other ways for crawlers to access the content:
drupal.org/tracker/12322
drupal.org/tracker/%5Buserid?page=9
drupal.org/tracker/order/author/desc?page=55

Trailing slashes have been removed from some URLs in the patch, but not all. For example, the current rule to block /modules/ is:
Disallow: /modules/

You cannot access the URL example.com/modules from a browser -- the server will add the trailing slash for you because it is a directory. Also, if you remove the trailing slash in this case, it might block a page of regular content with a URL like: example.com/modules-are-good

It is generally better not to have duplicate content indexed by search engines unless your site has a lot of link popularity. The aggregator module can create a lot of duplicate content.

The path /aggregator is blocked without a trailing slash because the following URLs should probably be blocked:
/aggregator
/aggregator/sources
/aggregator/categories
/aggregator?page=3

Removing the trailing slash on /aggregator would also block someone's post named example.com/aggregator-module-tutorial. So it may be better to change the attached patch to:
Disallow: /aggregator/
Disallow: /aggregator?

The following rule prevents all feeds from being crawled, except the main one at /rss.xml.
Disallow: /*/feed$

The following rule prevents the user track pages from being crawled:
Disallow: /*/track$

The first rule below prevents /search (a 302 redirect) from being hit, while the second one blocks the search results:
Disallow: /search$
Disallow: /search/

If you did Disallow: /search it would also block pages of content like example.com/search-engine-optimization -- that is why there are two rules above.

There is no way to make a single perfect robots.txt file for all sites, but this patch should help cut down on the server load from crawlers and improve general SEO of the basic Drupal installation.

Further explanations of the reasoning behind these proposed changes can be found on my Drupal robots.txt tutorial here:
http://drupalzilla.com/robots-txt

Each module that you add to a Drupal installation can require additional robots.txt rules. I left out robots.txt rules for modules in the attached patch, even though it might be a good idea to put some rules in for common modules. I'm building a database of Drupal modules with instructions on how to modify the robots.txt file for each extra module that is installed:
http://drupalzilla.com/module

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

moshe weitzman’s picture

Status: Needs review » Needs work

thanks for working on this ... please use unified diff format for diffs. see diffandpatch. we're so used to them that i can't recall which lines of yours are adds versus deletes.

Drupalzilla.com’s picture

FileSize
1.75 KB

Sorry about that (my first time submitting a patch).

I've attached a new patch made with cvs diff -up.

catch’s picture

Status: Needs work » Needs review

setting back to review.

Anonymous’s picture

Why do we want to do +Disallow: /node$ and its q equivalent?

Do we really want to remove -Disallow: /contact/ and add +Disallow: /contact$ and others or do we just want to add the $ ending ones? The three engines you mention aren't the only ones.

We should think about creating robots.txt on the fly in the module activation processes. Activate/deactivate module could add/remove robots.txt text.

catch’s picture

Why do we want to do +Disallow: /node$ and its q equivalent?

Wouldn't that mean duplicate content between example.com/ and example.com/node if /node is the front page?

Anonymous’s picture

Yes, I suppose. But then there is http://drupal.org/project/globalredirect which would correct that issue without needing to modify the robots.txt file. And adding http://drupal.org/project/gsitemap can help even further. However for the default install, I see your point.

Drupalzilla.com’s picture

Why do we want to do +Disallow: /node$ and its q equivalent?

example.com/node is duplicate content of example.com/
example.com/node/1 shouldn't be blocked though.

Do we really want to remove -Disallow: /contact/ and add +Disallow: /contact$ and others or do we just want to add the $ ending ones? The three engines you mention aren't the only ones.

It looked to me like the current rule, Disallow: /contact/ was an attempt to block the contact forms. It doesn't block the default contact form because of trailing slash. But it might be best just to leave that rule in its current form.

We should think about creating robots.txt on the fly in the module activation processes. Activate/deactivate module could add/remove robots.txt text.

That is a good idea because a lot of modules create duplicate content problems -- as long as you could still have precise control over the robots.txt rules by hand.

catch’s picture

dynamic robots.txt for modules could maybe be an addition to this: http://drupal.org/node/53579? - either way it's a very nice idea.

Freso’s picture

Status: Needs review » Needs work

example.com/node is duplicate content of example.com/ – per default, this is true, but it can be easily changed, even by people who do not have access to the file system and thus won't be able to edit robots.txt.

Also, where does your use of "$" come from? I haven't been able to discern its function from anything I could find at robotstxt.org or Wikipedia...

It was also agreed upon in issue 75916 to have aggregator indexed by default, so that should be changed to "If you do not want you aggregator pages to be indexed, uncomment the following line".

Finishing this, I'd recommend you to read through issue 75916, as it contains some hints and has some discussion on this.

catch’s picture

$ is an end of line character, not in the spec, but recognised by all major search engines (this is covered in the issue discussion and I almost cut and pasted).

example.com/node - I agree with though, a lot of sites don't use /node as the front page.

catch’s picture

Version: 6.x-dev » 7.x-dev

Bumping to 7.x

Freso’s picture

Status: Needs work » Needs review
FileSize
1.06 KB

Okay, I've had some time to turn this over in my mind, and I'm feeling rather uneasy about using * and $ in the robots.txt, as they're not standard. Google and co. might well support it, but I'll bet you that there are tons of (polite) robots out there that actually follow the standard and doesn't care for (or possibly doesn't even know of) the extensions Google et al. use. And to me they seem like they would confuse standards compliant robots.

The attached patch removes some of the trailing slashes, namely /foo/bar/ to /foo/bar, leaving /foo/ alone (in case someone wants to make a /fooxyz node). Even if the non-standards approach is deemed a road worth continuing along, this patch will provide a temporary boost to robots.txt's effectiveness.

lilou’s picture

Patch still applied.

cburschka’s picture

The attached patch removes some of the trailing slashes, namely /foo/bar/ to /foo/bar

Actually, the patch adds new rules without affecting the existing ones... was that what you meant to do?

Freso’s picture

Yes, this is what I meant to do.

BartVB’s picture

edit: Nevermind :\ Should read the actual patch before replying..

cburschka’s picture

Um... huh? Which rule, specifically, prevents http://drupal.org/forum/...etc from being indexed? These are all of the new ones:

+Disallow: /comment/reply
+Disallow: /node/add
+Disallow: /user/register
+Disallow: /user/password
+Disallow: /user/login
+Disallow: /?q=comment/reply
+Disallow: /?q=node/add
+Disallow: /?q=user/password
+Disallow: /?q=user/register
+Disallow: /?q=user/login

Status: Needs review » Needs work

The last submitted patch failed testing.

lilou’s picture

Status: Needs work » Needs review

Status: Needs review » Needs work

The last submitted patch failed testing.

Freso’s picture

Status: Needs work » Needs review
FileSize
1.12 KB

Re-roll.

Freso’s picture

chx’s picture

reposting for bot's sake.

Anonymous’s picture

Status: Needs review » Reviewed & tested by the community
Dries’s picture

I wonder why we need /?q=user/logout/ -- can something follow the logout-part of the path?

catch’s picture

Isn't it because that leads you to a 403? Same as admin?

Freso’s picture

It was added with #75916: Include a default robots.txt (commit), but that issue doesn't seem to mention why it is using the slash at the end of it. I think the safe thing to do is to keep it; should we find out it causes trouble, it can be removed later.

cburschka’s picture

Don't see any remaining issues here, unless we want to get rid of some of the trailing slashes.

webchick’s picture

Status: Reviewed & tested by the community » Needs work

We could do with a comment at the top of this file that explains why the paths are repeated. Although I would love a reason better than "We don't know why the slashes are there" :P I'm wondering if we should just remove them, since the contents of this file with this patch are absolutely baffling.

Any SEO experts in the house?

Anonymous’s picture

webchick’s picture

@earnie: Can you explain how that page explains why we need both trailing and not trailing slashes on every path? And if so, could you formulate that into a comment and re-roll the patch?

Anonymous’s picture

I think it more says we need the ones with the slash more than we need the ones without it. See http://www.google.com/support/webmasters/bin/answer.py?answer=40360&ctx=... for examples.

In particular:

# To block a directory and everything in it, follow the directory name with a forward slash.

Disallow: /junk-directory/ 
eMPee584’s picture

IISC there's something important missing here: wildcard paths for multilingual site, as posted on #347515: robots.txt: add wildcarded paths for multilingual sites:

# For multi-language sites (wildcards supported at least
# by GoogleBot, MSNBot and Yahoo Slurp web spiders)
# Paths (clean URLs)
Disallow: /*/admin/
Disallow: /*/comment/reply/
Disallow: /*/contact/
Disallow: /*/logout/
Disallow: /*/node/add/
Disallow: /*/search/
Disallow: /*/user/register/
Disallow: /*/user/password/
Disallow: /*/user/login/
# Paths (no clean URLs)
Disallow: /*/?q=admin/
Disallow: /*/?q=comment/reply/
Disallow: /*/?q=contact/
Disallow: /*/?q=logout/
Disallow: /*/?q=node/add/
Disallow: /*/?q=search/
Disallow: /*/?q=user/password/
Disallow: /*/?q=user/register/
Disallow: /*/?q=user/login/
Anonymous’s picture

Version: 7.x-dev » 6.x-dev
Status: Needs work » Active

Is there a reason why /user/ isn't blocked? The current robots.txt file blocks "/user/logon" but nothing addresses "/user" (which routes to the same logon page). Seems to me addition of the following would be required:

Disallow: /user/
Disallow: /?q=user/

Similarly, attempts to navigate to /system/ or /system/files/ result in a "page not found" error. This is good. But when files are attached to nodes, those files become available as /system/files/foo.txt (replace foo.txt with the appropriate filename+extension). I have seen said file attachments indexed by Google (not good, imho). Wouldn't the following additions to robots.txt prevent the indexing of any node-attached files?

Disallow: /system/files/
Disallow: /?q=system/files/

VM’s picture

Version: 6.x-dev » 7.x-dev
Status: Active » Needs work

readjusting version and status as there is already a patch in play that needs work according to webchicks comments in #29

Anonymous’s picture

Thanks @VeryMisunderstood, I'm working with 6.14 and didn't consider how changing the version from what was defaulted might be a problem. My apologies.

http://robotstxt.org is supposed to be the definitive source but they're currently generating a 503 server error. So I hit Wikipedia.

The robots.txt patterns are matched by simple substring comparisons, so care should be taken to make sure that patterns matching directories have the final '/' character appended, otherwise all files with names starting with that substring will match, rather than just those in the directory intended.

What that tells me is inclusion of the trailing slash obviates the need for additional entries including the same path. So 1-3 below will disallow indexing of content within specific directories subordinate to "http://foo.com/user/" while 4 accomplishes the same in addition to disallowing _everything else_ subordinate to "http://foo.com/user/":

  1. Disallow: /user/register/ (blocks everything under "http://foo.com/user/register/")
  2. Disallow: /user/password/ (blocks everything under "http://foo.com/user/password/")
  3. Disallow: /user/login/ (blocks everything under "http://foo.com/user/login/")
  4. Disallow: /user/ (blocks everything under "http://foo.com/user/")

As far as I can tell, the current approach (inclusion of the trailing slash) is correct. But it seems to me "Disallow: /user/" could replace "Disallow: /user/register/, /user/password/, /user/login/" (same for the non-clean URL equivalents).

Finally, in my initial comment (#34), I also suggested adding another dir path (/system/files/) so as to block indexing of any files attached to nodes. I think it might make sense to simply block /system/ but haven't looked at that carefully to prevent unintended consequences.

mattyoung’s picture

.

Anonymous’s picture

Following up on my last post (#36), http://robotstxt.org continues to be offline. Wikipedia isn't a bad source but also not the most reliable. So I checked W3C (w3.org) and they provide the same details on "robots.txt".

  • Disallow: /help disallows both /help.html and /help/index.html
  • Disallow: /help/ would disallow /help/index.html but allow /help.html

I believe this confirms my suspicion that "Disallow: /user/" would negate "Disallow: /user/register/", /user/password/", and "/user/login/" (same for the non-clean URL equivalents). Additionally, "/user/" yields a login page that is currently not blocked from indexing. So adding "Disallow: /user/" (to replace the 3 /user paths listed in the current robots.txt) will block the one that is currently not getting blocked from indexing, while also doing what the three current entries attempt to accomplish.

Something else I just thought of... how would all this work with a multi-site installation? I'd like to test this but to-date have not been able to successfully complete a multi-site installation (yes, I've tried following the handbook references). If someone can coach me through a test multi-site installation, I'd be happy to look into this.

And as I mentioned in #36, I believe it makes sense to add "Disallow: /system/files/" (and the non-clean URL equiv). This would block the indexing of node file attachments, presuming the file system default of "/files" or "(/something)/files" is retained.

VM’s picture

when using the private file system, the files folder should be moved above the public root which as far as I can tell disallows anon users to reach them. Bots index as anon users? a file system set as private but left in the public root is essentially public regardless of setting?

I'd gladly help you with a multisite install. I've done a few. However, this thread isn't the place for those instructions. Feel free to create a forum thread. May even want to do a search on the forums as I've posted my successful steps multiple times.

j0nathan’s picture

subscribing

mlbrgl’s picture

@zacamjo - #38

Wouldn't "Disallow: /user/" also block "/user/[USER ID]" paths, that community sites might want to keep getting indexed, when they are public?

http://www.google.com/search?q=site%3Adrupal.org%2Fuser%2F

YK85’s picture

I was wondering if someone can help setup the robot.txt for drupal 6 for a multilingual site? Thank you!

j0nathan’s picture

Hi, here is another example of a modified robots.txt file, for multilingual Drupal 6 site:
https://wiki.koumbit.net/DrupalRobots

YK85’s picture

It seems like #43 link is using the method in #33.
I'm still not clear if all 8 lines shown below needs to be in the robot.txt for each url:
Does anyone know for sure?

# Paths (no clean URLs)
Disallow: /?q=admin/

# Paths (clean URLs)
Disallow: /admin/

# Paths (clean URLs) no trailing
Disallow: /admin

# Paths (no clean URLs) no trailing
Disallow: /?q=admin

# Paths (clean URLs) multilingual
Disallow: /*/admin/

# Paths (no clean URLs) multilingual
Disallow: /*/?q=admin/

# Paths (clean URLs) multilingual, no trailing
Disallow: /*/admin

# Paths (no clean URLs) multilingual, no trailing
Disallow: /*/?q=admin
andypost’s picture

Version: 7.x-dev » 8.x-dev
Status: Needs work » Needs review
FileSize
568 bytes

D7 introduced comment/% urls for comments this brings a huge trouble with content duplication

So proposal is totally disable /comment/

RobLoach’s picture

pillarsdotnet’s picture

#45: 180379-comment-url.patch queued for re-testing.

andypost’s picture

Issue tags: +SEO, +Drupal SEO

This trouble mostly cause by "Last comments" block which points to comment/ID#comment-ID

Another way to fix this to change a block to display links for comments like /node/NID#comment-ID

Ayesh’s picture

No need to mention that a simple problem in robots.txt can be a fatal problem for sites that mainly depend on Google's traffic.
Keeping Google traffic in mind, I could set query params in Google webmaster central to INDEX page, sort and order queries. GW has a really cool feature to set which query do what.

About /comment/ID URLs, I got content duplication warnings and a robots.txt entry to disallow them worked great. But Do we really need to give each comment a URL ?
D6's comment URL pattern looks nice but without the node ID, comment URLs are a little misleading.

kscheirer’s picture

Issue tags: -SEO, -Drupal SEO

#45: 180379-comment-url.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +SEO, +Drupal SEO

The last submitted patch, 180379-comment-url.patch, failed testing.

maciej.zgadzaj’s picture

Re #44:

# Paths (clean URLs) no trailing
Disallow: /admin

Would block (for example) /administration-guide

# Paths (no clean URLs) no trailing
Disallow: /?q=admin

Would block /?q=administration-guide

# Paths (clean URLs) multilingual
Disallow: /*/admin/

Would block /content/admin/

# Paths (clean URLs) multilingual, no trailing
Disallow: /*/admin

Would block /content/administration-guide

Anonymous’s picture

Re #53: And why do we want a robot accessing administration-guide anyway?

maciej.zgadzaj’s picture

Re #53: And why do we want a robot accessing administration-guide anyway?

Because that could be an article alias, content of which someone could want to have indexed by a search engine?

andypost’s picture

The related discussion about links to comments #2113323: Rename Comment::permalink() to not be ambiguous with ::uri()

DevElCuy’s picture

There is a broken link at line 14. The new link is: http://www.robotstxt.org/robotstxt.html

hass’s picture

http://www.frobee.com/robots-txt-check link in robots.txt is broken.

ronaldmulero’s picture

DevElCuy’s picture

following #58, there is a good syntax checker that requires no account creation like the google one: https://webmaster.yandex.com/robots.xml

Patch attached.

gbisht’s picture

Status: Needs work » Needs review
Issue tags: +SprintWeekend2015

@develCuy please put the issue in needs review after submitting the patch.

jonhattan’s picture

Status: Needs review » Reviewed & tested by the community

In the general term it would be more accurate to link to http://www.robotstxt.org/checker.html, wikipedia or any other trusted source but none of them provide a listing.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

Hmmm the patch on #60 is completely unrelated to the issue summary. I think that the fact that the http://www.frobee.com/robots-txt-check is broken should be a new issue. That new issue should discuss whether or not we should link to a validator in robots.txt - to me this seems superfluous.

cilefen’s picture

cilefen’s picture

Title: Fixing Robots.txt » Fix path matching in robots.txt
Ayesh’s picture

There's a lot to fix in the robots.txt file.
#2446657: Dead link on robots.txt
#1137848: /filter/tips page is listed by search engines

Still, it needs some rework, now that Google recommends to not block CSS/JS folders for its mobile-friendly SEO rankings (1, 2). Of course we shouldn't be focusing on just Google, but I do not see the motivation behind blocking module and theme paths, login pages (People do search for "facebook login", "facebook sign up", etc).

deepakaryan1988’s picture

Issue tags: -SprintWeekend2015

Removing sprint weekend tag!!
As suggested by @YesCT

deepakaryan1988’s picture

Issue tags: +SprintWeekend2015

Sorry, these issues were actually worked on during the 2015 Global Sprint
Weekend https://groups.drupal.org/node/447258

lpalgarvio’s picture

Version: 8.0.x-dev » 8.1.x-dev

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

salvis’s picture

Found this old issue...

According to https://developers.google.com/search/reference/robots_txt, /fish/ does not match /fish, i.e. /admin/ doesn't match /admin, so GoogleBot may try to access /admin (and hit 304) if some hacker links there.

If you have a "Log in" Link on your front page, you'll find that Google fully indexes /user/login, even though our robots.txt has Disallow: /user/login/

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

apaderno’s picture

Version: 8.6.x-dev » 8.7.x-dev
philsward’s picture

Considering this problem has been around since Drupal 5 and this issue has been around for over a decade now, I don't see it ever getting committed.

Crazy how difficult it is to get a simple text file committed for Drupal.

Somebody may as well close the issue as "Won't Fix".

leymannx’s picture

I think there's just some fundamental (human) SEO expertise needed to bring this issue forward.

It definitely needs some more attention, yes.

Version: 8.7.x-dev » 8.8.x-dev

Drupal 8.7.0-alpha1 will be released the week of March 11, 2019, which means new developments and disruptive changes should now be targeted against the 8.8.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

cilefen’s picture

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.0-alpha1 will be released the week of October 14th, 2019, which means new developments and disruptive changes should now be targeted against the 8.9.x-dev branch. (Any changes to 8.9.x will also be committed to 9.0.x in preparation for Drupal 9’s release, but some changes like significant feature additions will be deferred to 9.1.x.). For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.1.x-dev

Drupal 8.9.0-beta1 was released on March 20, 2020. 8.9.x is the final, long-term support (LTS) minor release of Drupal 8, which means new developments and disruptive changes should now be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 9.1.x-dev » 9.2.x-dev

Drupal 9.1.0-alpha1 will be released the week of October 19, 2020, which means new developments and disruptive changes should now be targeted for the 9.2.x-dev branch. For more information see the Drupal 9 minor version schedule and the Allowed changes during the Drupal 9 release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Drupal 9.2.0-alpha1 will be released the week of May 3, 2021, which means new developments and disruptive changes should now be targeted for the 9.3.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

longwave’s picture

Status: Needs work » Postponed (maintainer needs more info)

I'm not sure what actionable tasks there are for this issue. There is a lot of discussion of different factors but there doesn't seem to be anything concrete we can move forward with. I think all Drupal core can hope to do here is ship a simple robots.txt file that covers some basic paths used by core, as it does at present. Site owners can edit the file directly or install robotstxt module if they wish to override the default settings.

I don't think we can use the * or $ operators, while these are supported by some search engines they are almost certainly not accepted by all.

#74 was resolved in #3123285: Actually exclude user register, login, logout, and password pages from search results in robots.txt (current rules are broken)

I suggest that this issue should be closed but if there are specific, actionable problems with any of the lines in the current robots.txt that these are discussed in new issues.

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.0-rc1 was released on November 26, 2021, which means new developments and disruptive changes should now be targeted for the 9.4.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

catch’s picture

Status: Postponed (maintainer needs more info) » Closed (works as designed)

It's been a few months since #86, let's close this one.