Allow robots.txt to disallow URLs with "sort" and "filter" in them [#278775]

Here's a patch - also tries to catch GET filter params, such as used by Views and Apache Solr integration. Here we can't omit the ? or & since Views may have a number before the = sign (e.g. filter1=) and currently Apache Solr integrations uses ?filters=

Log in or register to post comments

Comment #6

David_Rothstein commented 20 February 2009 at 03:51

I'm not sure it's a good idea for core to try to predict the way contrib modules are going to use certain URLs....

If a contrib module wants to exclude certain URLs from being indexed, can't they achieve that by putting <META> tags on the appropriate pages? (e.g., http://www.robotstxt.org/meta.html)

Log in or register to post comments

Comment #7

FiReaNGeL commented 20 February 2009 at 05:42

Better option available : http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonica...

Log in or register to post comments

Comment #8

dries commented 20 February 2009 at 10:28

I agree with David in #6. Where does it end? :)

Log in or register to post comments

Comment #9

Z2222 commented 5 March 2009 at 17:20

Version:

7.x-dev

» 6.9

I agree with David in #6. Where does it end? :)

Robots.txt has to be customized for most sites so there is no end -- but there are some bugs with the default robots.txt even for core.

For example, removing trailing slashes on some paths in robots.txt will reduce server load and prevent URLs like this getting crawled and indexed:
example.com/user/login?destination=comment/reply/806%2523comment_form
(there's no trailing slash on the above URL, so it doesn't get blocked)

Disallow: /admin/
Disallow: /aggregator
Disallow: /comment/reply/
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search/
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

Blocking URLs like that can make a big difference with SEO...

Log in or register to post comments

Comment #10

dave reid

he/him

English

Nebraska USA

commented 6 March 2009 at 01:11

Version:

6.9

» 7.x-dev

We also have a nice RobotsTxt module that has a hook_robotstxt() that other contrib modules can add custom lines to the robots.txt file.

Log in or register to post comments

Comment #11

Freso commented 7 March 2009 at 14:38

Status:

Needs review

» Closed (duplicate)

#180379: Fix path matching in robots.txt

Log in or register to post comments

Comment #12

giorgio79 commented 7 March 2009 at 15:38

For example, removing trailing slashes on some paths in robots.txt will reduce server load and prevent URLs like this getting crawled and indexed:
example.com/user/login?destination=comment/reply/806%2523comment_form
(there's no trailing slash on the above URL, so it doesn't get blocked)

Wow, I always wondered why does my cache_form table grows to over 100 MB in size...

Log in or register to post comments

Comment #13

David_Rothstein commented 7 March 2009 at 16:03

Title:	robots.txt SEO feature	» Allow robots.txt to disallow URLs with "sort" and "filter" in them
Status:	Closed (duplicate)	» Needs work

I do not see how this is a duplicate. The original issue as described above was about disallowing /*sort= and similar things, whereas the latest patch at the other issue does not do anything like that. I think it's better to keep the issues separate.

Regarding the RobotsTxt module and hook_robotstxt(), is it ridiculous to suggest that that kind of functionality belongs in core? (Since robots.txt ships with core, after all, and core generally likes to allow everything to be modifiable...) For sites that care about performance and don't want robots.txt to trigger a Drupal bootstrap, they could easily override it by creating their own robots.txt as an actual file.

Log in or register to post comments

Comment #14

Freso commented 7 March 2009 at 16:19

Status:

Needs work

» Closed (duplicate)

@ David_Rothstein: Of course, if you're only looking at the latest patch in the other issue, you're right. However, the latest patch is a long way from the initial issue specification, which had e.g. Disallow: /*sort=. Please read up on that entire issue before removing the duplicate status of this one.

Log in or register to post comments

Comment #15

dave reid

he/him

English

Nebraska USA

commented 7 March 2009 at 17:03

@David_Rothstein: I never, never implied that robotstxt.module should be in core, I was merely showing there is a option available in contrib. If a search module has a certain url that it provides searches at that does not want to be indexed, it can add a hook_robotstxt().

Log in or register to post comments

Allow robots.txt to disallow URLs with "sort" and "filter" in them

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

News items

Our community

Documentation

Drupal code base

Governance of community