Allow robots.txt to disallow URLs with "sort" and "filter" in them

giorgio79 - July 5, 2008 - 08:45
Project:Drupal
Version:7.x-dev
Component:base system
Category:bug report
Priority:normal
Assigned:Unassigned
Status:duplicate
Description

Hi,

I found a nice SEO recommendation here

http://drupalzilla.com/module/forum

basically sorting functionality creates duplicate content. this it is good idea to add this to robots.txt

Disallow: /*sort=

#1

J. Cohen - July 5, 2008 - 18:24

Hi,

I wrote that tutorial.

There is a longer version here:
http://drupalzilla.com/robots-txt

Drupal.org added a similar rules to robots.txt here:
http://drupal.org/robots.txt

Disallow: /*?sort*
Disallow: /*&sort*

but you can also do it in one line like this:

Disallow: /*sort=

A trailing asterisk isn't needed.

It's definitely a good idea to add that rule to your robots.txt file...as well as the other ones mentioned here:
http://drupalzilla.com/robots-txt

#2

lilou - August 23, 2008 - 22:19
Version:5.x-dev» 7.x-dev

Feature request go to CVS HEAD.

#3

smitty - February 6, 2009 - 18:38
Version:7.x-dev» 5.15
Category:feature request» bug report

Well, I would rather see this as a bug because it creates duplicate content and that is bad for the ranking in search engines, especially in Google.

So why not just take the changes described in http://tips.webdesign10.com/robots-txt-and-drupal and put them into the next release?

#4

pwolanin - February 20, 2009 - 02:44
Version:5.15» 7.x-dev

bugs also go to HEAD.

#5

pwolanin - February 20, 2009 - 03:06
Status:active» needs review

Here's a patch - also tries to catch GET filter params, such as used by Views and Apache Solr integration. Here we can't omit the ? or & since Views may have a number before the = sign (e.g. filter1=) and currently Apache Solr integrations uses ?filters=

AttachmentSize
avoid-spiders-278775-5.patch 484 bytes
Testbed results
avoid-spiders-278775-5.patchpassedPassed: 10426 passes, 0 fails, 0 exceptions Detailed results

#6

David_Rothstein - February 20, 2009 - 03:51

I'm not sure it's a good idea for core to try to predict the way contrib modules are going to use certain URLs....

If a contrib module wants to exclude certain URLs from being indexed, can't they achieve that by putting <META> tags on the appropriate pages? (e.g., http://www.robotstxt.org/meta.html)

#7

FiReaNG3L - February 20, 2009 - 05:42

#8

Dries - February 20, 2009 - 10:28

I agree with David in #6. Where does it end? :)

#9

J. Cohen - March 5, 2009 - 17:20
Version:7.x-dev» 6.9

I agree with David in #6. Where does it end? :)

Robots.txt has to be customized for most sites so there is no end -- but there are some bugs with the default robots.txt even for core.

For example, removing trailing slashes on some paths in robots.txt will reduce server load and prevent URLs like this getting crawled and indexed:
example.com/user/login?destination=comment/reply/806%2523comment_form
(there's no trailing slash on the above URL, so it doesn't get blocked)

Disallow: /admin/
Disallow: /aggregator
Disallow: /comment/reply/
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search/
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

Blocking URLs like that can make a big difference with SEO...

#10

Dave Reid - March 6, 2009 - 01:11
Version:6.9» 7.x-dev

We also have a nice RobotsTxt module that has a hook_robotstxt() that other contrib modules can add custom lines to the robots.txt file.

#11

Freso - March 7, 2009 - 14:38
Status:needs review» duplicate

#180379: Fixing Robots.txt

#12

giorgio79 - March 7, 2009 - 15:38

For example, removing trailing slashes on some paths in robots.txt will reduce server load and prevent URLs like this getting crawled and indexed:
example.com/user/login?destination=comment/reply/806%2523comment_form
(there's no trailing slash on the above URL, so it doesn't get blocked)

Wow, I always wondered why does my cache_form table grows to over 100 MB in size...

#13

David_Rothstein - March 7, 2009 - 16:03
Title:robots.txt SEO feature» Allow robots.txt to disallow URLs with "sort" and "filter" in them
Status:duplicate» needs work

I do not see how this is a duplicate. The original issue as described above was about disallowing /*sort= and similar things, whereas the latest patch at the other issue does not do anything like that. I think it's better to keep the issues separate.

Regarding the RobotsTxt module and hook_robotstxt(), is it ridiculous to suggest that that kind of functionality belongs in core? (Since robots.txt ships with core, after all, and core generally likes to allow everything to be modifiable...) For sites that care about performance and don't want robots.txt to trigger a Drupal bootstrap, they could easily override it by creating their own robots.txt as an actual file.

#14

Freso - March 7, 2009 - 16:19
Status:needs work» duplicate

@ David_Rothstein: Of course, if you're only looking at the latest patch in the other issue, you're right. However, the latest patch is a long way from the initial issue specification, which had e.g. Disallow: /*sort=. Please read up on that entire issue before removing the duplicate status of this one.

#15

Dave Reid - March 7, 2009 - 17:03

@David_Rothstein: I never, never implied that robotstxt.module should be in core, I was merely showing there is a option available in contrib. If a search module has a certain url that it provides searches at that does not want to be indexed, it can add a hook_robotstxt().

 
 

Drupal is a registered trademark of Dries Buytaert.