Allow robots.txt to disallow URLs with "sort" and "filter" in them
giorgio79 - July 5, 2008 - 08:45
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | base system |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | duplicate |
Description
Hi,
I found a nice SEO recommendation here
http://drupalzilla.com/module/forum
basically sorting functionality creates duplicate content. this it is good idea to add this to robots.txt
Disallow: /*sort=

#1
Hi,
I wrote that tutorial.
There is a longer version here:
http://drupalzilla.com/robots-txt
Drupal.org added a similar rules to robots.txt here:
http://drupal.org/robots.txt
Disallow: /*?sort*Disallow: /*&sort*
but you can also do it in one line like this:
Disallow: /*sort=A trailing asterisk isn't needed.
It's definitely a good idea to add that rule to your robots.txt file...as well as the other ones mentioned here:
http://drupalzilla.com/robots-txt
#2
Feature request go to CVS HEAD.
#3
Well, I would rather see this as a bug because it creates duplicate content and that is bad for the ranking in search engines, especially in Google.
So why not just take the changes described in http://tips.webdesign10.com/robots-txt-and-drupal and put them into the next release?
#4
bugs also go to HEAD.
#5
Here's a patch - also tries to catch GET filter params, such as used by Views and Apache Solr integration. Here we can't omit the ? or & since Views may have a number before the = sign (e.g. filter1=) and currently Apache Solr integrations uses ?filters=
#6
I'm not sure it's a good idea for core to try to predict the way contrib modules are going to use certain URLs....
If a contrib module wants to exclude certain URLs from being indexed, can't they achieve that by putting <META> tags on the appropriate pages? (e.g., http://www.robotstxt.org/meta.html)
#7
Better option available : http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonica...
#8
I agree with David in #6. Where does it end? :)
#9
Robots.txt has to be customized for most sites so there is no end -- but there are some bugs with the default robots.txt even for core.
For example, removing trailing slashes on some paths in robots.txt will reduce server load and prevent URLs like this getting crawled and indexed:
example.com/user/login?destination=comment/reply/806%2523comment_form
(there's no trailing slash on the above URL, so it doesn't get blocked)
Disallow: /admin/Disallow: /aggregator
Disallow: /comment/reply/
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search/
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login
Blocking URLs like that can make a big difference with SEO...
#10
We also have a nice RobotsTxt module that has a hook_robotstxt() that other contrib modules can add custom lines to the robots.txt file.
#11
#180379: Fixing Robots.txt
#12
For example, removing trailing slashes on some paths in robots.txt will reduce server load and prevent URLs like this getting crawled and indexed:example.com/user/login?destination=comment/reply/806%2523comment_form
(there's no trailing slash on the above URL, so it doesn't get blocked)
Wow, I always wondered why does my cache_form table grows to over 100 MB in size...
#13
I do not see how this is a duplicate. The original issue as described above was about disallowing /*sort= and similar things, whereas the latest patch at the other issue does not do anything like that. I think it's better to keep the issues separate.
Regarding the RobotsTxt module and hook_robotstxt(), is it ridiculous to suggest that that kind of functionality belongs in core? (Since robots.txt ships with core, after all, and core generally likes to allow everything to be modifiable...) For sites that care about performance and don't want robots.txt to trigger a Drupal bootstrap, they could easily override it by creating their own robots.txt as an actual file.
#14
@ David_Rothstein: Of course, if you're only looking at the latest patch in the other issue, you're right. However, the latest patch is a long way from the initial issue specification, which had e.g. . Please read up on that entire issue before removing the duplicate status of this one.
#15
@David_Rothstein: I never, never implied that robotstxt.module should be in core, I was merely showing there is a option available in contrib. If a search module has a certain url that it provides searches at that does not want to be indexed, it can add a hook_robotstxt().