Hi,
I found a nice SEO recommendation here
http://drupalzilla.com/module/forum
basically sorting functionality creates duplicate content. this it is good idea to add this to robots.txt
Disallow: /*sort=
| Comment | File | Size | Author |
|---|---|---|---|
| #5 | avoid-spiders-278775-5.patch | 484 bytes | pwolanin |
Comments
Comment #1
Z2222 commented.
Comment #2
lilou commentedFeature request go to CVS HEAD.
Comment #3
smitty commentedWell, I would rather see this as a bug because it creates duplicate content and that is bad for the ranking in search engines, especially in Google.
So why not just take the changes described in http://tips.webdesign10.com/robots-txt-and-drupal and put them into the next release?
Comment #4
pwolanin commentedbugs also go to HEAD.
Comment #5
pwolanin commentedHere's a patch - also tries to catch GET filter params, such as used by Views and Apache Solr integration. Here we can't omit the ? or & since Views may have a number before the = sign (e.g. filter1=) and currently Apache Solr integrations uses ?filters=
Comment #6
David_Rothstein commentedI'm not sure it's a good idea for core to try to predict the way contrib modules are going to use certain URLs....
If a contrib module wants to exclude certain URLs from being indexed, can't they achieve that by putting <META> tags on the appropriate pages? (e.g., http://www.robotstxt.org/meta.html)
Comment #7
FiReaNGeL commentedBetter option available : http://googlewebmastercentral.blogspot.com/2009/02/specify-your-canonica...
Comment #8
dries commentedI agree with David in #6. Where does it end? :)
Comment #9
Z2222 commentedRobots.txt has to be customized for most sites so there is no end -- but there are some bugs with the default robots.txt even for core.
For example, removing trailing slashes on some paths in robots.txt will reduce server load and prevent URLs like this getting crawled and indexed:
example.com/user/login?destination=comment/reply/806%2523comment_form
(there's no trailing slash on the above URL, so it doesn't get blocked)
Blocking URLs like that can make a big difference with SEO...
Comment #10
dave reidWe also have a nice RobotsTxt module that has a hook_robotstxt() that other contrib modules can add custom lines to the robots.txt file.
Comment #11
Freso commented#180379: Fix path matching in robots.txt
Comment #12
giorgio79 commentedWow, I always wondered why does my cache_form table grows to over 100 MB in size...
Comment #13
David_Rothstein commentedI do not see how this is a duplicate. The original issue as described above was about disallowing /*sort= and similar things, whereas the latest patch at the other issue does not do anything like that. I think it's better to keep the issues separate.
Regarding the RobotsTxt module and hook_robotstxt(), is it ridiculous to suggest that that kind of functionality belongs in core? (Since robots.txt ships with core, after all, and core generally likes to allow everything to be modifiable...) For sites that care about performance and don't want robots.txt to trigger a Drupal bootstrap, they could easily override it by creating their own robots.txt as an actual file.
Comment #14
Freso commented@ David_Rothstein: Of course, if you're only looking at the latest patch in the other issue, you're right. However, the latest patch is a long way from the initial issue specification, which had e.g. . Please read up on that entire issue before removing the duplicate status of this one.
Comment #15
dave reid@David_Rothstein: I never, never implied that robotstxt.module should be in core, I was merely showing there is a option available in contrib. If a search module has a certain url that it provides searches at that does not want to be indexed, it can add a hook_robotstxt().