Hi,

I found a nice SEO recommendation here

http://drupalzilla.com/module/forum

basically sorting functionality creates duplicate content. this it is good idea to add this to robots.txt

Disallow: /*sort=

CommentFileSizeAuthor
#5 avoid-spiders-278775-5.patch484 bytespwolanin

Comments

Z2222’s picture

.

lilou’s picture

Version: 5.x-dev » 7.x-dev

Feature request go to CVS HEAD.

smitty’s picture

Version: 7.x-dev » 5.15
Category: feature » bug

Well, I would rather see this as a bug because it creates duplicate content and that is bad for the ranking in search engines, especially in Google.

So why not just take the changes described in http://tips.webdesign10.com/robots-txt-and-drupal and put them into the next release?

pwolanin’s picture

Version: 5.15 » 7.x-dev

bugs also go to HEAD.

pwolanin’s picture

Status: Active » Needs review
StatusFileSize
new484 bytes

Here's a patch - also tries to catch GET filter params, such as used by Views and Apache Solr integration. Here we can't omit the ? or & since Views may have a number before the = sign (e.g. filter1=) and currently Apache Solr integrations uses ?filters=

David_Rothstein’s picture

I'm not sure it's a good idea for core to try to predict the way contrib modules are going to use certain URLs....

If a contrib module wants to exclude certain URLs from being indexed, can't they achieve that by putting <META> tags on the appropriate pages? (e.g., http://www.robotstxt.org/meta.html)

FiReaNGeL’s picture

dries’s picture

I agree with David in #6. Where does it end? :)

Z2222’s picture

Version: 7.x-dev » 6.9

I agree with David in #6. Where does it end? :)

Robots.txt has to be customized for most sites so there is no end -- but there are some bugs with the default robots.txt even for core.

For example, removing trailing slashes on some paths in robots.txt will reduce server load and prevent URLs like this getting crawled and indexed:
example.com/user/login?destination=comment/reply/806%2523comment_form
(there's no trailing slash on the above URL, so it doesn't get blocked)

Disallow: /admin/
Disallow: /aggregator
Disallow: /comment/reply/
Disallow: /contact
Disallow: /logout
Disallow: /node/add
Disallow: /search/
Disallow: /user/register
Disallow: /user/password
Disallow: /user/login

Blocking URLs like that can make a big difference with SEO...

dave reid’s picture

Version: 6.9 » 7.x-dev

We also have a nice RobotsTxt module that has a hook_robotstxt() that other contrib modules can add custom lines to the robots.txt file.

Freso’s picture

Status: Needs review » Closed (duplicate)
giorgio79’s picture

For example, removing trailing slashes on some paths in robots.txt will reduce server load and prevent URLs like this getting crawled and indexed:
example.com/user/login?destination=comment/reply/806%2523comment_form
(there's no trailing slash on the above URL, so it doesn't get blocked)

Wow, I always wondered why does my cache_form table grows to over 100 MB in size...

David_Rothstein’s picture

Title: robots.txt SEO feature » Allow robots.txt to disallow URLs with "sort" and "filter" in them
Status: Closed (duplicate) » Needs work

I do not see how this is a duplicate. The original issue as described above was about disallowing /*sort= and similar things, whereas the latest patch at the other issue does not do anything like that. I think it's better to keep the issues separate.

Regarding the RobotsTxt module and hook_robotstxt(), is it ridiculous to suggest that that kind of functionality belongs in core? (Since robots.txt ships with core, after all, and core generally likes to allow everything to be modifiable...) For sites that care about performance and don't want robots.txt to trigger a Drupal bootstrap, they could easily override it by creating their own robots.txt as an actual file.

Freso’s picture

Status: Needs work » Closed (duplicate)

@ David_Rothstein: Of course, if you're only looking at the latest patch in the other issue, you're right. However, the latest patch is a long way from the initial issue specification, which had e.g. Disallow: /*sort=. Please read up on that entire issue before removing the duplicate status of this one.

dave reid’s picture

@David_Rothstein: I never, never implied that robotstxt.module should be in core, I was merely showing there is a option available in contrib. If a search module has a certain url that it provides searches at that does not want to be indexed, it can add a hook_robotstxt().