Fixing Robots.txt

Drupalzilla.com - October 3, 2007 - 00:43
Project:Drupal
Version:7.x-dev
Component:other
Category:bug report
Priority:normal
Assigned:Unassigned
Status:patch (code needs review)
Description

This patch cuts down on the duplicate content that is spidered by search engine crawlers. It is difficult to make a single robots.txt file for all Drupal sites, but this one fixes some errors in the current default file and adds some new rules.

Wildcards (*) and end-of-string ($) characters are not part of the robots.txt standard, but they are accepted by Google, Yahoo, and MSN. You can test the rules in Google's Robots.txt tool in their Webmaster Tools: http://www.google.com/webmasters/tools

For example, you need to use a wildcard to block these tables that can be sorted in multiple ways:
http://drupal.org/forum/2?sort=asc&order=Last+reply&page=428

The rule is:
Disallow: /*sort=

Pages like these do not need to get spidered because there are other ways for crawlers to access the content:
drupal.org/tracker/12322
drupal.org/tracker/%5Buserid?page=9
drupal.org/tracker/order/author/desc?page=55

Trailing slashes have been removed from some URLs in the patch, but not all. For example, the current rule to block /modules/ is:
Disallow: /modules/

You cannot access the URL example.com/modules from a browser -- the server will add the trailing slash for you because it is a directory. Also, if you remove the trailing slash in this case, it might block a page of regular content with a URL like: example.com/modules-are-good

It is generally better not to have duplicate content indexed by search engines unless your site has a lot of link popularity. The aggregator module can create a lot of duplicate content.

The path /aggregator is blocked without a trailing slash because the following URLs should probably be blocked:
/aggregator
/aggregator/sources
/aggregator/categories
/aggregator?page=3

Removing the trailing slash on /aggregator would also block someone's post named example.com/aggregator-module-tutorial. So it may be better to change the attached patch to:
Disallow: /aggregator/
Disallow: /aggregator?

The following rule prevents all feeds from being crawled, except the main one at /rss.xml.
Disallow: /*/feed$

The following rule prevents the user track pages from being crawled:
Disallow: /*/track$

The first rule below prevents /search (a 302 redirect) from being hit, while the second one blocks the search results:
Disallow: /search$
Disallow: /search/

If you did Disallow: /search it would also block pages of content like example.com/search-engine-optimization -- that is why there are two rules above.

There is no way to make a single perfect robots.txt file for all sites, but this patch should help cut down on the server load from crawlers and improve general SEO of the basic Drupal installation.

Further explanations of the reasoning behind these proposed changes can be found on my Drupal robots.txt tutorial here:
http://drupalzilla.com/robots-txt

Each module that you add to a Drupal installation can require additional robots.txt rules. I left out robots.txt rules for modules in the attached patch, even though it might be a good idea to put some rules in for common modules. I'm building a database of Drupal modules with instructions on how to modify the robots.txt file for each extra module that is installed:
http://drupalzilla.com/module

AttachmentSize
robots_2.patch1.36 KB

#1

moshe weitzman - October 3, 2007 - 01:07
Status:patch (code needs review)» patch (code needs work)

thanks for working on this ... please use unified diff format for diffs. see diffandpatch. we're so used to them that i can't recall which lines of yours are adds versus deletes.

#2

Drupalzilla.com - October 4, 2007 - 00:33

Sorry about that (my first time submitting a patch).

I've attached a new patch made with cvs diff -up.

AttachmentSize
robots_3.patch1.75 KB

#3

catch - October 4, 2007 - 11:32
Status:patch (code needs work)» patch (code needs review)

setting back to review.

#4

earnie - October 4, 2007 - 12:16

Why do we want to do +Disallow: /node$ and its q equivalent?

Do we really want to remove -Disallow: /contact/ and add +Disallow: /contact$ and others or do we just want to add the $ ending ones? The three engines you mention aren't the only ones.

We should think about creating robots.txt on the fly in the module activation processes. Activate/deactivate module could add/remove robots.txt text.

#5

catch - October 4, 2007 - 14:50

Why do we want to do +Disallow: /node$ and its q equivalent?

Wouldn't that mean duplicate content between example.com/ and example.com/node if /node is the front page?

#6

earnie - October 4, 2007 - 19:38

Yes, I suppose. But then there is http://drupal.org/project/globalredirect which would correct that issue without needing to modify the robots.txt file. And adding http://drupal.org/project/gsitemap can help even further. However for the default install, I see your point.

#7

Drupalzilla.com - October 4, 2007 - 22:47

Why do we want to do +Disallow: /node$ and its q equivalent?

example.com/node is duplicate content of example.com/
example.com/node/1 shouldn't be blocked though.

Do we really want to remove -Disallow: /contact/ and add +Disallow: /contact$ and others or do we just want to add the $ ending ones? The three engines you mention aren't the only ones.

It looked to me like the current rule, Disallow: /contact/ was an attempt to block the contact forms. It doesn't block the default contact form because of trailing slash. But it might be best just to leave that rule in its current form.

We should think about creating robots.txt on the fly in the module activation processes. Activate/deactivate module could add/remove robots.txt text.

That is a good idea because a lot of modules create duplicate content problems -- as long as you could still have precise control over the robots.txt rules by hand.

#8

catch - October 4, 2007 - 23:51

dynamic robots.txt for modules could maybe be an addition to this: http://drupal.org/node/53579? - either way it's a very nice idea.

#9

Freso - October 26, 2007 - 09:01
Status:patch (code needs review)» patch (code needs work)

example.com/node is duplicate content of example.com/ – per default, this is true, but it can be easily changed, even by people who do not have access to the file system and thus won't be able to edit robots.txt.

Also, where does your use of "$" come from? I haven't been able to discern its function from anything I could find at robotstxt.org or Wikipedia...

It was also agreed upon in issue 75916 to have aggregator indexed by default, so that should be changed to "If you do not want you aggregator pages to be indexed, uncomment the following line".

Finishing this, I'd recommend you to read through issue 75916, as it contains some hints and has some discussion on this.

#10

catch - October 26, 2007 - 10:22

$ is an end of line character, not in the spec, but recognised by all major search engines (this is covered in the issue discussion and I almost cut and pasted).

example.com/node - I agree with though, a lot of sites don't use /node as the front page.

#11

catch - February 11, 2008 - 21:36
Version:6.x-dev» 7.x-dev

Bumping to 7.x

#12

Freso - February 20, 2008 - 22:26
Status:patch (code needs work)» patch (code needs review)

Okay, I've had some time to turn this over in my mind, and I'm feeling rather uneasy about using * and $ in the robots.txt, as they're not standard. Google and co. might well support it, but I'll bet you that there are tons of (polite) robots out there that actually follow the standard and doesn't care for (or possibly doesn't even know of) the extensions Google et al. use. And to me they seem like they would confuse standards compliant robots.

The attached patch removes some of the trailing slashes, namely /foo/bar/ to /foo/bar, leaving /foo/ alone (in case someone wants to make a /fooxyz node). Even if the non-standards approach is deemed a road worth continuing along, this patch will provide a temporary boost to robots.txt's effectiveness.

AttachmentSize
robots.txt.d7.trailing_slashes.patch1.06 KB
Testbed results
robots.txt.d7.trailing_slashes.patchpassedPassed: 7252 passes, 0 fails, 0 exceptions a href=http://testing.drupal.org/pifr/file/1/robots.txt.d7.trailing_slashes.patchDetailed results/a

#13

lilou - August 23, 2008 - 17:33

Patch still applied.

#14

Arancaytar - November 13, 2008 - 02:31

The attached patch removes some of the trailing slashes, namely /foo/bar/ to /foo/bar

Actually, the patch adds new rules without affecting the existing ones... was that what you meant to do?

#15

Freso - November 13, 2008 - 10:27

Yes, this is what I meant to do.

#16

BartVB - November 14, 2008 - 23:14

edit: Nevermind :\ Should read the actual patch before replying..

#17

Arancaytar - November 14, 2008 - 21:30

Um... huh? Which rule, specifically, prevents http://drupal.org/forum/...etc from being indexed? These are all of the new ones:

+Disallow: /comment/reply
+Disallow: /node/add
+Disallow: /user/register
+Disallow: /user/password
+Disallow: /user/login
+Disallow: /?q=comment/reply
+Disallow: /?q=node/add
+Disallow: /?q=user/password
+Disallow: /?q=user/register
+Disallow: /?q=user/login

#18

System Message - November 16, 2008 - 21:40
Status:patch (code needs review)» patch (code needs work)

The last submitted patch failed testing.

#19

lilou - November 17, 2008 - 13:29
Status:patch (code needs work)» patch (code needs review)

See: #335122: Test clean HEAD after every commit and http://pastebin.ca/1258476

 
 

Drupal is a registered trademark of Dries Buytaert.