Because duplicate content is considered harmful for search engine optimization (Drupal SEO Group), if you are using Views, you should take steps to avoid duplication. For example, you should edit your robots.txt file (Google Webmasters/Site owners Help - > Dynamic pages, Google Webmasters/Site owners Help -> Duplicate content). If you have any exposed filters in Views, then the URL is full of variables. This leads to duplicate content and thus should be controlled via robots.txt (Google Webmasters/Site owners Help -> URL structure). In short if you use Views you need to use robots.txt! The following information will help in optimizing robots.txt for Google. Other major search engines and robots are covered here... via wikipedia cited source.

The reason why I'm only discussing Google is because it's the only search engine that allows you to test your robots.txt file against any URL you want Google Webmasters/Site owners Help -> Checking robots.txt.

URL variables

First things first, enable clean URLs! Lets say you have a view that exposes the title to the user, so they can refine or search the view. This will cause two duplicate versions of that view. This isn't good SEO.

www.example.com/mycoolview
www.example.com/mycoolview?&title=

Lets say that view also uses a pager, so you need to allow the page URL variable and disallow every other URL variable. This can be accomplished by adding this to robots.txt in the root of your webserver.

# Disallow all URL variables except for page
Disallow: /*?
Allow: /*?page=
Disallow: /*?page=*&*
Disallow: /*?page=0*

I tested this using the robots.txt tool that is provided in Google's webmaster tools so this may not work with other search engines.

Using Ajax in conjunction with pager doesn't get rid of the problem, Drupal will serve the non-Ajax page to browsers that have JavaScript disabled.

Multiple views of content

I believe it is better to only have your nodes indexed, or only a single, all encompassing view. Having both will lead to duplicate content and thus a lower overall page rank. For my site, I use Views in conjunction with Directory to display my multiple taxonomies. Thus the directory will have duplicate content, so I need to disallow that from being indexed, but I want my nodes to be indexed. This is where XML Sitemap is key. I only have XML Sitemap, XML Sitemap: Engines and XML Sitemap: Node enabled because I only want my nodes to be submitted. I use pathauto for my taxonomies and nodes. I put my nodes in one directory and my taxonomy terms in another. Then in robots.txt all I have to do is disable my taxonomy root directory like this.

# No taxonomy 
Disallow: /taxonomy-dir-name

Submitting a sitemap is key, otherwise the search engine may never find everything. The main point is to put your nodes in a directory that can be easily separated from all your views. Allow the nodes, disallow the views.

200 returned for a non existent path

Live Example
http://drupal.org/project/modules
http://drupal.org/project/modules/google.com <- This should return a 404; 200 given. Potentially a duplicate content penalty!
Views404 is a module designed to handle this situation.

Notes

Here's my "eureka" thread to the views issue tracker
http://drupal.org/node/344708

Comments

Z2222’s picture

An easier way of writing this:

# Disallow all URL variables except for page
Disallow: /*?
Allow: /*?page=
Disallow: /*?page=*&*
Disallow: /*?page=0*

is this:

Disallow: /*&
Disallow: /*page=0

I would prefer to block non-page query strings individually (by module) than to block all and then 'allow' page query strings. I think it's better to have some extra query strings spidered than to block search engines from pages accidentally.

dunklea’s picture

You said that using Ajax will not solve the problem because Drupal will serve the non-Ajax page if Javascript is disable. However, wouldn't this only affect users and not search-bots crawling your site?

I ask only because I have seen other posts that claim that using Ajax does fix the problem so I just wanted a little more clarification.

sovarn’s picture

I use Ajax for my exposed views on my taxonomy term pages.
However googlebot still does crawl every single combination of pages, so switching to ajax makes no difference as googlebot can still use the exposed views block.

One way to stop google from doing this is using the robots.txt above. All this does is stop google from indexing the filtered pages (reducing duplicate content).

However this still may not be good SEO (http://www.stonetemple.com/blog/?p=514) see point 14. Google suggests that even though you stop them from indexing the page, the 'link' is still seen by them reducing your 'crawl budget' - meaning less pages being craweled.

Anyone have a way of stopping google from using the exposed view?

mkeplinger’s picture

This seems to be the BIGGEST gap in SEO for views.

The page title is so important, but I cannot find resources or how-to on how to set the page-title of a view page. I have numerous view pages, each one is using data from just one node and I want to be able to use the [page-title] token from the node to set the page title of the view.

Can someone PLEASE help me understand how to do this? TIA

Witch’s picture

You can do this bis arguments! Just set an argument as the Title.

acamposruiz’s picture

Hello, I have the same problem and I´ve see your post but I think this way is for content title?, isn´t it? . I need to configure the SEO title in the head of the HTML page, can you help me please?

Thank you very much,

fietserwin’s picture

According to the latest version of the mentioned document about duplication, Google no longer finds duplicate content within a site harmful and advises to not use robots.txt or so to hide duplicate content. This seems to make this page largely superfluous.

Ryan258’s picture

You just saved a lot of people a whole lot of time fietserwin!

I dug into it and found the google documentation to back up your claim.

rooby’s picture

Even so, sometimes a bot will come along and submit exposed filter form values that are invalid, giving you a watchdog log full of illegal choice errors. Very annoying.

nuiloa’s picture

Regardless of what Google says in their help pages, if your site is creating large amounts of duplicate content, "Google Panda" is very likely to obliterate your rankings. It's happened to a lot of sites. Do a Google/Bing search for "Google Panda" and read up on it.

mikeahuja’s picture

lots of duplicate content is not good...but title tags are still very relevant for google search...and some how incorporating the title tags in the content as an h1 or h2 is good

volkan@bd’s picture

The canonical urls generated from metatags module solve this problem,don't they?