Hello,

I have enabled the taxonomy/term for a free tagging vocabulary of mine, and I notice that Google Webmasters is reporting duplicate content for me.
How?
Basically, for the following urls for example
/taxonomy/term/367/all
/taxonomy/term/367/0
/taxonomy/term/367/
/tags/myterm (this is the path with the Pathauto module)

These all return the same content.

I tried removing the all wildcard from the basic view, but Drupal still brings it up.

I also tried renaming "all" in the wildcard definition, to "0" to coincide with (as I understand) taxonomy's way of showing all nodes for a termid.

Any ideas are appreciated.

Ideally, what I would like is to redirect
/taxonomy/term/367/all
to its pathautoed equivalent.

The way Google found it was through the feed urls, because at the bottom, rss was displayed like this:
/taxonomy/term/367/all/feed

still /taxonomy/term/367/all does not exist in my sitemaps.

Comments

giorgio79’s picture

For further info I have the following arguments (I also have a hierarchical vocab hence I needed the depth modifier)

Taxonomy: Term ID (with depth)
Taxonomy: Term ID depth modifier

I realize pathauto takes care of redirecting from
/taxonomy/term/367/
to the pathauto equivalent

and
global redirect module takex care of redirecting from
/taxonomy/term/367/0
to the pathautoed equivalent.

That leaves
/taxonomy/term/367/all

I tried remove the "all" wildcard as I mentioned and cleared cache, still Drupal displays values as if it would be
/taxonomy/term/367/

merlinofchaos’s picture

You can use, I think it's global redirect, to redirect Drupal paths to their aliased counterparts.

The only way to prevent /all and /0 from having the same content is to simply remove the depth option. If you're not really using depth, then really anything you put there will appear to return the same content. Actually I don't think that will really change much either.

giorgio79’s picture

Thanks Merlin I took care of /0 with the Global Redirect module, which can chop it off.

My only question is, can I remove the default wildcard for the taxonomy/term view, which is "all"?

I tried removing it from the textboxes, still it exists as a url. Even after flushing cache.

I also tried renaming "all" to "0" :P, but no luck.

I already get all the values at the taxonomy/term/[termid] page (which is pathautoed and redirecting fine), so I dont see any need for it.

merlinofchaos’s picture

I don't think there's a way to do that. The thing is, this exists on almost all Drupal sites, regardless of Views. Changing that number doesn't always appear to do anything. Google doesn't seem to punish for that.

giorgio79’s picture

Status: Active » Closed (works as designed)

Thanks Merlin, will put this as by design.
Have to think about it a bit more.
It sounds like the "all" wildcard is coming from the taxonomy module originally, but with the taxonomy term id argument, and depth modifier I get all the nodes related to a tag fine, without the need for it.

What I see in Google Webmasters is that some terms got indexed liek

tags/mypathautoedtag

and
taxonomy/term/45454/all

I suspect Google discovered the taxonomy/term/45454/all from the feed path which I left as default in the taxonomy/term view, and thus it shows as taxonomy/term/45454/all/feed.

I notice this is configurable though, and I can apply pathauto to the feeds as well, so I will give that a shot.

giorgio79’s picture

For those that have this issue, I added this htaccess rewrite rule that fixes it for me.

It redirects all taxonomy/term/xxx/all to taxonomy/term/xxx

:)

# fix taxonomy term all dupe content issue
RewriteRule ^taxonomy/term/([0-9]+)/all$ /taxonomy/term/$1 [L,R=301]

HS’s picture

Had the same issue. Yahoo is indexing duplicate pages with the same content.

First posted it under Global Redirect issue queue because I thought GR wasn't working.
http://drupal.org/node/657912

I see now that this is a views issue. giorgio79, thank you for the htaccess rewrite rule. Will give it a try.

Is it wise to chop the trailing 0 with the Global Redirect module? What if two different nodes share the same title and you use pathauto to auto generate aliases? For examples if I had news/apples and another user wrote a news story about apples and decided to name it apples, it would appear as news/apples-0, won't Global Redirect then chop off that trailing 0?

HS’s picture

# fix taxonomy term all dupe content issue
RewriteRule ^taxonomy/term/([0-9]+)/all$ /taxonomy/term/$1 [L,R=301]

The above didn't work for me. I added the lines to htaccess and when I view 'taxonomy/term/39/all' it doesn't redirect. It does nothing.

The change in the htaccess should take effect immediately right?

giorgio79’s picture

I ended up disallowing completely in robots.txt /taxonomy*

HS’s picture

Status: Closed (works as designed) » Active

Thanks for the update mate. So you don't allow any search engines to index your lists and taxonomy pages?

A lot of folks aren't probably aware of this issue. It's only when I discovered Yahoo indexing duplicate content and displaying them back to back on search results that I realized what was going on.

Do you know for certain that this is not a views issue? This surely can't be by design?

HS’s picture

Version: 6.x-3.x-dev » 6.x-2.7

I'm on 6.x-2.7 so changed status/version. I'm hoping merlinofchaos would be kind enough to take another look at this.

HS’s picture

@giorgio79

According to this: http://www.webstrategies.co.nz/2009/09/04/noindex-meta-tag-versus-robots...

A dissallow command in your robots.txt will still have the page indexed on Google.

Can you please confirm if disallowing taxonomy/* in robots.txt has helped you?

giorgio79’s picture

Interesting, that is why they may be disappearing slowly from the index :)

taxonomy pages can be noindexed or disalllowed, because I have the aliased versions like /tags/whatever...

esmerel’s picture

Status: Active » Closed (fixed)

No activity for 6 months

RumpledElf’s picture

This one is driving me nuts too. At least I know its the feed causing it. Getting sick of thousands of duplicate content warnings in GWT.

remkovdz’s picture

Is there any definitive solution to the fact that taxonomy pages allow anything to be put in the depth argument position?

So these URL's:
taxonomy/term/[termid]/all
taxonomy/term/[termid]/bla
taxonomy/term/[termid]/house
taxonomy/term/[termid]/tree

all show the content of the original page:
taxonomy/term/[termid]

From a SEO perspective, what we want is that these pages (/all, /house etc.) do actually not exist (because it's duplicate content), which means they should 301 redirect to the original page. I'm not a coder but I believe this .htaccess line is supposed to do the trick, however it doesn't work for me (and other people, so I've read):

RewriteRule ^taxonomy/term/([0-9]+)/all$ /taxonomy/term/$1 [L,R=301]

(I assume this line should be copy/paste in .htaccess, without anything else)

Does anyone have a solution for this? Thanks!

remkovdz’s picture

I found a solution to this problem, however not useful for this module (I'm not a developer). I got it to work with this manual code in .htaccess:

RedirectMatch 301 /taxonomy/term/([0-9]+)/([0-9]+)/(.*) /taxonomy/term/$1/$2

This redirects everything that is put in the depth argument position, in my case: /taxonomy/term/[id]/[id]/this/this/also-this/etc
Please note I have two term id's. If you only have one term id, the code should be:

RedirectMatch 301 /taxonomy/term/([0-9]+)/(.*) /taxonomy/term/$1

To be honest I'm not sure if this is the best way to go, but it works like a charm here. For me, this is finally a solution to duplicate content caused by taxonomy.

Another workaround to prevent/delete duplicate content is to block the taxonomy/term pages in robots.txt:
Disallow: /taxonomy/term

But this only works if your pages are not already indexed by Google. If your pages are already in the Google index, you also have to delete the URL's with Google Webmaster Tools, otherwise they will keep existing in the Google database. Personally, I don't like this robots.txt option for three reasons 1) Linkjuice should flow freely trough a site, in the end this is always the best solution in my experience 2) It causes problems when you have multiple id's (like me), because these can't be aliased as far as I know 3) You never know which other module decides to use the unaliased version of the taxonomy URL to link to for whatever reason, this is linkvalue down the drain.

1mundus’s picture

taxonomy/term/0 path can be redirected by using Global redirect's trailing zero option (taxonomy term pages only).

taxonomy/term/0/feed can be avoided by adding Global: null contextual filter in your taxonomy view and then adding validation.

Jon Pollard’s picture

Issue summary: View changes

I found a problem with the redirectmatch option from remkovdz - it doesn't allow taxonomy terms to be edited. Here is another solution using rewrite rule and condition

RewriteCond %{REQUEST_URI} !^(.*)/edit [NC]
RewriteRule ^taxonomy/term/([0-9]+)/(.*) /taxonomy/term/$1 [L,R=301]

It looks to me like it works - let me know if you find a problem