This is something that has been puzzling me over the past year or so, when we lost all our Google rankings due to a (suspected) algorithm update affecting relatively new websites. After my first experimental installation, all my sites now run Drupal and I never regretted this decision.

The problem is that each page needs to be represented by a single URL and not by multiple ones. This problem exists for all new sites and some may not consider this important, however one of my sites has suffered dearly over the past year. The examples below can help better understand what I am trying to say here.

Homepage

The homepage can be accessed at http://www.example.com, but also http://www.example.com/node. Even though I can't figure out where does Google discovers this (as no links to http://www.example.com/node exist), the issue is that you get a penalty for having this page available using two URLs.

The solution to this is to add a 301 redirect to your .htaccess file which sends all hits to http://www.example.com/node to http://www.example.com.

Taxonomy Pages

On sites with many stories on taxonomy pages which span across multiple pages each additional page is a duplicate. The reason for this is that each (next) page has the same and taxonomy description (which is different if you have nodewords module). It does not matter what the links on the page say.

The solution to this is to add the following lines to the robots.txt file of your site:

Disallow: *?page=
Disallow: *?from=

I know that it is unimportant to many of you, but believe me... My new site suffered dearly because of this.

Nomatter what I say here, you need to read the thread Duplicate Content - Get it right or perish over at Webmaster World (not affiliated) especially comments from g1smd (not me).

These issues have caused a significant investment to fail and it is definitely not an issue concerning just Drupal.

I think the robots.txt with some default instructions needs to be included in the Drupal core, as it may be too late when some realise.

It is worth noting that sites which exist for many years do not get penalised. I have a site which runs Drupal (after migrating from PostNuke) which is considered an "authority" in its field, having been linked from the BBC and other "authority" sites. This site never lost anything. However my new site was nowhere to be seen for relevant keyword searches on Google.

This issue only affects the Google search engine and not any other (like MSN and Yahoo) but given its market share you simply can not afford to pay attention to this.

Hope this helps...

Comments

TheWhippinpost’s picture

Yes, one needs to be very careful.

Have a look at Drupal Pages Getting Indexed Twice for additional comment and a "starter" robots.txt file.

Also, don't discount, in the case of your site being new, the possible effects of what some people claim as the "sandbox".

Generally, with dupe content, Google tries to score one page whilst consigning the other into its supplemental index. So it then becomes a matter of which page Google considers to be the "correct" page.

All that aside, just avoid dupe paths to content.

Mike

pavlos’s picture

Just made this post for those who don't know...

dman’s picture

Having followed a few of these sorta posts recently, and having come up with my own problem with aliased content (and relative links only being accurate half the time) I wondered whether the path alias module would be better off behaving more like an HTTP redirector than a duplicator.

Many folk offer instructions for manipulating the .htaccess rewrite engine, but shouldn't that be a massively old-fashioned and redundant way of doing things with a powerful CMS at hand?

don't alias - redirect

I think it would be possible to intercept all requests, and if it's a system path that has an alias assigned return a 301 bounce to the 'canonic' URL.

Thus, node/317 and ?q=node/317 and about.htm return 301 Redirect /about
and /about serves the content.

Surely that would cheer up the search engines?
This of course would be an add-on module, not a rewrite of path.module.

It would also solve my import woes where the same page is called
/about (new clean url)
/about/index.htm (legacy support)
but contains a link to href="team.htm"
... which resolves either to team.htm or about/team.htm depending...

.dan.
How to troubleshoot Drupal | http://www.coders.co.nz/

pavlos’s picture

That module would be ideal.... But my PHP knowledge is not that advance to do this...

TheWhippinpost’s picture

Thinking off the top of my head; I think your 301 idea would perhaps be a good "failsafe" for existing sites porting-over to Drupal, or beginning to use clean url's and pathauto.

For new sites, obviously there's no substition for just sealing everything up airtright before launch.

I wouldn't be comfortable with having a system that pumps out lots of 301's dynamically, based on rules.

I think it should be a serious consideration for core to have a few options to say sommat like:

... and other common mistakes made by people linking-in from external sites - Theoretically, if you have an airtight site from scratch, then it's only the incoming links that can ruin the day.

Oh yeah, Drupal should spit-out, or at least provide the option of spitting-out, ALL internal links as absolute... leave no room for doubt.

Mike

pavlos’s picture

The redirect should not be to index.html. Directories should always be available as /directory and not as /directory/index.html (that's what I have concluded after reading miles of conversations on webmasterworld.

Also, I think it makes it easier for those working on this in the Drupal core...

It shouldn't be that difficult... I mean all that it is required is to redirect nodes to the correct "alias" if there is one.

The taxonomy fix is needed cause if you check the pager's code it is not possible to add a "nofollow" to the next pages... I am away from my main computer right now... Will post examples after Monday that I will be back in london.

TheWhippinpost’s picture

The redirect should not be to index.html. Directories should always be available as /directory and not as /directory/index.html (that's what I have concluded after reading miles of conversations on webmasterworld.

I've had the debate elsewhere here and it'd be taking this off-topic to elaborate further but in the end, I'll just say that we'll have to differ. The main point of the suggestion was to just to have it as an option to cater for both camps and scenario's.

Mike

dman’s picture

Yeah, I'm thinking of importing legacy sites. That's where a lot of my work is going.
ALSO, over the last most-of-a-decade that's where a lot of my effort has been spent - converting giant sites built in whatever source with hard links to /path/index.htm and list.asp and thing.php and /cgi-bin/formbuilder/custom-crap.pl

I believe it's best practice to link to /path rather than /path/index.htm BUT not many static html builder applications **ntpage, Dreamweaver or whatever (even local link checkers) support that without twisting their arm. That's a legacy thing however, so not an argument.

I don't think there is such a thing as an airtight site - the special cases enumerated in these threads point out lists of possible leaks.
I wouldn't worry about pumping out 301s dynamically based on 'rules'. The alias table currently pumps out content based on the same rules!

My relativity problem was from imported content, nothing to do with Drupal nav. I've restructured so many sites that I LIKE relative links to remain relative. Self-referential subsections SHOULD be able to be moved about within the heirachy without being re-written. Because 2 years down the track the platform may change again and the new site structure will be different. But that's the long view. Done it too many times.

.dan.
How to troubleshoot Drupal | http://www.coders.co.nz/

TheWhippinpost’s picture

I take your point WRT relative paths - I see both sides - but with the SE's being so important to the blood-flow of sites these days, and with Matt Cutts himself (of Google) recommending absolute paths to minimise canonical problems; and with the problems thrown-up by modules themselves, I'm sold.

Leave no room for doubt... and that's my philosophy WRT the /index.htm debate too, in a nutshell :D

Mike

moshe weitzman’s picture

i think this redirect should be default behavior in path.module and not just an add-on. path does a good job of rewriting outbound links, and translating inbound requests with paths to normail drupal urls. what is lacking is redirecting unaliased paths as you are suggesting.

ednique’s picture

Very interesting...
I'm using pathauto to generate urls...
And while doing so, I try to remove all paths that contain node
even so, I translate all paths to dutch so "/image" becomes "/fotoboek"

And maybe it's a challenge for module devolpers to fix the problem paths...
Just look at the examples below:

The reason why google finds /node, is because of modules like print
They add a link like /node/50/print to get the printer friendly page...
Thus google assumes that /node/50 and also /node exist...
Other modules like forward use a link like /foward/50

It would be wiser for the module developer of print to use a url like /print/50
Or maybe pathauto could become more clever...
so that if he translates /node/50 to /my-first-category/my-first-post
he would also translate /node/50/print to /my-first-category/my-first-post/print

Same problem exists with the users...
My paths are translated like:
/users/ednique
and the blog gets translated as
/users/ednique/blog
Now I have xtracker and the contact form link...
these still remain
/user/1/xtrack and /user/1/contact
thus google finds /user/1
This should better be
xtrack/1 and contact/1
or via pathauto
/users/ednique/xtrack and /users/ednique/contact
ideally I would rather like to translate here too and get
/users/ednique/opvolgen and /users/ednique/contacteer
but maybe that's too much asked...

pavlos’s picture

Well you have problems related to the translations available for Drupal... I would open a support issue on the relevant section of drupal.org (e.g.: the Dutch translation). You are asking a lot though... That sounds like a lot of work...

I am using pathauto and have the following in my robots.txt...

Disallow /node*

While to ensure that things are smooth I am 301 redirecting /node to / using htaccess. I've become paranoid since finding this out.

But I insist the taxonomy is the most important.... Imagine all our main sections were duplicates and of course badly penalised. Either a "nofollow" attribute to the page numbers and the next last previous first links or a noindex meta tag on all other pages...

ednique’s picture

I don't have translation problems...
When I say that /node/50 gets translated into /my-first-cat/my-first-node
I mean that an alias /my-first-cat/my-first-node is created for /node/50

I'm simply pointing out that google knows of node and thus indexing it because node is on the links created by modules... as they create links like /node/50/print

I pointed out a second solution that indeed requires more work...
work for module developers to alter their urls...
so that each url is /print/50 instead of /node/50/print...

Your solution probably works just fine, but for me looks like a workaround rather then a solution...

I also mensioned that instead of all modules developers to adopt their urls, the pathauto could deal with fixing their urls...
Just now I found another issue where a module creates links to my taxonomy terms
but for some reason it adds /9 to the url like
/taxonomy/term/23/9
This will list the 9 first entries of the term...

If Pathauto was indeed capable to do the following
all urls including the /node/50/print type ones would be fixed and no search engine would ever find the node links again:
if you make an alias called "/my-first-cat/my-first-node"
that points to "/node/50"
pathauto could also translate any url that start with "/node/50" thus
"/node/50/print" ==> "/my-first-cat/my-first-node/print"
"/node/50/edit" ==> "/my-first-cat/my-first-node/edit"

pavlos’s picture

I don't have translation problems...
When I say that /node/50 gets translated into /my-first-cat/my-first-node
I mean that an alias /my-first-cat/my-first-node is created for /node/50

Got confused by the "translated" there....

Your solution probably works just fine, but for me looks like a workaround rather then a solution...

It is only a quick solution to stop Google from indexing the /node page. Fully agree with your suggestions... I'd love to see them implemented...

greggles’s picture

For core, pathauto handles lots of features on its own. For contrib, I feel it's up to the contrib modules to implement pathauto hooks to keep things working.

If Pathauto was indeed capable to do the following
all urls including the /node/50/print type ones would be fixed and no search engine would ever find the node links again:
if you make an alias called "/my-first-cat/my-first-node"
that points to "/node/50"
pathauto could also translate any url that start with "/node/50" thus
"/node/50/print" ==> "/my-first-cat/my-first-node/print"
"/node/50/edit" ==> "/my-first-cat/my-first-node/edit"

That would all be up to the print module, or whatever other modules create extra node/50/foo to implement the pathauto API to make this work.

I believe there is work in Drupal6 to make things work as you expect, but ask chx about that - I don't know.

Greg

--
Knaddison Family | mmm Free Range Burritos

TheWhippinpost’s picture

I am using pathauto and have the following in my robots.txt...

Disallow /node*

I think you should take a squint at some of the "leaks" I found - Node isn't enough:

User-agent: *
Disallow: /cgi-bin/
Disallow: /imgs/
Disallow: /files/
Disallow: /flash/
Disallow: /js/
Disallow: /node/
Disallow: /book/
Disallow: /taxonomy/
Disallow: /comment/
Disallow: /user/
Disallow: /contact/
Disallow: /tracker/
Disallow: /themes/
Disallow: /scripts/
Disallow: /styles/
Disallow: /misc/
Disallow: /aggregator2/

Obviously, not all of those apply directly to this issue, nor to your particular set-up, but there are a few common denominators to most Drupal sites in there that can trip you up.

Grab a copy of Xenu Linksleuth and spider your site for an appraisal of your own links is best advice 'cos as ednique has highlighted, modules can add their own kinks to the pipework.

Also, on wider note, another area worthy of investigation is to take the various error msgs Drupal will spit-out and Google for them (within quotes, ""). There may be cases where error pages are writing the wrong paths too ("Return home" for example).

Mike

pavlos’s picture

A friend just pointed out to this module... http://drupal.org/project/globalredirect

ednique’s picture

Supurb...
This is VERY handy...

dman’s picture

TheWhippinpost’s picture

I've not tried this but do you know if it corrects, "incorrectly" written navigation paths within a page?

Else, this is surely just a sticky plaster solution.

Mike

ednique’s picture

Conclusion of my tests:
it redirects when he finds that the URL someone asks for actually has an alias...
so surfing to /node/50 would redirect you to /some-cat/some-page... Supurb!
surfing to /taxonomy/term/50 would redirect you to /my-voca/my-lovely-term... Supurb!

This fixes a lot as google will not find duplicate pages no more...

yet URLs like /node/50/print are not redirected as long as there are no aliasses available...
As also links for listing the 5 latest nodes of a taxonomy term: /taxonomy/term/50/5
The designers of pathauto are working on a solution, so I've been told...

When google finds /node/50/print he'll try /node/50 and get redirected as mensioned above... Supurb!
When google finds /taxonomy/term/50/5 he'll try /taxonomy/term/50 and get redirected as mensioned above... Supurb!

finally I've added an alias for /node to simply / ==> directs to home page

Is that what you mean by "incorrectly" written paths?

ednique’s picture

Dayum I just noticed that /node/50/ is not redirected...
thus when / is at the end...

pavlos’s picture

I just noticed that /node/50/ is not redirected...

This URL is not possible to be generated by Drupal or any modules (to my knowledge). Thus it does not need to be redirected... I don't think google would assume that this exists because node/50 is existing... If nothing links to that URL it is highly unlikely to cause a problem.

TheWhippinpost’s picture

Is that what you mean by "incorrectly" written paths?

Sorry, I should've made the point a bit clearer.

I mean, if a module (or other code) is spitting-out un-aliased URL's onto a page.

For instance, I've just had to create an alias to convert:

node to index.htm

... to prevent the breadcrumb printing: www.example.com/node

We've probably all experienced, or read of instances where modules don't spit-out the intended URL alias onto the final page. To me, this is the real problem for, as I said above, if the site is airtight, no one should find alternative paths.

The Global Redirect module "appears" (don't forget, I've not used it) to be just a quick-fix to an underlying problem.

Maybe part of the problem is that mod-developers are confronted with several ways to print links?

As I understand it, using l is the correct way to print aliased links... but there's also advice to use url (for which the API usage description is rather loose). There's also code knocking about that use $_GET['q']), drupal_get_alias etc...

(Note: I may be talkin out of my arse with some of those code excerpts - I'm still bending my head round the API, but I'm sure the point is clear: Mod-dev's may be confused as to which is the best way of printing the correct paths, universally)

Mike

ednique’s picture

I'm sorry but no...
redirect is not used for that... nor meant for that...

It gives a clean, nice and logic solution for dealing with people trying to access any URL that is aliassed... (maybe an old bookmark for example)

Also for the /node/50/print type of links...
it will block people/crawlers to go to /node/50...

And indeed my first post on the subject stated that
the debeloper of the print module should have used a link like /print/50 instead
or the developer of pathauto should have made his module clever enough to provide an alias for /node/50/what/ever/you/want to my-node-title/what/ever/you/want
which he will be dealing with soon...

Finally you note that developers are spitting out URLs that are not going trough the url aliassing code...
where they are incorrectly printed on the screen...
Again, the real solution is quite simple...
it doesn't matter if the developer has only one way or a million ways to make a link...
If the link he puts on your page is wrong, then that is a bug... plain and simple...
make an issue on the module concerned and the developer will fix the bugs...

greggles’s picture

or the developer of pathauto should have made his module clever enough to provide an alias for /node/50/what/ever/you/want to my-node-title/what/ever/you/want
which he will be dealing with soon...

Unless I'm mistaken, I don't think this is the case. That kind of catching/redirecting would have to be done in the drupal core menu/path system - it has almost nothing to do with pathauto.

--
Knaddison Family | mmm Free Range Burritos

pavlos’s picture

Haven't yet tried this module. My sites are all safe from duplicates and node/xx urls, including the /node

I just published it for those interested or those who do not know how to 301 redirect.

The big problem is that once your site has been incorrectly indexed, it may take up to 1 year for all urls that should not be in the Google index to be dropped. This is another drama, and it is what's puzzling me (instead of looking at making my sites better).

Hopefully this (duplicates problem) will be fully resolved on the next Drupal release, as I see increased interest in the topic... Wish I could help with the development, but it could take me ages on getting it together...

niklp’s picture

For anyone reading this post as of the date I posted this, both the global redirect module and the path redirect module are now working fine in Drupal 5.1.

I am still looking for an easy solution to cover the problem of duplicate content delivered by the taxonomy module. It might be obvious, but I am! :)