Overall, Drupal has been great in terms of SEO. However, I'm running into a pretty serious problem with pages getting indexed twice in Google (which isn't good).

For example, typically one will find the following within the search engine results.

http://www.example.com/about
AND
http://www.example.com/about/

The probably has to do with the way URL paths get handled. Are there any workarounds / hacks that can prevent this from happening?

Ideally, accessing http://www.example.com/about would send a 301 redirect to the proper location of: http://www.example.com/about/ (notice trailing slash).

Comments

patricksettle’s picture

Didn't get a chance to look at this till tonight. This seems to do the job, but I've not tested it across a ton of modules,

RewriteCond %{REQUEST_URI} ^/[^\.]+[^/]$
RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1/ [R=301,L]

The only problem I've come across so far is when the user isn't logged in and they attempt to login via the login box which includes ?destination=%PATH% at then end of the submit URL. The problem is the URL gets encoded so the %path% (which is now path/ ) gets %252F stuck on the end of it... if you go to /user to log in it works as expected.

Now we just need to figure out a way around the URL encoding. Either patch user.module or maybe a more exotic rewrite rule?

ivank’s picture

I just noticed your post today, thanks for your efforts. I'll be implementing it and testing it sometime this week. If anything comes up in regards to this, I'll post it here.

Thanks again.

TheWhippinpost’s picture

Yes, folks (and the dev-heads here) need to be careful with the potential dupe-content issues that can easily get thrown-up un-noticed by modules.

In particular, watch out for the links it creates

Make sure that only one URL link-path exists for each page - That means checking every page and ensuring no module is outputting non-clean URL's, where you've globally specified clean.

Grab a copy of Xenu Linksleuth (sorry, you'll have to Google it), which will give you a report of all your links and report broken ones too.

Everyone should spider their own sites before launching - You'll prolly be surprised at the link-paths it spits out.

... this will help you with the next bit:

(Preferably) before you launch your site, throw-up a well-thought-out robots.txt disallowing entry to directories that might either cause dupe-content probs, or simply have no business in the search engines.

I'll whack mine here as a start/example:

User-agent: *
Disallow: /cgi-bin/
Disallow: /imgs/
Disallow: /files/
Disallow: /flash/
Disallow: /js/
Disallow: /node/
Disallow: /book/
Disallow: /taxonomy/
Disallow: /comment/
Disallow: /user/
Disallow: /contact/
Disallow: /tracker/
Disallow: /themes/
Disallow: /scripts/
Disallow: /styles/
Disallow: /misc/
Disallow: /aggregator2/

Shake or stir to suit.

NOTE: Not suitable for a site that doesn't have clean url's enabled, as it will stop search-bots entering

Watch out for "printer" links too.

Mike

benthere’s picture

Be careful cut-and-pasting this into your robots.txt.

Many sites may not want to prevent spidering of /node/, where all their content is.

--
Cheap, reliable Drupal hosting: 20GB | 1TB
Save $75| DH75OFF coupon for 1 year ($3.75/mo!)
Save $50| DH02SETUP coupon for monthly ($10/mo!)

TheWhippinpost’s picture

Many sites may not want to prevent spidering of /node/, where all their content is.

Of course - If you don't have clean URL's enabled, and your paths remain in the form of 'node/' then absolutely. Was that not clear?

Mike

benthere’s picture

Yes, Clean URLs doesn't automatically make content not be in the form of node/.

URL aliases with pathauto installed does, but not everyone has that.

Think of my post as a friendly disclaimer. I wouldn't want someone copy-pasting their way out of Google with no warning. ;-)

--
Cheap, reliable Drupal hosting: 200GB | 2TB
Save $75| DH75OFF coupon for 1 year ($3.75/mo!)
Save $50| DRUPAL50 coupon for monthly ($10/mo!)

TheWhippinpost’s picture

Yes, I should've made clear the assumption that the pathauto module is at work here too, quite right.

Mike

Bevan’s picture

RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1/ [R=301,L] will rewrite all URLS to have the trailing slash.

Why not use RewriteRule ^(.*)/$ http://%{HTTP_HOST}/$1 [R=301,L] to rewrite all URLs to not have the traling slash, which is how drupal is made to work?

Compare the slashes:

RewriteRule ^(.*)$ http://%{HTTP_HOST}/$1/ [R=301,L]
RewriteRule ^(.*)/$ http://%{HTTP_HOST}/$1 [R=301,L]
gemini’s picture

Thanks bevan!
This line
RewriteRule ^(.*)/$ http://%{HTTP_HOST}/$1 [R=301,L]
made my day :)

ivank’s picture

Thanks to everyone for giving this proper attention. There has been a lot of valuable commentary. I will address some of the more interesting comments brought up in this thread when I get a chance.

This is the final rewrite rule that, with the help of our engineer Erik, we launched to production. It's almost been a week and so far no problems have emerged, I'll post here if anything needs to be adjusted.

# Remove duplicate content by appending a slash
RewriteCond %{REQUEST_URI} ^/[^\.]+[^/]$
# Except for the admin, user, and node areas
RewriteCond %{REQUEST_FILENAME} !^\/(admin|user|node)
RewriteRule ^(.*)$ http://%{HTTP_HOST}$1/ [R=Permanent,L]

# We load mod_rewrite before mod_alias, so we are forced to
# add this workaround here for aliased folders
RewriteCond %{REQUEST_FILENAME} !^\/(images|tour|js)\/

Here are some links to where other parts of this conversation were conducted:

dutchie76’s picture

Sorry but none of the above worked for us. In light of the previous rewrite I got a 401 for URLs without traing slash. For some reason it was not redirecting.

I've been trying to figure this out for a while now and thought I had a solution.

We moved our site over to Drupal. All URLs of the old site end in training slash /. The reason for reason for not wanting to remove the trailing slash is because I don't want to have to re-indexed the site with the search engines so best not to change file structure.

Now I need to force the trailing slash. Why? Because anyone can externally link to a page without the trailing slash. Result, duplication = Bad! You can also split your internal PR = not good for SEO. Duplication rule 101 "no more that one, unique URL for each page" any more and you run the risk of tanking your site in the serps.

I have tried the examples above and none work for me (Drupal 6.3)

The following did work and it forced the URLs to append the trailing slash.

RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !(\.[a-zA-Z0-9]{1,5}|/)$
RewriteRule (.*)$ http://www.mysite.com/$1/ [R=301,L]

Problem is I am no longer able to change the status of modules and I can't upload images using the image module. I tried moving it around the .htaccess file but no joy.

Any feedback would be great.

Thanks

Bevan’s picture

Global Redirect is a better solution most of the time.

Bevan/

dutchie76’s picture

Global redirect might be the solution for D5 but not for D6. And as I can't test it our to see how it actually works without module conflict then I will use the following solution for now.

The following .htaccess will force trailing slashes and exclude the redirect rule from running on certain directories.

# Force Trailing Slash
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !(\.[a-zA-Z0-9]{1,5}|/)$
RewriteCond %{REQUEST_URI} !^/(admin|user|node|edit|filefield)
RewriteRule (.*)$ http://www.mygreatsite.com/$1/ [R=301,L]

If you are having problems with certain modules from running then exclude the directory where the module is tring to access. In the example of uploading images to /filefield/ you need to exclude filefield from the .htaccess rule. I have also excluded admin/
user/
node/
edit/

Because we need to know what URL(s) are used for "uploading an image." If not easily-findable, you can get this information by using the "Live HTTP Headers" add-on for Firefox/Mozilla browsers; Enable the "recording screen" then upload an image, and the add-on will capture all transactions between your browser and the server.

It works like a charm.