Posted by WeRockYourWeb.com on February 24, 2006 at 7:19pm
I'm a big fan of the SEF feature provided by Drupal's mod_rewrite capabilities and URL aliases. One question however -> does using the URL alias result in duplicate links that will be penalized by search engines?
ie. does
clean URL : http://example.com/node/1 with
alias : http://example.com/front-page
result in search engines indexing both these links and penalizing for duplicate content?
Thanks so much,
Alex
Comments
Yes...
If the SE found the un-aliased pages then yes you would no doubt expect dupe penalties and such i wrote into my robots.txt file:
User-agent: *
Disallow: /node/
It is not necessarily the case that the robots will find the node links, but it is better to be safe than sorry.
Other causes for concern would be the print option on every page. I believed this may also be a cause for concern, however in this case the entire template is removed and just the content is left, so duplication may not be seen by the SEs...but i removed these pages as well in the robots.txt just to be safe. :)
Cool, Thx!
Thx murph!
That appears to solve my problem :)
In regards to the print links - that's actually what I came across when I did a search for "duplicate links Drupal" in Google (since I can't seem to find stuff easily enough using Drupal's search), but since I'm still in the process of learning the system I haven't put up any printable content yet.
Once I do, how do you go about filtering that one out?
I just came across another question, not sure if this should be in a separate thread: the URL aliases don't contain an extension - ie. .HTM. Isn't this an SEO issue as well? Is there a way to append .HTM to the aliases?
Thanks so much for your time,
Alex
Alex
----------
We Rock Your Web
Same as before...
Just add:
Disallow: /book/print/
Not quite sure what you mean in relation to the aliases not having extensions. If you are manually adding each path alias to each node, you can specifiy whatever extension you like .html .htm .php .asp etc....
If you are using auto aliasing, then i don't know as i have never used that module...
Extensions fixed; front page content duplication?
I was trying to add the extension to the url alias instead of the content path. Thanks for the tip -problem fixed!
Another duplicate content possibility just occured to me - is the content from the front page seen as duplicate content? Ie. whatever snippets of content appear on the front page appear exactly the same on the full page (node) they link to. Do you think SE's see this as duplicate content? Is there a way around this?
Cheers and sorry for the delay in posting,
Alex
Alex
----------
We Rock Your Web
Well..
I don't use drupal in that context (ie with teasers...), but i should have thought you'd be ok with that, as there would be a fair amount of difference between the frontpage and the full page. Besides that is the default way to use drupal, and drupal is considered very sef!
Good Call
Thanks for the quick reply! Sorry for the sporadic timing on my replies - I travel often and don't get around to replying as often as I'd like.
The overall SEF friendliness is definitely a primary reason I'm using Drupal right now. That makes sense - as long as the teasers are short enough there shouldn't be enough duplication to warrant penalties.
Just curious - you currently have a Drupal site setup without teasers? Does this mean your front page is a full page (ie. without the "read more" link)? I'd like to implement this as well, the only problem I'm having is that when loading the site the front page saves as the page title, whereas when clicking on the "home" menu item to access the front page it is saved as the path title - resulting in two different pages?
Cheers,
Alex
Alex
----------
We Rock Your Web
Disallowing of /node/ solves teaser problem?
Actually it just occurred to me that the disallowing of /node/ in robots.txt might also solve the "teaser duplication" issue. Since I have my front page set as "node" (in admin -> settings), isn't it true that search engines won't even be able to see the teaser (front) page?
Thanks so much for your time and help,
Alex
Alex
----------
We Rock Your Web
Not really following you...
Using sef aliases your frontpage is just seen as root to the search engines (ie index.php or whatever)....not as node/ so it will definitely get indexed! By setting an alias for every page, you won't have any pages with the node/ url
I have set my frontpage to an alternative node, but if i didn't want teasers on the frontpage and i have my frontpage set to node, i would just switch them off at the node level (ie on every node edit page uncheck the promote to frontpage) and the teasers wouldn't appear any more. :) Hope that helps.
Disallowing index.php?
That definitely helps - so you're saying if I don't want teasers I simply name the front page to an alternate page name - like welcome-page.htm, and make this the default page under admin -> settings? But what happens if someone simply types in the domain name? Then the url address looks like domain.com/ instead of domain.com/welcome-page.htm, and when saving the page the page is saved as the page title ("welcome%20page.htm), and not welcome-page.htm. Does that matter? Just wondering if the SE sees both of these.
Now for the situation with teasers - would it work to disallow index.php? Would that effectively prevent the SE's from seeing the teasers page so they only see the full pages? Or do you think it's a bad idea to disallow a domain's default index page in robots.txt?
Update: Just came across this: http://forums.digitalpoint.com/showthread.php?t=48460 That might solve the problem...
Cheers,
Alex
Alex
----------
We Rock Your Web
AFAIK
If you don't want teasers, you simple don't check the box that says 'promote to frontpage' on every node that you create. If the boxes iare checked AND you use the default frontpage (think it is just node) then you will see teasers on your frontpage.
I would name your frontpage node/1 or something like that, set it to node/1 in the settings and then disallow node/ in your robots.txt file. This is what i have done on 11+ sites.
I have never seen a case for disallowing bots to index your root page...this page carries (in theory) your highest PR
Thanks!
Ok, so let me see if I understand you correctly: for a "no teasers" setup:
1) don't promote any pages to front page, and don't give my front page a path (so it defaults to node/1)?
2) Then by disallowing /node/ in robots.txt search engines will not see the page twice - now they will only see it as the root page for the domain? This is true even if it's not named "index.*"?
On a sidenote: Right now I have it set up like this:
1) my front page is "promoted to frontpage," I then set the "break" tag all the way at the bottom, and the path is "node"
2) then in settings I have "node" as the default front page
3) in robots.txt I redirect index.php -> root
It appears to me that this has more or less the same affect. What do you think?
Thanks so much,
Alex
Alex
----------
We Rock Your Web
...
The default frontpage is
node. You don't want that, as that is for teasersYou WANT to set the frontpage as (eg) node/1, but it could be any page name (eg) home.html
The thing is when you set a page to frontpage in settings, it becomes root
/I set mine to
node/1, as this means that when i disallownode/from robots.txt, i am doubley sure that there won't be a duplicate frontpage - the last thing you want!Hopefully i have been a bit clearer :)
Linking to front page and node access?
Cool - I think I understand how it works now. The only thing left is finding a way for me to link to the front page in the main menu. I've found this thread ->
http://drupal.org/node/36465
which contains a patch with a way to link to the front page from the navigation menu by typing "front" in the menu path. What I don't understand is, why don't people just link to "node/1" to get to the front page? I notice that when I do this, the URL shows "domain.com/node/1" instead of simply "domain.com", which I get when clicking on "home" in the breadcrumb navigation (the url in the breadcrumb is "./" but I tried this in the path and it won't work). According to our disallow of /node/ this shouldn't matter since the SE's won't see "node/1", so I wonder why they are bothering to make a patch?
One thing I'm concerned about is that the URL "node/1" in the navigation menu may be linked externally, or the SE will see it but wonder why it's not supposed to follow it? In other words, is it a bad idea to disallow spidering of pages and then linking to them?
Finally - with the current setup when I type domain.com/node I get the Drupal "intro" page asking me if I want to create the first account. How do I prevent this?
Thanks - you've been a terrific help!
Alex
Alex
----------
We Rock Your Web
Upgrade to 4.7
I've upgraded to 4.7 finally, now that the modules I needed are supported, and that solved all my problems.
Alex
Alex
----------
We Rock Your Web