From http://drupalzilla.com/tutorial/seo/drupal-org-seo-3
Visit http://www2.drupal.org and you stay there
Visit http://www.drupal.org and you forward to http://drupal.org
We should tweak some settings (like in the .htaccess) to include www2.drupal.org in the 301 forwarding to drupal.org (at least I imagine that's where it's done).
Comments
Comment #1
emsearcy commentedThis is presently our only way of going to a specific web-node in order to 1) check for whether its up, so our paging system can check d.o, www1.d.o, and www2.d.o separately, and 2) resolving errors that may be specific to one website. Adding the redirect would make the URL no longer serve it's purpose.
With regards to robots.txt, we can not (easily) have a separate one for www1 and www2 since the webroot is the same. We could however have redirects for a different robots.txt based on the SERVERNAME using RewriteCond, which adds addition processing time to compute a regex each load, but could be done.
Can't we just use Google's webmasters tools to prevent them from crawling this URL? This seems like the most obvious solution.
Comment #2
gerhard killesreiter commentedI've added www1.d.o to my webmaster tools but I don't see how I could stop google from crawling the subdomain...
Comment #3
greggles@emsearcy - your explanation makes sense.
@killes - I have no idea ;) I didn't think it was possible.
Comment #4
gregglesThere is a "select canonical url" option...perhaps that's what emsearcy is referring to:
From the site overview page
Tools > Set preferred domain > { radio buttons which may include the www2, though may not }
I'm not sure that helps with the originally stated problem of duplicate content, though.
Comment #5
ramereth commented@emsearcy Can we maybe just allow www1/2 from the OSL net ranges since we like it for testing purposes? Then we can have a redirect for everyone else. Not sure if thats the clean way to do it, but it seems to fix the issue.
Comment #6
gregglesI marked http://drupal.org/node/193527 as a duplicate of this.
Also, I'm starting see weird search results e.g. this one which for me only shows a single result from www2.d.o
The two reasons for doing this were 1) monitoring 2) debugging.
Can't we use IP based monitoring? Or use a different url on those hosts that is not quite as important as the www.d.o ? E.g. association.d.o or infarstructure.d.o or groups.d.o ?
About debugging - couldn't people who need to debug add entries to their hosts file to achieve the same result?
Comment #7
emsearcy commentedWell, I've figured out how to handle paging/monitoring, by overriding the IP address while still making the check for `drupal.org'. The issue with typical IP address-based checks is that no site would be returned because the site is based on the domain used, and if you use the domain usually it determines the IP address. But now I'm explicitly handling both.
Is there any reason to keep these domains active? Because we've been moving the boxes to be not need `drupal' in the hostname (like drupaldb.drupal.org -> db1.drupal.org), I've been wanting to update the hosts to switch drupal1.drupal.org -> www1.drupal.org. This would mean some diagnostic, non-world-accessible content that I keep on the drupal1 vhost would be nice to switch over to www1 ... this would mean I wouldn't be redirecting www1 to drupal.org. My reason for thinking this may be acceptable is because no one should be linking to our monitoring domain anyhow. Sure, I know they must have because it's showing up on Google, but the question is whether or not those links were legitimate in the first place. So, let me know if there's a concern about link rot or if the domain can, as far as the public is concerned, disappear.
Note: I haven't found a good way to leave around a Internet-accessible, per-host site that other infrastructure maintainers can use for the aforementioned debugging purposes because it will show up on Google. Hopefully since the hosts are basically equivalent this won't be a concern.
Comment #8
seanrCan't you just use robots.txt to keep Google from indexing it?
Comment #9
gerhard killesreiter commentedseanr: no he cant't.
Eric: Please make the www[1,2] domains disappear. They aren't needed.
Comment #10
gregglesI believe that dropping www1 and www2 instead of doing a 301 redirect is hurting us pretty badly in the short term. I know that in the long run we'll get pages re-indexed, but these changes have confused the poor little googlebot which makes it harder for people to find us and harder for regular community members to find content as well. I know we have the drupal.org search but the fact remains that (for various reasons) it is not the preferred way to get this information for many people.
Eric - I've re-read your question about keeping them active and I'm said to say I don't get it. If it's not possible/easy to set up a 301 redirect then I'm ok letting this drop and being patient. But if it's possible/reasonable then it would be *awesome* to get 301s.
Comment #11
emsearcy commentedCould you first explain what ``these changes have confused the poor little googlebot'' means?
Comment #12
gregglesOne example: there are 109,000 pages that it thinks are on www2.drupal.org.
Any search that returns one of those pages is now going to send the user to a 404. That's a lot of incoming traffic for us to drop on the floor. we know that we can simply remove the www2 but not everybody does. Especially at the reduced crawl rate that we have instructed google to use via our cache patch on drupal.org, it could be a while before we see an improvement.
Comment #13
gregglesPardon my premature submittal and followup to my own post, but I just saw that there are only 64,000 pages in the index for drupal.org so by not using a 301 to redirect this we are reducing the number of valuable pages in the index by 2/3. That feels like a big enough problem to me that we should look into it.
I know that the "result number 1-10 of about XYZ" numbers are not supremely accurate but even if they are off by 50% I think they show enough the problem that we have.
Comment #14
killes@www.drop.org commented1) the reduced google bot crawl rate has expired two weeks ago. It is still much lower than it used to be, though.
2) I suggest we temporarily switch www2 back on and I tell google through the webmaster tools that it is synonymous to d.o. I guess that we can drop it after a while again.
Comment #15
emsearcy commentedUnfortunately, there is no feature in the webmaster tools to link the two domains, killes. It is only possible to link two domains www.N <-> N. If we turn wwwN into a permanent redirect, I don't know how GoogleBot will choose to handle that. Preferably, it would cache the redirect target, meaning these wwwN pages will be reindexed. If not, we wouldn't be able to get rid of them.
Greggles, do you think the side-effects of page rankings would be significantly hurt if we were to, at this point, use the webmaster tools to remove the wwwN sites? This would remove the pages that supposedly work but lead to 404s. Furthermore, this may (presumably) allow these pages to be reindexed as d.o pages on the next crawl of d.o.
It baffles me that enough people would link to wwwN that it would have more Google entries than d.o ... there must be something fishy going on like Google isn't showing us ones from d.o that it does have cached, perhaps because of the ``we have omitted some entries very similar to the YYY already displayed'' feature. So maybe it is a possibility that disabling the wwwN sites in the webmaster tools may cause hidden d.o entries to appear?
Comment #16
gregglesWe are still running your patch that sets the updated header to a more accurate (older) number though, right? Even if not, we were for months which has "trained" Google bot about how often it should visit our site. That will mean that for many of those 109,000 pages Google thinks it doesn't need to come back very often since they hadn't been updated in 3 months, 3 years, etc.
I agree with emsearcy about the webmaster tools idea not being feasible.
emsearcy - if we just tell it that those pages no longer exist then I imagine we lose the pagerank they had but, more importantly, now Google has to rediscover them all. Also, afaik there is no such thing as "the next crawl of d.o". It's an ongoing crawl of different parts of the site weighted to their ranks and frequency of update (based on header data and sitemap if available). So it would take some time (see above about how often it crawls pages that haven't been updated for months/years).
I take it from people posting alternate solutions that a 301 redirect isn't simple to do *for the long term*. What about Gerhard's proposal that we go back and start accepting incoming www2.drupal.org requests but then instead of using webmaster tools we use a 301 redirect to forward them to drupal.org. We could then monitor the number of pages that are listed for the www2 subdomain and when it gets reasonably small we could stop serving them up again. That mechanism has worked quite effectively for www.d.o
I agree with you, emsearcy, that it probably does have data about drupal.org pages somewhere in the bowels of its cloud, but the trick is how to get those back to being *the* authoritative version and I believe that a 301 is the best way. The information for webmasters page and dealing with duplicate content pages suggest a couple alternatives. We've done most of the things on the duplicate content ticklist, but haven't done the 301s.
At the risk of opening a new can of worms - another potential solution is to combine emsearcy's idea of removing all wwwN.d.o content with the sitemap module. Of course that requires a security review and performance testing which are beasts of their own.
I decided to check Yahoo! and Live to see what they think - both of them are actually doing relatively well on at least this problem.
Comment #17
stroobl commentedMight be usefull for you to know, I already have some experience from a simular situation.
On one of the sites I maintain and that also appeared in Google with different urls, we configured:
RewriteCond %{HTTP_HOST} !^www\.tik\.be$ [NC]
RewriteRule (.*) http://www.tik.be/$1 [L,R=301]
in the .htaccess (so not just a redirect to / , but redirect to the right page).
We didn't notice any negative side-effects for the number of visitors that came through Google and most of the search results already show the right url now (few months later).
Compare:
http://www.google.com/search?q=site%3Apub.telenet.be
http://www.google.com/search?q=site%3Awww.tik.be
Comment #18
gregglesJust an update -
Now down to 58,000.
Now up to 424,000
So, the process is moving although not super-fast.
I had also noticed in the past that new pages on d.o were slow to get indexed/cached. This seems to have gone away since pages are now being crawled and included with the index within hours/minutes of being created. Not sure if that's related to the crawl-speed change being lifted or the duplicate page situation or what.
Comment #19
gregglesIt appears that someone has fixed this since my update on the 22nd.
Many thanks to whomever did that.
I just tested again and the numbers continue to improve. This change to a 301 should help as well so I think it's time to mark it fixed.
Comment #20
(not verified) commentedAutomatically closed -- issue fixed for two weeks with no activity.