I've recently transitioned my free sample site from http://id.chilleddreams.com/freebies to http://freesampleforager.com . For the past week or so, drupal is inexplicably giving the googlebot crawler (identified as a 66.249.xx.xx visitor) Access Denied on the home page (it is attempting to crawl http://freesampleforager.com/ ).
I am absolutely certain this is a drupal issue due to the Access Denied being logged within drupal's logs, and google webmaster tools validates the robots.txt as 'Allowed' for its bot.
I've checked (and then rebuilt) node_access, there were no issues there. Anon users have full access to content, as anyone not logged in can verify. There are no Access Rules set for the IP, and googlebot hits to the entire site number 4-5k per day, so it is crawling the rest of the domain just fine. I'm not throttling anything, and google is free to INDEX FOLLOW the normal directories.
Does anyone have any suggestions for what on earth might be causing this? I don't see Access Denied occurring for any other visitors on the home page, just the googlebots. I know this sounds like a .htaccess, despite drupal logging the denial, so I did attempt removing the .htaccess and reloaded apache to test: googlebot was denied anyway.
I'm at my wits end on this one, it's killed my site traffic to a trickle.. Any help would be very much appreciated.
Comments
Using the User Agent
Using the User Agent Switcher for FireFox, I switched to Googlebot 2.1 and I didn't get Access Denied. You sure it isn't some type of false positive?
_______________________________________________________________________________
http://www.hagrin.com - Just my little slice of the Interweb
Certain
The home page has dropped out of 3-4 google SERPs ('free samples', 'free sample') where it was ranking top 20 and rising (the previous domain had been top 4 in those keywords prior to the transition), and had survived the domain switchover. 99% of the SERPS that were returning my site's home page on the first/second page just dropped it off the map an evening a few days ago. Been searching for reasons why given that I hadn't knowingly violated any google TOS, and other pages from the domain were still ranking ok for other keywords.
cache:freesampleforager.com returns nothing, and google webmaster tools says the last successful index of my home page was January 4. The home page had been getting crawled every 1-2 days prior to that.
I researched it and finally found drupal logging Access Denied for the googlebot attempting to reach the home page. It's the only lead I've got at the moment.
Example from drupal logs:
I only keep logs around for 2 days, but i've found 25 occurences in the last 48 hours.
Attempting
What I've attempted to do since the discovery of this drupal logging of ADs is twofold:
I've added an access rule to allow by host 66.249.% hoping this might override whatever setting is denying.
I've also set the 403 AD page to bounce to /node, which may or may not help to overcome this.
An SEO note: if you want to
An SEO note: if you want to maintain pagerank you need to set up a 301 redirect from http://id.chilleddreams.com/freebies to http://freesampleforager.com (ideally from http://id.chilleddreams.com/freebies/specific_page to http://freesampleforager.com/specific_page). If your old pages are indexed in the search engine then this is always worth doing.
Also the forwarding page is not an HTML document. It's just text with an HTML link in it. Maybe this affects how Googlebot tries to traverse the link. Until you set up the 301 redirect, change it to something like this:
Finally, and related to the 2nd point above, check your server logs and the Drupal accesslog (turn on the statistics module if necessary) to try to glean more about what is going on. What is the actual request that Googlebot is making that is resulting in an error? Do all visits result in a 403 or only a proportion? Can you correlate the different logs? etc.
gpk
----
www.alexoria.co.uk
gpk
----
www.alexoria.co.uk
Already Done
Hi gpk, thanks for the tips. The transition period between the domains is already complete. I did run a 301 that handled everything very well (was what I meant when I said the freesampleforager.com domain had survived the transition), and it had done exactly as expect to help bolster the PR of the new domain. I removed the 301 only after the hits to the original domain had fallen to a very low amount and the PR juice of the old was all but gone. Search engines, including google, had already adjusted SERPs to use the new domain at that point, just old bookmarks and a few random outdated backlinks were still sending there. The issues surrounding this appear to have zero to do with the past redirection of the old domain.
I have already checked all logs as mentioned in my posts above:
Apache is not doing this and is not yielding any useful errors. This is a drupal issue completely. This is a basic request to load the home page, nothing more. All other requests to all other allowed scripts and pages are working fine, and google is crawling 4-5 thousand pages a day from my site.
Ok, well I had a few other
Ok, well I had a few other people try accessing your site with a changed user agent and nothing. No access denied error. Therefore, under your Access Rules admin page, are you sure you didn't inadvertently block the IP?
Your robots.txt file is fine. I can't check your .htaccess file, but I assume that is fine as well. Outside of that and the suggestion above, I really don't know.
_______________________________________________________________________________
http://www.hagrin.com - Just my little slice of the Interweb
Cheers and thanks. I also
Cheers and thanks. I also tested with the UA from firefox and never got AD at all.
I did also verify that there are no access rules barring the bot, both in the cms and in the db table.
I currently have drupal error reporting passing a 403 rsponse to /node and this seems to have stopped the googlebot from getting Access Denied (haven't logged a new one since the change last night). Doesn't solve whatever is causing the issue, but hopefully it will allow google indexing to resume on the home page. Still stumped as to the cause of this...
All very strange. What is
All very strange. What is specified for Default front page at admin/settings/site-information?
Don't know how googlebot will respond to a 403 that contains the full home page content (hopefully it will like it :-) ) ... but if googlebot is being given a 403 it should still show up as such in the logs. So maybe the original problem is sorted, somehow?
Also re. logs, I was meaning the Apache logs - is there a 1 to 1 correspondence between 403s shown in watchdog and in the Apache logs? Did all googlebot requests for the home page result in 403s, or only some? What do the referring URLs, if any, suggest? The Drupal accesslog table (admin/logs/hits) may have some of this info if statistics module enabled.
Glad to know you were running 301s anyway.
gpk
----
www.alexoria.co.uk
gpk
----
www.alexoria.co.uk
How long did it take for your ranking to come back?
I have a similar problem. I had a site on MS Frontpage and developed a drupal site with version 5.7. When I pointed my domain to the new server on March 31, 2008, my visits popped and then have steadily gone south. Lots of 404 errors which I have fixed via 301 redirects. Still heading sought and down about 65%.
I get those access denied errors from two bots- the googlebot and oyoy.eu 72.0.207.229. maybe I will try what you did. Any comments or advice would be deeply appreciated. thanks
Various and random 404
Various and random 404 errors are not unusual. The 403s are more interesting though. Maybe put some PHP in your 403 page to track what's happening... these specific tests may not bring up anything very interesting though ... (untested):
If the bots are indeed being denied as a result of _menu_item_is_accessible() returning FALSE then dig further in there... Also worth checking in the detail of the watchdog log and the Apache logs (inc. error log) that it was really the home page that was requested.
gpk
----
www.alexoria.co.uk
gpk
----
www.alexoria.co.uk
Same troubles: Googlebot gets a 403!
I'm having the same troubles at the moment, but I can narrow it down for a bit:
Starting from the 18th of November, the Googlebot has decreased in frequency of stopping by at our website. We merged from Drupal 4.7 to Drupal 6.4 in that month, by switching servers behind a loadbalancer, we had no downtime. We kept the url_alias table intact, so all urls still exist.
Starting in December, we could see a tremendous drop in searchengine traffic. However, this was only Google stopping to visit us. By looking at the trend, the traffic drop was really organic, so this doesn't involve a penalty or something.
When checking out Google Webmastertools, I see that the crawler is getting a shitload of HTTP (4xx) errors. At that moment, I couldn't find our content in Google anymore. MSN.com and Yahoo Search were working fine.
Now it's getting strange. The HTTP errors are telling me I have a technical problem. But, how can it be that MSN & Yahoo are indexing us just fine? Hopefully some logfiles will tell us more.
So, by playing around for a bit with grepping on Googlebot, HTTP codes and IP-adresses, I can conclude the following:
- 80% of the 403s our server throws, are at a Google bot UA.
- when it does, the Googlebot was visiting from IP 66.249.65.240
- There is a weird exception: when the Googlebot Imagecrawler (different UA) visits from 66.249.65.240, it get's a HTTP 200!
- When spoofing the UA with Firefox I have no problems, then Drupal is giving me a HTTP 200 just fine. I'm then visiting from my own IP.
What is happening here? Is Drupal throwing 403s when IP + UA are matching 66.249.65.240 + Googlebot? I can't reproduce this exactly, but it is a bit strange. And why is it only happening from this IP? I see some requests from a Googlebot from a different IP, and getting a HTTP 200. Unfortunately this bot isn't visiting so often.
I'm using lighttpd/1.4.20 with the standard drupal.lua, Drupal 6.4 plus modules pathauto, Feedburner, Global Redirect 1.1 and login destination. I searched for the UA string and ip-address in my entire codebase and database, but with no results. Why the hell is Drupal throwing 403s in this particular situation? Am I missing something? What happens when a Googlebot comes by- should be getting a 200...
Please help, I have no idea how we can fix this.
Thanks.
Fixed
I just found out our site was the victim of an exploit vulnerability.
I found a bunch of access rules in Drupal, telling Googlebots coming from host 66.249.65.240 to sodd off. This explains my whole story, the organic decrease of searchengine traffic, etc.
The exploit was patched in Drupal 6.4, but the information was already (unnoticed) in our database. Fixed now.
Glad you found it, I guess
Glad you found it, I guess this emphasises to all of us the importance of upgrading promptly!
gpk
----
www.alexoria.co.uk
gpk
----
www.alexoria.co.uk
Drupal Giving Googlebot Access Denied on Home Page
This solved the issue of 403 error for homepage (Googlebot)
I gave unauthenticated users permissions to access front page ( /admin/user/permissions/). This is under front_page module
access frontpage