First reported after upgrading to 2.0.3

In our Google Webmaster Tools we are seeing (1) of URL pattern within out site that Google Bot is not able to crawl.

It appears from the google test / fetch as bot page it is being stopped at the Nginx layer.

Looking in my NginX access.log I can see:

"66.249.72.66" windhorsetour.com [25/May/2012:09:41:51 +0800] "GET /Yangtze/Cruise-Sailing-Calendar-Aug HTTP/1.1" 403 134 266 336 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 0.000 "1.32"

I have aslo checked in /data/disk/dev/config/server_master/nginx/post.d/nginx_vhost_include.conf and see no entry that relates to this path.

I have attached the Google Report for reference.

Fetch as Google
This is how Googlebot fetched the page.
URL: http://windhorsetour.com/Yangtze/Cruise-Sailing-Calendar-Aug
Date: Thursday, May 24, 2012 6:41:50 PM PDT
Googlebot Type: Web
Download Time (in milliseconds): 148
HTTP/1.1 403 Forbidden
Server: nginx
Date: Fri, 25 May 2012 01:41:51 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Content-Encoding: gzip

<html>
<head><title>403 Forbidden</title></head>
<body bgcolor="white">
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx</center>
</body>
</html>

Thank you.

Comments

omega8cc’s picture

It is "by design" denied in the config: http://drupalcode.org/project/barracuda.git/blob/HEAD:/aegir/conf/nginx_...

However, we should probably improve this to not affect URLs like this, and only lock the access for bots where it is really required.

hyperglide’s picture

Thank you for the reply.

Can this be over written by following the hint here:
http://drupalcode.org/project/barracuda.git/blob/HEAD:/docs/HINTS.txt#l16

We would want to allow for "calendar" in our case.

omega8cc’s picture

Title: Google Bot Error -- 403 -- On 1 URL Path Sturuture Only (NgingX Layer) » Any URI with 'calendar' is denied for all known bots/crawlers

It could be overridden in that custom config only of you will define there more specific location path, so something more than just 'calendar'.

jtbayly’s picture

I've got about a dozen posts that all include the word "event" in the title. It's the same problem. I'd tried fixing it by putting the following into nginx_vhost_include.conf

location ~* /blog/.*event {
  try_files $uri @cache;
}

It's not working. Can anybody explain to me how to fix it so that it will actually allow Google to index these posts?

Thanks,
-Joseph

omega8cc’s picture

You would need to use parent location with literal string matching to terminate it at this point, so it should be:

location ^~ /blog/ {
  location ~* /blog/.*event {
    try_files $uri @cache;
  }
  try_files $uri @cache;
}
omega8cc’s picture

omega8cc’s picture

Or just:

location ^~ /blog/ {
  try_files $uri @cache;
}
zkrebs’s picture

The same thing is happening to me, Googlebot won't connect to my /calendar-list page which shows events from 2012. Any quick way to fix this?

jtbayly’s picture

You need to edit the nginx_vhost_include.conf file and add an exception to allow the pages you want. I'm just guessing based on what you wrote, that /calendar-list is the only page you need to fix. If so, try the following:

location ^~ /calendar-list {
  try_files $uri @cache;
}

The point is to make a positive match on the url (the first line) and tell it to serve the page to everybody (the second line). The first line might have to be different, depending on exactly what the url structure is.

You can read about how to construct that line to make a positive match at the link omega8cc posted above: http://wiki.nginx.org/HttpCoreModule#location

-Joseph

P.S. You'll need to reload nginx after saving your changes to the nginx_vhost_include.conf file.
P.P.S. Grace, thanks for your help. I got my problem fixed, obviously.

omega8cc’s picture

Status: Active » Fixed

You can still add it (optionally) but you no longer need to do anything to remove it - we have removed this filtering in HEAD: http://drupalcode.org/project/barracuda.git/commit/cbe894a

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Issue summary: View changes

added version note at top of issue