Should I stop Google from indexing all daily calendar pages since 1970?

jstarek - November 11, 2008 - 21:44

Hello everyone,

this may not be the best forum for this question, but I don't really know where else to ask...

I run the Event module, which, among other things, provides a sort of "calendar sheet" for each day. Normally, users would use those to check which Events were entered for a specific day. These pages have "previous day" and "next day" links. Since yesterday, however, I observe Google trying to index those pages. All of them. The crawler is still stuck in November of 1970, so there will be a *lot* to follow. Over 10 000 pages, in fact. The vast majority of which will be empty...

I excluded the /event/ path in my robots.txt, but apparently, the crawler does not re-fetch that file once it started indexing. Should I try to be nice to Google and my hoster and block the crawler's IP address? Or would that throw me out of Google's index?

Many thanks in advance

Jürgen

It annoys me too, that

themphill - November 12, 2008 - 02:18

It annoys me too, that Google tries to index calendar events from way in the past and in the future. I've come to the conclusion that it doesn't matter. Their crawler hasn't caused my site any performance or bandwidth problems that I'm aware of. And on the plus side, Google seems to find useful content in my current calendar entries.

Google's crawler comes from multiple IP addresses. I'll leave it as an exercise to the reader to figure out what they are, but to block them would certainly make your site invisible to Google.

Solved

jstarek - November 12, 2008 - 23:52

I may have formulated that a bit ambiguously, you write:

Google's crawler comes from multiple IP addresses. I'll leave it as an exercise to the reader to figure out what they are, but to block them would certainly make your site invisible to Google.

I did not intend to block the IP blocks assigned to Google, of course. The idea was to block that one crawler - I thought (but was unsure) that that would cause another crawler to be sent to my site, which would in turn read the updated robots.txt file.

However, as ScoutBaker was saying (thanks for that information), the file seems to be loaded periodically: The crawler has meanwhile stopped accessing the events list.

You also write that "Google seems to find useful content in my current calendar entries" - I suggest that you also try to block /event/ by robots.txt and provide crawlable (sp?) links to all your relevant calendar entries by a view somewhere on your site, or by a sitemap. That way, you'll get fewer hits on empty calendar pages and Google still sees all the relevant content. I now set it up that way, and hope that it will work as intended.

Google does reevaluate the robots.txt

ScoutBaker - November 12, 2008 - 04:07

They download the robots.txt file on a regular basis. This obviously leads to delays between when you make a change and when you see the results. There's plenty of information on Google about how their crawlers work if you want to know more.

---
"Nice to meet you Rose...run for your life." - The Doctor
My first public Drupal site - EyeOnThe503

 
 

Drupal is a registered trademark of Dries Buytaert.