Should I stop Google from indexing all daily calendar pages since 1970?

By jstarek on 11 Nov 2008 at 21:44 UTC

Hello everyone,

this may not be the best forum for this question, but I don't really know where else to ask...

I run the Event module, which, among other things, provides a sort of "calendar sheet" for each day. Normally, users would use those to check which Events were entered for a specific day. These pages have "previous day" and "next day" links. Since yesterday, however, I observe Google trying to index those pages. All of them. The crawler is still stuck in November of 1970, so there will be a *lot* to follow. Over 10 000 pages, in fact. The vast majority of which will be empty...

I excluded the /event/ path in my robots.txt, but apparently, the crawler does not re-fetch that file once it started indexing. Should I try to be nice to Google and my hoster and block the crawler's IP address? Or would that throw me out of Google's index?

Many thanks in advance

Jürgen

Comments

It annoys me too, that

themphill commented 12 November 2008 at 02:18

It annoys me too, that Google tries to index calendar events from way in the past and in the future. I've come to the conclusion that it doesn't matter. Their crawler hasn't caused my site any performance or bandwidth problems that I'm aware of. And on the plus side, Google seems to find useful content in my current calendar entries.

Google's crawler comes from multiple IP addresses. I'll leave it as an exercise to the reader to figure out what they are, but to block them would certainly make your site invisible to Google.

Solved

jstarek commented 12 November 2008 at 23:52

I may have formulated that a bit ambiguously, you write:

Google's crawler comes from multiple IP addresses. I'll leave it as an exercise to the reader to figure out what they are, but to block them would certainly make your site invisible to Google.

I did not intend to block the IP blocks assigned to Google, of course. The idea was to block that one crawler - I thought (but was unsure) that that would cause another crawler to be sent to my site, which would in turn read the updated robots.txt file.

However, as ScoutBaker was saying (thanks for that information), the file seems to be loaded periodically: The crawler has meanwhile stopped accessing the events list.

You also write that "Google seems to find useful content in my current calendar entries" - I suggest that you also try to block /event/ by robots.txt and provide crawlable (sp?) links to all your relevant calendar entries by a view somewhere on your site, or by a sitemap. That way, you'll get fewer hits on empty calendar pages and Google still sees all the relevant content. I now set it up that way, and hope that it will work as intended.

Google does reevaluate the robots.txt

scoutbaker commented 12 November 2008 at 04:07

They download the robots.txt file on a regular basis. This obviously leads to delays between when you make a change and when you see the results. There's plenty of information on Google about how their crawlers work if you want to know more.

---
"Nice to meet you Rose...run for your life." - The Doctor
My first public Drupal site - EyeOnThe503

Should I stop Google from indexing all daily calendar pages since 1970?

Comments

It annoys me too, that

Solved

Google does reevaluate the robots.txt

New forum topics

News items

Our community

Documentation

Drupal code base

Governance of community