Hi,

The bandwidth of my site has increased exponentially as the site grows, both in the number of (stories/nodes) and the number of taxonomy categories.

Suppose I have 1000 stories, 400 categories of which 60% of stories fall within two or more categories. I have menu links pointing to all category items. There are links within 30% of the stories that provide a category menu linking to other related stories (populated by a custom module).

Thus a visiting spider hits the site and loads the same story under multiple different links. Thus the number of unique pages on the site may be 1000, but the number of links for a search engine is more like 1000x10 = 10,000 stories. If I add a seperate taxonomy with the categories commercial / non-commercial the previous count could double to 20,000 pages linked on the site. And so on.

When a spicer hits the site it loads 1000 stories under many categories and subcategories and continues until about 540Mb is consumed for every spider that hits my site. Some spiders are brain-dead and hit my site with 1.87Gb per visit!

I am forced to move to a high bandwidth hosting co. However, there must be another solution since ultimately I am only delaying the crisis until my site gets to 50,000 stories. Searching for bandwidth, I found a lot of posts concerning high bandwidth but little that really addresses this issue. I cannot see how a robots.txt file would help, other than to exclude a spider totally (as I have done with the 1.87Gb culprit).

I have removed the calendar component since that also aggrevates the number of links on the site. The tagadelic component remains enabled due to the functionality it provides, but it again increases the number of links to pages.

I have a lot of questions:

1) Would the (Google) sitemap module help IRO this problem? Is it going to exclude spiders from traversing the site and rather only load individual pages. Or does the sitemap module just add another 1000 links on the site?

2) I have the 'Path' module enabled that provides custom paths for nodes ie sitename/christmas instead of sitename/node/31. Does the Path module aggrevate the problem further in that every story is indexed under two URL's? I note that some search engines would list both sitename/node/31 as well as sitename/christmas for the same story. This could be a problem in any case since it is not a good policy as far as SEO is concerned - the site may be penalised for trying to serve essentially the same info under different URL's.

3) Is there a way of customising robots.txt to avoid this? I absolutely rely on Search Engines for traffic to my sites, thus I cannot exclude them all.

4) Is there any other solution? Am I missing something?

Any help will be appreciated.

Casper Labuschagne

Comments

dries’s picture

  1. Modify your theme not to generate as many links (hide parts of your site). Might not be an option.
  2. Install a robots.txt for nice crawlers. Many of the crawlers will adhere your robots.txt. Like, you might want to block 'user/*'. Depending on your site, the profiles might not be worth being indexed/searched.
  3. Automatically block excessive crawlers. I don't think we have a module for that, but the forth coming Drupal 4.7.0 helps you identify "heavy bandwidth consumers" and lets you ban them manually. Quite convenient.
casperl’s picture

Casper Labuschagne
*** Drupal Status : Rank Beginner ***

I will implement those suggestions.

Maybe, what is really needed is a structure whereby search engines are told to ignore the main site, but to be directed to a flat-list (not themed) linking all nodes on the site. I will give it some thought.

Drupal 4 .7 will help in identifying heavy consumers, but Google (at 500mb a visit) is an essential partner in the Internet strategy.

Ultimately the problem is just being delayed by moving to higher bandwidth providers.

Thanks for the info

Casper

chx’s picture

Please elaborate. I do not yet understand the problem. Nodes usually have links to pages like taxonomy/term/1245 but there are only number of tids such pages. Or do you autogenerate links like taxonomy/term/1245+1246 ? If you do, then you really should ban robots taxonomy/term. Please give us more information on what kind of URLs are causing the problem and we'll help.
--
Read my developer blog on Drupal4hu. | The news is Now Public

--
Drupal development: making the world better, one patch at a time. | A bedroom without a teddy is like a face without a smile.

alexmc’s picture

I think that what you want is still possible by robots.txt

You can tell robots to not scan parts of your site - that may help.

PS I havent checked whether Drupal responds to HEAD requests properly. have the pages concerned changed between visits?

A good spider should say "Give me this web page *IF* it has changed since last time I visited."

varunvnair’s picture

The issue you are facing can be tackled to some extent by using 1 or more of the following approaches.

  1. Disallow specific URIs in robots.txt: Well behaved bots generally follow robots.txt quite meticulously. Since you are using path (and possibly pathauto) module you can Disallow crawling of certain URIs. For e.g. on my blog I have a vocabulary called categories. All terms in this vocabulary have their own URI that goes http://example.com/categories/some-category-name. I could put '/categories/' in a Disallow statement and ask bots not to crawl it. Ditto with other URIs patterns such as '/user/', '/archive/' and '/comment/'.
  2. Disallow '/node/' using robots.txt: Even if you are using path search engines still manage to find the default URIs. Do a Google search for 'site:your-site.com inurl:node' and you will see that nodes are probably being crawled from at least 2 URIs. This might also be good SEO because your nodes are crawled only from 1 URI and search listings do not show your non-aliased URI.
  3. Use Crawl-delay in your robots.txt: Take a look at Slashdot's robots.txt to see an example of this. The value you specify is the number of seconds a well-behaved bot will wait between successive crawls of the same site.
  4. Disallow crawling of images: You could Disallow '/files/' in robots.txt to prevent crawling of images and other files.
  5. Using a proper sitemap: Having a correctly generated sitemap might help you. For e.g. a sitemap specifies (using changefreq) how frequently the page at a given URI changes. This might cause bots that use the sitemap to intelligently reduce the frequency of crawls.
  6. Ban bad bots: This is an extreme measure and I am not sure how to go about it exactly. I think there are some Drupal module (Bad Behavior) that might help you.

Please note that banning bots from crawling taxonomy term URIs might negatively affect SEO.

Do let us know what measures you actually used and how useful they were in dealing with this problem.

My Drupal-powered Blog: ThoughtfulChaos

pamphile’s picture

casperl

1. Does your host supply you with enough bandwith ?
A good "growing" website needs at least 60 Gig per month. Don't settle for anything less than that. Get 60 Gigs even if you only consume 10 or 30 Gigs. You get what you pay doe

2. What is the total bandwith being eaten up per month ?

>Does the Path module aggrevate the problem further in that
>every story is indexed under two URL's? I note that some
>search engines would list both sitename/node/31 as well as
>sitename/christmas for the same story.

Two different urls are a BIG no-no. This will make your seo ranking especially if your site was already indexed. You site could eventually suffer penalties. I've experienced it and fixed it.

I had this problem on one of my sites ( not a drupal site ) I fixed it by doing 301 redirects to the new URLS. Google recommends the use of 301 redirects.

Interlinking between related pages on your site is a VERY good upselling technique. It' worth sacrificing bandwidth for. If your host cannot supply good cheap bandwith, find a new one.

Marcel
http://www.macminiforums.com/forums/
http://01wholesale.com
http://businessletters.com

casperl’s picture

Casper Labuschagne
*** Drupal Status : Rank Beginner ***

No, I changed hosts and it solved the issue. South Africa has a parastatal monopoly telco that supplies the most expensive ( and somewhat inferior ) telecoms service in the world. ISP ' s have no other option than to use that expensive infrastucture. Therefore, my bandwidth for a premium type account was 1 Gb per month which was pushed up to 5 Gb per month, which was still being exceeded.

I moved the site to the USA where for $ 6.95 I get disk space and bandwidth that would have cost $ 7,733 ( USD ) or R 48,000 ( ZAR ) in South Africa.

The one thing that was wrong on the site was that the SiteMenu component was enabled while I also had a second menu navigation system in place. Some search engines would then traverse both set of links, firstly on my menu system using the custom menu system, then the same links using the links provided by SiteMenu and lastly since I had (still have) the Path component installed, the URL's for the same content would be different. In a worst case scenario the same page would be loaded four or five times by a search engine. By cutting out the SiteMenu component (which was only a temporary arrangement in any case) search engine visits generated 30 % less bandwidth.

pamphile’s picture

I don't think Drupal has 301 redirects built in to prevent URL duplication. I hope I am wrong.

Marcel
http://businessletters.com

dkaps’s picture

Due to this bandwidth load problem on the servers - Siteground.com kicked my site out! Man, it was a terrible experience - but I need to set my site right now!

Hey anyone of you know of a good hosting service where I can go to with a drupal site? I am currently using a friend's server.. which wont be there for very long!

so please let me know of a good host.

Cheers,
dkaps
www.Drishtikone.com

pamphile’s picture

modwest.com or autica.com

I would also suggest you turn off any modules that access the database everytime a spider makes a visit.

Bandwidth is easy to get and cheap. Heavy server loads can get costly.

casperl’s picture

Casper Labuschagne
*** Drupal Status : Rank Beginner ***

I solved my problems by moving to bluehost on the recommendation of John of the Rocky Mountain Blog .

nomad411’s picture

They give us 800 Gigs a month of bandwidth.. I suspected that was mostly BS, but how much were you using anyway?

casperl’s picture

Regarding the excessive bandwidht problem, aslo consider the changes to robots.txt outlined in the posts below:

http://drupal.org/node/45240

http://drupal.org/node/22265

Applying the correct robots.txt changes is also a crucial step in restricting excessive bandwidth!!!

Casper Labuschagne
Where am I on the Drupal map on Frapper?