There is a TridentSpider3 (allegedly not a mail harvester but SE indexer) which is obviously written badly.
Today it got to our Drupal site and when it got to Events ie. /events/ with Rewrite Engine on, it:
- ate up 200+ MB of bandwidth indexing freaking events for more than 4 hours and it still sucks!
- filled the cache table several times with 56 MB, 83 MB, 182 MB of data
When it gets to events, it indexes every day, be it filled or empty in the event calendar.
Can you imagine the number of queries it produces?
Since it respects robots.txt, I believe it can be avoided by entering:
User-agent: TridentSpider3
Disallow: /event/
in the robots.txt, and it is useful to block all bots from indexing your Events:
User-agent: *
Disallow: /event/
If the robot doesn't obide those rules, the same can be accomplished in .htaccess but then it will log errors.
If anyone knows a better solution...
Comments
firewall
if it is a "good bad" bot it will run from fixed IP addresses and not from zombie hosts. I drop these bots in my ipchains.
--
groets
bertb
--
groets
bert boerland
Also in hosts.deny
Yep, it does run from a fixed IP. I have blocked it in hosts.deny and asked the admin of this bot to put my server's IPs to the exclude list of this freaking bot.