Search engine bots still visiting deleted pages because of search404 ? (consequence: high load on server)

wwwoliondorcom - December 27, 2007 - 20:23
Project:Search 404
Version:5.x-1.x-dev
Component:Miscellaneous
Category:support request
Priority:minor
Assigned:zyxware
Status:active (needs more info)
Description

Hello,

Google bot and other search engines boats still visit hundreds of pages that have been created by a translation script even though I disabled the script and deleted the translated pages on the server,

do you think it is because of search 404 that the bots still visits the translated pages, causing high load on the server ?

I do not use the jump feature.

Thanks.

#1

wwwoliondorcom - January 6, 2008 - 06:42

Any help ? I still have the same problem. Thanks.

#2

vsr - January 22, 2008 - 18:05

If the translations were in a special directory you could tell them to not go there using the robots.txt file in the root document directory . If you know mod_rewrite, you might be able to have it give a message like a 410 - gone, or at least give a forbidden message..

#3

wwwoliondorcom - January 23, 2008 - 14:13

Yes but chinese bots do not respect any robots.txt , so will they respect a mod_rewrite ? (I didn't try yet)

Thanks.

#4

vsr - January 23, 2008 - 17:48

Mod_rewrite is part of Apache and does what the rules tell it to do. You ae in control when you use your .htaccess file to control access You can block by IP address, user agent, you can do a lot. If you wanted to you could redirect the bots from china back on tho their host using your .htaccess file. You have a directory tht you do not want accessed any more you can just create a .htaccess file for that directory and put in that file deny from all. Than any attempt to access that directory will give a 403 - Forbidden message. You just have to makesure that you have a 403 page or apache will spit out a message.

If you know the IP addresses you could do something like this in your server

<limit GET POST>
order deny,allow
deny from 123.456.789.123
allow from all
</limit>

Actually http://www.biyw.com/ has a nice little .htaccess cheat sheet and primer on this if you do not know much about this. Look in the sitemap for the site. Not sure what the page is. http://apache.org/ has a lot of information about this and more. Hope this helps you.

#5

zyxware - February 5, 2008 - 18:50
Component:Documentation» Code
Assigned to:Anonymous» zyxware

Can you please email me a URL(Use http://drupal.org/user/222163/contact) where bots are still visiting. Search 404 returns "Error 404 page not found" error which should take care of the removal of the page from the index. Perhaps you should wait a little longer for the indices to reflect the change.

#6

zyxware - April 1, 2008 - 18:43
Component:Code» Miscellaneous
Priority:normal» minor
Status:active» active (needs more info)

This module is working perfectly fine on zyxware.com. Unless we get the URL of the site where the problems are seen we cannot say anything about this issue. Also try upgrading to the latest version of the module

 
 

Drupal is a registered trademark of Dries Buytaert.