Closed (won't fix)
Project:
Global Redirect
Version:
7.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Feature request
Assigned:
Unassigned
Reporter:
Created:
14 May 2007 at 19:54 UTC
Updated:
8 Jul 2009 at 10:05 UTC
Would it be possible to add the feature to filter out urls like in the title? The genious Yahoo Slurp bot believes that it's found an unlimited amount of identical nodes on my site like this.
http://site/node/219%3C?page=1&from=675
http://site/node/452?page=8
Comments
Comment #1
nicholasthompsonIts possibly but VERY hard.
A better solution than redirecting would be to produce a 404 message for those non-existent dupe pages, for example:
EXISTS: node/123?page=1
DOESN'T EXIST: node/123?page=1&from=456
Is that what you mean?
The difficulty is that there is next to no way for Global Redirect to know if this is a Dupe page or if its actually intended to be there by another module.
Comment #2
nicholasthompsonA thought....
1) Do all these "dupe" URL's share the same argument?
2) Are you using Apache with mod_rewrite enabled?
You could do a redirect that would be SOMETHING like this (but not this):
I think your problem is too specific for Global Redirect - however it should be solvable somehow.
Comment #3
cybe commentedWell actually node/123?page=123 does not exist, but just like on any? server (http://www.washingtonpost.com/wp-dyn/content/article/2007/05/11/AR200705...)
it doesn't say 404.
category?page=123 does exist though.
For some strange reason Yahoo Slurp thinks those pages do exist, perhaps they even differ in some minute way which is why Slurp keeps going through them?
I've been trying to figure out a mod_rewrite to redirect Slurp to a "410 Gone" but I've not been successful even though I've already got a huge .htaccess full of tricks (some rules I've spent days figuring out)
It's probably the question-mark that's causing trouble. The rule should probably be made with using RewriteCond but I'm not skillful enough nor does your example help me much so please help if you are able to.
These two examples below do not work.
Comment #4
cybe commentedI've now banned pages like these in robots.txt but Slurp seems to read it very seldom.
Comment #5
nicholasthompsonTry SOMETHING like this? (I say something with such emphasis as I'm not a Jedi Rewrite Master yet)
That would need to go after the line which reqwrites the URL into a neat one which I think is one of the last things to happen...
Comment #6
cybe commentedThanks for the suggestion, but you are not a Jedi yet - it didn't work, nor the modifications I did. Looks like it's time to visit the mod_rewrite forum.
Comment #7
cybe commentedWonderful! I got it from someone on the mod_rewrite forum
Why not put this rewrite to this module?
Eat 410 Yahoo Slurp
Comment #8
cybe commentedWhat a weird bot it is that Yahoo Slurp by the way, now it is accessing my robots.txt.old
Comment #9
cybe commentedI just noticed that the rule also rewrites http://site/?page=2 and http://site/node/add so it still needs some modification.
Comment #10
cybe commentedI've added a
so it only applies for Yahoo Slurp. It will still find all the contents without visiting any "page=num" pages.
Comment #11
nicholasthompsonOn this line:
You dont need most of that Regex - it can be simplified to:
Now, seeing as it was applying itself on non "node/123" style pages, you need to apply more filtering... Something like this?
Put them all together and you get....
That will remove ALL page arguments from any node... this might break Book content types though.
Comment #12
nicholasthompsonComment #13
SlyK commentedI got many URLs in with various parameters in search engine index stats. Too bad you can't fix it.
It seems if someone will left the link to:
www.site.com/?page=1&some=sh*t
the whole pages with param will be indexed :(