Would it be possible to add the feature to filter out urls like in the title? The genious Yahoo Slurp bot believes that it's found an unlimited amount of identical nodes on my site like this.

http://site/node/219%3C?page=1&from=675
http://site/node/452?page=8

Comments

nicholasthompson’s picture

Its possibly but VERY hard.

A better solution than redirecting would be to produce a 404 message for those non-existent dupe pages, for example:
EXISTS: node/123?page=1
DOESN'T EXIST: node/123?page=1&from=456

Is that what you mean?

The difficulty is that there is next to no way for Global Redirect to know if this is a Dupe page or if its actually intended to be there by another module.

nicholasthompson’s picture

A thought....
1) Do all these "dupe" URL's share the same argument?
2) Are you using Apache with mod_rewrite enabled?

You could do a redirect that would be SOMETHING like this (but not this):

  RewriteCond %{QUERY_STRING} from=([0-9]+)
  RewriteRule (.*) $1? [R=301,L]

I think your problem is too specific for Global Redirect - however it should be solvable somehow.

cybe’s picture

Well actually node/123?page=123 does not exist, but just like on any? server (http://www.washingtonpost.com/wp-dyn/content/article/2007/05/11/AR200705...)
it doesn't say 404.

category?page=123 does exist though.

For some strange reason Yahoo Slurp thinks those pages do exist, perhaps they even differ in some minute way which is why Slurp keeps going through them?

I've been trying to figure out a mod_rewrite to redirect Slurp to a "410 Gone" but I've not been successful even though I've already got a huge .htaccess full of tricks (some rules I've spent days figuring out)

It's probably the question-mark that's causing trouble. The rule should probably be made with using RewriteCond but I'm not skillful enough nor does your example help me much so please help if you are able to.

These two examples below do not work.

RewriteRule ^node/(.*)?page - [G,L]
  RewriteCond %{QUERY_STRING} node/([0-9]+)\?page
  RewriteRule - [G,L]
cybe’s picture

I've now banned pages like these in robots.txt but Slurp seems to read it very seldom.

nicholasthompson’s picture

Try SOMETHING like this? (I say something with such emphasis as I'm not a Jedi Rewrite Master yet)

  RewriteCond %{QUERY_STRING} q=node/([0-9]+)
  RewriteCond %{QUERY_STRING} page=([0-9]+)
  RewriteRule - [G,L]

That would need to go after the line which reqwrites the URL into a neat one which I think is one of the last things to happen...

cybe’s picture

Thanks for the suggestion, but you are not a Jedi yet - it didn't work, nor the modifications I did. Looks like it's time to visit the mod_rewrite forum.

cybe’s picture

Wonderful! I got it from someone on the mod_rewrite forum

Options +FollowSymLinks

RewriteEngine On

RewriteCond %{QUERY_STRING} ^(.*&)?page=[0-9]+(&.*)?$ [NC]
RewriteRule ^node(/.*)?$ - [G,L]

Why not put this rewrite to this module?

Eat 410 Yahoo Slurp

cybe’s picture

What a weird bot it is that Yahoo Slurp by the way, now it is accessing my robots.txt.old

cybe’s picture

I just noticed that the rule also rewrites http://site/?page=2 and http://site/node/add so it still needs some modification.

cybe’s picture

I've added a

RewriteCond %{HTTP_USER_AGENT} Slurp [OR]

so it only applies for Yahoo Slurp. It will still find all the contents without visiting any "page=num" pages.

nicholasthompson’s picture

On this line:

RewriteCond %{QUERY_STRING} ^(.*&)?page=[0-9]+(&.*)?$ [NC]

You dont need most of that Regex - it can be simplified to:

RewriteCond %{QUERY_STRING} page=[0-9]+ [NC]

Now, seeing as it was applying itself on non "node/123" style pages, you need to apply more filtering... Something like this?

RewriteCond %{REQUEST_URI} ^/node/[0-9]+$ [NC]

Put them all together and you get....

RewriteCond %{QUERY_STRING} page=[0-9]+ [NC]
RewriteCond %{REQUEST_URI} ^/node/([0-9]+)$ [NC]
RewriteRule .* /node/%1? [G,L]

That will remove ALL page arguments from any node... this might break Book content types though.

nicholasthompson’s picture

Status: Active » Closed (won't fix)
SlyK’s picture

I got many URLs in with various parameters in search engine index stats. Too bad you can't fix it.
It seems if someone will left the link to:
www.site.com/?page=1&some=sh*t

the whole pages with param will be indexed :(