Download & Extend

Redirect url with a bad/spammed query string?

Project:Global Redirect
Version:6.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:postponed

Issue Summary

Hi.

I would like to know if it would be possible to redirect an url that contains a bad GET page variable. For example, say we have a node:

mysite.com/a-test

If this page doesn't have pagination and we add a GET page variable:

mysite.com/a-test?page=4

There's no redirection 404 to mysite.com/a-test

Is it possible to know if a page has a pagination, and if not or if the GET page value isn't valid, doing a 404 redirection?

Thanks.

Comments

#1

Title:Redirect url with a bad GET page variable?» Redirect url with a bad/spammed query string?
Version:6.x-1.2» 6.x-1.x-dev
Category:support request» feature request
Status:active» postponed

Technically this is possible.

The problem is "how do you define a bad entry in the query string". Maybe a module on the page requires it? It could be anything...

#2

Good idea!

I just noticed in Google Webmasters one of my simple node pages was reported for duplicate title tage, when I checked it was like this

mysite.com/mynode
mysite.com/mynode?page=1
mysite.com/mynode?page=2
mysite.com/mynode?page=1205

Weird. No clue how Google picked that up as those page variables do not exist, it is just simply "mynode"

#3

Google Webmaster Tools will always report those pages like duplicated, whenever the passed query string is used or not.

The only solution to that problem is to add a meta tag to those pages.

#4

Thanks Kiam, I think I understand but the problem is that those ?page=xxx dont exist!

I have no idea how Google picked those up, as they all show the same page!
mysite.com/mynode?page=1
mysite.com/mynode?page=2
mysite.com/mynode?page=1205

is the equivalent of mysite.com/mynode

This is not a views page with paging, it is a simple node page :)

#5

That is really oddy. Google should pick up links used in Drupal nodes, not attach random strings to the URLs.

#6

I see the same in the webserver logfiles. Google is crawling pages adding out of the blue ?page=xxx and thus theoretically indefinitely crawling the same pages over and over again. What a bottomless mess! Wonder how this brain dead Googlebot is/was picking these up.

Like for example this very page here can be called with any nonsensical query string like http://drupal.org/node/386928?page=123 etc. and Drupal is silently ignoring it. This can lead to significant overhead and waste of bandwidth.

The issue is also discussed at http://drupal.org/node/309804

I was hoping that this module could do something about it, but I understand it is a much wider problem and not all limited to Drupal. One can pretty much add ?page=123 to the URLs of perhaps most websites without any consequence at all.