A client of mine prompted me to look look into a fair number of soft 404s being reported by google webmaster tools. Of the 800 or so reported errors, all of them contained index.php in the URL (e.g. www.mysite.com/index.php/node/345 or even with pathauto on it was the same)

That led me try it on Drupal.org (and several other Drupal sites). If you go to http://drupal.org/index.php/about (notice the "index.php" it actually returns you to the frontpage, when it should produce a full 404.

According to Google at http://www.google.com/support/webmasters/bin/answer.py?answer=181708

Returning a code other than 404 or 410 for a non-existent page (or redirecting users to another page, such as the homepage, instead of returning a 404) can be problematic. Firstly, it tells search engines that there’s a real page at that URL. As a result, that URL may be crawled and its content indexed. Because of the time Googlebot spends on non-existent pages, your unique URLs may not be discovered as quickly or visited as frequently and your site’s crawl coverage may be impacted (also, you probably don’t want your site to rank well for the search query.

Any thoughts on this?

Comments

geerlingguy’s picture

My Drupal 7 site doesn't seem to have this problem... A 404 is reported if I add in index.php to the path. Maybe this is a 6.x issue?

pixelsweatshop’s picture

Version: 7.8 » 6.22

You are right, Jeff. I have done some more testing and it is indeed only a D6 issue. Remarking.

pixelsweatshop’s picture

Priority: Normal » Major

Due to the potential negative impact on SEO, the sheer number of Drupal sites that can be affected and the fact that Google has flagged Soft404s as "Problematic". I am moving this issue up to major.

dman’s picture

There are actually an unlimited number of ways you can fool Drupal into giving you something that is not really there.
http://drupal.org/node/1316128/not/a/path
http://drupal.org/user/693674/doesn't/exist

It's been noticed before, but we've survived this long ...

geerlingguy’s picture

Version: 6.22 » 6.x-dev
pixelsweatshop’s picture

dman: I see your point, however, try that on any D7 site and it returns a hard 404. So there is something in D7 that is properly controlling this that should be backported to D6.

dman’s picture

My guess - based on how path routing works - it that the something that deals with it in D7 is an enforcement of the rules where a modules hook_menu defines which and how many arguments it takes. Unless otherwise stated, the additional arguments get split into an array and passed to the callback function - which usually ignores anything it didn't ask for and returns what it thinks you asked for.

Changing that rule would affect ... pretty much every module in D6. :-/

pixelsweatshop’s picture

What about adding to robot.txt

Disallow: /index.php
Disallow: /index.php/*

It won't stop the soft404's but it will keep search engines from indexing the problem.

Status: Active » Closed (outdated)

Automatically closed because Drupal 6 is no longer supported. If the issue verifiably applies to later versions, please reopen with details and update the version.