A client of mine prompted me to look look into a fair number of soft 404s being reported by google webmaster tools. Of the 800 or so reported errors, all of them contained index.php in the URL (e.g. www.mysite.com/index.php/node/345 or even with pathauto on it was the same)
That led me try it on Drupal.org (and several other Drupal sites). If you go to http://drupal.org/index.php/about (notice the "index.php" it actually returns you to the frontpage, when it should produce a full 404.
According to Google at http://www.google.com/support/webmasters/bin/answer.py?answer=181708
Returning a code other than 404 or 410 for a non-existent page (or redirecting users to another page, such as the homepage, instead of returning a 404) can be problematic. Firstly, it tells search engines that there’s a real page at that URL. As a result, that URL may be crawled and its content indexed. Because of the time Googlebot spends on non-existent pages, your unique URLs may not be discovered as quickly or visited as frequently and your site’s crawl coverage may be impacted (also, you probably don’t want your site to rank well for the search query.
Any thoughts on this?
Comments
Comment #1
geerlingguy commentedMy Drupal 7 site doesn't seem to have this problem... A 404 is reported if I add in index.php to the path. Maybe this is a 6.x issue?
Comment #2
pixelsweatshop commentedYou are right, Jeff. I have done some more testing and it is indeed only a D6 issue. Remarking.
Comment #3
pixelsweatshop commentedDue to the potential negative impact on SEO, the sheer number of Drupal sites that can be affected and the fact that Google has flagged Soft404s as "Problematic". I am moving this issue up to major.
Comment #4
dman commentedThere are actually an unlimited number of ways you can fool Drupal into giving you something that is not really there.
http://drupal.org/node/1316128/not/a/path
http://drupal.org/user/693674/doesn't/exist
It's been noticed before, but we've survived this long ...
Comment #5
geerlingguy commentedComment #6
pixelsweatshop commenteddman: I see your point, however, try that on any D7 site and it returns a hard 404. So there is something in D7 that is properly controlling this that should be backported to D6.
Comment #7
dman commentedMy guess - based on how path routing works - it that the something that deals with it in D7 is an enforcement of the rules where a modules hook_menu defines which and how many arguments it takes. Unless otherwise stated, the additional arguments get split into an array and passed to the callback function - which usually ignores anything it didn't ask for and returns what it thinks you asked for.
Changing that rule would affect ... pretty much every module in D6. :-/
Comment #8
pixelsweatshop commentedWhat about adding to robot.txt
Disallow: /index.php
Disallow: /index.php/*
It won't stop the soft404's but it will keep search engines from indexing the problem.