Any URL containing index.php or any other existing file/directory returns 200, not 404
bjraines - April 13, 2009 - 15:38
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | base system |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
Description
The has been a major issue for me as I switched all my site from Joomla to Drupal.
Any previously indexed URL that contains index.php will return 200 status and redirect to the front page as opposed to returning 404 Page Not Found
Let me give you an example (links are broken purposely here)
h t t p :// w w w.shareyourexpertise.com/doesnotexist ---Return Page NOT Found
h t t p:// w w w .shareyourexpertise.com/index.php/doesnotexist ---Return 200 and front page

#1
not the node system.
#2
I did not see which issue to post under and when I check similar Rewrite threads they were posted to node system.
#3
no worries. I just want to try and help make sure the right eyes see this.
#4
Thanks I hope the right eyes see it because this is causing me a world of trouble and I know nothing about Rewrite Rules
#5
This is not a bug, as it is the standard Apache behavior.
I suggest you try to add the following in your
.htaccess, after Drupal's standard rewrite rules:RewriteRule ^/index.php/(.*)$ index.php?q=$1 [L,QSA,R]Do you have some
/index.php/xxxxxURLs you want to save (for SEO reasons, for example)? If so, you might also want to define the corresponding URL aliases inside your Drupal site.#6
There are no URLs containing index.php that I want to save, I would like to redirect them all to a proper 404 page.
Does go just after the Rewrite Rules
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_URI} !=/favicon.ico
RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]
RewriteRule ^/index.php/(.*)$ index.php?q=$1 [L,QSA,R]
or somewhere else? I thought the "L" means to stop the rewrite rules do I need to remove it from the 4th line?
Thanks, some Joomla Pretty URLs mods got in the habit of including index.php in the URL string and now I am stuck with tons of these that do not exist anymore
#7
If you just want to 404 those, I would suggest something a little bit stronger, like (untested):
RewriteRule ^/index.php/(.*)$ non-existing-404-page.html [L]Preferably before Drupal's own rewrite rules.
#8
#9
but what I don't understand is how does this Rewrite Rules deliver the headers that tell Google etc that is a 404 instance and not just pointing to another page
#10
If this is standard apache behaviour then how does Wordpress get around this same issue?
Its clean URL function will direct any url that contains /index.php/notreallink to page not found unlike Drupal which loads the front page
#11
Ok I have now tested this with Joomla and Wordpress. In both these CMS (and no I do not want to switch) whenver a URL that does not exist that contains index.php is entered, a 404 Page Not Found is served.
However in Drupal, if you present a url that does not exist and contains index.php it will load the front page.
#12
pushing to 6.x-dev as it would have to be fixed there first.
#13
at this point it can be switched to a bug report as well.
#14
The query string seems to be ignored; trying a URL like http://example.com/index.php?password=pass causes the front page to be shown.
#15
This doesn't look to me like a case of 'standard apache behaviour', and it has potential to cause more trouble than has been identified here so far. Consider this URL, which displays the same bug:
http:// drupal.org/LICENSE.txt/foobar
Apache is not just ignoring the bit of the path after the existing file. Rather because the file does not exist, the drupal supplied clean-url rewrite URL is triggered which rewrites this to:
http:// drupal.org/index.php?LICENSE.txt/foobar
Drupal is then responsible for determining whether or not the URL is valid or not and what status code should be returned.
I was about to publish this comment without disabling the above URLs, and then I thought about the potential consequences. From the above URL, click on the 'Modules' link a few times and watch the URL get longer each time. This means that an infinite number of 'valid' URLs (ie HTTP status=200) can be spidered on the site, albeit they get long and most spiders are smart enough to give up. Depending what URLs are on the front page of a site though the number of URLs of a given depth could be much higher, with a potential combinatorial explosion of directory paths.
eg if there are relative links to "about/contact", "products/specials" and "help/FAQ" on the front page, then a spider that got a link to http:// drupal.org/LICENSE.txt/foo would later reach URLs like http:// drupal.org/LICENSE.txt/about/help/help/about/products/help/FAQ amongst 3^6 other URLs of that length and the site would still be returning status=200 for all of them.
#16
Thanks for the support here. I happen to be an analytical chemist and after testing this several times I thought it was a valid issue not just standard apache behaviour
#17
The standard Apache behavior is to ignore the part after the filename an executable file, as in:
http://php.net/docs.php/test/toto/titihttp://php.net/support.php/toto/titi/tata
The part after the executable file is called the path information in PHP lingo, and is available in:
<?php$_SERVER['PATH_INFO']
?>
So this is *not* a bug. If you don't want that, you can use mod_rewrite to redirect all the URLs with a path info to
/. This report could be requalified as a feature request for Drupal 7, but for Drupal 6, it is not a bug.#18
If Damien is right then lets make it a feature request for D7 at least.
Even if it is a Drupal bug or just a feature request I think we should have this fixed/solved (in D7 and if it is possible maybe backport to D6).
If you read #11, it seems that Joomla and Wordpress don't replicate this behaviour (they send a correct 404 response it this case).
#19
yes I agree. I think this really defines what a *bug* is. It might not be an apache *bug* but as far as a properly working website it is a *bug*
Other top CMS scripts are built on PHP and Apache and report proper 404s
I strongly believe this should be fixed for Drupal 6x as module and sites are still catching up to Drupal 6
As far as the Status: active won't fix I just do not see this as the right direction since Google has made it known that they place great emphasis on proper error reporting.
#20
RE Damien's comment #17
Apache's PATHINFO type behaviour is not the cause of this bug.
You say that apache passes PATHINFO in the case of an executable file. It's nothing to do with whether there's an executable file involved See my examples in #15, where I used a non-executable file example (LICENSE.txt), and without drupal's mod_rewrite rule, apache would correctly produce a 404 response. Apache's behaviour in this case is over-ridden, and is relevant only insofar as it provides an example of what should happen. Note also that 'executable' in apache's view is not necessarily related to the flags in the file system, but rather relates to whether there is an apache handler for the file.
As I've pointed out, these URLs are being passed to drupal because drupal's .htaccess file has a rule saying that's what should happen when no file is found. In the case of /LICENSE.txt/foo, apache is not passing PATHINFO=/foo to /LICENSE.txt because the mod_rewrite rule tells it to pass that to drupal's index.php instead, and at that point it becomes Drupal's responsibility to handle the request sensibly.
Having implemented pretty much the whole of Drupal as a handler for file not found situations, If Drupal fails to deliver file not found responses where that's important, then Drupal has a bug. When combined with relative links on the generated page (in drupal's case it serves it's front page) to paths with one or more '/' in them, it becomes a serious bug.
If Drupal was to reimplement the functionality you refer to in Apache, passing control for requests that Drupal cannot handle back to, then that would be a functionally good solution. It would mean that scripts that utilise the PATHINFO behaviour could function within Drupal's directory. Besides fixing the bug, this would be a win for anyone who wanted to run web apps other than Drupal on a site where Drupal runs the top level directory. Instead, Drupal seems to be producing extra disk accesses to check the file system for the presence of a file that might be supposed to handle the response, and then serving the front page instead.
If Drupal was to serve up a 404 error page directly, then that would be OK, and would mean that the number of disk reads could be reduced by not searching for a file that accounts for part of the requested path.
#21
Changing the title because this is not limited to index.php.
The following mod_rewrite rules are possible fixes in the case where you know the prefix that's causing problems. eg if you've got a spider reading through /index.php/foo type files this will help, but it won't change the more general case where any other existing file in drupal's file system causes the same thing. eg my earlier example of /LICENSE.txt/foo. I don't think this is a good fix, but it will help if someone's getting a lot of unwanted traffic from web crawlers.
Choose any one of the approaches below.
What you need is another rewrite rule before the one that's already in your .htaccess file that handles any URL path starting with "/index.php/". Here's some possible approaches:
This first one produces an ugly apache error page. Doesn't use drupal, so doesn't load your server much.
It's a cachable result also, which helps reduce repeat requests for the same URL.
RewriteRule ^index.php/.* non-existent-file [G,L]This one rewrites the URL to something which drupal then handles will give a 404 error for. Nicer display.
Drupal's 404 report isn't very useful here though as it logs the rewritten url.
RewriteRule ^index.php/.* non-existent-file [L]If you care about drupal recording what got rewritten you could use this one:
RewriteRule ^(index.php/.*) 404bug/($1) [L]Have a think though about whether a 404 error is really what you want to send.
Maybe you'd be better off getting those hits to where they are wanted.
Search engine traffic can be a considerable asset.
You could redirect the browser to the front page with a 301 redirect. The browser still sees the same page,
but search engines will get the message that the old page is no good,
people won't bookmark the wrong page, and relative links will point to the right place.
RewriteRule ^index.php/.* /index.php [R=301,L]#22
Thanks for the great list of option
Do you just place these before the current Drupal Rewrite Rules or after them.
I am not sure what the final Drupal .htaccess should appear like, say , if I used the last option
RewriteRule ^index.php/.* /index.php [R=301,L]do I just paste it in there somewhere or is there something else I need to do?
#23
Would not it better if Drupal comes with a .htaccess file that already contains a way to resolve such cases? In this way, it would not be required to apply the changes to that file each time it gets changed from a commit made on Drupal code repository.
#24
#25
The change should go first in D7, and only then backported. Can someone come up with a patch?
#26
In reply to #22, whichever rewrite rule you choose must go after "RewriteEngine on" and it's better to be after the "RewriteBase" rule if you've enabled that. It must go before the RewriteCond / RewriteRule block that Drupal uses to enable clean URLs. Ie in drupal 6.10 which is what I'm using, it should go just before the comment that reads " # Rewrite URLs of the form 'x' to the form 'index.php?q=x'."
In reply to #23, htaccess rules are really not the right way for Drupal to fix the problem because it's not just index.php that has the problem, and writing rules for a large and unpredictable set of files isn't viable. The .htaccess approach might help someone with a problem relating to specific urls that are in fact being crawled by a search engine or such like.
Re #25, I don't have drupal 7 installed. I've attached a patch for 6.10 though, which should be easy enough to apply to 7 as it basically doesn't interact with anything else in drupal except drupal_not_found(). This turned out to be pretty easy to write because it turns out that the .htaccess directive "ErrorDocument 404 index.php" is part of the picture. It seems the 404 handler gets to set some stuff up before the RewriteRule modifies the path that gets to index.php. It bothers me that I don't fully understand how mod_rewrite and the ErrorDocument directive are interacting. Testing required, and insight solicited.
The same sort of issue (as seen externally, different code bugs) is widespread in Drupal. Many arbitrary and silly urls fail to generate errors, leading to potential spider pits. eg:
http://drupal.org/project/issues/statistics/drupal/foo/bar/foo/bar
http://drupal.org/aggregator/foo/bar/foo/bar/foo/bar
http://drupal.org/user/12356/foo/bar/foo/bar/foo/bar
http://drupal.org/comment/reply/432384/1525254/foo/bar/foo/bar
These bugs will need different code fixes, but the fact that these continue to function as before is a good sign in terms of my patch not screwing things up. It would be easy (5 minutes + testing) to write a program to find urls on a site which can be extended in this manner.
Here's some relative links that will trigger this situation. Unless Drupal does some sort of filtering here, this bit of user input will create an infinite set of linked pages. Given that such bugs do, and presumably will exist, I looked around for people's fixes to the problems of relative links, and discovered that this is a very old discuussion. #13148: Problems with using relative path names.
#27
I've attached a tidied patch.
menu.inc-404-D6.patch attached.
It's functionally the same as my last patch, but a tidier place to put it.
I'm still not sure quite how the apache processing works. While I doubt this patch breaks a typical apache/drupal install, I don't know how robust this is when using a different web server, or perhaps with a different configuration approach under apache.
While less consistent with drupal's usual approach, it might be more stable to have a separate top level file for 404 errors. I've attached such a file. To use it, leave everything else alone, except the "ErrorDocument 404" line in your .htaccess file:
ErrorDocument 404 404.php#28
I second the idea of having a php file for just 404's. I worked around this issue in the boost module so it correctly sends out a 404.
#345484: 404 hits to /files directory cached as homepage with broken form actions