Problem/Motivation

Drupal handles 404s for certain files that are not Drupal content.

Examples are .js, .css, .jpg, .gif, ...etc.

This has two disadvantages:

- We do an expensive, slow bootstrap for every 404 for such files that cause server load and delay browser page rendering.

- We send far more bytes for that 404 than if we let the web server standard error page get sent, wasting bandwidth and further delaying browser page rendering.

From a scalability point of view 404s can be a huge problem for sites with old domain names that have gone through multiple re-structurings over the years. The 404 traffic simply from hotlinked files alone can be a significant draw on resources. The current setup is also a huge risk for active sites - if you add a 404 to a busy page, or perhaps accidentally move, chmod or otherwise make the files directory inaccessible, server load instantly skyrockets and can render the server unresponsive. An example of this is the (missing by default) favicon.ico file which, until we added a workaround for in .htaccess was basically doubling server load for all Drupal sites.

From an end user point of view this is also a major issue. Browsers will patiently sit around with the loading spinner going waiting for Drupal to fully bootstrap, generate a beautiful, fully themed page to represent an image or css file, that the browser then completely ignores as soon as it notes the 404 http code. Not only does this mean that we end up waiting for Drupal to bootstrap twice (or more!), but the response is at least a factor of ten larger than it needs to be. In some cases (e.g. whitehouse.gov or amnesty.org) it is over 100 times larger than necessary. Multiply this by all the 404s on all the Drupal 7 sites, and the end result is a lot of end user wait time, and a lot of bandwidth and cycles wasted.

Proposed resolution

The approach is to return a lightweight 200 byte page earlier in the page generation cycle. By default it is returned in drupal_deliver_html_page(), which at least cuts out some amount of page generation time. However we also have an option to return the fast 404 when settings.php is being loaded, which is extremely fast but has some tradeoffs to consider. There are settings in settings.php to disable this completely (and get pretty 404s back), to adjust the extensions that trigger a fast 404, and to enable the settings.php early return.

We spent the first zillion comments or so considering .htaccess based approaches, but these had enough tradeoffs to not work for core as a default, and issues with logging that most people would not want to opt in to.

Here is a table describing the status-quo and the 2 options added by this patch:

Configuration A- disabled Configuration B - called from drupal_deliver_html_page() Configuration C - called in settings.php
Heavy 404 page generation cost Smaller 404 page generation cost (~50% of A?) Minute 404 page generation cost (< fastpath cache)
Slow 404 page response time (e.g. 1-10 seconds) Faster 404 page response time (e.g. 0.5-5 seconds) Very fast 404 page response time (e.g. < 0.2 seconds)
Large 404 page size (2,000-30,000+ bytes, depending on site) Tiny 404 page size (~200 bytes) Tiny 404 page size (~200 bytes)
Inlined assets (images, JS, CSS) 404 as normal Inlined assets (images, JS, CSS) 404 as normal Inlined assets (images, JS, CSS) 404 as normal
Directly requested assets show fully themed 404 page Directly requested assets show simple 404 page Directly requested assets show simple 404 page
All 404s logged in watchdog as normal All 404s logged in watchdog as normal "Fast" 404s not logged in watchdog
All 404s logged in Apache logs as normal All 404s logged in Apache logs as normal All 404s logged in Apache logs as normal
Core image api works normally Core image api works normally Core image api works normally
404 based lazy loaders (e.g. D6 imagecache) work normally 404 based lazy loaders (e.g. D6 imagecache) work normally 404 based lazy loaders (e.g. D6 imagecache) don't work without tweaking the regexp
Pages with aliases that match regex load as normal Pages with aliases that match regex load as normal Pages with aliases that match regex 404 instead of load

Server side benchmarks

Using apachebench, and visting example.com/none (which is a 404), with 5 users, 1000 requests

1. Without patch
Requests per second: 49.71

2. With patch, default settings
Requests per second: 51.11

3. With patch, with function uncommented in settings.php
Requests per second: 51.15

When visiting example.com/none.jpg (which a 404 that triggers a fast 404), 5 users, 1000 requests

With patch, with function uncommented in settings.php
Requests per second: 352.31

Browser side benchmarks

Benchmarks show the effect of the current slow inline 404s on browser display latency for end users. These were done using http://www.webpagetest.org/ testing the same site (in different configurations, 404s introduced through a node body promoted to home page) using default settings, 10 test runs each time.

First View Summary (seconds)
  No patch Patch Patch (fast 404 enabled)
No 404 2.591
Image 404 2.944 2.881 2.525
CSS 404 3.571 2.890 2.501
Percentage Improvement
  Patch Patch (fast 404 enabled)
Image 404 2% 14%
CSS 404 19% 30%
Absolute Improvement (seconds)
  Patch Patch (fast 404 enabled)
Image 404 0.063 0.419
CSS 404 0.681 1.07
Repeat View Summary (seconds)
  No patch Patch Patch (fast 404 enabled)
No 404 0.886
Image 404 1.716 1.488 1.011
CSS 404 1.604 1.500 1.287
Percentage Improvement
  Patch Patch (fast 404 enabled)
Image 404 13% 41%
CSS 404 6% 20%
Absolute Improvement (seconds)
  Patch Patch (fast 404 enabled)
Image 404 0.228 0.705
CSS 404 0.104 0.317

See attached file for more detailed benchmarks.

The results are not insignificant - an inline 404 currently slows down the parent page display by anything from 500ms to a full second, depending on the type of 404. Committing the patch will shave something like 100ms-500ms off that time. If site builders enable the "fast 404" return in settings.php the pages display almost as fast as if the 404 wasn't there - anything from 500ms to a full second faster than the current performance. In other words, this patch clearly has enormous benefits for both end user responsiveness, in addition to the server utilization benefits described in depth above.

Remaining tasks

This has been through many months of revision and tweaking. One potential improvement that comes to mind is a test that the lightweight page is indeed returned (we already check for 404 status in several cases) - this would testing some pretty trivial code (one function call, one if), but is probably worth considering.

The previously submitted patch in #286 was incompatible with image styles and other generated files (such paths would return a 404). The latest patch in #316 addresses this issue by adding a configurable list of excluded paths that should not use the fast 404 feature. An interdiff showing changes between the two patches is available in #320. This revision of the patch has been tested by several people.

It may be possible to further optimize the regex; see #319 and #322.

User interface changes

None

API changes

None

Files: 
CommentFileSizeAuthor
#353 fast404-76824-351.patch5.06 KBShellingfox
PASSED: [[SimpleTest]]: [MySQL] 190 pass(es).
[ View ]
#351 fast404-76824-347.patch5.04 KBShellingfox
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch fast404-76824-347.patch. This may be a -p0 (old style) patch, which is no longer supported by the testbots.
[ View ]
#350 fast404file-76824-346.patch5.04 KBShellingfox
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch fast404file-76824-346_0.patch. This may be a -p0 (old style) patch, which is no longer supported by the testbots.
[ View ]
#347 fast404file-76824-346.patch5.04 KBShellingfox
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch fast404file-76824-346.patch. This may be a -p0 (old style) patch, which is no longer supported by the testbots.
[ View ]
#328 404_fast_paths_7x-76824-328.patch6.33 KBdroplet
PASSED: [[SimpleTest]]: [MySQL] 36,012 pass(es).
[ View ]
#328 404_fast_paths_8x-76824-328.patch6.33 KBdroplet
PASSED: [[SimpleTest]]: [MySQL] 32,866 pass(es).
[ View ]
#326 404_fast_paths_8x-76824-326.patch6.33 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 32,861 pass(es).
[ View ]
#326 404_fast_paths_7x-76824-326.patch6.33 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 32,860 pass(es).
[ View ]
#320 interdiff-286-316.txt3.17 KBxjm
#316 404_fast_paths_7x-76824-316.patch6.33 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 33,643 pass(es).
[ View ]
#316 404_fast_paths_8x-76824-316.patch6.33 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 33,642 pass(es).
[ View ]
#311 404_fast_paths_7x-76824-311.patch6.33 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 33,632 pass(es).
[ View ]
#311 404_fast_paths_8x-76824-311.patch6.33 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 33,647 pass(es).
[ View ]
#310 404_fast_paths_7x-76824-310.patch6.34 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 33,641 pass(es).
[ View ]
#310 404_fast_paths_8x-76824-310.patch6.34 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 33,640 pass(es).
[ View ]
#309 404_fast_paths_7x-76824-309.patch6.33 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 33,649 pass(es).
[ View ]
#309 404_fast_paths_8x-76824-309.patch6.33 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 33,634 pass(es).
[ View ]
#307 404_fast_paths_7x-76824-307.patch6.33 KBgeerlingguy
FAILED: [[SimpleTest]]: [MySQL] Invalid PHP syntax in includes/bootstrap.inc.
[ View ]
#307 404_fast_paths_8x-76824-307.patch6.33 KBgeerlingguy
FAILED: [[SimpleTest]]: [MySQL] Invalid PHP syntax in includes/bootstrap.inc.
[ View ]
#305 404_fast_paths_7x-76824-305.patch6.33 KBgeerlingguy
FAILED: [[SimpleTest]]: [MySQL] Invalid PHP syntax in includes/bootstrap.inc.
[ View ]
#305 404_fast_paths_8x-76824-305.patch6.33 KBgeerlingguy
FAILED: [[SimpleTest]]: [MySQL] Invalid PHP syntax in includes/bootstrap.inc.
[ View ]
#304 404_fast_paths_8x-76824-303.patch5.93 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 33,638 pass(es).
[ View ]
#303 404_fast_paths_7x-76824-303.patch5.93 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 33,620 pass(es).
[ View ]
#286 fast_404_handling-76824-286.patch5.77 KBgeerlingguy
PASSED: [[SimpleTest]]: [MySQL] 29,849 pass(es).
[ View ]
#283 76824-283.diff6.38 KBjoelstein
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 76824-283.diff.
[ View ]
#230 404.diff6.08 KBmoshe weitzman
PASSED: [[SimpleTest]]: [MySQL] 28,851 pass(es).
[ View ]
#217 404 benchmarks.ods6.14 KBOwen Barton
#221 76824-fast_404_221.patch6.05 KBscor
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 76824-fast_404_221.patch.
[ View ]
#215 76824-fast_404_215.patch6.03 KBscor
PASSED: [[SimpleTest]]: [MySQL] 22,749 pass(es).
[ View ]
#208 76824-fast_404_208_0.patch4.64 KBOwen Barton
PASSED: [[SimpleTest]]: [MySQL] 22,462 pass(es).
[ View ]
#206 76824-fast_404_206.patch6.03 KBscor
PASSED: [[SimpleTest]]: [MySQL] 20,688 pass(es).
[ View ]
#196 76824-18.patch6.18 KBkbahey
FAILED: [[SimpleTest]]: [MySQL] 20,258 pass(es), 1 fail(s), and 0 exception(es).
[ View ]
#194 76824-17.patch5.93 KBOwen Barton
FAILED: [[SimpleTest]]: [MySQL] 20,148 pass(es), 1 fail(s), and 0 exception(es).
[ View ]
#192 76824-16.patch6.1 KBkbahey
FAILED: [[SimpleTest]]: [MySQL] 20,158 pass(es), 1 fail(s), and 0 exception(es).
[ View ]
#189 76824-15.patch6.02 KBkbahey
FAILED: [[SimpleTest]]: [MySQL] 20,104 pass(es), 1 fail(s), and 0 exception(es).
[ View ]
#175 76824-14.patch5.35 KBOwen Barton
FAILED: [[SimpleTest]]: [MySQL] Drupal installation failed.
[ View ]
#167 76824-13.patch5.31 KBkbahey
PASSED: [[SimpleTest]]: [MySQL] 19,157 pass(es).
[ View ]
#163 76824-12.patch4.64 KBkbahey
PASSED: [[SimpleTest]]: [MySQL] 19,168 pass(es).
[ View ]
#157 76824-11.patch4.61 KBOwen Barton
PASSED: [[SimpleTest]]: [MySQL] 19,107 pass(es).
[ View ]
#154 76824-10.patch4.9 KBOwen Barton
PASSED: [[SimpleTest]]: [MySQL] 19,095 pass(es).
[ View ]
#153 76824-9.patch4.48 KBOwen Barton
PASSED: [[SimpleTest]]: [MySQL] 19,107 pass(es).
[ View ]
#151 76824-8.patch4.48 KBOwen Barton
PASSED: [[SimpleTest]]: [MySQL] 19,111 pass(es).
[ View ]
#149 76824-7.patch4.15 KBkbahey
PASSED: [[SimpleTest]]: [MySQL] 19,101 pass(es).
[ View ]
#148 76824-6.patch4.12 KBOwen Barton
PASSED: [[SimpleTest]]: [MySQL] 19,100 pass(es).
[ View ]
#141 76824-4.patch8.97 KBOwen Barton
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 76824-4.patch.
[ View ]
#114 76824-113.patch8.87 KBOwen Barton
Failed: 11124 passes, 1 fail, 0 exceptions
[ View ]
#110 76824-110.patch6.01 KBkbahey
Failed: 11098 passes, 18 fails, 0 exceptions
[ View ]
#107 patch12.patch2.29 KBm3avrck
Failed: Failed to apply patch.
[ View ]
#102 76824.patch4.5 KBOwen Barton
Failed: Failed to apply patch.
[ View ]
#92 htaccess.patch1.58 KBzeezooz
Passed on all environments.
[ View ]
#73 files-404-3.patch2.44 KBkbahey
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch files-404-3.patch.
[ View ]
#70 files-404-2.patch2.56 KBkbahey
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch files-404-2.patch.
[ View ]
#67 files-404.patch2.49 KBkbahey
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch files-404.patch.
[ View ]
#66 htaccess-404.patch2.15 KBkbahey
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch htaccess-404.patch.
[ View ]
#54 d_5.patch1.19 KBm3avrck
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch d_5.patch.
[ View ]
#51 htaccessjv.patch1.08 KBjvandyk
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch htaccessjv.patch.
[ View ]
#48 d_3.patch1010 bytesm3avrck
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch d_3.patch.
[ View ]
#41 404_2.patch1.2 KBm3avrck
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 404_2.patch.
[ View ]
#38 404_1.patch1.2 KBm3avrck
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 404_1.patch.
[ View ]
#36 block_9.patch4.05 KBm3avrck
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch block_9.patch.
[ View ]
#32 404_0.patch5.24 KBm3avrck
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 404_0.patch.
[ View ]
#26 htaccess-404.patch_0.txt1.19 KBrobertDouglass
#24 patch_953.35 KBmoshe weitzman
#13 htaccess-404.patch.txt1.97 KBkbahey
#11 htaccess-index.patch1.55 KBrobertDouglass
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch htaccess-index.patch.
[ View ]
#4 htaccess_new_0.txt2.59 KBagentrickard
#1 htaccess_7.patch947 byteskbahey
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch htaccess_7.patch.
[ View ]
htaccess.patch.txt939 byteskbahey

Comments

StatusFileSize
new947 bytes
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch htaccess_7.patch.
[ View ]

Here is another version.

It adds .ico to the list.

Quick question based on the mailing list discussion-

Assume I want to be able to set up path aliases ending in .html or .htm becuase I've replaced static site content with Drupal content. Will this patch prevent that from working? It seems that it will, so I'd suggest separating out these sorts of endings (shtml, html, htm) into a separate rewrite statement and making it commented out by default.

StatusFileSize
new2.59 KB

I opened another issue before getting to the end of the thread....

The original handbook page was written with reference to Drupal 4.6.x.

As discussed at http://drupal.org/node/56773, in 4.6, we made some edits to the default .htaccess file to allow the web server to handle bad file requests for most common file types.

Before this is standardized, some more testing is needed. Attached is out current .htaccess file.

@agentrickard - that attached file does not correspond to the current .htaccess. Did you add more to it?

Let me clarify:

That file is the .htaccess file that we use in production (Drupal 4.6.9), which inspired the original Handbook page. It has been modified since the original handbook page was written.

I added it for reference.

For what its worth, I am using the patch attached in #1 on my live sites, and there is no ill effect at all ...

It does not seem to catch them all though ...

mysql> select wid, from_unixtime(timestamp) as time,  location from watchdog where type = 'page not found' and location like '%gif' order by time desc limit 5;
| 178866 | 2006-08-04 08:48:20 | http://example.com/sites/default/themes/example/images/menu-expanded.gif   |
| 178865 | 2006-08-04 08:47:16 | http://example.com/sites/default/themes/example/images/menu-leaf.gif       |
| 178864 | 2006-08-04 08:12:02 | http://example.com/sites/default/themes/example/images/sidebar_bg_left.gif |
| 178860 | 2006-08-04 07:52:26 | http://example.com/sites/default/themes/example/images/menu-expanded.gif   |
| 178850 | 2006-08-04 05:44:12 | http://example.com/images/header_bg.gif                                       |

I changed it to this, but don't know if it works or not yet.

Index: .htaccess
===================================================================
RCS file: /cvs/drupal/drupal/.htaccess,v
retrieving revision 1.73
diff -u -r1.73 .htaccess
--- .htaccess   14 Apr 2006 09:08:26 -0000      1.73
+++ .htaccess   4 Aug 2006 14:20:13 -0000
@@ -8,6 +8,12 @@
   Deny from all
</FilesMatch>
+# Handle files that should not be handled by Drupal,
+# so a) no bootstrapping happens, and b) 404 error is less bytes
+<FilesMatch "\.(gif||jpg|jpeg|css|js|ico|cgi)$">
+  ErrorDocument 404 default
+</FilesMatch>
+
# Set some options.
Options -Indexes
Options +FollowSymLinks
@@ -82,6 +88,8 @@
   # Rewrite current-style URLs of the form 'index.php?q=x'.
   RewriteCond %{REQUEST_FILENAME} !-f
   RewriteCond %{REQUEST_FILENAME} !-d
+  # Handle 404s for non Drupal files. See above.
+  RewriteCond %{REQUEST_FILENAME} !\.(gif||jpg|jpeg|css|js|ico|cgi)$
   RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]
</IfModule>

@agentrickard: I don't see how your .htaccess (from #4) is really applicable to Drupal core since you simply disable Drupal's 404 handling altogether:

# Customized error messages.
#ErrorDocument 404 /index.html
#ErrorDocument 403 /index.php?q=FileNotFound
# This overrides the Drupal 404 handler for files that should never be handled by Drupal
#<FilesMatch "\.(gif|jpe?g|png|s?html|css|js|dll|exe|asp)$">
#  ErrorDocument 404 "File Not Found
#</FilesMatch>

The original version of this patch (and that which is in the handbook) works for nonexistant files requested from directories that actually exist in the filesystem. It does something like this:

Request: http://site.com/misc/image.gif

1. Look for directory /misc/.
2. /misc/ is found, so look for image.gif inside of it.
3. image.gif is not found, and image.gif matches the FilesMatch regexp, so return Apache default 404.

Request: http://site.com/some/image.gif

1. Look for the directory /some/.
2. Directory not found, so the URL gets rewritten to http://site.com/index.php?q=some/image.gif
3. Bootstrap Drupal, look in menu, not found, return Drupal 404 page.

The only fix I'm able to think of for this would be to modify index.php to intercept files using the same regexp from the FilesMatch directive, and return 404 from index.php. I don't know if the powers that be will buy such a solution.

Oh, I think they will, given large sites having lots of 404 really suffer because of that false bootstrap, and too many of them stressing the system.

I don't have the time to devise a patch, but will be happy to test one on my sites, and perhaps a large client site too.

StatusFileSize
new1.55 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch htaccess-index.patch.
[ View ]

Here is what seems like an optimal solution. We keep the stuff in .htaccess as it is because having Apache handle things without invoking PHP is ideal. For the cases that Apache can't get to because of the rewrite rules, we run the same regular expression as the first line of index.php and return our own 404 if it matches. The results are as follows:

Request Result
/misc/blog.png Apache will serve the image because it exists.
/misc/blogxyz.png Apache will serve a 404 based on the .htaccess rules because the directory exists but the file doesn't.
/foo/bar.gif Drupal's index.php will serve a 404 based on the rules there because the request ends with .gif. The very fact that we are in index.php means Apache has already decided it is a 404 case, with or without the modification to .htaccess (so the modification to index.php is not an Apache specific hack).

As long as developers don't start making paths in the menu that end in .gif, .png, .swf etc., we're fine. Does anyone know of any cases where such paths are generated? Perhaps by CAPTCHA or watermarking?

So much for my nice table.

/misc/blog.png
Apache will serve the image because it exists.

/misc/blogxyz.png
Apache will serve a 404 based on the .htaccess rules because the directory exists but the file doesn't.

/foo/bar.gif
Drupal's index.php will serve a 404 based on the rules there because the request ends with .gif. The very fact that we are in index.php means Apache has already decided it is a 404 case, with or without the modification to .htaccess (so the modification to index.php is not an Apache specific hack).

StatusFileSize
new1.97 KB

Here is an updated patch.

It picks up from what Robert did, and changes the following:

- Adds .ico file as a type to be ignored.
- Makes .html as a commented out line with instructions. The reason is, some sites use the .html in URL aliases for various purposes (e.g. content migrated from older systems, making awstats give better stats by type, ...etc.)

I tested this and it works as expected. Should cut down resource usage on some busy sites with lots of flat files.

So, +1 from me.

What if I'm serving images from the database, using the 'private file method'?

What if, in the near future, we want to do access control on images associated with nodes? (Something we should do.)

This breaks private files. Ugh.

Of course, we could update the regexp to accommodate the private files path... but what about filemanager files? And ecommerce files?

Yuck.

I don't know how we're going to introduce the access permissions on files attached to nodes (filemanager/attachment already has a solution for this), and don't think that it should be part of this patch.

Is the best thing to do to make the change in .htaccess and leave it at that? I suspect that we're only covering 5-10% of the target cases if that's the route we go. The other option would be to build the check into the bootstrap process and hope that we could do a partial bootstrap for 404 images instead of a full bootstrap, but this is clearly inferior.

I'm writing something maybe related, http://drupal.org/node/77514. It performs a search for the missing URL with certain words stripped (eg html, php etc).

I wouldn't be surprised if we end up introducing a second entry point to the bootstrap process, something like files.php, that handles any URL that has a file ending. That file would initiate an alternate bootstrap process where one of the first checks would be "Is this really a file that exists?", and right after that would check "Is this a path that is handled by a module?".

Haven't Walkah and Co. been doing some thinking on the file bootstrap issue? What is their concensus?

i posted some thoughts on private files at http://www.tejasa.com/node/113. no code though.

IMO, the best we can do is to try and serve a lightweight page. Dropping the sidebars on 404 pages could save a fair amount of resources; both bandwidth as CPU cycles.

Assigned:Unassigned» moshe weitzman

i guess that will have to do for now ... i've been meaning to add a block-less page anyway. needed for panels.module ... if anyone has ideas on how to implement this, please chime in. assigning to self.

Introduce a new theme function? mini_page based on theme_page() in theme.inc?

theme_maintenance does something similar. That or provide a drupal_is_404() function for the template files? Food for thought. :)

Thats was easy. i just added minor conditionals in phptemplate.engine and chameleon.theme. Is used by 404 page, and will certainly be used by panels.module - http://angrydonuts.com/the_dashboard_revisited

To omit blocks, just pass FALSE as 3rd argument - for example theme('page', $output, FALSE)

StatusFileSize
new3.35 KB

oops - here is the patch

It begs the question: do we need to redo the theme_maintenace() page using this system? Can, and if so, should we merge them?

StatusFileSize
new1.19 KB

Here's the .htaccess portion of the patch. We should adopt this part regardless of the solution that we come to for the rest because it is the most effecient way to handle requests for files that don't exist.

i looked into maintainence_page and install_page but they are specialized and merit own function. They avoid calls to functions that haven't been loaded yet. I think we can leave them alone.

I'm ok with Moshe's patch. My original thought was to introduce a helper function (drupal_is_404()) but I guess this works too. A drupal_is_404() might be a little more powerful, and more transparent though.

Then again, the extra paramater passed to theme_page() might, for example, be useful to simply certain pages (like the welcome page after installation).

Status:Needs review» Reviewed & tested by the community

i'll take some more feedback from cvs commit team, or proceed with commit. the extra param on theme_page() is vital for panels.module (whichis super cool if you haven't looked yet - http://drupal.org/files/issues/htaccess-404.patch_0.txt)

Status:Reviewed & tested by the community» Needs review

Which patch are we talking about? I can't tell from moshe's review if he thought his own patch was ready for commit or Robert's.

i am talking about my patch. i have not tested robert's patch.

StatusFileSize
new5.24 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 404_0.patch.
[ View ]

Updated patch... includes Moshe's patch + Roberts + update for HEAD.

Tested and working well. I really like the fact that blocks go off when you get a 404, that really can save a lot of bandwith + CPU cycles.

Also, the Apache 404 for actual files is *very* necessary. If you have a link to http://www.example.com/somethingFunnyLooking.gif ... you are linking to a file directly--you are expecting either the file or a 404, *not* a full Drupal boostrap to give you a 404. 99% of the time that link is going to be in the theme or some link posted by someone and they'll never see that 404... so let's give them the absolute least amount of overhead with that 404.

I'd say this is ready to go, nothing new here and we solve 2 problems at once: 404s for content in Drupal (no blocks, less CPU, SQL, etc..) and 404s for files (*nothing* at all).

Also when/if this goes in, we need to update the theming docs about this new $show_blocks variable so themers can be sure to use it so save some CPU cycles on 404s.

Status:Needs review» Reviewed & tested by the community

Talked with Moshe in IRC, we both agreed this is RTBC. It was ready before Neil just didn't know what to commit (now it's both).

Status:Reviewed & tested by the community» Needs work

This is two separate issues. Make a separate issue for one half of this.

StatusFileSize
new4.05 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch block_9.patch.
[ View ]

Neil, ok here is the first patch then, this is for the blocks. Note, this will need a documentation update.

This patch is the same from before, just synced with HEAD.

Status:Needs work» Reviewed & tested by the community

StatusFileSize
new1.2 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 404_1.patch.
[ View ]

After the patch in #36 is commited (same as in #24 just updated for HEAD), here is the next patch to address the .htaccess and 404.

Status:Reviewed & tested by the community» Needs work

Please post different patches on /separate/ issues.

I moved the most recent version of Moshe's patch to a different issue since the title of this issue fits my patch the best:
http://drupal.org/node/80201

Status:Needs work» Reviewed & tested by the community
StatusFileSize
new1.2 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 404_2.patch.
[ View ]

Here's the 404 patch which is for this issue :-)

Thoroughly tested, this is ready to go. Like I said in #32, this is needed for files, it can save a ton of wasted Drupal bootstraps and server overheads.

Because of this (two) patches I think the need for http://drupal.org/node/76824 has increased, since the 404 page just "blanks" ATM. Most users wont configure a 404 page...

Status:Reviewed & tested by the community» Needs work

The nearly-identical commented lines next to the uncommented need a better explanation.

From what I can tell this still breaks private file downloads.

sad to see this one languishing. if it cam edown to it, i would add this and put some doc for private files tha tthis must be commented out.

Version:x.y.z» 6.x-dev

moving from x.y.z to 6.x-dev version queue. Hopefully this will also bump the feature back into people's attention.

color.module also does not work for private files and it is a quite prominent part of our default theme. see http://drupal.org/node/92059

i think private file incompatibility is acceptable for drupal 5. we will fix that in 6. anyone want to address drumm's last comment and resubmit?

-1 on this addition to the .htaccess, It will break any dynamic generation of the restricted file types & drupal based file access control.

If it is added I'd like to see it commented out by default, so people who have concerns about bandwidth and server load can uncomment them. We could also choose a special case for files/ or system/files.

StatusFileSize
new1010 bytes
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch d_3.patch.
[ View ]

Here's and updated patch which cleans up some stuff.

Known issues:

  • Regexes need to be updated to ignore files that live in files/ and system/files/ --- this will fix both private downloads and modules like imagecache which make special use of the files/
  • Directories that do not exist (e.g., themes/abc/something.gif) cause an Apache 404 error but then it says it can't find the proper 404 document. However, if use a directory that exists /themes/something.gif with an image that doesn't exist, it gets the proper 404

The problem with 'files' is that it can change and we don't have access to file_directory_path from the .htaccess file...

it is reasonable to force folks to have 'files' in the difrectory path, like we do for modules and themes. but if not, i htink this can go in commented out and ask folks to enter their files path.

StatusFileSize
new1.08 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch htaccessjv.patch.
[ View ]

Here's an updated patch. Moved definition of ErrorDocument below our existing definition (otherwise it'll just get stomped on by the later ErrorDocument 404 index.php line).

Changed the description a bit. Commented the lines out per #50, except the RewriteCond.

Don't have time to test this just now, but will not handling rewriting for .html files mess when you have node/2 aliased to about.html, for example?

Also, in those new comments it might help to have a reason why one might NOT want to uncomment those lines. For example, why does this come commented-out by default.

Version:6.x-dev» 5.x-dev
StatusFileSize
new1.19 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch d_5.patch.
[ View ]

Here's an updated patch with a regex from Steven on IRC to ignore those files when they exist in /files which can interfere with private downloads and imagecache module.

Regex needs some tweaking though, almost there.

Also uncommented the regex because if it can ignore anything in files/ it shouldn't break anything. Feel free to comment out though if I missed a use case.

m3vrck: FilesMatch does absolutely nothing with Directories. Scanning against files/ won't work. The only way forward on that avenue is to get the default handler into the .htaccess that is generated into the files/ directory.

And files isn't always named 'files'.

How about a new approach to this issue that has been going back and forth for so long, rolled back, ...etc?

Instead of doing things in .htaccess, requiring people to uncomment things, ..etc., can we move this into Drupal itself?

Can we do a light bootstrap (as early as possible), read a variable (perhaps in settings.php) that has a pattern, try to match it, then exit early?

This is of course more overhead than letting Apache handle the whole thing, but it would still be better that what we have now, provided we do it as early as possible.

Comments?

@kbahey - now is not the time to switch gears on this issue. thats a radical change and quite debateable.

@drumm - i think a code comment in htaccess and possibly where we set the files directory is enough to satisfy the rare case where files dir does not end in 'files'.

moshe, the reason I suggested an alternative is that this is dragging on, due to various issues that are pointed out by various people. So, getting consensus on it will be difficult.

So, in my view, it may not make it to 5.0.

My suggestion was to move it inside drupal, so we can manage the patterns to exclude, ...etc. from inside Drupal with a little bit of sacrifice on the overhead side.

If others think it will get in core via htaccess soon, then I am happy to go with that too.

anyone care to carry this forward?

Version:5.x-dev» 6.x-dev

Anyone willing to work on this performance enhancing patch?

The code freeze is a few weeks away ...

Can someone answer my question at http://drupal.org/node/76824#comment-173567, which is the same as #3 submitted by pwolanin? What if I have an alias for .html pointing to a missing node? Not that that is a huge deal, just would like clarification. Also, the hardcoding /files issue is still there. What about for multisite?

@kbahey, I agree that moving this inside Drupal would save us a lot as we could get the right files dir and some of the other issues might be alleviated. But, maybe that would soften the performance gain that this is trying to achieve. Since this is a bit funky, I think the easiest way is to put this in .htaccess and comment it out initially, allowing users to turn on this boost if they want. What do you think?

Many of my site has .html as the suffix, mainly so awstats can give good stats on them.

I think that this patch is for performance more than anything else. If we avoid a Drupal bootstrap, we waste less resources when we have a 404.

So, the options are:

1. We add the stuff in .htaccess and index.php and comment it out. This is the path of least resistance and would give us what we need for D6.

2. Have .htaccess generated automatically from Drupal like we do with the CSS cache stuff. A default setting provides which file extensions to use and the admin can override them (e.g. remove html or add xyz)

The .htaccess need not be writable. We can take a non-fancy approach and just display the .htaccess contents on the screen and ask the admin to copy/paste, or we can write to the tmp directory and ask the admin to move the file manually. Since this is an advanced feature, I think this is acceptable.

3. Move parts of that in a lighweight bootstrap, as I suggested earlier.

I think 1 will have the best performance. If Dries is amenable to this, I can work on this towards end of next week.

I agree that #1 is best for now. I hope kbahey or other have a chance to revive this.

Assigned:moshe weitzman» kbahey
Status:Needs work» Needs review
StatusFileSize
new2.15 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch htaccess-404.patch.
[ View ]

I hate to see Drupal 6 go out the door without this simple yet effective feature.

Here is a patch for current HEAD.

StatusFileSize
new2.49 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch files-404.patch.
[ View ]

Here is a patch that moves the index.php stuff to settings.php, and comment it out.

This way, we have no cruft in index.php, and those who need this can just uncomment it (albeit in 3 different places, but better than a whole Drupal bootstrap for a measly 404).

Status:Needs review» Reviewed & tested by the community

works as advertised. i think this is a good compromise, and all the lines are needed.

Status:Reviewed & tested by the community» Needs work

One of the chunks is double commented, the other is not.

Status:Needs work» Needs review
StatusFileSize
new2.56 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch files-404-2.patch.
[ View ]

Here it is, fresh from the microwave ... ;-)

I think this patch is important (I wrote the original handbook page).

Circling back, have you tested against http://drupal.org/node/56773#comment-243479? And against file uploads in general?

Reread the htaccess comments, there are typos and repeated stuff as mentioned by Steven. Also if the functionality in htaccess is effective, there would not be a way to get to the default settings.php change. Are three three marketed as an 'all or nothing' solution to uncomment all three to cover all bases?

StatusFileSize
new2.44 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch files-404-3.patch.
[ View ]

The 3 are needed but for different reasons. When a directory exists but a file in it does not, then Apache is able to handle the 404, and hence the .htaccess changes.

However, if the directory does not exist, then Drupal is still bootstrapped, so we exit as early as we can, and hence the settings.php change.

Rerolled the patch. Hopefully comments are OK now. I also changed it to use REQUEST_URI instead of QUERY_STRING, since that is what we are checking in .htaccess.

Also added a small "Page not found" message.

Status:Needs review» Needs work

* What's a "flat file"? I know what a flat file is. It's not what the documentation prefaces.
* None of the sentences that end with "settings.php" have a period at the end.
* None of these lines are commented (and they should be).
* "corresponding the RewriteCond" has an extra "the".
* I can agree that we'll save "server" resources, but not "bandwidth" resources.
* "corresponding FilesMatch line" - it's actually three line[s], not just one "line".
* Technically, if a number is less than 10, it should be written out ("three lines", not "3 lines").

I'm not sure I like this third check. I worry about dynamically generated images (ala CAPTCHA or various statistics) - I suspect, just off the top of my head, that one of the "make an image from text" modules will fail, like textimage.module. This third check also defeats a request for an image uploaded to a node when private downloads are enabled (URL path is ?q=system/files/1608.png, which triggers your check).

And note that #74 and the "private downloads" issue was brought up as far back as #14.

I have trouble believing that the performance gain is worth completely eliminating file extensions in the virtual file system. The whole simple beauty of the "no file named $1? let index.php?q=$1 handle it" method is that there is no difference from the outside between the Drupal handler and the files served directly.

What performance gain is this supposed to give, anyway - it seems to speed up only 404 errors. For a site without broken links and sensible, persistent URLs, shouldn't those be the exception rather than the norm?

Just my two cents there...

What performance gain is this supposed to give, anyway - it seems to speed up only 404 errors. For a site without broken links and sensible, persistent URLs, shouldn't those be the exception rather than the norm? - Not really. Given enough time, your site will be constantly hit by broken spiders, endless requests for favicon.ico (which may not exist), and various cracking attempts via known vulnerabilities in IIS and other webservers. My top WEEKLY 404s are 14045 /favicon.ico(has never existed and never will), 2296 /popular (never existed. broken bot), 1668 /ghostsites/images/pixel.gif (probably once existed.), 1414 /popular/alltime (never existed. broken bot), 1025 /images/pixel.gif (probably once existed), 467 /cgi-bin/isp-post/isp-post.cgi (last existed 7 years ago. bot attempting exploits), 409 /files/users/picture-11.jpg (existed once. user, and all his content, was deleted.), 178 /db/zone.htm (never existed. no idea where this comes from).

@Arancaytar

Just for background: This is a massive problem for sites that migrate to Drupal from a flat-file system. In the case of SavannahNOW, we had thousands of now dead links in Google, Yahoo, et. al. We were basically getting DOS'ed by 404 errors.

Plus, if someone requests a page that no longer exists, and that .html file contains two .css and 10 .gif files, that turns into 13 404 errors that "fall through" to Drupal. In effect, Drupal spends all of its time searching for files instaed of serving pages.

It is also a generic problem with bots and spiders -- particularly spambots and hackers. Our traffic logs show repeated requests for .dll and .cgi files that point to known or suspected exploits in IIS.

All that aside, the case may be that the best path here is a well-tested and documented handbook page, with a reference in the core INSTALL.txt that points to those instructions. I say that simply because edge-cases seem to be drowning this issue. If we can't fix all the edge cases, I doubt this will get into core.

Here is the wording, before I roll a patch again:

.htaccess part1:

The following lines prevents Drupal from handling 404 errors for certain
types of static files, such as images and css.
This can help reduce the server's load because 404s for such files do not
cause a full Drupal bootstrap.
You must uncomment the corresponding RewriteCond lines later in this file as
well as those in the settings.php file.

.htaccess part2:

The following lines prevents Drupal from handling 404 errors for certain
types of static files, such as images and css.
This can help reduce the server's load because 404s for such files do not
cause a full Drupal bootstrap.
You must uncomment the corresponding FilesMatch lines later in this file as
well as those in the settings.php file.

settings.php:

The following lines prevents Drupal from handling 404 errors for certain
types of static files, such as images and css.
You must uncomment the corresponding RewriteCond and FilesMatch lines in
the .htaccess file as well.

Regarding private download, we can add a comment that this does not work in such a case. How about that? Any wording suggestions Morbus?

CSS is a acronym and should be capitalized. I'd much rather these sentences be merged into one paragraph (for each block).

As for the settings.php, I can't support this patch with that block in - adding wording-to-the-wise about private downloads doesn't solve the issue of textimage (which creates dynamic images) or any other module that attempts to do something dynamic with normally static files. It's something that would cause more confusion and endless "your module is broken!" requests.

I support the .htaccess additions.

Assigned:kbahey» Unassigned

Morbus is right about private file download. This is something we need to think about. However, that should not stop the .htaccess parts.

Not sure what merging sentences in one block for each paragraph means.

I am unassigning myself from this issue. Someone else can take it and run with it.

In order not to break textimage and imagecache, I just added this condition in the settings.php if block:

&& strpos($_SERVER['REQUEST_URI'], 'files') === FALSE

i know that one could use a files dir that doesn't begin with 'files' but thats a bug in core IMO. we require modules dir to be modules, themes to be themes, and we have every right to require that files be files. i just need to submit a patch for that. if folks agree, i'll get started.

note that color.module does not work with private files and thats a far more important feature than these chunks of commented out code that are proposed.

Given that any module (or even custom site code) might attempt to "fake" a file extension in the URL, it seems like this fixes one issue (404 overload) at the cost of introducing an uncounted number of problems whose cause cannot not readily identified.

"Why is such-and-such a page suddenly returning 404 errors?" seems like a far more difficult support issue to solve than "Why is my Drupal site constantly running into my host's server load limits?". When the server overloads, you have a very obvious problem that will probably be handled by an admin who can poke around in .htaccess files. The other one is encountered by users who have no idea what happened.

Because of this, I'd propose that the relative addition/change in .htaccess be commented out with a description of how to activate it, and document this as a performance tweak in the handbook...

subscribe

Where has this issue been all my life? Subscribing.

Subscribing

Subscribing

Version:6.x-dev» 7.x-dev

Moving feature requests to D7 queue.

Version:7.x-dev» 5.7
Status:Needs work» Needs review

For handle file in non-existent directory I use rewrite to /404.ext in .htaccess. Put

  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteCond %{REQUEST_URI} !^/files/
  RewriteCond %{REQUEST_URI} \.(png|gif|jpe?g|s?html?|css|js|cgi|ico|swf|flv|dll)$
  RewriteCond %{REQUEST_URI} !^404.%1$
  RewriteRule ^(.*)$ 404.%1 [L]

before last RewriteConds/RewriteRule and
<FilesMatch "\.(png|gif|jpe?g|s?html?|css|js|cgi|ico|swf|flv|dll)$">
  ErrorDocument 404 "<html><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL was not found on this server.</p></body></html>
</FilesMatch>

just after
ErrorDocument 404 /index.php

It does not require to patch other files and seems to work well in 5.7 (including imagefield preview), but error message contain 404.ext filename so I use custom message.

Version:5.7» 7.x-dev
Status:Needs review» Needs work

@zeezooz, can you roll a patch?

New features are going into 7.x-dev, not 5.7.

The trickiest thing to work around here are modules like imagecache that rely on the 404 handling to auto-generate versions of files that don't yet exist. I wonder if we could get around this by having them add a querystring paramete (e.g. 404handler=drupal) that will cause .htaccess to use the Drupal 404 for these files, rather than the lightweight static 404 page. This way we could potentially have the rule enabled by default, and still allow these modules to work without having people hack the (fragile!) .htaccess regexi.

Anyone fancy taking a shot at this?

StatusFileSize
new1.58 KB
Passed on all environments.
[ View ]

patch for #89

Note: Favicon.ico just got such special treatment in Drupal 6.3.

patch #92 works great!

We are using this on MothersClick with imagecache & imagefield with *no* problems at all, test it yourself:
http://www.mothersclick.com/no-directory/this-really-works.css

The patch that went into D7 and D6 for favico should be reverted in favor of this patch which works great.

Category:feature» task
Status:Needs work» Needs review

Still applies with fuzz, back to needs review.

I agree 100% that a more generic solution would be great for this issue rather than special casing favicon.ico - it's a big performance issue on some sites (including one I run) that have old legacy linkrot for images etc.

Could we leave default behavior as it is? This will break functionality on a great number of sites - whether you are using a private file system, or any module that dynamically generates file downloads, or even if you are in the habit of adding ".html" to the URL aliases of your nodes.

It's a fine tweak, but I don't see that we should knowingly and without warning or explanation break the sites of one part of the user base in order to improve performance for another. These lines should be commented out initially, so you can uncomment them if you need the performance. We don't default to aggressive caching either, for the same reasons.

This patch DOES NOT break dynamically generated files like images with imagecache. Like I said, we're using it on MothersClick and we rely on imagecache and everything works a-ok. Before saying a patch doesn't work, test it. There is a line in there that says ^/files/ ...

I'm sorry, I often get mod_rewrite rules wrong. Does the following prevent the file being handled by Drupal, or is it an exception for files that are handled by Drupal?

+  RewriteCond %{REQUEST_URI} !^404.%1$
+  RewriteRule ^(.*)$ 404.%1 [L]

Related: #295456: Bug in .htaccess file.
Also the patch Gábor mentions in #93: #174940: Useless favicon.ico 404 messages in logs.

Is it proposed that *by default* Drupal should not handle 404s for the image file types? Two comments:
- if Drupal doesn't handle them then I lose an easy way of finding broken img tags etc. on my sites
- the comment sections e.g. # The following lines prevents [sic] ... # You must uncomment the corresponding RewriteCond and RewriteRule lines later in this file. suggests that by default these lines will be commented out??

@98, and for my own edification: What I think those 2 lines do is the following:
- first of all, %1 is a backreference to the file extension of the image file (as matched the line before in the RewriteCond)
- so RewriteCond %{REQUEST_URI} !^404.%1$ makes sure that the request was not for 404.jpg say (this is needed so that the next line doesn't create a loop)
- RewriteRule ^(.*)$ 404.%1 [L] then rewrites the current request (e.g. picture.jpg) to 404.jpg and ends ReWrite processing (so that the following rule - Drupal's normal one - doesn't get applied).
- only then does the FilesMatch directive get applied

Actually I'm not sure the request URI needs to be rewritten ... just need to make sure that Drupal's normal RewriteRule doesn't get run?

Status:Needs review» Needs work

The last submitted patch failed testing.

Status:Needs work» Needs review

#100 was the testbot problem #335122: Test clean HEAD after every commit.

StatusFileSize
new4.5 KB
Failed: Failed to apply patch.
[ View ]

Here is a rerolled patch that does the following:
- Reverts the favicon.ico rules, because they now are handled by these rules
- Adds some additional extensions for common 404s (mostly exploit-bots)
- Changes the 'files' rule so it does not need to be at the start of a line. This is needed so we can catch sites/default/files paths (the default since Drupal 6). This also fixes (AFAICS) the case where private files are enabled.
- Adds a help text and a validation rule to the file system settings to ensure the file system path contains the word 'files' somewhere (as suggested by moshe in #82).
- Removes handling for htm/html files by default, because migrating from a static html site is a pretty common use case. I have added comments and sample lines for people who know they don't/won't use these aliases, but by default you don't have to do this (and you still save a large number of CPU cycles).
- Makes this enabled by default. I don't think it's worth including this if it is going to be disabled by default, especially if it means uncommenting 2 far apart sections of code in a hard for newbies-to-grok file. If we are not going to enable it by default this could live just as happily in a documentation section for high traffic domains.

Some thoughts:
- This opens up the possibility of a contrib module to save static versions of the error pages as 404.whatever (a bit like boost module does for regular pages), or introduce some lightweight way of logging these errors (404-handler.php) with a minimal change to the rewrite rules.

this could live just as happily in a documentation section for high traffic domains.

+1, though leaving it in as a comment will make it far easier to enable - just tell them to remove the comment on the documentation page.

Again: This change will speed things up for some people and break the sites of everyone using private file downloads, since you're not even using files/ or system/ exceptions, as well as anyone generating dynamic images with extensions, which can't be fixed with any additional special exception. Is it too much to ask not to break people's site by default?

@Arancaytar - it has been stated multiple times in the thread that this patch does not break imagecache or any similar modules and has been used successfully on large production sites that also use imagecache. If you read my post you would also note that this should now work with private file downloads. If you would like to find something that it does break, then please by all means try the patch and report back, but repeating "this breaks imagecache" over and over is neither accurate nor helpful :)

Status:Needs review» Needs work

The last submitted patch failed testing.

It occurred to me that this would be a great candidate for some tests (which would help settle the what this does/does not break debates!) - any takers?

StatusFileSize
new2.29 KB
Failed: Failed to apply patch.
[ View ]

The issue I have with this patch is treats 404s for files differently than 404s for something in Drupal's menu system.

IMO these should be treated the same. The attached patch is how we handle on MothersClick.com we a single configurable file that both Apache and Drupal can use.

While not core worthy, hopefully this sparks proper direction for this patch.

What a cursed issue we have here.

Patch in #102 should not throw a form validation error. Instead, just use a note for the form element description, so site admins who need to use a path without "files" do not have to hack core.

Patch #107 looks interesting. Just an idea: Could we use a separate 404.php in the root directory, which performs a minimal bootstrap, checks the configured file directory path for the site in question, and short-circuits to a (optionally configurable) 404 output - or - if Drupal is supposed to handle the request, performs a full bootstrap?

Status:Needs work» Needs review
Issue tags:+Performance
StatusFileSize
new6.01 KB
Failed: 11098 passes, 18 fails, 0 exceptions
[ View ]

I hate that D7 would be released still handling all 404s via an expensive bootstrap. This is a big performance suck.

The attached patch solves this in two ways:

- Adds rules to .htaccess to bypass 404s for certain files.
- Adds a safeguard in settings.php to stop very early if we are handling a static file.

Please please let us have this in D7. We can refine later if needed, but let us ship with something that prevents Drupal from handling static files.

Status:Needs review» Needs work

The last submitted patch failed testing.

You want to write a test module which defines a callback in files, say 'test.gif' and check the return of it -- stuff a random string in a variable in the test and print variable_get in the test module.

Status:Needs work» Needs review

Here is an updated patch, with 3 new assertions to ensure that the web server is passing "fast 404" responses to specific file type requests outside the files directory, and that Drupal can still handle requests within the files directory (system_test module now registers 2 imagecache style handlers for this purpose) for both public and private file serving.

I have also tried to make the comment help as clear as possible, although it is hard to balance clarity against excess verbosity - feedback is welcome.

I also took the settings.php fallback out - I am happy to add this back in if someone can explain why we should. This would only be reached on non-Apache webservers. We generally don't duplicate .htaccess functionality in php (especially for "optional" functionality such as this) - for example the www/no-www redirection could easily be done in settings.php, but is not. I think it is reasonable to request that people set up equivalent functionality in their non-Apache webserver configurations, and/or add code to their settings.php to do the same - we have places in the manual to document these.

StatusFileSize
new8.87 KB
Failed: 11124 passes, 1 fail, 0 exceptions
[ View ]

Status:Needs review» Needs work

The last submitted patch failed testing.

Hmm, I can't reproduce the test fail locally - anyone else care to test?

GET http://default.head/s120547Pa34UmLhrr.ico returned 404 (139 bytes). Browser system.test 613 PageNotFoundTestCase->testPageNotFound()
Valid HTML found on "http://default.head/s120547Pa34UmLhrr.ico" Browser system.test 613 PageNotFoundTestCase->testPageNotFound()
404s for specific (fast 404) file types outside the files directory are returning without Drupal bootstrap. Other system.test 614 PageNotFoundTestCase->testPageNotFound()
GET http://default.head/sites/default/files/simpletest120547/system-test/404-handling.jpg returned 200 (38 bytes). Browser system.test 615 PageNotFoundTestCase->testPageNotFound()
Valid HTML found on "http://default.head/sites/default/files/simpletest120547/system-test/404-handling.jpg" Browser system.test 615 PageNotFoundTestCase->testPageNotFound()
Drupal is handling requests for specific (fast 404) filetypes when within a public files directory. Other system.test 616 PageNotFoundTestCase->testPageNotFound()
GET http://default.head/system/files/system-test/404-handling.jpg returned 200 (38 bytes). Browser system.test 618 PageNotFoundTestCase->testPageNotFound()
Valid HTML found on "http://default.head/system/files/system-test/404-handling.jpg" Browser system.test 618 PageNotFoundTestCase->testPageNotFound()
Drupal is handling requests for specific (fast 404) filetypes when within a private files directory. Other system.test 619 PageNotFoundTestCase->testPageNotFound()

Owen, I do not get this, the rewrite rule changes the request to 404.gif or similar? Or what is the 404.%1 wanted to be? And, why wont the filesmatch grab your request, even in the files dir before any rewrite?

The rewrite rule simply prevents the request from being rewritten and passed to Drupal (if the file does not exist, the file is outside */files/* and has a specified extension). This causes Apache to 404 the request and look up the appropriate ErrorDocument for the request, which has been previously set for files with this extension. The filesmatch does not do anything at all unless Apache looks up an ErrorDocument for the request.

There perhaps could be problems with this if clean URLs are disabled - kbahey, did you check this at all?

I am wondering if this is a good idea. While it might be good for performance, you will never notice that some files are missing if you do nto additionally look into the apache log.

Yeah, honestly, I'm not hugely fond of this functionality - it's pretty hackish, and severly cripple's Drupal's own log functionality (which, over the many years, people have probably come to expect to be a literal equivalent to server logfiles). I don't doubt that heavy 404s can cause performance issues - I just don't feel this is the answer.

I'm wondering if we should just do either extension matching or file directory matching in drupal_not_found() - so log the 404 for site admins, but then return straight away if it's a file rather than a page. That'd save all the page building and rendering for the 404, but still give us logs.

It simply is not Drupal's job to catch all possible requests under its tree and log them. I like this patch a lot. catch's idea still requires a bootstrap which is a big problem. some contrib module can implement an apache log viewer if people need to view these 404's inside of drupal. or make a contrib module which lets you add a simple link to awstats/webalizer the reports block. web hosts could even make their own distro with that link pre-configured.

It simply is not Drupal's job to catch all possible requests under its tree and log them.

It has done so for years and considering everything this behaviour has caused very little problems.

I like this patch a lot. catch's idea still requires a bootstrap which is a big problem. some contrib module can implement an apache log viewer if people need to view these 404's inside of drupal. or make a contrib module which lets you add a simple link to awstats/webalizer the reports block. web hosts could even make their own distro with that link pre-configured.

I think it would be bad user experience to change this behaviour.

Catch's idea sounds interesting, but I am not sure how it could work.

Summary: I'd won't fix this if I were a core committer.

I don't like, at all, how the patch redirects the user to 404.$fileextension. This is *really* damaging - that's the actual entry that will get logged into the server log file, meaning that we are *crippling* a traditional error_log by filling it with garbage. How can I fix my broken links or embeds (etc.)., if Drupal is filling my log with "404.jpg" and "404.html"? This'll corrupt not just the error_log, but also anything that works on error_logs, such as Webalizer and so forth.

To amend my previous comment, here's what the error log looks like with this patch:

[Mon Jun 15 10:52:35 2009] [error] [client 75.67.50.34] File does not exist: /var/www/vhosts/disobey.com/httpdocs/404.jpg
[Mon Jun 15 10:52:37 2009] [error] [client 75.67.50.34] File does not exist: /var/www/vhosts/disobey.com/httpdocs/404.ico, referer: http://www.disobey.com/ahem.jpg

The first entry is the rewritten file - I had requested ahem.jpg. The second entry is a modern browser's standard attempt to get the favicon.ico (which, naturally, we can't really tell here, since it's been rewritten).

Huge -1.

Also has anyone tested this with imagecache?

catch: assuming that the Drupal 'files' directory has been set to 'files', then imagecache would be fine. If the 'files' directory was named something not 'files', then imagecache, private downloads, and other dynamic generations would fail. They could be made accessible again only if the .htaccess and Drupal source code was modified (fun, right?).

Should've read the patch again, thanks though Morbus.

@Morbus - could you recommend an approach then? Right now, a site can nearly crumble if a few bad invalid images get posted to the home page (for example)

Just a note I think we could do my approach via #444756: Add a page_not_found() hook to Drupal with a few modifications.

@moshe: I don't have a suggestion you'd like, no - I don't think it can be cleanly fixed at the .htaccess level (i.e., meaning we'd need to do /some/ sort of bootstrap; ala @catch's point). However, I also disagree entirely with your assertion that "a site can nearly crumble with a few invalid images" - that's ridiculous: as someone who has been running the same site for more than 10 years, and gets dozens of 404s every minute, (historical, rootkitish, and accidental, etc.) it's a bit FUDdish. To think that Drupal could get to where it is now, with such a easy 404 DoS, is inane.

I will agree, however, that certain shared hosts can have such a problem with a popular site and sliced resources. I just don't think, however, it's so incredibly widespread that Drupal sites everywhere are being crippled because of it. Do I think something should be done about it? Sure, I have no problem with that. Do I think it's critical? Nope. Do I think it's critical enough to a) strip an existing expectation from Drupal (404s in the reports) and b) break the equivalent functionality at the webserver/log level? Most definitely not.

Also, putting the notice about the missing images into the watchdog will allow site maintainers to create the missign images or to remove the broken node/template/whatever.

This issue is pure sillyness.

@catch - in this issue, we want to handle 404 without a bootstrap. really we want to handle a 404 without any php at all but definately no bootstrap. a 404 hook completely the wrong direction IMO. a 404 hook for the files directory might be fine and can be pursued in that issue.

Also we haven't been doing a full bootstrap for favicon.ico 404s since 28th June 2008, so this patch only fixes image files which aren't favicon.ico and aren't in sites/*/files. I think with something in drupal_not_found() we could skip rendering the page for files in sites/*/files as well.

>skip rendering the page
If page caching is on then the response will be from the page cache anyway, so there will only be a full bootstrap/rendering of the page if the request hasn't already been cached.

cool, so I need to check if favicon.ico exists myself? What a truly great idea. I'd expect something like this from WP or Joomla, but not from Drupal...

Comment 127 is not correct - the file directory path does not need to be set to "files", but simply needs to include the string "files" somewhere within it. There is a validation on the setting in the patch.

I agree that the current patch breaks Apache logging (404.ext) in an unacceptable way and is not committable. Possibly there is a way to fix this - I tried briefly to no avail. I think I am also leaning towards a php based solution, ideally with optional Drupal logging (and hence DB bootstrap). Either way, I DO think we need an improvement here - I have seen several site outages caused by 404s when assets on a single browser page view creates many Drupal hits, or the page has repeating AJAX callbacks that are 404ing.

Either way, I think 2 of the 3 tests I created are still worthwhile, since they ensure the imagecache style facility is functional (which I think could pick up some subtle breakages in the file path/url functions).

Owen - the RewriteCond specifically looks for "/files/" (unbounded), but the validation code only looks for "files". This would allow me to use "drupal_files" inside Drupal, and pass the validation, but lose the "fast" 404s.

I actually like to translate the default files path to use "dateien"...

If your system is prone to meltdown because of a few missing files I tend to think that this is not a Drupal problem per se.

You should evaluate your hardware and possibly buy more.

You also should test your site on a staging system to catch such 404s in advance.

@Morbus - correct, this was a bug - however the symptoms would be broken imagecache, rather than lost fast 404s.

Attached patch fixes this, and also fixes the issue with the tests failing when mod_rewrite is not enabled. It doesn't fix the Apache 404.ext logging issue however.

StatusFileSize
new8.97 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 76824-4.patch.
[ View ]

Sigh

Well, 5 broken images on your home page can be a nearly a 5x multiplier on your traffic. It leads to 5x on the number of bootstraps that your web and DB servers have to service.

Excellent reason to just fix the stupid images. Thanks to Drupal (as it is now) you'll at least be alerted to the problem.

You won't be alerted to the problem if the site is down due to 5x traffic

You can't solve the issue if your error log shows 5 entries for 404.jpg.

(See, we can do snarky replies too!) ;)

Snarky replies aside, I think everyone agrees that logging 404.ext for missing files is not acceptable.

What we need is either an alternate .htaccess approach (that logs 404s correctly, and works acceptably both with mod_rewrite enabled and disabled) - OR a PHP based approach that is sufficiently faster than a cached bootstrap to make it worth doing.

FWIW, I do this on my sites:

<?php
/**
* Generates a 404 error if the request can not be handled.
*/
function drupal_not_found() {
 
drupal_set_header('HTTP/1.1 404 Not Found');
 
watchdog('page not found', check_plain($_GET['q']), NULL, WATCHDOG_WARNING);
  global
$conf;
  include_once
$conf['404_file'];
  exit;
}
?>

Than I usually set the path to sites/whatever/theme/whatever/404.php and include my logic there.

This is in addition to setting the htaccess to point to this same 404 file as well (and having similar htaccess work).

Status:Needs work» Needs review
StatusFileSize
new4.12 KB
PASSED: [[SimpleTest]]: [MySQL] 19,100 pass(es).
[ View ]

OK, let's try a different approach. I think this one could even make everyone happy!

I think we have demonstrated that a .htaccess based approach is too problematic. So this is a simple PHP approach. It allows you to do send a lightweight 404 for matching URLs either:

a) just before the normal 404 would otherwise be sent (the default). This has minimal side effects (actual pages that happen to match still get served as normal, and 404s are logged correctly in both the Apache and system log). Of course, you still have a pretty good chunk of processing needed to reach this point (we manage to skip most of the theming work though), but users still benefit from bandwidth savings, and this should be safe enough to enable by default.

b) Alternatively (and not the default!) you can return the 404 directly during the configuration bootstrap. This has huge benefits for server load, but significant side effects (identical to the htaccess ones we saw) - even valid pages that happen to match the pattern will return 404s, rather than the actual page. It will also prevent the log entry. This could still be perfectly acceptable for large/active sites with well understood URLs looking to get an edge off their server load.

There are a few things I would like feedback on:
- Does this general approach work for folks? If there are fundamental objections please speak up.
- Naming - is "lightweight 404s" grokable? I didn't want to use "fast" since it is really not a super performance bonus by default. Should we say "not_found" instead of 404?
- Regular expressions in settings.php - this strikes me as a bit risky/unfriendly. drupal_match_path() could work in a pinch, but needs case lowering, and would mean we couldn't use the jp?eg style shortcuts. Other ideas?
- Calling drupal_lightweight_404() directly in setting.php (optionally, of course) is also kind of a new concept. Should we make this a $conf variable (lightweight_404_fastpath, perhaps), and check it in bootstrap.inc somewhere?
- Are there any clean up tasks (hook_exit etc) we really need to do in either case (a) or (b)?

StatusFileSize
new4.15 KB
PASSED: [[SimpleTest]]: [MySQL] 19,101 pass(es).
[ View ]

Owen

I like this. It is totally configurable and off by default. So no one should complain about it.

My modification is attached. The most important modifications are:

1. In drupal_fast_404(), you were missing this before the print statement:

header('HTTP/1.0 404 Not Found');

2. Added a pattern for cached css and js where we have a ? and then one letter after it.

I did some bikeshedding, please feel free to ignore. It is renaming "fast" instead of "lightweight". Shorter to type and remember. This should NOT derail the patch. I am happy to revert.

Please let us get this in in Drupal 7. It is way over due.

Can we get a review and an RTBC?

Good catch - I tweaked it to add the header with drupal_add_http_header() rather than header() and also changed the order of operations in drupal_deliver_html_page() which avoids a duplicate header (or now function call).

I also removed the css?. and js?. patterns - we are checking against $_GET['q'], which already excludes the querystring, so these aren't needed.

No objections to the name change.

StatusFileSize
new4.48 KB
PASSED: [[SimpleTest]]: [MySQL] 19,111 pass(es).
[ View ]

+++ sites/default/default.settings.php 9 Apr 2010 16:35:38 -0000
@@ -393,6 +393,35 @@ ini_set('session.cookie_lifetime', 20000
+ * By default, fast 404s is returned as part of the normal page request
+ * process, will serve valid pages that happen to match, and will also log any
+ * 404s in the Drupal log. Alternatively you can choose to return a 404 now by

I think some sentences got mashed together improperly here. Otherwise, is RTBC. Nice work.

StatusFileSize
new4.48 KB
PASSED: [[SimpleTest]]: [MySQL] 19,107 pass(es).
[ View ]

How about this?

StatusFileSize
new4.9 KB
PASSED: [[SimpleTest]]: [MySQL] 19,095 pass(es).
[ View ]

OK, that didn't parse either...try this

Status:Needs review» Reviewed & tested by the community

OK, lets give this poor issue a chance to shine.

Status:Reviewed & tested by the community» Needs work

+++ includes/bootstrap.inc 9 Apr 2010 20:54:00 -0000
@@ -2045,6 +2045,24 @@ function drupal_maintenance_theme() {
/**
+ * If fast 404 pages have been configured, and this is a matching page
+ * then return a simple 404 page.

This needs to be a proper function summary, stating what this function does. The current summary is rather part of the description.

"Returns a simple 404 Not Found page."

+++ sites/default/default.settings.php 9 Apr 2010 20:54:00 -0000
@@ -393,6 +393,36 @@ ini_set('session.cookie_lifetime', 20000
+ * Drupal can generate fully themed 404 pages. However, some of the time these
+ * responses are for image or other resource files that are never displayed
+ * to the end user, or from bots that will do nothing with the resulting page.
+ * This can waste bandwidth, and also generate server load.

"However, some of these responses are for images or other resources that are not displayed to the user. This can waste bandwidth and generate unnecessary server load."

+++ sites/default/default.settings.php 9 Apr 2010 20:54:00 -0000
@@ -393,6 +393,36 @@ ini_set('session.cookie_lifetime', 20000
+ * The options below return a simple, fast 404 page for URLs matching a
+ * specific pattern.
+ *
+ * - 404_fast_paths is a regular expression matching paths to return the
+ *   fast 404 on. If you don't have any aliases ending in htm or html you
+ *   can add '|s?html?' to the expression.
+ * - 404_fast_html is the html to return.

The preceding sentence should end with a colon (:), and the list should directly follow without blank line.

The array keys should be delimited with a colon, followed by a description, e.g.

- 404_fast_paths: A regular expression to match paths to return a simple 404 page on. [...]

+++ sites/default/default.settings.php 9 Apr 2010 20:54:00 -0000
@@ -393,6 +393,36 @@ ini_set('session.cookie_lifetime', 20000
+ * ¶

Trailing white-space here.

+++ sites/default/default.settings.php 9 Apr 2010 20:54:00 -0000
@@ -393,6 +393,36 @@ ini_set('session.cookie_lifetime', 20000
+/*

1) Should be /**

2) Missing blank line before this docblock.

116 critical left. Go review some!

Status:Needs work» Needs review
StatusFileSize
new4.61 KB
PASSED: [[SimpleTest]]: [MySQL] 19,107 pass(es).
[ View ]

Thanks for the review, Sun. This should resolve all these issues.

Status:Needs review» Reviewed & tested by the community

Thanks!

Webchiiiiiiiiiiiiick! Driiiiiiiiiiies!

Version:7.x-dev» 8.x-dev
Category:task» feature

This looks like a new feature, and should go into Drupal 8 at this point.

Dries I disagree that it is a new feature.

This is a performance improvement. We bootstrap the whole of Drupal when the request is a static file 404.

This patch does not change any behavior and is off by default, enabling it reduces resource waste when a lot of 404s for static files are happening.

Category:feature» task

I agree that this would be good to do for D7.

However,

This patch does not change any behavior and is off by default

is not entirely true, due to this change:

+++ includes/bootstrap.inc 9 Apr 2010 22:38:30 -0000
@@ -2045,6 +2045,26 @@ function drupal_maintenance_theme() {
+function drupal_fast_404() {
+  $fast_paths = variable_get('404_fast_paths', FALSE);
+++ includes/common.inc 9 Apr 2010 22:38:31 -0000
@@ -2329,11 +2329,14 @@ function drupal_deliver_html_page($page_
+        // Check for and return a fast 404 page if configured.
+        drupal_fast_404();
+++ sites/default/default.settings.php 9 Apr 2010 22:38:31 -0000
@@ -393,6 +393,36 @@ ini_set('session.cookie_lifetime', 20000
+$conf['404_fast_paths'] = '/.(txt|png|gif|jpe?g|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp)$/i';

Since the pattern is always defined, you will get a fast 404 page instead of a full one, if the pattern matches.

Can we remove this change in behavior, for example, by introducing another variable to toggle the entire functionality?

113 critical left. Go review some!

Version:8.x-dev» 7.x-dev
StatusFileSize
new4.64 KB
PASSED: [[SimpleTest]]: [MySQL] 19,168 pass(es).
[ View ]

Fair point sun. Thanks for the catch.

Here is a version of the patch that is completely off by default. The user needs to uncomment stuff in settings.php if they want that on.

Dries, it REALLY does not change any behavior at all, and fixes a performance bug.

I'm not ready to commit this, and want to discuss this some more.

For example, I'm not a fan of the list of extensions. For site visitors, it doesn't make much sense to get a different 404 depending on file extension.

Maybe we should go back to the Drupal 6 solution where 404 pages where fast (because blocks where omitted)? We should try to make all 404s fast and avoid more complex solutions.

The 404s that are emitted because of static files are often invisible. For example, the site owner changes the name of a small PNG from CSS, or misspells it. The page renders normally for the end user, but this PNG (and perhaps others) triggers 404s that the user will never see. Despite these not being seen by the user, they do a full Drupal bootstrap with all the overhead. This solution avoids this unnecessary, yet invisible, overhead.

We are removing known overhead for known file types. The rest of 404s require a Drupal bootstrap to check the menu, ...etc. So we can't do much about them until we reach that stage. But why go that deep into Drupal only to send a 404 that the user will never see?

Even if the user sees the error, it is the site admin's decision to show them a full or brief error message. We just give them the means to do so, not force them to.

On all high traffic sites that 2bits.com manage, we have a variant of this in settings.php just to avoid this overhead. We look at the extension and emit a 404 and then exit right in settings.php.

As for the block-less 404, I personally don't like it since it looks inconsistent and odd, and only removes one overhead source (blocks) not the full bootstrap.

Drupal 6 only omitted the left and right sidebars, not all blocks, and many sites don't use block module at all, likely to be less in D7 now it's optional.

Also, omitting blocks isn't much of an optimization depending on the site, say 10/20% of total page generation. Skipping rendering altogether is a good 50% or more on D7 since so much happens in drupal_render_page(), the two can't really be compared.

Additionally the D6 optimization often meant that search boxes or the navigation menu were omitted on 404s . Quick search shows a tonne of issues complaining about that optimization, following list isn't exhaustive. It's also not a simpler solution - requires messing about with the rendering process, and there are still cleanup patches outstanding for it:
#221399: Allow administrator to decide if blocks should be displayed for site_404 pages
#423992: Remove show_blocks page variable
#141162: 403/404 errors should allow blocks
#423696: Allow display of menu links on 404 error page

On the patch here, I fully agree with Khalid that most 404s for files are invisible - on one site I run which has been going since 2003, 2005 on Drupal, the majority of 404s for images are very old ones which were embedded in pages on external sites - unless you look in the net tab on firebug you might not even realise those images were ever supposed to be displayed. Going to all the effort of a full bootstrap for those is a bit much.

Even if someone visits your site directly and gets a 404 for an image, they're very unlikely to see the full page render because they'll be viewing a page with that image embedded too. The only case where you'll actually see the 404 for a file, is if it's linked to directly somewhere, both within sites (which tend not to link to dead files so much) and external links, neither are really the common case - and even then a lot of those hits will be from crawlers.

However, I'd like to see this patch remove the special handling for favicon.ico in .htaccess, or at least disable it by default. Seems like that was a stopgap which mainly went in because this issue hadn't. I'm also not too sure about having the function defined twice - are we 100% sure that's never going to cause issues with opcode caching etc?

StatusFileSize
new5.31 KB
PASSED: [[SimpleTest]]: [MySQL] 19,157 pass(es).
[ View ]

Catch, thanks for the review.

The attached patch removes the special handling for favicon in .htaccess. Yes, that was a special stopgap and is no longer needed with this new approach.

I can't find anywhere that the function is defined twice. It is defined in bootstrap.inc once only.

$ grep -r drupal_fast . | grep -v patch
./sites/default/default.settings.php:# drupal_fast_404();
./includes/common.inc:        drupal_fast_404();
./includes/bootstrap.inc:function drupal_fast_404() {

That'll teach me to review patches on Sunday night, obviously it's not defined twice, just called in two different places.

Next question is what happens if you use image derivatives (or whatever imagecache is called now) with the settings.php version? I'm not up to speed with how it handles requests in D7 now, but if that's going to be affected, I think it's worth adding a particular note either way there that it's incompatible - given that's part of core now. Also if it /is/ incompatible, given that additional optimization is only in settings.php so doesn't require a core hack, it could be taken out without making the overall patch any less useful.

From an end user point of view, I actually like the fact that all pages do a full bootstrap because all 404s create an entry in my watchdog -- that is very convenient and allows me to actually fix 404s. I use that all the time.

@Dries - sure, that watchdog is very handy for some sites. Those sites will do nothing and still get the watchdog.

This new line in settings.php is for sites that can't afford a full bootstrap because of scaling constraints. This is a valuable weapon for those sysadmins. As you are hearing here, we've all been doing this in our own hackish way in settings.php for *years* and now we are trying to share that with the world. If this doesn't get committed, then only those in the know get the weapon. Owen, kbahey, catch and myself, don't need this patch. We're doing it to benefit everyone else.

I'm pretty sure that the sysadmins for Drupal Gardens and Buzzr would want this tool in their belt.

@catch, I may be mistaken, but this is what I could find from a quick search on image derivatives:

<?php
$success
= file_exists($destination) || image_style_create_derivative($style, $path, $destination);
?>

So it seems to check the physical file on the file system, not rely on a 404. So that part does not need further action for this patch. If someone finds otherwise, I am happy to add a comment.

@Dries, the 404s in the watchdog are mostly noise on big sites. Old links from other sites, old elements from themes that have been changed for a while, ...etc. It is rarely about real full pages. In that case, the user still has the option to skip the new fast 404 as Moshe said.

@Dries,

Here is what I use for several high traffic site, in settings.php.

So this patch comes out of real needs for real sites.

<?php
$error_404_header
= 'HTTP/1.0 404 Not Found';
$error_404_text = '<html><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL was not found on this server.</p></body></html>';
if (
$_SERVER['QUERY_STRING'] != 'rss.xml' || preg_match("/\.(txt|png|gif|jpe?g|shtml?|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp|xml)$/", $_SERVER['QUERY_STRING'])) {
 
header($error_404_header);
  print
$error_404_text;
  exit();
}
?>

@Dries, there's two optimizations here, one is to avoid the full bootstrap altogether, the other is to not do a full page render for those paths. With the latter, you still get a very large performance gain for those particular file extensions (at least 50%), but you also get the watchdog message. There's currently no way to do that without a core hack at the moment.

Here is a table describing the status-quo and the 2 options added by this patch:

Configuration A- disabled Configuration B - called from drupal_deliver_html_page() Configuration C - called in settings.php
Heavy 404 page generation cost Smaller 404 page generation cost (~50% of A?) Minute 404 page generation cost (< fastpath cache)
Slow 404 page response time (e.g. 1-10 seconds) Faster 404 page response time (e.g. 0.5-5 seconds) Very fast 404 page response time (e.g. < 0.2 seconds)
Large 404 page size (2,000-30,000+ bytes, depending on site) Tiny 404 page size (~200 bytes) Tiny 404 page size (~200 bytes)
Inlined assets (images, JS, CSS) 404 as normal Inlined assets (images, JS, CSS) 404 as normal Inlined assets (images, JS, CSS) 404 as normal
Directly requested assets show fully themed 404 page Directly requested assets show simple 404 page Directly requested assets show simple 404 page
All 404s logged in watchdog as normal All 404s logged in watchdog as normal "Fast" 404s not logged in watchdog
All 404s logged in Apache logs as normal All 404s logged in Apache logs as normal All 404s logged in Apache logs as normal
Core image api works normally Core image api works normally Core image api works normally
404 based lazy loaders (e.g. D6 imagecache) work normally 404 based lazy loaders (e.g. D6 imagecache) work normally 404 based lazy loaders (e.g. D6 imagecache) don't work without tweaking the regexp
Pages with aliases that match regex load as normal Pages with aliases that match regex load as normal Pages with aliases that match regex 404 instead of load

From a scalability point of view 404s can be a huge problem for sites with old domain names that have gone through multiple re-structurings over the years. The 404 traffic simply from hotlinked files alone can be a significant draw on resources. The current setup is also a huge risk for active sites - if you add a 404 to a busy page, or perhaps accidentally move, chmod or otherwise make the files directory inaccessible, server load instantly skyrockets and can render the server unresponsive.

From an end user point of view this is also a major issue. Browsers will patiently sit around with the loading spinner going waiting for Drupal to fully bootstrap, generate a beautiful, fully themed page to represent an image or css file, that the browser then completely ignores as soon as it notes the 404 http code. Not only does this mean that we end up waiting for Drupal to bootstrap twice (or more!), but the response is at least a factor of ten larger than it needs to be. In some cases (e.g. whitehouse.gov or amnesty.org) it is over 100 times larger than necessary. Multiply this by all the 404s on all the Drupal 7 sites, and the end result is a lot of end user wait time, and a lot of bandwidth and cycles wasted.

Currently the default is configuration A - while this is safe and conservative, it has no benefit - most sites and users would be left with the same issues we have now (and a re-broken default favicon 404 to boot). I would strongly argue that if you consider the table above it is clear that configuration B is actually a perfectly reasonable and safe default for the overwhelming majority of sites.

StatusFileSize
new5.35 KB
FAILED: [[SimpleTest]]: [MySQL] Drupal installation failed.
[ View ]

Here is an updated patch with configuration B as the default, which I think is the best choice by far. If the only way this can get committed is disabled by default then I think we need to go with the patch in 163, because we can't really remove the favicon.ico workaround if the fix is disabled.

The whole idea for the earlier version (#167) is reduce resource usage and improve performance and scalability. The patch in #175 is a step back. Although it is better than the status quo, it is still too late in the boot process and consumes a lot of unnecessary resources.

I would like us to seriously consider #167 for commit.

@kbahey - I think you have misread my patch - #167 is more conservative that #175 - the former is completely off by default, and identical to the status quo in terms of resource usage and scalability. #175 cuts out a significant chunk of process and bandwidth by default on all sites, yet retains almost identical behaviour and logging. If an admin decides they need more savings they can comment out the exact same settings.php line you have in #167 and return a 404, skipping the rest of the bootstrap.

In addition #167 means that favicon.ico is again broken by default, and will be creating a full bootstrap on every page (even cached pages) just to generate the 404 response to favicon.ico, which is completely unacceptable IMHO and why I suggested #163 as a fallback.

I prefer having configuration B as described in #174 as the default, so I endorse the patch in #175.

If configuration A must be what is commited, then I agree that #163 is preferable to #167.

I would have set the status to RTBC, but it's already the case.

@Owen Barton

My bad. I understand it now. It is better than the previous iterations, since it still preserves the watchdog. The early bailout functionality is optional and off by default and that is what I am looking for.

So +1 from me.

Would like to hear from moshe or catch on approach B. Seem like it is the only option given some of the regressions with option C. Great table Owen!

I'd be thrilled to have B as default. #175 gets +1 from me as well.

B sounds great to me as well, don't think we can get away with C as a default.

@kbahey / #172: I tried copying and pasting verbatim at the end of my settings.php (without the open/close PHP tags), but Drupal wouldn't load after doing so... is there something else you did to get the above code to work for 404s on files? [Edit: just noticed, there's no logic to check if we're at a 404 status here... thus all files will be served as 404]

My site gets about 500-1000 404s per minute for hotlinked images which have gone missing now, and this causes my D6.x installation to consume a constant background CPU usage of about 1 (for a 4-core server).

Also, subscribe.

I emailed you the code via the d.o contact form so as not to create further noise in this issue (and dries can commit it ...)

little bump

ddorian, that's really not necessary. Please don't do that.

#175: 76824-14.patch queued for re-testing.

+++ sites/default/default.settings.php 12 Apr 2010 17:54:10 -0000
@@ -393,6 +393,38 @@ ini_set('session.cookie_lifetime', 20000
+$conf['404_fast_html'] = '<html xmlns="http://www.w3.org/1999/xhtml"><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL /' . $_GET['q'] . ' was not found on this server.</p></body></html>';

Is $_GET['q'] set in D7 when settings.php is included?

85 critical left. Go review some!

StatusFileSize
new6.02 KB
FAILED: [[SimpleTest]]: [MySQL] 20,104 pass(es), 1 fail(s), and 0 exception(es).
[ View ]

Actually, yes, it works at this point. Just add a print there and you will see it.

I put a better path using base_path() and $_GET['q'] in common.inc. base_path() cannot be used in settings.php though, so just enclosed it in double quotes.

Also, I did some benchmarks on this last patch, and here are the results:

Using apachebench, and visting example.com/none (which is a 404), with 5 users, 1000 requests

1. Without patch
Requests per second: 49.71

2. With patch, default settings
Requests per second: 51.11

3. With patch, with function uncommented in settings.php
Requests per second: 51.15

When visiting example.com/none.jpg (which a 404 that triggers a fast 404), 5 users, 1000 requests

With patch, with function uncommented in settings.php
Requests per second: 352.31

Orders of magnitude faster, now with numbers ...

I wonder if the $_GET['q'] needs a check_plain - I think it might.

Status:Reviewed & tested by the community» Needs work

+++ includes/bootstrap.inc 1 May 2010 03:06:16 -0000
@@ -2063,6 +2063,26 @@ function drupal_maintenance_theme() {
+    print variable_get('404_fast_html', '<html xmlns="http://www.w3.org/1999/xhtml"><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL /' . $_GET['q'] . ' was not found on this server.</p></body></html>');
+++ sites/default/default.settings.php 1 May 2010 03:06:21 -0000
@@ -393,6 +393,38 @@
+$conf['404_fast_html'] = '<html xmlns="http://www.w3.org/1999/xhtml"><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL "' . $_GET['q'] . '" was not found on this server.</p></body></html>';

So effectively two issues:

1) check_plain().

2) The actual path should really be injected via strtr() during output, and not during variable assignment.

Powered by Dreditor.

Status:Needs work» Needs review
StatusFileSize
new6.1 KB
FAILED: [[SimpleTest]]: [MySQL] 20,158 pass(es), 1 fail(s), and 0 exception(es).
[ View ]

Both issues fixed now. The variable has a @path token that is replaced via strtr(), and also check_plain() is used.

Status:Needs review» Needs work

you can't run t() on dynamic strings

Status:Needs work» Needs review
StatusFileSize
new5.93 KB
FAILED: [[SimpleTest]]: [MySQL] 20,148 pass(es), 1 fail(s), and 0 exception(es).
[ View ]

This fixes the t() call, and also replaces the $_GET['q'] with request_uri() (tested in all 3 usages), which I think is exactly what we need, since we want to show everything the after the domain part of the URI, and $_GET['q'] doesn't have the base path, and also can get messed with by language prefixes and similar systems.

Status:Needs review» Needs work

The last submitted patch, 76824-17.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new6.18 KB
FAILED: [[SimpleTest]]: [MySQL] 20,258 pass(es), 1 fail(s), and 0 exception(es).
[ View ]

I don't understand why this failure is happening after t() was fixed.

Rerolling for offsets against current HEAD.

Status:Needs review» Needs work

The last submitted patch, 76824-18.patch, failed testing.

The test appears flawed. It assumes that a language is always available but this is very early in the bootstrap so it is OK to call t() [this is new in d7], but only custom string translation will be performed.

The flaws of automated testing ...

I can't fix the test to detect early and normal though ...

Moshe, would you consider using strtr() here appropriate?

Just FYI, using this code (for me) has caused some problems with ImageCache; instead of IC getting wind of an attempt to access a .jpg, .png, etc. in one of the ImageCache directories, then creating the image, Drupal throws a 404 before ImageCache is called... something to watch out for.

Perhaps having a check on the path where imagecache stores the cached images will do the trick.

If the path of image requested matches a parttern ( say "cache") then don't do anything else send 404.

@geerlingguy and @ajayg - please see comment 174 this patch will not break imagecache type modules by default, and will not break the image generation ("imagecache in core") in Drupal 7 even when the optional settings.php "fast return" is enabled. If you use Khalid's code on a Drupal 6 site then it will break Drupal 6 imagecache if you don't add extra conditions. However, please do not discuss imagecache or Drupal 6 in this issue - this causes clutter and distraction.

subscribe

Status:Needs work» Needs review
Issue tags:-Performance

#196: 76824-18.patch queued for re-testing.

Status:Needs review» Needs work
Issue tags:+Performance

The last submitted patch, 76824-18.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new6.03 KB
PASSED: [[SimpleTest]]: [MySQL] 20,688 pass(es).
[ View ]

The test fails because this hunk adds the 404 uri in the page

+++ includes/common.inc 1 May 2010 16:24:47 -0000
@@ -2362,7 +2365,7 @@ function drupal_deliver_html_page($page_
-          $return = t('The requested page could not be found.');
+          $return = t('The requested page "@path" could not be found.', array('@path' => request_uri()));

which will contain the $langcode, something that is not expected by locale.test. It appears to me that this test in locale.test (lin 263-265) is a hackish way of checking for a 404 anyways, so I've fixed that.

Status:Needs review» Reviewed & tested by the community

Gosh, I thought this was in already. Thanks scor.

StatusFileSize
new4.64 KB
PASSED: [[SimpleTest]]: [MySQL] 22,462 pass(es).
[ View ]

Rerolling for patch fuzz...

#208: 76824-fast_404_208_0.patch queued for re-testing.

Priority:Normal» Major

It passed again. Pretty please commit please.

Is there any intermediate 6.x solution, while we all wait to move to d7?

I currently hacked common.inc's drupal_not_found() to directly load a custom static 404.html with some extra htaccess rules to handle files 404's.

@morningtime - Here's code for Drupal 6 only (paste it at the bottom of settings.php):

<?php
/**
* 404 Handling, to conserve server resources upon missing image/text/non-html files.
* See: http://drupal.org/node/76824#comment-2834536
* Note: JPG, PNG, GIF don't work with ImageCache, so can only do non-image files here...
*/
if (preg_match("/\.(txt|shtml?|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp)$/", $_SERVER['QUERY_STRING'])) {
header('HTTP/1.0 404 Not Found');
print
'<html><head><title>404 Not Found</title></head><body><h1>404 Not Found</h1><p>The requested URL was not found on this server. If you think you reached this page in error, please visit the <a href="http://archstl.org/">Archdiocese of Saint Louis home page</a> and search for the page or file for which you are looking.</p></body></html>';
exit();
}
?>

@morningtime
Don't hack common.inc. Kittens die horrible deaths. There is an easier and more maintainable solution, with only a change in settings.php.

@greelingguy
If you use imagecache, then you have to exclude its paths from being handled.

Here is the code I use for most sites: Reducing server resource utilization on busy sites by implementing fast 404s in Drupal.

StatusFileSize
new6.03 KB
PASSED: [[SimpleTest]]: [MySQL] 22,749 pass(es).
[ View ]

the default.settings.php hunk is missing from #208, rerolling.

@kbahey
Thanks! I learned a lot from your DrupalCon video!

Issue tags:+Performance
StatusFileSize
new6.14 KB

I did some benchmarks to show the effect of the current slow inline 404s on browser display latency for end users. These were done using http://www.webpagetest.org/ testing the same site (in different configurations, 404s introduced through a node body promoted to home page) using default settings, 10 test runs each time.

First View Summary (seconds)
  No patch Patch Patch (fast 404 enabled)
No 404 2.591
Image 404 2.944 2.881 2.525
CSS 404 3.571 2.890 2.501
Percentage Improvement
  Patch Patch (fast 404 enabled)
Image 404 2% 14%
CSS 404 19% 30%
Absolute Improvement (seconds)
  Patch Patch (fast 404 enabled)
Image 404 0.063 0.419
CSS 404 0.681 1.07
Repeat View Summary (seconds)
  No patch Patch Patch (fast 404 enabled)
No 404 0.886
Image 404 1.716 1.488 1.011
CSS 404 1.604 1.500 1.287
Percentage Improvement
  Patch Patch (fast 404 enabled)
Image 404 13% 41%
CSS 404 6% 20%
Absolute Improvement (seconds)
  Patch Patch (fast 404 enabled)
Image 404 0.228 0.705
CSS 404 0.104 0.317

See attached file for more detailed benchmarks.

The results are not insignificant - an inline 404 currently slows down the parent page display by anything from 500ms to a full second, depending on the type of 404. Committing the patch will shave something like 100ms-500ms off that time. If site builders enable the "fast 404" return in settings.php the pages display almost as fast as if the 404 wasn't there - anything from 500ms to a full second faster than the current performance. In other words, this patch clearly has enormous benefits for both end user responsiveness, in addition to the server utilisation benefits described in depth above.

Thanks for the benchmarks... very interesting indeed!

This has Dries's fingerprints all over it, so I'll leave it to him to either commit or D8 this.

Subscribe

StatusFileSize
new6.05 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 76824-fast_404_221.patch.
[ View ]

patch didn't apply any longer

Summary is at #174.

Server side benchmarks are in #189 and front-end benchmarks are in #217.

Issue tags:-Performance

#221: 76824-fast_404_221.patch queued for re-testing.

#221: 76824-fast_404_221.patch queued for re-testing.

Version:7.x-dev» 8.x-dev

Although badly needed, this is D8 material according to the rules (I had to learn today). It may be backported at a later point in time (though that's unlikely).

Version:8.x-dev» 7.x-dev

recover from temper tantrum.

This has been RTBC since June. Would love to get this in, or find out what is needed to get to that point! See meta-summary of summaries/benchmarks in #222

Issue tags:-Performance

#221: 76824-fast_404_221.patch queued for re-testing.

Status:Reviewed & tested by the community» Needs work
Issue tags:+Performance

The last submitted patch, 76824-fast_404_221.patch, failed testing.

Status:Needs work» Reviewed & tested by the community
StatusFileSize
new6.08 KB
PASSED: [[SimpleTest]]: [MySQL] 28,851 pass(es).
[ View ]

Reroll. Keeping at RTBC since there are no changes.

Status:Reviewed & tested by the community» Needs work
Issue tags:-Performance

The last submitted patch, 404.diff, failed testing.

Status:Needs work» Needs review
Issue tags:+Performance

#230: 404.diff queued for re-testing.

Status:Needs review» Reviewed & tested by the community

Bot finally recovered from its bender. Back to RTBC.

#230: 404.diff queued for re-testing.

Another one with Dries's fingerprints all over it.

Leaving for The Spikey Haired One to make the call.

Status:Reviewed & tested by the community» Needs work

Shoot, sorry for the interruption here... just noticed this.

Related issue: #177790: Late cache hit can keep cron.php from working in certain circumstances.

Dreamhost regularly kills off running cron processes, creating an artificial 404 on cron.php. Which Drupal then caches oh so disfunctionally.

It would be fantastic if cron.php and update.php and the other php files in the root directory could be added to this list to resolve that longstanding WTF tension between Dreamhost's bizarre process-killing and Drupal's tendency to page-cache the wrong thing.

If that's inappropriate, feel free to go back to RTBC.

@rfay - given that this is my oldest core issue, I'd hate to see it derailed due to one host's sloppiness.

Is there a way that the Dreamhost error can be fixed in contrib or config, or do we have to account for that here?

@agentrickard... I understand completely. I was just mentioning this weird and problematic Drupal behavior that would be fixed by adding 3 words to the patch (I think).

If I read the patch correctly, all it would take to fix a Drupal install with the weird caching-404 problem would be to add cron.php to the variable provided by this patch. So it doesn't even take contrib.

But since cron.php and update.php are clear cases that should *also* not be cached, I just thought it might be nice to get those in.

I understand completely if you just set this *right back* to RTBC. I was just stopping by, and happy to see this happening.

Status:Needs work» Reviewed & tested by the community

I don't think this is the right place to fix the cron.php issue - this issue is about a performance improvement that can optionally be disabled: if it was disabled the bug would reoccur. Also, a few contrib modules add their own .php entry points (for example ad module uses an adserve.php for javascript rotated ads), and a hardcoded "cron.php|update.php" in the variable means that these entry points would still be susceptible to that issue. It sounds to me like a better fix would be to test that the index.php entry point was used before caching a page, set the $conf['cache_inc'] cache.inc default in index.php (or set it to cache-install.inc in cron.php) or some similar approach.

#230: 404.diff queued for re-testing.

I see this makes changes to .htaccess

Will this work with other webservers like nginx, lighttpd and cherokee?

Apache is not the only one running out there.

@Fidelix / #241 - I know nginx doesn't support .htaccess files at all (a different mechanism is used for rewrites and such).

However, we have the web.config file for IIS (I think), and .htaccess for Apache. Those who use other server environments will need to do the work to translate the rewrite rules correctly for their own servers. There are hundreds of different web server projects out there—the htaccess hits the overwhelming majority of use cases for Drupal...

The only change to .htaccess here is the removal of the special casing of favicon.ico.

Version:7.x-dev» 8.x-dev

This did not make it to Drupal 7.x.

Sigh ... moving to 8.x ...

I guess I have to fulfill a vow I made to myself and kill the family cat

With the length of this issue, we might need to consider either pushing this in soon, or opening up a new summary issue for further discussion.

Version:8.x-dev» 7.x-dev

I don't think we should push this to 8.x until HEAD opens up. Edit - a new summary issue might help yeah.

This is a critical (in the layman's sense) performance patch and the RTBC patch works.

If only i'd come across this last night, i would have known what to do with my extra drink tickets and access to Dries. Please?

@catch

Do you think it has a chance to make it into Drupal 7.x?

I don't think so, personally, but if you think it would, then I would be pleasantly surprised.

That cat is doomed ...

Subscribe. Love to see it get in D7.

Subscribing.

Myself and others have created a module which achieves essentially what is discussed here, as well as very early URL checking (either right out of settings.php, or within hook_boot). Right now there is a D6 release and a D7 release is in the works.

http://drupal.org/project/fast_404

#241

Nginx doesn't need anything of this sort. It serves the static files first if they exist. AFAIK this is Apache specific.

Hop in to the Nginx group on g.d.o and check a few configs that do the right thing (tm).

@perusio (#252) this is incorrect - this patch has just the same benefits for Nginx and any other webserver set up so that Drupal is still responsible for serving 404s (which it almost certainly will be if clean URLs are configured). The patch means that Drupal does reduced processing (either massively, or somewhat - depending on the configuration), and also returns a lower bandwidth response. This all happens in the PHP - the webserver has no effect here, other than checking that an actual static file doesn't exist before handing control over to PHP - this behavior is the same on all webservers that support Clean URLs, and is unchanged by this patch.

@Fidelix (#241) - this does actually not modify .htaccess, other than removing a hardcoded check for non-existent favicon.ico that is no longer needed. Earlier patches did use .htaccess to return lightweight 404s, but that turned out to not be a good approach for most sites. As described above this change is completely independent of whatever webserver is used.

Not really if you setup the proper rules in your nginx config. Supposing you setup a static page for handling the 404s.

location ~* ^.+\.(?:jp[e]*g|png|ico)$ {
    try_files $uri /404.html;
}
location ~* ^.+\.(?:css|js)$ {
    try_files $uri =204; # this returns a 204 status: no content
    access_log off;
}

Of course if you relay it to Drupal then the patch applies. But for static files you can be out of harm's way. In the case of images the issue gets complicated because of image cache. If you don't use it than the above config avoids altogether touching the PHP layer.

Also the configs discussed at the g.d.o Nginx group handle the static files properly. The above is just a random example. A proper config is more complex.

which it almost certainly will be if clean URLs are configured

No: almost 'certainly' it isn't. This all happens because you're relaying all 404s to Drupal.

perusio, your nginx conf is the first attempt in the patch in #0 in 2006. Since then it has evoluted.

#255 I beg to differ. This all issue happens because of this:

# Make Drupal handle any 404 errors.
ErrorDocument 404 /index.php

If you remove it then this is patch is completely unnecessary. Now I'm afraid my mod_rewrite chops are quite limited and in that sense I don't know if you really need this for things like imagecache to work properly with Apache.

But I suppose that you can write a rewrite rule targeted to the imagecache URIs and act upon it. Of course the above "catch all" directive handles it out of the box. So no need to instruct people running imagecache for inserting such a rule in their .htaccess.

And it's not "my" config, but rather common shared knowledge between anyone that bothered to lookup on prior art before running drupal with nginx.

If you remove that (without access to server-level host configuration files), Drupal won't function.

Ok. I'm not privy to the inner workings of Apache and how all the container context delegation thing operates really. Which I suspect is what is being adressed here.

In Nginx there's no .htaccess equivalent. To reload a config you send a SIGHUP to the running instance and the config is read into memory where it stays unchanged until you send a new SIGHUP.

Can you elaborate on your comment? I would like to know how things are handled in Apache so that the ErrorDocument for 404 is needed.

Thanks

Drupal (and WordPress and Joomla) are giant 404 handlers. That is, they process any URL request that the webserver cannot. All Drupal requests are actually routed through example.com/index.php?q=path/of/request.

On shared hosts, most people don't have access to Apache configuration, so those rules have to be set in .htaccess. The side effect, of course, is this issue, which has been around for years.

For people with admin control over their hosting environment, be it Apache, Nginx or what have you, this can be remedied quite easily. For others -- including the non-technical user -- we need to provide sane defaults for them rather than the catch-all method of routing _all_ 404s through Drupal.

@agentrickard - could you document that procedure (at least for Apache) here, or maybe add a documentation page to the performance section of the handbook, for everyone's benefit? It'd be nice to be able to have a wiki page where we can point people to for a brief overview of this huge issue...

Issue tags:-Performance

Oops. Browser snafu made me cancel the submission but the form hand been already submitted. Sorry :(

See comment below.

Issue tags:+Performance

@agentrickard I already knew that Drupal et al are "giant 404 handlers". That's not the point I was trying to get at. Instead I was under the impression that Apache context delegation thing doesn't allow to route around the issue of setting Drupal as the single 404 handler. So let me recap, this is my take on this issue, you and others feel free to chime in.

  1. There's an issue with Drupal in that static files that are a subrequest of a page request are not served directly, in the sense that if they're not found the default 404 handler is index.php so you always perform a full bootstrap in order to signal the 404. That's expensive, hence the appropriate term of expensive 404s coined by kbahey, AFAIK.
  2. Back in 2006 when the awareness of this issue was first raised there were attempts to route around the issue by placing new directives in the .htaccess file so that static files would have a static 404 handler instead of index.php.
  3. Due to the complexities and openness of the Drupal way, there isn't any way to have an .htaccess file that can possibly contemplate all hypothetic situations. An obvious example is imagecache. If the image is not found then you should fallback to drupal in order to (re)generate the missing image.
  4. The above patch resulted from a compromise between letting the expensive 404s issue linger around and degrade performance, and instead inserting some code in the early bootstrap process so that the 404s would be much less expensive.
  5. This can be solved in both Nginx, Apache, Lighty and I presume IIS also, by defining a location at the vhost level so that static files are served directly and there's no need to have either this patch or the ErrorDocument 404 index.php directive.

Is that a fair assesment of the issue? If so then there should be an handbook page documenting this problem and the possible strategies to solve it. There's no inexorable need to either use the patch or the ErrorDocument directive. From what I gather that's a mere convenience that Drupal provides. It's nice for non-technically inclined people and also if you cannot have it configured as it should at the vhost level. There should be also accompanying documentation for Drupal when this patch gets merged in the regular Drupal core release.

@perusio: Your 2) and 5) are the same. AFAIK, there is no difference between the .htaccess and vhost level in routing a request/serving a 404 error.

Since very early, we don't want to solve it at the web server level as it is too restrictive (and not universal, maybe).

@jcisio not really. AFAIK you cannot specify locations at the directory context in Apache. The Location directive cannot be used in a directory context (.htaccess). This must be set at the Server or Virtual Host level.

What do you mean by "not universal, maybe"? I suppose all web servers let you control locations.

Three responses.

1) This new discussion is off-topic, and should be moved to a documentation issue (or jut to a docs page).

2) @perusio, you are ignoring the fact that a vast number of Drupal users have no server access to the settings you suggest we change. Or are you arguing that this should just be a performance documentation issue instead of a code-level fix?

3) I am not qualified to write the documentation page. Someone who knows more about Apache should take a stab at it.

Let's get back to actual reviews of the patch and its worthiness.

In this tangent, I think the original question is "Does this patch break Nginx?"

@agentrickard regarding 2). What I'm suggesting is that this issue be well documented and the tradeoffs explained.

Regarding 3) I'm not either. But I can seed it with the little I know. I no longer use Apache altogether. I can make a first draft and let others more knowledgeable then I improve it. I can take a stab at writing the Nginx related part, which is a nice segway into your last question.

This patch is unnecessary for Nginx with a decent config, unless you insist on letting index.php be the 404 handler. In most configs I've seen out there that's not the case. This patch introduces a performance penalty since you're checking if the file being requested is a static file or not. It may be small or not, it varies with the load and further investigation is required.

I'm not ignoring "vast number of Drupal users have no server access to the settings you suggest we change". What I'm saying is that if you have, like it's the case with anyone running Nginx, this patch is unnecessary.

I fail to see how it's "off-topic" since I believe the objective is to document as well as possible the tradeoffs involved. Althought I agree that a documentation page is now the proper venue to pursue this issue further.

EDIT: Regarding a documentation page in the handbook. The only thing related I could find is: http://drupal.org/node/56773

The handbook pages could benefit of a server related section IMHO. Does anyone here has power to start such a book page? I will gladly seed it with this issue as a section.

Hey Perusio;

I think your solution makes a heck of a lot of sense. Since (I imagine this may not be the case, but it usually is in my experience) Nginx has access to the static filesystem it can check too to see if the file exists. There is no need for it to pass every 404 off to Apache and then Drupal. Of course, by the same extension even if you don't have Nginx, you can do this in apache in a vhost.

If you give the nginx script essentially the same list of extensions as the patch and tell it to go to a static 404 if those files don't exist (heck make it a purty page and have your reverse proxy cache it) you'll save a ton of time in not going through 2+ other layers to server the 404 when nginx could have done it up front.

Definitely a not seeing the forest for the trees sort of thing.

Let me know what your doc page is when you have it rolling and I'll link to it from my Fast 404 module (still useful for folks who can't modify server configs).

@perusio - I am not following your logic here. Yes, if you use Ngnix configured to catch 404s for specific file extensions this patch is unnecessary. The same can be said if you use Varnish (which can easily accommodate these rules), various CDNs, or your custom built RFC1149 based protocol.

All of these approaches do have drawbacks (see above and below), are inaccessible to a great many users and are hence inappropriate solutions for Drupal core. Yes, of course they should be documented - but this issue is regarding a core patch, and discussion of documentation should focus on the patch and it's behavior. Trying to wedge documentation about the numerous alternative approaches that skilled site builders could employ is out of scope - please open a documentation issue to explore this, add comments to existing documentation pages, and/or apply to join the documentation team. By all means link to a documentation project issue from here.

I don't believe it is correct to say "with anyone running Nginx, this patch is unnecessary" - this is only true if the user has configured these rules, and that is not universally true - for example the first generic configuration on the Ngnix g.d.o. description (for MTecknology) does not have the required rules. In any case, even if this patch is unnecessary with a particular configuration, I don't see how that has any bearing on this patch.

Finally - dealing with these 404s in the webserver or proxy has a number of issues or side effects (that should be made clear in any documentation) - here is a summary:
- They requires root access on some web servers
- Imagecache style modules and/or non-default file directory paths each require a specific exclusion (that may require knowledge of regular expressions to implement)
- If a site uses Drupal's alias or redirect modules to handle legacy (non-404) paths ending in .html or similar, these may require specific exclusion (or vice versa if these are not used)
- Drupal no longer logs these 404s in the watchdog, which means many less skilled web admins will have no idea there is a static file 404, which is a problem in many situations
- Finally, the biggest issue with doing this in Apache (that Drupal supports, through it's default .htaccess file) is that the most effective rule prevents 404 paths from being logged at all (even in the webserver logs).

Can we please focus on the patch at hand which avoids all of these problem, whilst enabling a reasonable and simple high performance compromise in settings.php, and still doesn't prevent any of the web-server/proxy specific approaches from being implemented.

This patch introduces a performance penalty since you're checking if the file being requested is a static file or not. It may be small or not, it varies with the load and further investigation is required.

Can you point out where this performance penalty is? The only checking I see is in drupal_fast_404(), and with the current patch, unless you manually edit settings.php, that code only gets executed if you are already going to serve a 404 from Drupal.

@catch

Yes it's dependant on a variable being set. I was under the impression that the variable was being set from an DB update or upon installation of Drupal. Is it not something that it will be enabled by default on stock Drupal installs?

In settings.php there's the definition of the regex mask for the type of files being checked.

If that's not the case then there's no penalty. Otherwise since drupal_fast_404() is invoked in drupal_deliver_html_page it always get executed. And I suppose that it's always faster to perform that check at the server level instead of doing it at the application level.

If I request the URI /foobar and if there's not a menu entry for that path the drupal_fast_404() function get's executed.

@perusio, no that's deliberately left commented out in settings.php and it's a function call rather than a variable, so can't be affected by install or update (and there's nothing in the patch that affects install or update). That is purely an additional optimization over the main one this patch introduces, disabled by default.

otherwise since drupal_fast_404() is invoked in drupal_deliver_html_page it always get executed. And I suppose that it's always faster to perform that check at the server level instead of doing it at the application level.

Nope. drupal_deliver_html_page() will only execute that code if there is going to be a 404 anyway - since that whole hunk is inside a switch statement.

The only PHP being executed by default here is the assigning of two fairly short strings to $conf, which really is negligible by any standards (and that can be commented out by the same person that sets up the nginx config).

This leaves the fact we define the function at all in bootstrap.inc (which does have a small cost but barely measurable).

@catch Yes I know that. The function call in settings.php is for an even faster 404 since there isn't even a watchdog entry for the 404. The variable I'm talking about is in the condition of the if statement:

<?php
function drupal_fast_404() {
 
$fast_paths = variable_get('404_fast_paths', FALSE); // This is the variable I'm talking about.
 
if ($fast_paths && preg_match($fast_paths, $_GET['q']))
  (...)
?>

It the $fast_paths is set, then the check is always performed. I gave the example of requesting a URI for which there isn't a menu entry, hence the switch statement branch active is the one with drupal_fast_404().

My issue with performance is related to the regex matching in preg_match the remainder is negligible I agree.

I'm reading directly from the patch #230 by Moshe. Isn't this the one being discussed? I don't see any other below that comment.

@perusio - this regex in drupal_fast_404 is not "always" run. If you set 404_fast_paths to FALSE in settings.php then the regular expression never runs - php evaluates if statements in the normal approach - if the "left" test of an ANDed condition fails, it does not even run the "right" test. Hence, if you prefer to do this check at the webserver level (and accept the various issues I describe in #268), then you can disable this patches function completely.

@Owen Barton

Yes that's correct. But the patch enables it by default.

<?php
/**
   (...)
  * Add leading hash signs if you would like to disable this functionality.
*/
$conf['404_fast_paths'] = '/.(txt|png|gif|jpe?g|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp)$/i'; // enabled by default
?>

Also I gather from what's being discussed above that the patch is for Drupal 8. Patching settings.php is potentially "dangerous". In the sense that this is a file specific to a given install and might have comments, code blocks, what have you, that will make the patch fail for this file: be it because of a too large offset, or because there are changes that make the patch tool not being able to find a match for inserting the new stuff.

@perusio - yes, we do enable it be default. I think it is very reasonable to ask people who would rather put this in their webserver/proxy configuration (and are able to understand the fairly complex tradeoffs involved in this situation) to place a single "#" in their settings.php. For the vast majority of Drupal users they will not have the desire or skills to do this in their webserver/proxy, and we should not expect them to need to change settings.php (this is why we have the installer, after all) - having a safe and faster default configuration is a huge benefit to them.

Note that we never "patch" running settings.php (for a start, core is unable to upgrade itself, and most users upgrade via wget). We would only change the default.settings.php, so that new sites would benefit. If this was committed to Drupal 8, then backported to Drupal 7, settings.php changes would go in the release notes for users to add themselves.

@perusio: it is enabled by default so that if you tune your server settings, you can then disable this feature.

I can't get what you mean by "patching settings.php". Once this patch get committed, all that you do is comment out that line. Nothing dangerous here. You do it manually, as instructed many other modules (caching modules, for instance) or Drupal core ("translation" of a few strings).

default.settings.php is copied to settings.php, so updates to that file don't affect existing settings.php on Drupal 7 sites that are already installed unless you copy the changes across manually.

@Owen Barton

Ok. I'll now move this debate to a documentation page and link it from here when available. Thank you.

Regarding the settings.php patching thing I was thinking on the case of upgrading core using drush. Or just even doing a simple:

tar zxvf drupal-7.x.tar.gz --strip-components=1 --exclude=sites/*

And if providing a patch. Which isn't the case. The update will be counseled through release notes.

subscribe

Subscribing, greetings, Martijn

Version:7.x-dev» 8.x-dev
Issue tags:+needs backport to D7

For anyone needing a temporary solution, here's what I have in my settings.php files for D7 sites. Also included here are paths to exclude, such as the "styles" paths (aka, "imagecache" in D6). I've included another exclusion, so you can see how multiple ones can be strung together.

<?php
/**
* Fast 404 pages, to conserve server resources upon missing non-Drupal files.
*/
$conf['404_fast_paths'] = '/\.(txt|png|gif|jpe?g|s?html?|css|js|ico|swf|flv|cgi|bat|pl|perl|dll|exe|asp|jsp|php|pdf)$/i';
$conf['404_fast_paths_exclude'] = '/^((sites\/default\/files\/styles)|(some\/other\/url))/i';
if (!
preg_match($conf['404_fast_paths_exclude'], $_GET['q']) and preg_match($conf['404_fast_paths'], $_GET['q'])) {
 
drupal_add_http_header('Status', '404 Not Found');
  print
'<html xmlns="http://www.w3.org/1999/xhtml"><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL "' . check_plain(request_uri()) . '" was not found on this server.</p></body></html>';
  exit;
}
?>

For D6 sites, I use something slightly different:

<?php
/**
* Fast 404 pages, to conserve server resources upon missing non-Drupal files.
*/
$conf['404_fast_paths'] = '/\.(txt|png|gif|jpe?g|s?html?|css|js|ico|swf|flv|cgi|bat|pl|perl|dll|exe|asp|jsp|php|pdf)$/i';
$conf['404_fast_paths_exclude'] = '/^q=((sites\/default\/files\/imagecache)|(some\/other\/url))/i';
if (!
preg_match($conf['404_fast_paths_exclude'], $_SERVER['QUERY_STRING']) and preg_match($conf['404_fast_paths'], $_SERVER['QUERY_STRING'])) {
 
header('HTTP/1.0 404 Not Found');
  print
'<html xmlns="http://www.w3.org/1999/xhtml"><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL "' . check_plain(request_uri()) . '" was not found on this server.</p></body></html>';
  exit;
}
?>

I hope this contributes in some way to the discussion.

EDIT: Adjusted to account for the note in #283.

Status:Reviewed & tested by the community» Needs work
StatusFileSize
new6.38 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 76824-283.diff.
[ View ]

The patch in #230 has a problem. The regex doesn't escape the period, so paths that end with any of the extensions will issue a 404.

The attached patch fixes this.

Status:Needs work» Needs review

Fixing status.

Status:Needs review» Needs work

The last submitted patch, 76824-283.diff, failed testing.

Status:Needs work» Needs review
StatusFileSize
new5.77 KB
PASSED: [[SimpleTest]]: [MySQL] 29,849 pass(es).
[ View ]

Updated git patch for 8.x-dev (looks like the other patch might've been against 7.x) straight from #283. Hopefully the testbot's happy...

Subscribe

Status:Needs review» Reviewed & tested by the community

YAY! Now that #1036132: Provide a mechanism for issue summaries is deployed, at 290 replies, this issue could really use a summary. :)

Could someone please edit the issue node using the issue summary template? That would really help both Dries and everyone else coming into this issue understand what's going on.

Assigned:Unassigned» Owen Barton

Working on this now

Assigned:Owen Barton» Unassigned

Summary is up - yay for issue summaries!

Assigned:Unassigned» Dries

Putting this into Dries's queue to review.

I applied the patch from #283 for D7. WARNING: enabling drupal_fast_404() prevents file styles images from being generated - it returns those requests as 404's! Has this been taken into account?

@morningtime - Read through the issue summary, especially the notes about where regex (some tweaking) is required to make fast 404s compatible with image styles. I had to add any paths to the regex where dynamic content is created (anything drupal's path system doesn't know about that ends in a file extension defined in the fast 404 function).

@geerlingguy My point is why isn't that regex part of the patch Dries is going to look at? In it's current state, it's a dangerous patch, because only 100% of all D7 sites uses file styles.

Status:Reviewed & tested by the community» Needs work

That sounds like a needs work.

Status:Needs work» Reviewed & tested by the community

@morningtime: by default, fast 404s are disabled, for precisely that reason. You have to know what you're doing to set this up, and its buried towards the bottom of settings.php with other advanced configuration options. You can turn on fast 404s if you want, at the settings.php level, by uncommenting the line with the function call. You can also enable them at a higher level, later in the bootstrap, for more safety when it comes to styles and such, but with a slight decrease in the performance impact.

Setting back to rtbc... Unless the contention is elsewhere.

+1 on the rtbc this is much needed advanced feature

Status:Reviewed & tested by the community» Needs work

Damn, I really don't want to push this over 300 replies, but...

Advanced or not, I don't think it's acceptable to make the default regex break a core feature. It should be possible to put an exception for those paths in the regex.

Pages