Problem/Motivation

Drupal handles 404s for certain files that are not Drupal content.

Examples are .js, .css, .jpg, .gif, ...etc.

This has two disadvantages:

- We do an expensive, slow bootstrap for every 404 for such files that cause server load and delay browser page rendering.

- We send far more bytes for that 404 than if we let the web server standard error page get sent, wasting bandwidth and further delaying browser page rendering.

From a scalability point of view 404s can be a huge problem for sites with old domain names that have gone through multiple re-structurings over the years. The 404 traffic simply from hotlinked files alone can be a significant draw on resources. The current setup is also a huge risk for active sites - if you add a 404 to a busy page, or perhaps accidentally move, chmod or otherwise make the files directory inaccessible, server load instantly skyrockets and can render the server unresponsive. An example of this is the (missing by default) favicon.ico file which, until we added a workaround for in .htaccess was basically doubling server load for all Drupal sites.

From an end user point of view this is also a major issue. Browsers will patiently sit around with the loading spinner going waiting for Drupal to fully bootstrap, generate a beautiful, fully themed page to represent an image or css file, that the browser then completely ignores as soon as it notes the 404 http code. Not only does this mean that we end up waiting for Drupal to bootstrap twice (or more!), but the response is at least a factor of ten larger than it needs to be. In some cases (e.g. whitehouse.gov or amnesty.org) it is over 100 times larger than necessary. Multiply this by all the 404s on all the Drupal 7 sites, and the end result is a lot of end user wait time, and a lot of bandwidth and cycles wasted.

Proposed resolution

The approach is to return a lightweight 200 byte page earlier in the page generation cycle. By default it is returned in drupal_deliver_html_page(), which at least cuts out some amount of page generation time. However we also have an option to return the fast 404 when settings.php is being loaded, which is extremely fast but has some tradeoffs to consider. There are settings in settings.php to disable this completely (and get pretty 404s back), to adjust the extensions that trigger a fast 404, and to enable the settings.php early return.

We spent the first zillion comments or so considering .htaccess based approaches, but these had enough tradeoffs to not work for core as a default, and issues with logging that most people would not want to opt in to.

Here is a table describing the status-quo and the 2 options added by this patch:

Configuration A- disabled Configuration B - called from drupal_deliver_html_page() Configuration C - called in settings.php
Heavy 404 page generation cost Smaller 404 page generation cost (~50% of A?) Minute 404 page generation cost (< fastpath cache)
Slow 404 page response time (e.g. 1-10 seconds) Faster 404 page response time (e.g. 0.5-5 seconds) Very fast 404 page response time (e.g. < 0.2 seconds)
Large 404 page size (2,000-30,000+ bytes, depending on site) Tiny 404 page size (~200 bytes) Tiny 404 page size (~200 bytes)
Inlined assets (images, JS, CSS) 404 as normal Inlined assets (images, JS, CSS) 404 as normal Inlined assets (images, JS, CSS) 404 as normal
Directly requested assets show fully themed 404 page Directly requested assets show simple 404 page Directly requested assets show simple 404 page
All 404s logged in watchdog as normal All 404s logged in watchdog as normal "Fast" 404s not logged in watchdog
All 404s logged in Apache logs as normal All 404s logged in Apache logs as normal All 404s logged in Apache logs as normal
Core image api works normally Core image api works normally Core image api works normally
404 based lazy loaders (e.g. D6 imagecache) work normally 404 based lazy loaders (e.g. D6 imagecache) work normally 404 based lazy loaders (e.g. D6 imagecache) don't work without tweaking the regexp
Pages with aliases that match regex load as normal Pages with aliases that match regex load as normal Pages with aliases that match regex 404 instead of load

Server side benchmarks

Using apachebench, and visting example.com/none (which is a 404), with 5 users, 1000 requests

1. Without patch
Requests per second: 49.71

2. With patch, default settings
Requests per second: 51.11

3. With patch, with function uncommented in settings.php
Requests per second: 51.15

When visiting example.com/none.jpg (which a 404 that triggers a fast 404), 5 users, 1000 requests

With patch, with function uncommented in settings.php
Requests per second: 352.31

Browser side benchmarks

Benchmarks show the effect of the current slow inline 404s on browser display latency for end users. These were done using http://www.webpagetest.org/ testing the same site (in different configurations, 404s introduced through a node body promoted to home page) using default settings, 10 test runs each time.

First View Summary (seconds)
  No patch Patch Patch (fast 404 enabled)
No 404 2.591
Image 404 2.944 2.881 2.525
CSS 404 3.571 2.890 2.501
Percentage Improvement
  Patch Patch (fast 404 enabled)
Image 404 2% 14%
CSS 404 19% 30%
Absolute Improvement (seconds)
  Patch Patch (fast 404 enabled)
Image 404 0.063 0.419
CSS 404 0.681 1.07
Repeat View Summary (seconds)
  No patch Patch Patch (fast 404 enabled)
No 404 0.886
Image 404 1.716 1.488 1.011
CSS 404 1.604 1.500 1.287
Percentage Improvement
  Patch Patch (fast 404 enabled)
Image 404 13% 41%
CSS 404 6% 20%
Absolute Improvement (seconds)
  Patch Patch (fast 404 enabled)
Image 404 0.228 0.705
CSS 404 0.104 0.317

See attached file for more detailed benchmarks.

The results are not insignificant - an inline 404 currently slows down the parent page display by anything from 500ms to a full second, depending on the type of 404. Committing the patch will shave something like 100ms-500ms off that time. If site builders enable the "fast 404" return in settings.php the pages display almost as fast as if the 404 wasn't there - anything from 500ms to a full second faster than the current performance. In other words, this patch clearly has enormous benefits for both end user responsiveness, in addition to the server utilization benefits described in depth above.

Remaining tasks

This has been through many months of revision and tweaking. One potential improvement that comes to mind is a test that the lightweight page is indeed returned (we already check for 404 status in several cases) - this would testing some pretty trivial code (one function call, one if), but is probably worth considering.

The previously submitted patch in #286 was incompatible with image styles and other generated files (such paths would return a 404). The latest patch in #316 addresses this issue by adding a configurable list of excluded paths that should not use the fast 404 feature. An interdiff showing changes between the two patches is available in #320. This revision of the patch has been tested by several people.

It may be possible to further optimize the regex; see #319 and #322.

User interface changes

None

API changes

None

CommentFileSizeAuthor
#353 fast404-76824-351.patch5.06 KBShellingfox
#351 fast404-76824-347.patch5.04 KBShellingfox
#350 fast404file-76824-346.patch5.04 KBShellingfox
#347 fast404file-76824-346.patch5.04 KBShellingfox
#328 404_fast_paths_7x-76824-328.patch6.33 KBdroplet
#328 404_fast_paths_8x-76824-328.patch6.33 KBdroplet
#326 404_fast_paths_8x-76824-326.patch6.33 KBgeerlingguy
#326 404_fast_paths_7x-76824-326.patch6.33 KBgeerlingguy
#320 interdiff-286-316.txt3.17 KBxjm
#316 404_fast_paths_7x-76824-316.patch6.33 KBgeerlingguy
#316 404_fast_paths_8x-76824-316.patch6.33 KBgeerlingguy
#311 404_fast_paths_7x-76824-311.patch6.33 KBgeerlingguy
#311 404_fast_paths_8x-76824-311.patch6.33 KBgeerlingguy
#310 404_fast_paths_7x-76824-310.patch6.34 KBgeerlingguy
#310 404_fast_paths_8x-76824-310.patch6.34 KBgeerlingguy
#309 404_fast_paths_7x-76824-309.patch6.33 KBgeerlingguy
#309 404_fast_paths_8x-76824-309.patch6.33 KBgeerlingguy
#307 404_fast_paths_7x-76824-307.patch6.33 KBgeerlingguy
#307 404_fast_paths_8x-76824-307.patch6.33 KBgeerlingguy
#305 404_fast_paths_7x-76824-305.patch6.33 KBgeerlingguy
#305 404_fast_paths_8x-76824-305.patch6.33 KBgeerlingguy
#304 404_fast_paths_8x-76824-303.patch5.93 KBgeerlingguy
#303 404_fast_paths_7x-76824-303.patch5.93 KBgeerlingguy
#286 fast_404_handling-76824-286.patch5.77 KBgeerlingguy
#283 76824-283.diff6.38 KBjoelstein
#230 404.diff6.08 KBmoshe weitzman
#217 404 benchmarks.ods6.14 KBOwen Barton
#221 76824-fast_404_221.patch6.05 KBscor
#215 76824-fast_404_215.patch6.03 KBscor
#208 76824-fast_404_208_0.patch4.64 KBOwen Barton
#206 76824-fast_404_206.patch6.03 KBscor
#196 76824-18.patch6.18 KBkbahey
#194 76824-17.patch5.93 KBOwen Barton
#192 76824-16.patch6.1 KBkbahey
#189 76824-15.patch6.02 KBkbahey
#175 76824-14.patch5.35 KBOwen Barton
#167 76824-13.patch5.31 KBkbahey
#163 76824-12.patch4.64 KBkbahey
#157 76824-11.patch4.61 KBOwen Barton
#154 76824-10.patch4.9 KBOwen Barton
#153 76824-9.patch4.48 KBOwen Barton
#151 76824-8.patch4.48 KBOwen Barton
#149 76824-7.patch4.15 KBkbahey
#148 76824-6.patch4.12 KBOwen Barton
#141 76824-4.patch8.97 KBOwen Barton
#114 76824-113.patch8.87 KBOwen Barton
#110 76824-110.patch6.01 KBkbahey
#107 patch12.patch2.29 KBm3avrck
#102 76824.patch4.5 KBOwen Barton
#92 htaccess.patch1.58 KBzeezooz
#73 files-404-3.patch2.44 KBkbahey
#70 files-404-2.patch2.56 KBkbahey
#67 files-404.patch2.49 KBkbahey
#66 htaccess-404.patch2.15 KBkbahey
#54 d_5.patch1.19 KBm3avrck
#51 htaccessjv.patch1.08 KBjvandyk
#48 d_3.patch1010 bytesm3avrck
#41 404_2.patch1.2 KBm3avrck
#38 404_1.patch1.2 KBm3avrck
#36 block_9.patch4.05 KBm3avrck
#32 404_0.patch5.24 KBm3avrck
#26 htaccess-404.patch_0.txt1.19 KBrobertDouglass
#24 patch_953.35 KBmoshe weitzman
#13 htaccess-404.patch.txt1.97 KBkbahey
#11 htaccess-index.patch1.55 KBrobertDouglass
#4 htaccess_new_0.txt2.59 KBagentrickard
#1 htaccess_7.patch947 byteskbahey
htaccess.patch.txt939 byteskbahey
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

kbahey’s picture

FileSize
947 bytes

Here is another version.

It adds .ico to the list.

kbahey’s picture

pwolanin’s picture

Quick question based on the mailing list discussion-

Assume I want to be able to set up path aliases ending in .html or .htm becuase I've replaced static site content with Drupal content. Will this patch prevent that from working? It seems that it will, so I'd suggest separating out these sorts of endings (shtml, html, htm) into a separate rewrite statement and making it commented out by default.

agentrickard’s picture

FileSize
2.59 KB

I opened another issue before getting to the end of the thread....

The original handbook page was written with reference to Drupal 4.6.x.

As discussed at http://drupal.org/node/56773, in 4.6, we made some edits to the default .htaccess file to allow the web server to handle bad file requests for most common file types.

Before this is standardized, some more testing is needed. Attached is out current .htaccess file.

pwolanin’s picture

@agentrickard - that attached file does not correspond to the current .htaccess. Did you add more to it?

agentrickard’s picture

Let me clarify:

That file is the .htaccess file that we use in production (Drupal 4.6.9), which inspired the original Handbook page. It has been modified since the original handbook page was written.

I added it for reference.

kbahey’s picture

For what its worth, I am using the patch attached in #1 on my live sites, and there is no ill effect at all ...

It does not seem to catch them all though ...

mysql> select wid, from_unixtime(timestamp) as time,  location from watchdog where type = 'page not found' and location like '%gif' order by time desc limit 5;
| 178866 | 2006-08-04 08:48:20 | http://example.com/sites/default/themes/example/images/menu-expanded.gif   |
| 178865 | 2006-08-04 08:47:16 | http://example.com/sites/default/themes/example/images/menu-leaf.gif       |
| 178864 | 2006-08-04 08:12:02 | http://example.com/sites/default/themes/example/images/sidebar_bg_left.gif |
| 178860 | 2006-08-04 07:52:26 | http://example.com/sites/default/themes/example/images/menu-expanded.gif   |
| 178850 | 2006-08-04 05:44:12 | http://example.com/images/header_bg.gif                                       |

I changed it to this, but don't know if it works or not yet.

Index: .htaccess
===================================================================
RCS file: /cvs/drupal/drupal/.htaccess,v
retrieving revision 1.73
diff -u -r1.73 .htaccess
--- .htaccess   14 Apr 2006 09:08:26 -0000      1.73
+++ .htaccess   4 Aug 2006 14:20:13 -0000
@@ -8,6 +8,12 @@
   Deny from all
 </FilesMatch>

+# Handle files that should not be handled by Drupal,
+# so a) no bootstrapping happens, and b) 404 error is less bytes
+<FilesMatch "\.(gif||jpg|jpeg|css|js|ico|cgi)$">
+  ErrorDocument 404 default
+</FilesMatch>
+
 # Set some options.
 Options -Indexes
 Options +FollowSymLinks
@@ -82,6 +88,8 @@
   # Rewrite current-style URLs of the form 'index.php?q=x'.
   RewriteCond %{REQUEST_FILENAME} !-f
   RewriteCond %{REQUEST_FILENAME} !-d
+  # Handle 404s for non Drupal files. See above.
+  RewriteCond %{REQUEST_FILENAME} !\.(gif||jpg|jpeg|css|js|ico|cgi)$
   RewriteRule ^(.*)$ index.php?q=$1 [L,QSA]
 </IfModule>
robertDouglass’s picture

@agentrickard: I don't see how your .htaccess (from #4) is really applicable to Drupal core since you simply disable Drupal's 404 handling altogether:

# Customized error messages.
#ErrorDocument 404 /index.html
#ErrorDocument 403 /index.php?q=FileNotFound

# This overrides the Drupal 404 handler for files that should never be handled by Drupal
#<FilesMatch "\.(gif|jpe?g|png|s?html|css|js|dll|exe|asp)$">
#  ErrorDocument 404 "File Not Found
#</FilesMatch>
robertDouglass’s picture

The original version of this patch (and that which is in the handbook) works for nonexistant files requested from directories that actually exist in the filesystem. It does something like this:

Request: http://site.com/misc/image.gif

1. Look for directory /misc/.
2. /misc/ is found, so look for image.gif inside of it.
3. image.gif is not found, and image.gif matches the FilesMatch regexp, so return Apache default 404.

Request: http://site.com/some/image.gif

1. Look for the directory /some/.
2. Directory not found, so the URL gets rewritten to http://site.com/index.php?q=some/image.gif
3. Bootstrap Drupal, look in menu, not found, return Drupal 404 page.

The only fix I'm able to think of for this would be to modify index.php to intercept files using the same regexp from the FilesMatch directive, and return 404 from index.php. I don't know if the powers that be will buy such a solution.

kbahey’s picture

Oh, I think they will, given large sites having lots of 404 really suffer because of that false bootstrap, and too many of them stressing the system.

I don't have the time to devise a patch, but will be happy to test one on my sites, and perhaps a large client site too.

robertDouglass’s picture

FileSize
1.55 KB

Here is what seems like an optimal solution. We keep the stuff in .htaccess as it is because having Apache handle things without invoking PHP is ideal. For the cases that Apache can't get to because of the rewrite rules, we run the same regular expression as the first line of index.php and return our own 404 if it matches. The results are as follows:

Request Result
/misc/blog.png Apache will serve the image because it exists.
/misc/blogxyz.png Apache will serve a 404 based on the .htaccess rules because the directory exists but the file doesn't.
/foo/bar.gif Drupal's index.php will serve a 404 based on the rules there because the request ends with .gif. The very fact that we are in index.php means Apache has already decided it is a 404 case, with or without the modification to .htaccess (so the modification to index.php is not an Apache specific hack).

As long as developers don't start making paths in the menu that end in .gif, .png, .swf etc., we're fine. Does anyone know of any cases where such paths are generated? Perhaps by CAPTCHA or watermarking?

robertDouglass’s picture

So much for my nice table.

/misc/blog.png
Apache will serve the image because it exists.

/misc/blogxyz.png
Apache will serve a 404 based on the .htaccess rules because the directory exists but the file doesn't.

/foo/bar.gif
Drupal's index.php will serve a 404 based on the rules there because the request ends with .gif. The very fact that we are in index.php means Apache has already decided it is a 404 case, with or without the modification to .htaccess (so the modification to index.php is not an Apache specific hack).

kbahey’s picture

FileSize
1.97 KB

Here is an updated patch.

It picks up from what Robert did, and changes the following:

- Adds .ico file as a type to be ignored.
- Makes .html as a commented out line with instructions. The reason is, some sites use the .html in URL aliases for various purposes (e.g. content migrated from older systems, making awstats give better stats by type, ...etc.)

I tested this and it works as expected. Should cut down resource usage on some busy sites with lots of flat files.

So, +1 from me.

Dries’s picture

What if I'm serving images from the database, using the 'private file method'?

What if, in the near future, we want to do access control on images associated with nodes? (Something we should do.)

robertDouglass’s picture

This breaks private files. Ugh.

Of course, we could update the regexp to accommodate the private files path... but what about filemanager files? And ecommerce files?

Yuck.

I don't know how we're going to introduce the access permissions on files attached to nodes (filemanager/attachment already has a solution for this), and don't think that it should be part of this patch.

Is the best thing to do to make the change in .htaccess and leave it at that? I suspect that we're only covering 5-10% of the target cases if that's the route we go. The other option would be to build the check into the bootstrap process and hope that we could do a partial bootstrap for 404 images instead of a full bootstrap, but this is clearly inferior.

forngren’s picture

I'm writing something maybe related, http://drupal.org/node/77514. It performs a search for the missing URL with certain words stripped (eg html, php etc).

robertDouglass’s picture

I wouldn't be surprised if we end up introducing a second entry point to the bootstrap process, something like files.php, that handles any URL that has a file ending. That file would initiate an alternate bootstrap process where one of the first checks would be "Is this really a file that exists?", and right after that would check "Is this a path that is handled by a module?".

Haven't Walkah and Co. been doing some thinking on the file bootstrap issue? What is their concensus?

moshe weitzman’s picture

i posted some thoughts on private files at http://www.tejasa.com/node/113. no code though.

Dries’s picture

IMO, the best we can do is to try and serve a lightweight page. Dropping the sidebars on 404 pages could save a fair amount of resources; both bandwidth as CPU cycles.

moshe weitzman’s picture

Assigned: Unassigned » moshe weitzman

i guess that will have to do for now ... i've been meaning to add a block-less page anyway. needed for panels.module ... if anyone has ideas on how to implement this, please chime in. assigning to self.

kbahey’s picture

Introduce a new theme function? mini_page based on theme_page() in theme.inc?

Dries’s picture

theme_maintenance does something similar. That or provide a drupal_is_404() function for the template files? Food for thought. :)

moshe weitzman’s picture

Thats was easy. i just added minor conditionals in phptemplate.engine and chameleon.theme. Is used by 404 page, and will certainly be used by panels.module - http://angrydonuts.com/the_dashboard_revisited

To omit blocks, just pass FALSE as 3rd argument - for example theme('page', $output, FALSE)

moshe weitzman’s picture

FileSize
3.35 KB

oops - here is the patch

Dries’s picture

It begs the question: do we need to redo the theme_maintenace() page using this system? Can, and if so, should we merge them?

robertDouglass’s picture

FileSize
1.19 KB

Here's the .htaccess portion of the patch. We should adopt this part regardless of the solution that we come to for the rest because it is the most effecient way to handle requests for files that don't exist.

moshe weitzman’s picture

i looked into maintainence_page and install_page but they are specialized and merit own function. They avoid calls to functions that haven't been loaded yet. I think we can leave them alone.

Dries’s picture

I'm ok with Moshe's patch. My original thought was to introduce a helper function (drupal_is_404()) but I guess this works too. A drupal_is_404() might be a little more powerful, and more transparent though.

Then again, the extra paramater passed to theme_page() might, for example, be useful to simply certain pages (like the welcome page after installation).

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

i'll take some more feedback from cvs commit team, or proceed with commit. the extra param on theme_page() is vital for panels.module (whichis super cool if you haven't looked yet - http://drupal.org/files/issues/htaccess-404.patch_0.txt)

drumm’s picture

Status: Reviewed & tested by the community » Needs review

Which patch are we talking about? I can't tell from moshe's review if he thought his own patch was ready for commit or Robert's.

moshe weitzman’s picture

i am talking about my patch. i have not tested robert's patch.

m3avrck’s picture

FileSize
5.24 KB

Updated patch... includes Moshe's patch + Roberts + update for HEAD.

Tested and working well. I really like the fact that blocks go off when you get a 404, that really can save a lot of bandwith + CPU cycles.

Also, the Apache 404 for actual files is *very* necessary. If you have a link to http://www.example.com/somethingFunnyLooking.gif ... you are linking to a file directly--you are expecting either the file or a 404, *not* a full Drupal boostrap to give you a 404. 99% of the time that link is going to be in the theme or some link posted by someone and they'll never see that 404... so let's give them the absolute least amount of overhead with that 404.

I'd say this is ready to go, nothing new here and we solve 2 problems at once: 404s for content in Drupal (no blocks, less CPU, SQL, etc..) and 404s for files (*nothing* at all).

m3avrck’s picture

Also when/if this goes in, we need to update the theming docs about this new $show_blocks variable so themers can be sure to use it so save some CPU cycles on 404s.

m3avrck’s picture

Status: Needs review » Reviewed & tested by the community

Talked with Moshe in IRC, we both agreed this is RTBC. It was ready before Neil just didn't know what to commit (now it's both).

drumm’s picture

Status: Reviewed & tested by the community » Needs work

This is two separate issues. Make a separate issue for one half of this.

m3avrck’s picture

FileSize
4.05 KB

Neil, ok here is the first patch then, this is for the blocks. Note, this will need a documentation update.

This patch is the same from before, just synced with HEAD.

m3avrck’s picture

Status: Needs work » Reviewed & tested by the community
m3avrck’s picture

FileSize
1.2 KB

After the patch in #36 is commited (same as in #24 just updated for HEAD), here is the next patch to address the .htaccess and 404.

drumm’s picture

Status: Reviewed & tested by the community » Needs work

Please post different patches on /separate/ issues.

robertDouglass’s picture

I moved the most recent version of Moshe's patch to a different issue since the title of this issue fits my patch the best:
http://drupal.org/node/80201

m3avrck’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
1.2 KB

Here's the 404 patch which is for this issue :-)

Thoroughly tested, this is ready to go. Like I said in #32, this is needed for files, it can save a ton of wasted Drupal bootstraps and server overheads.

forngren’s picture

Because of this (two) patches I think the need for http://drupal.org/node/76824 has increased, since the 404 page just "blanks" ATM. Most users wont configure a 404 page...

drumm’s picture

Status: Reviewed & tested by the community » Needs work

The nearly-identical commented lines next to the uncommented need a better explanation.

From what I can tell this still breaks private file downloads.

moshe weitzman’s picture

sad to see this one languishing. if it cam edown to it, i would add this and put some doc for private files tha tthis must be commented out.

coreb’s picture

Version: x.y.z » 6.x-dev

moving from x.y.z to 6.x-dev version queue. Hopefully this will also bump the feature back into people's attention.

moshe weitzman’s picture

color.module also does not work for private files and it is a quite prominent part of our default theme. see http://drupal.org/node/92059

i think private file incompatibility is acceptable for drupal 5. we will fix that in 6. anyone want to address drumm's last comment and resubmit?

dopry’s picture

-1 on this addition to the .htaccess, It will break any dynamic generation of the restricted file types & drupal based file access control.

If it is added I'd like to see it commented out by default, so people who have concerns about bandwidth and server load can uncomment them. We could also choose a special case for files/ or system/files.

m3avrck’s picture

FileSize
1010 bytes

Here's and updated patch which cleans up some stuff.

Known issues:

  • Regexes need to be updated to ignore files that live in files/ and system/files/ --- this will fix both private downloads and modules like imagecache which make special use of the files/
  • Directories that do not exist (e.g., themes/abc/something.gif) cause an Apache 404 error but then it says it can't find the proper 404 document. However, if use a directory that exists /themes/something.gif with an image that doesn't exist, it gets the proper 404
dopry’s picture

The problem with 'files' is that it can change and we don't have access to file_directory_path from the .htaccess file...

moshe weitzman’s picture

it is reasonable to force folks to have 'files' in the difrectory path, like we do for modules and themes. but if not, i htink this can go in commented out and ask folks to enter their files path.

jvandyk’s picture

FileSize
1.08 KB

Here's an updated patch. Moved definition of ErrorDocument below our existing definition (otherwise it'll just get stomped on by the later ErrorDocument 404 index.php line).

Changed the description a bit. Commented the lines out per #50, except the RewriteCond.

RobRoy’s picture

Don't have time to test this just now, but will not handling rewriting for .html files mess when you have node/2 aliased to about.html, for example?

RobRoy’s picture

Also, in those new comments it might help to have a reason why one might NOT want to uncomment those lines. For example, why does this come commented-out by default.

m3avrck’s picture

Version: 6.x-dev » 5.x-dev
FileSize
1.19 KB

Here's an updated patch with a regex from Steven on IRC to ignore those files when they exist in /files which can interfere with private downloads and imagecache module.

Regex needs some tweaking though, almost there.

m3avrck’s picture

Also uncommented the regex because if it can ignore anything in files/ it shouldn't break anything. Feel free to comment out though if I missed a use case.

Morbus Iff’s picture

m3vrck: FilesMatch does absolutely nothing with Directories. Scanning against files/ won't work. The only way forward on that avenue is to get the default handler into the .htaccess that is generated into the files/ directory.

drumm’s picture

And files isn't always named 'files'.

kbahey’s picture

How about a new approach to this issue that has been going back and forth for so long, rolled back, ...etc?

Instead of doing things in .htaccess, requiring people to uncomment things, ..etc., can we move this into Drupal itself?

Can we do a light bootstrap (as early as possible), read a variable (perhaps in settings.php) that has a pattern, try to match it, then exit early?

This is of course more overhead than letting Apache handle the whole thing, but it would still be better that what we have now, provided we do it as early as possible.

Comments?

moshe weitzman’s picture

@kbahey - now is not the time to switch gears on this issue. thats a radical change and quite debateable.

@drumm - i think a code comment in htaccess and possibly where we set the files directory is enough to satisfy the rare case where files dir does not end in 'files'.

kbahey’s picture

moshe, the reason I suggested an alternative is that this is dragging on, due to various issues that are pointed out by various people. So, getting consensus on it will be difficult.

So, in my view, it may not make it to 5.0.

My suggestion was to move it inside drupal, so we can manage the patterns to exclude, ...etc. from inside Drupal with a little bit of sacrifice on the overhead side.

If others think it will get in core via htaccess soon, then I am happy to go with that too.

moshe weitzman’s picture

anyone care to carry this forward?

kbahey’s picture

Version: 5.x-dev » 6.x-dev

Anyone willing to work on this performance enhancing patch?

The code freeze is a few weeks away ...

RobRoy’s picture

Can someone answer my question at http://drupal.org/node/76824#comment-173567, which is the same as #3 submitted by pwolanin? What if I have an alias for .html pointing to a missing node? Not that that is a huge deal, just would like clarification. Also, the hardcoding /files issue is still there. What about for multisite?

@kbahey, I agree that moving this inside Drupal would save us a lot as we could get the right files dir and some of the other issues might be alleviated. But, maybe that would soften the performance gain that this is trying to achieve. Since this is a bit funky, I think the easiest way is to put this in .htaccess and comment it out initially, allowing users to turn on this boost if they want. What do you think?

kbahey’s picture

Many of my site has .html as the suffix, mainly so awstats can give good stats on them.

I think that this patch is for performance more than anything else. If we avoid a Drupal bootstrap, we waste less resources when we have a 404.

So, the options are:

1. We add the stuff in .htaccess and index.php and comment it out. This is the path of least resistance and would give us what we need for D6.

2. Have .htaccess generated automatically from Drupal like we do with the CSS cache stuff. A default setting provides which file extensions to use and the admin can override them (e.g. remove html or add xyz)

The .htaccess need not be writable. We can take a non-fancy approach and just display the .htaccess contents on the screen and ask the admin to copy/paste, or we can write to the tmp directory and ask the admin to move the file manually. Since this is an advanced feature, I think this is acceptable.

3. Move parts of that in a lighweight bootstrap, as I suggested earlier.

I think 1 will have the best performance. If Dries is amenable to this, I can work on this towards end of next week.

moshe weitzman’s picture

I agree that #1 is best for now. I hope kbahey or other have a chance to revive this.

kbahey’s picture

Assigned: moshe weitzman » kbahey
Status: Needs work » Needs review
FileSize
2.15 KB

I hate to see Drupal 6 go out the door without this simple yet effective feature.

Here is a patch for current HEAD.

kbahey’s picture

FileSize
2.49 KB

Here is a patch that moves the index.php stuff to settings.php, and comment it out.

This way, we have no cruft in index.php, and those who need this can just uncomment it (albeit in 3 different places, but better than a whole Drupal bootstrap for a measly 404).

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

works as advertised. i think this is a good compromise, and all the lines are needed.

Steven’s picture

Status: Reviewed & tested by the community » Needs work

One of the chunks is double commented, the other is not.

kbahey’s picture

Status: Needs work » Needs review
FileSize
2.56 KB

Here it is, fresh from the microwave ... ;-)

agentrickard’s picture

I think this patch is important (I wrote the original handbook page).

Circling back, have you tested against http://drupal.org/node/56773#comment-243479? And against file uploads in general?

Gábor Hojtsy’s picture

Reread the htaccess comments, there are typos and repeated stuff as mentioned by Steven. Also if the functionality in htaccess is effective, there would not be a way to get to the default settings.php change. Are three three marketed as an 'all or nothing' solution to uncomment all three to cover all bases?

kbahey’s picture

FileSize
2.44 KB

The 3 are needed but for different reasons. When a directory exists but a file in it does not, then Apache is able to handle the 404, and hence the .htaccess changes.

However, if the directory does not exist, then Drupal is still bootstrapped, so we exit as early as we can, and hence the settings.php change.

Rerolled the patch. Hopefully comments are OK now. I also changed it to use REQUEST_URI instead of QUERY_STRING, since that is what we are checking in .htaccess.

Also added a small "Page not found" message.

Morbus Iff’s picture

Status: Needs review » Needs work

* What's a "flat file"? I know what a flat file is. It's not what the documentation prefaces.
* None of the sentences that end with "settings.php" have a period at the end.
* None of these lines are commented (and they should be).
* "corresponding the RewriteCond" has an extra "the".
* I can agree that we'll save "server" resources, but not "bandwidth" resources.
* "corresponding FilesMatch line" - it's actually three line[s], not just one "line".
* Technically, if a number is less than 10, it should be written out ("three lines", not "3 lines").

I'm not sure I like this third check. I worry about dynamically generated images (ala CAPTCHA or various statistics) - I suspect, just off the top of my head, that one of the "make an image from text" modules will fail, like textimage.module. This third check also defeats a request for an image uploaded to a node when private downloads are enabled (URL path is ?q=system/files/1608.png, which triggers your check).

Morbus Iff’s picture

And note that #74 and the "private downloads" issue was brought up as far back as #14.

cburschka’s picture

I have trouble believing that the performance gain is worth completely eliminating file extensions in the virtual file system. The whole simple beauty of the "no file named $1? let index.php?q=$1 handle it" method is that there is no difference from the outside between the Drupal handler and the files served directly.

What performance gain is this supposed to give, anyway - it seems to speed up only 404 errors. For a site without broken links and sensible, persistent URLs, shouldn't those be the exception rather than the norm?

Just my two cents there...

Morbus Iff’s picture

What performance gain is this supposed to give, anyway - it seems to speed up only 404 errors. For a site without broken links and sensible, persistent URLs, shouldn't those be the exception rather than the norm? - Not really. Given enough time, your site will be constantly hit by broken spiders, endless requests for favicon.ico (which may not exist), and various cracking attempts via known vulnerabilities in IIS and other webservers. My top WEEKLY 404s are 14045 /favicon.ico(has never existed and never will), 2296 /popular (never existed. broken bot), 1668 /ghostsites/images/pixel.gif (probably once existed.), 1414 /popular/alltime (never existed. broken bot), 1025 /images/pixel.gif (probably once existed), 467 /cgi-bin/isp-post/isp-post.cgi (last existed 7 years ago. bot attempting exploits), 409 /files/users/picture-11.jpg (existed once. user, and all his content, was deleted.), 178 /db/zone.htm (never existed. no idea where this comes from).

agentrickard’s picture

@Arancaytar

Just for background: This is a massive problem for sites that migrate to Drupal from a flat-file system. In the case of SavannahNOW, we had thousands of now dead links in Google, Yahoo, et. al. We were basically getting DOS'ed by 404 errors.

Plus, if someone requests a page that no longer exists, and that .html file contains two .css and 10 .gif files, that turns into 13 404 errors that "fall through" to Drupal. In effect, Drupal spends all of its time searching for files instaed of serving pages.

It is also a generic problem with bots and spiders -- particularly spambots and hackers. Our traffic logs show repeated requests for .dll and .cgi files that point to known or suspected exploits in IIS.

All that aside, the case may be that the best path here is a well-tested and documented handbook page, with a reference in the core INSTALL.txt that points to those instructions. I say that simply because edge-cases seem to be drowning this issue. If we can't fix all the edge cases, I doubt this will get into core.

kbahey’s picture

Here is the wording, before I roll a patch again:

.htaccess part1:

The following lines prevents Drupal from handling 404 errors for certain
types of static files, such as images and css.

This can help reduce the server's load because 404s for such files do not
cause a full Drupal bootstrap. 

You must uncomment the corresponding RewriteCond lines later in this file as 
well as those in the settings.php file.

.htaccess part2:

The following lines prevents Drupal from handling 404 errors for certain
types of static files, such as images and css.

This can help reduce the server's load because 404s for such files do not
cause a full Drupal bootstrap. 

You must uncomment the corresponding FilesMatch lines later in this file as
well as those in the settings.php file.

settings.php:

The following lines prevents Drupal from handling 404 errors for certain
types of static files, such as images and css.

You must uncomment the corresponding RewriteCond and FilesMatch lines in 
the .htaccess file as well.

Regarding private download, we can add a comment that this does not work in such a case. How about that? Any wording suggestions Morbus?

Morbus Iff’s picture

CSS is a acronym and should be capitalized. I'd much rather these sentences be merged into one paragraph (for each block).

As for the settings.php, I can't support this patch with that block in - adding wording-to-the-wise about private downloads doesn't solve the issue of textimage (which creates dynamic images) or any other module that attempts to do something dynamic with normally static files. It's something that would cause more confusion and endless "your module is broken!" requests.

I support the .htaccess additions.

kbahey’s picture

Assigned: kbahey » Unassigned

Morbus is right about private file download. This is something we need to think about. However, that should not stop the .htaccess parts.

Not sure what merging sentences in one block for each paragraph means.

I am unassigning myself from this issue. Someone else can take it and run with it.

moshe weitzman’s picture

In order not to break textimage and imagecache, I just added this condition in the settings.php if block:

&& strpos($_SERVER['REQUEST_URI'], 'files') === FALSE

i know that one could use a files dir that doesn't begin with 'files' but thats a bug in core IMO. we require modules dir to be modules, themes to be themes, and we have every right to require that files be files. i just need to submit a patch for that. if folks agree, i'll get started.

note that color.module does not work with private files and thats a far more important feature than these chunks of commented out code that are proposed.

cburschka’s picture

Given that any module (or even custom site code) might attempt to "fake" a file extension in the URL, it seems like this fixes one issue (404 overload) at the cost of introducing an uncounted number of problems whose cause cannot not readily identified.

"Why is such-and-such a page suddenly returning 404 errors?" seems like a far more difficult support issue to solve than "Why is my Drupal site constantly running into my host's server load limits?". When the server overloads, you have a very obvious problem that will probably be handled by an admin who can poke around in .htaccess files. The other one is encountered by users who have no idea what happened.

Because of this, I'd propose that the relative addition/change in .htaccess be commented out with a description of how to activate it, and document this as a performance tweak in the handbook...

mcurry’s picture

subscribe

Christefano-oldaccount’s picture

Where has this issue been all my life? Subscribing.

Owen Barton’s picture

Subscribing

juerg’s picture

Subscribing

Pancho’s picture

Version: 6.x-dev » 7.x-dev

Moving feature requests to D7 queue.

zeezooz’s picture

Version: 7.x-dev » 5.7
Status: Needs work » Needs review

For handle file in non-existent directory I use rewrite to /404.ext in .htaccess. Put

  RewriteCond %{REQUEST_FILENAME} !-f
  RewriteCond %{REQUEST_URI} !^/files/
  RewriteCond %{REQUEST_URI} \.(png|gif|jpe?g|s?html?|css|js|cgi|ico|swf|flv|dll)$
  RewriteCond %{REQUEST_URI} !^404.%1$
  RewriteRule ^(.*)$ 404.%1 [L]

before last RewriteConds/RewriteRule and

<FilesMatch "\.(png|gif|jpe?g|s?html?|css|js|cgi|ico|swf|flv|dll)$">
  ErrorDocument 404 "<html><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL was not found on this server.</p></body></html>
</FilesMatch>

just after

ErrorDocument 404 /index.php

It does not require to patch other files and seems to work well in 5.7 (including imagefield preview), but error message contain 404.ext filename so I use custom message.

Christefano-oldaccount’s picture

Version: 5.7 » 7.x-dev
Status: Needs review » Needs work

@zeezooz, can you roll a patch?

New features are going into 7.x-dev, not 5.7.

Owen Barton’s picture

The trickiest thing to work around here are modules like imagecache that rely on the 404 handling to auto-generate versions of files that don't yet exist. I wonder if we could get around this by having them add a querystring paramete (e.g. 404handler=drupal) that will cause .htaccess to use the Drupal 404 for these files, rather than the lightweight static 404 page. This way we could potentially have the rule enabled by default, and still allow these modules to work without having people hack the (fragile!) .htaccess regexi.

Anyone fancy taking a shot at this?

zeezooz’s picture

FileSize
1.58 KB

patch for #89

Gábor Hojtsy’s picture

Note: Favicon.ico just got such special treatment in Drupal 6.3.

m3avrck’s picture

patch #92 works great!

We are using this on MothersClick with imagecache & imagefield with *no* problems at all, test it yourself:
http://www.mothersclick.com/no-directory/this-really-works.css

The patch that went into D7 and D6 for favico should be reverted in favor of this patch which works great.

catch’s picture

Category: feature » task
Status: Needs work » Needs review

Still applies with fuzz, back to needs review.

I agree 100% that a more generic solution would be great for this issue rather than special casing favicon.ico - it's a big performance issue on some sites (including one I run) that have old legacy linkrot for images etc.

cburschka’s picture

Could we leave default behavior as it is? This will break functionality on a great number of sites - whether you are using a private file system, or any module that dynamically generates file downloads, or even if you are in the habit of adding ".html" to the URL aliases of your nodes.

It's a fine tweak, but I don't see that we should knowingly and without warning or explanation break the sites of one part of the user base in order to improve performance for another. These lines should be commented out initially, so you can uncomment them if you need the performance. We don't default to aggressive caching either, for the same reasons.

m3avrck’s picture

This patch DOES NOT break dynamically generated files like images with imagecache. Like I said, we're using it on MothersClick and we rely on imagecache and everything works a-ok. Before saying a patch doesn't work, test it. There is a line in there that says ^/files/ ...

cburschka’s picture

I'm sorry, I often get mod_rewrite rules wrong. Does the following prevent the file being handled by Drupal, or is it an exception for files that are handled by Drupal?

+  RewriteCond %{REQUEST_URI} !^404.%1$
+  RewriteRule ^(.*)$ 404.%1 [L]
gpk’s picture

Related: #295456: Bug in .htaccess file.
Also the patch Gábor mentions in #93: #174940: Root /favicon.ico accounts for majority of 'Page not found' errors.

Is it proposed that *by default* Drupal should not handle 404s for the image file types? Two comments:
- if Drupal doesn't handle them then I lose an easy way of finding broken img tags etc. on my sites
- the comment sections e.g. # The following lines prevents [sic] ... # You must uncomment the corresponding RewriteCond and RewriteRule lines later in this file. suggests that by default these lines will be commented out??

@98, and for my own edification: What I think those 2 lines do is the following:
- first of all, %1 is a backreference to the file extension of the image file (as matched the line before in the RewriteCond)
- so RewriteCond %{REQUEST_URI} !^404.%1$ makes sure that the request was not for 404.jpg say (this is needed so that the next line doesn't create a loop)
- RewriteRule ^(.*)$ 404.%1 [L] then rewrites the current request (e.g. picture.jpg) to 404.jpg and ends ReWrite processing (so that the following rule - Drupal's normal one - doesn't get applied).
- only then does the FilesMatch directive get applied

Actually I'm not sure the request URI needs to be rewritten ... just need to make sure that Drupal's normal RewriteRule doesn't get run?

Status: Needs review » Needs work

The last submitted patch failed testing.

gpk’s picture

Status: Needs work » Needs review

#100 was the testbot problem #335122: Test clean HEAD after every commit.

Owen Barton’s picture

FileSize
4.5 KB

Here is a rerolled patch that does the following:
- Reverts the favicon.ico rules, because they now are handled by these rules
- Adds some additional extensions for common 404s (mostly exploit-bots)
- Changes the 'files' rule so it does not need to be at the start of a line. This is needed so we can catch sites/default/files paths (the default since Drupal 6). This also fixes (AFAICS) the case where private files are enabled.
- Adds a help text and a validation rule to the file system settings to ensure the file system path contains the word 'files' somewhere (as suggested by moshe in #82).
- Removes handling for htm/html files by default, because migrating from a static html site is a pretty common use case. I have added comments and sample lines for people who know they don't/won't use these aliases, but by default you don't have to do this (and you still save a large number of CPU cycles).
- Makes this enabled by default. I don't think it's worth including this if it is going to be disabled by default, especially if it means uncommenting 2 far apart sections of code in a hard for newbies-to-grok file. If we are not going to enable it by default this could live just as happily in a documentation section for high traffic domains.

Some thoughts:
- This opens up the possibility of a contrib module to save static versions of the error pages as 404.whatever (a bit like boost module does for regular pages), or introduce some lightweight way of logging these errors (404-handler.php) with a minimal change to the rewrite rules.

cburschka’s picture

this could live just as happily in a documentation section for high traffic domains.

+1, though leaving it in as a comment will make it far easier to enable - just tell them to remove the comment on the documentation page.

Again: This change will speed things up for some people and break the sites of everyone using private file downloads, since you're not even using files/ or system/ exceptions, as well as anyone generating dynamic images with extensions, which can't be fixed with any additional special exception. Is it too much to ask not to break people's site by default?

Owen Barton’s picture

@Arancaytar - it has been stated multiple times in the thread that this patch does not break imagecache or any similar modules and has been used successfully on large production sites that also use imagecache. If you read my post you would also note that this should now work with private file downloads. If you would like to find something that it does break, then please by all means try the patch and report back, but repeating "this breaks imagecache" over and over is neither accurate nor helpful :)

Status: Needs review » Needs work

The last submitted patch failed testing.

Owen Barton’s picture

It occurred to me that this would be a great candidate for some tests (which would help settle the what this does/does not break debates!) - any takers?

m3avrck’s picture

FileSize
2.29 KB

The issue I have with this patch is treats 404s for files differently than 404s for something in Drupal's menu system.

IMO these should be treated the same. The attached patch is how we handle on MothersClick.com we a single configurable file that both Apache and Drupal can use.

While not core worthy, hopefully this sparks proper direction for this patch.

moshe weitzman’s picture

What a cursed issue we have here.

sun’s picture

Patch in #102 should not throw a form validation error. Instead, just use a note for the form element description, so site admins who need to use a path without "files" do not have to hack core.

Patch #107 looks interesting. Just an idea: Could we use a separate 404.php in the root directory, which performs a minimal bootstrap, checks the configured file directory path for the site in question, and short-circuits to a (optionally configurable) 404 output - or - if Drupal is supposed to handle the request, performs a full bootstrap?

kbahey’s picture

Status: Needs work » Needs review
Issue tags: +Performance
FileSize
6.01 KB

I hate that D7 would be released still handling all 404s via an expensive bootstrap. This is a big performance suck.

The attached patch solves this in two ways:

- Adds rules to .htaccess to bypass 404s for certain files.
- Adds a safeguard in settings.php to stop very early if we are handling a static file.

Please please let us have this in D7. We can refine later if needed, but let us ship with something that prevents Drupal from handling static files.

Status: Needs review » Needs work

The last submitted patch failed testing.

chx’s picture

You want to write a test module which defines a callback in files, say 'test.gif' and check the return of it -- stuff a random string in a variable in the test and print variable_get in the test module.

Owen Barton’s picture

Status: Needs work » Needs review

Here is an updated patch, with 3 new assertions to ensure that the web server is passing "fast 404" responses to specific file type requests outside the files directory, and that Drupal can still handle requests within the files directory (system_test module now registers 2 imagecache style handlers for this purpose) for both public and private file serving.

I have also tried to make the comment help as clear as possible, although it is hard to balance clarity against excess verbosity - feedback is welcome.

I also took the settings.php fallback out - I am happy to add this back in if someone can explain why we should. This would only be reached on non-Apache webservers. We generally don't duplicate .htaccess functionality in php (especially for "optional" functionality such as this) - for example the www/no-www redirection could easily be done in settings.php, but is not. I think it is reasonable to request that people set up equivalent functionality in their non-Apache webserver configurations, and/or add code to their settings.php to do the same - we have places in the manual to document these.

Owen Barton’s picture

FileSize
8.87 KB

Status: Needs review » Needs work

The last submitted patch failed testing.

Owen Barton’s picture

Hmm, I can't reproduce the test fail locally - anyone else care to test?

GET http://default.head/s120547Pa34UmLhrr.ico returned 404 (139 bytes).	Browser	system.test	613	PageNotFoundTestCase->testPageNotFound()	
Valid HTML found on "http://default.head/s120547Pa34UmLhrr.ico"	Browser	system.test	613	PageNotFoundTestCase->testPageNotFound()	
404s for specific (fast 404) file types outside the files directory are returning without Drupal bootstrap.	Other	system.test	614	PageNotFoundTestCase->testPageNotFound()	
GET http://default.head/sites/default/files/simpletest120547/system-test/404-handling.jpg returned 200 (38 bytes).	Browser	system.test	615	PageNotFoundTestCase->testPageNotFound()	
Valid HTML found on "http://default.head/sites/default/files/simpletest120547/system-test/404-handling.jpg"	Browser	system.test	615	PageNotFoundTestCase->testPageNotFound()	
Drupal is handling requests for specific (fast 404) filetypes when within a public files directory.	Other	system.test	616	PageNotFoundTestCase->testPageNotFound()	
GET http://default.head/system/files/system-test/404-handling.jpg returned 200 (38 bytes).	Browser	system.test	618	PageNotFoundTestCase->testPageNotFound()	
Valid HTML found on "http://default.head/system/files/system-test/404-handling.jpg"	Browser	system.test	618	PageNotFoundTestCase->testPageNotFound()	
Drupal is handling requests for specific (fast 404) filetypes when within a private files directory.	Other	system.test	619	PageNotFoundTestCase->testPageNotFound()	
chx’s picture

Owen, I do not get this, the rewrite rule changes the request to 404.gif or similar? Or what is the 404.%1 wanted to be? And, why wont the filesmatch grab your request, even in the files dir before any rewrite?

Owen Barton’s picture

The rewrite rule simply prevents the request from being rewritten and passed to Drupal (if the file does not exist, the file is outside */files/* and has a specified extension). This causes Apache to 404 the request and look up the appropriate ErrorDocument for the request, which has been previously set for files with this extension. The filesmatch does not do anything at all unless Apache looks up an ErrorDocument for the request.

There perhaps could be problems with this if clean URLs are disabled - kbahey, did you check this at all?

killes@www.drop.org’s picture

I am wondering if this is a good idea. While it might be good for performance, you will never notice that some files are missing if you do nto additionally look into the apache log.

Morbus Iff’s picture

Yeah, honestly, I'm not hugely fond of this functionality - it's pretty hackish, and severly cripple's Drupal's own log functionality (which, over the many years, people have probably come to expect to be a literal equivalent to server logfiles). I don't doubt that heavy 404s can cause performance issues - I just don't feel this is the answer.

catch’s picture

I'm wondering if we should just do either extension matching or file directory matching in drupal_not_found() - so log the 404 for site admins, but then return straight away if it's a file rather than a page. That'd save all the page building and rendering for the 404, but still give us logs.

moshe weitzman’s picture

It simply is not Drupal's job to catch all possible requests under its tree and log them. I like this patch a lot. catch's idea still requires a bootstrap which is a big problem. some contrib module can implement an apache log viewer if people need to view these 404's inside of drupal. or make a contrib module which lets you add a simple link to awstats/webalizer the reports block. web hosts could even make their own distro with that link pre-configured.

killes@www.drop.org’s picture

It simply is not Drupal's job to catch all possible requests under its tree and log them.

It has done so for years and considering everything this behaviour has caused very little problems.

I like this patch a lot. catch's idea still requires a bootstrap which is a big problem. some contrib module can implement an apache log viewer if people need to view these 404's inside of drupal. or make a contrib module which lets you add a simple link to awstats/webalizer the reports block. web hosts could even make their own distro with that link pre-configured.

I think it would be bad user experience to change this behaviour.

Catch's idea sounds interesting, but I am not sure how it could work.

Summary: I'd won't fix this if I were a core committer.

Morbus Iff’s picture

I don't like, at all, how the patch redirects the user to 404.$fileextension. This is *really* damaging - that's the actual entry that will get logged into the server log file, meaning that we are *crippling* a traditional error_log by filling it with garbage. How can I fix my broken links or embeds (etc.)., if Drupal is filling my log with "404.jpg" and "404.html"? This'll corrupt not just the error_log, but also anything that works on error_logs, such as Webalizer and so forth.

Morbus Iff’s picture

To amend my previous comment, here's what the error log looks like with this patch:

[Mon Jun 15 10:52:35 2009] [error] [client 75.67.50.34] File does not exist: /var/www/vhosts/disobey.com/httpdocs/404.jpg
[Mon Jun 15 10:52:37 2009] [error] [client 75.67.50.34] File does not exist: /var/www/vhosts/disobey.com/httpdocs/404.ico, referer: http://www.disobey.com/ahem.jpg

The first entry is the rewritten file - I had requested ahem.jpg. The second entry is a modern browser's standard attempt to get the favicon.ico (which, naturally, we can't really tell here, since it's been rewritten).

Huge -1.

catch’s picture

Also has anyone tested this with imagecache?

Morbus Iff’s picture

catch: assuming that the Drupal 'files' directory has been set to 'files', then imagecache would be fine. If the 'files' directory was named something not 'files', then imagecache, private downloads, and other dynamic generations would fail. They could be made accessible again only if the .htaccess and Drupal source code was modified (fun, right?).

catch’s picture

Should've read the patch again, thanks though Morbus.

moshe weitzman’s picture

@Morbus - could you recommend an approach then? Right now, a site can nearly crumble if a few bad invalid images get posted to the home page (for example)

catch’s picture

Just a note I think we could do my approach via #444756: Add a page_not_found() hook to Drupal with a few modifications.

Morbus Iff’s picture

@moshe: I don't have a suggestion you'd like, no - I don't think it can be cleanly fixed at the .htaccess level (i.e., meaning we'd need to do /some/ sort of bootstrap; ala @catch's point). However, I also disagree entirely with your assertion that "a site can nearly crumble with a few invalid images" - that's ridiculous: as someone who has been running the same site for more than 10 years, and gets dozens of 404s every minute, (historical, rootkitish, and accidental, etc.) it's a bit FUDdish. To think that Drupal could get to where it is now, with such a easy 404 DoS, is inane.

I will agree, however, that certain shared hosts can have such a problem with a popular site and sliced resources. I just don't think, however, it's so incredibly widespread that Drupal sites everywhere are being crippled because of it. Do I think something should be done about it? Sure, I have no problem with that. Do I think it's critical? Nope. Do I think it's critical enough to a) strip an existing expectation from Drupal (404s in the reports) and b) break the equivalent functionality at the webserver/log level? Most definitely not.

killes@www.drop.org’s picture

Also, putting the notice about the missing images into the watchdog will allow site maintainers to create the missign images or to remove the broken node/template/whatever.

This issue is pure sillyness.

moshe weitzman’s picture

@catch - in this issue, we want to handle 404 without a bootstrap. really we want to handle a 404 without any php at all but definately no bootstrap. a 404 hook completely the wrong direction IMO. a 404 hook for the files directory might be fine and can be pursued in that issue.

catch’s picture

Also we haven't been doing a full bootstrap for favicon.ico 404s since 28th June 2008, so this patch only fixes image files which aren't favicon.ico and aren't in sites/*/files. I think with something in drupal_not_found() we could skip rendering the page for files in sites/*/files as well.

gpk’s picture

>skip rendering the page
If page caching is on then the response will be from the page cache anyway, so there will only be a full bootstrap/rendering of the page if the request hasn't already been cached.

killes@www.drop.org’s picture

cool, so I need to check if favicon.ico exists myself? What a truly great idea. I'd expect something like this from WP or Joomla, but not from Drupal...

Owen Barton’s picture

Comment 127 is not correct - the file directory path does not need to be set to "files", but simply needs to include the string "files" somewhere within it. There is a validation on the setting in the patch.

I agree that the current patch breaks Apache logging (404.ext) in an unacceptable way and is not committable. Possibly there is a way to fix this - I tried briefly to no avail. I think I am also leaning towards a php based solution, ideally with optional Drupal logging (and hence DB bootstrap). Either way, I DO think we need an improvement here - I have seen several site outages caused by 404s when assets on a single browser page view creates many Drupal hits, or the page has repeating AJAX callbacks that are 404ing.

Either way, I think 2 of the 3 tests I created are still worthwhile, since they ensure the imagecache style facility is functional (which I think could pick up some subtle breakages in the file path/url functions).

Morbus Iff’s picture

Owen - the RewriteCond specifically looks for "/files/" (unbounded), but the validation code only looks for "files". This would allow me to use "drupal_files" inside Drupal, and pass the validation, but lose the "fast" 404s.

killes@www.drop.org’s picture

I actually like to translate the default files path to use "dateien"...

If your system is prone to meltdown because of a few missing files I tend to think that this is not a Drupal problem per se.

You should evaluate your hardware and possibly buy more.

You also should test your site on a staging system to catch such 404s in advance.

Owen Barton’s picture

@Morbus - correct, this was a bug - however the symptoms would be broken imagecache, rather than lost fast 404s.

Attached patch fixes this, and also fixes the issue with the tests failing when mod_rewrite is not enabled. It doesn't fix the Apache 404.ext logging issue however.

Owen Barton’s picture

FileSize
8.97 KB

Sigh

moshe weitzman’s picture

Well, 5 broken images on your home page can be a nearly a 5x multiplier on your traffic. It leads to 5x on the number of bootstraps that your web and DB servers have to service.

killes@www.drop.org’s picture

Excellent reason to just fix the stupid images. Thanks to Drupal (as it is now) you'll at least be alerted to the problem.

Owen Barton’s picture

You won't be alerted to the problem if the site is down due to 5x traffic

Morbus Iff’s picture

You can't solve the issue if your error log shows 5 entries for 404.jpg.

(See, we can do snarky replies too!) ;)

Owen Barton’s picture

Snarky replies aside, I think everyone agrees that logging 404.ext for missing files is not acceptable.

What we need is either an alternate .htaccess approach (that logs 404s correctly, and works acceptably both with mod_rewrite enabled and disabled) - OR a PHP based approach that is sufficiently faster than a cached bootstrap to make it worth doing.

m3avrck’s picture

FWIW, I do this on my sites:

/**
 * Generates a 404 error if the request can not be handled.
 */
function drupal_not_found() {
  drupal_set_header('HTTP/1.1 404 Not Found');

  watchdog('page not found', check_plain($_GET['q']), NULL, WATCHDOG_WARNING);

  global $conf;
  include_once $conf['404_file'];
  exit;
}

Than I usually set the path to sites/whatever/theme/whatever/404.php and include my logic there.

This is in addition to setting the htaccess to point to this same 404 file as well (and having similar htaccess work).

Owen Barton’s picture

Status: Needs work » Needs review
FileSize
4.12 KB

OK, let's try a different approach. I think this one could even make everyone happy!

I think we have demonstrated that a .htaccess based approach is too problematic. So this is a simple PHP approach. It allows you to do send a lightweight 404 for matching URLs either:

a) just before the normal 404 would otherwise be sent (the default). This has minimal side effects (actual pages that happen to match still get served as normal, and 404s are logged correctly in both the Apache and system log). Of course, you still have a pretty good chunk of processing needed to reach this point (we manage to skip most of the theming work though), but users still benefit from bandwidth savings, and this should be safe enough to enable by default.

b) Alternatively (and not the default!) you can return the 404 directly during the configuration bootstrap. This has huge benefits for server load, but significant side effects (identical to the htaccess ones we saw) - even valid pages that happen to match the pattern will return 404s, rather than the actual page. It will also prevent the log entry. This could still be perfectly acceptable for large/active sites with well understood URLs looking to get an edge off their server load.

There are a few things I would like feedback on:
- Does this general approach work for folks? If there are fundamental objections please speak up.
- Naming - is "lightweight 404s" grokable? I didn't want to use "fast" since it is really not a super performance bonus by default. Should we say "not_found" instead of 404?
- Regular expressions in settings.php - this strikes me as a bit risky/unfriendly. drupal_match_path() could work in a pinch, but needs case lowering, and would mean we couldn't use the jp?eg style shortcuts. Other ideas?
- Calling drupal_lightweight_404() directly in setting.php (optionally, of course) is also kind of a new concept. Should we make this a $conf variable (lightweight_404_fastpath, perhaps), and check it in bootstrap.inc somewhere?
- Are there any clean up tasks (hook_exit etc) we really need to do in either case (a) or (b)?

kbahey’s picture

FileSize
4.15 KB

Owen

I like this. It is totally configurable and off by default. So no one should complain about it.

My modification is attached. The most important modifications are:

1. In drupal_fast_404(), you were missing this before the print statement:

header('HTTP/1.0 404 Not Found');

2. Added a pattern for cached css and js where we have a ? and then one letter after it.

I did some bikeshedding, please feel free to ignore. It is renaming "fast" instead of "lightweight". Shorter to type and remember. This should NOT derail the patch. I am happy to revert.

Please let us get this in in Drupal 7. It is way over due.

Can we get a review and an RTBC?

Owen Barton’s picture

Good catch - I tweaked it to add the header with drupal_add_http_header() rather than header() and also changed the order of operations in drupal_deliver_html_page() which avoids a duplicate header (or now function call).

I also removed the css?. and js?. patterns - we are checking against $_GET['q'], which already excludes the querystring, so these aren't needed.

No objections to the name change.

Owen Barton’s picture

FileSize
4.48 KB
moshe weitzman’s picture

+++ sites/default/default.settings.php	9 Apr 2010 16:35:38 -0000
@@ -393,6 +393,35 @@ ini_set('session.cookie_lifetime', 20000
+ * By default, fast 404s is returned as part of the normal page request
+ * process, will serve valid pages that happen to match, and will also log any
+ * 404s in the Drupal log. Alternatively you can choose to return a 404 now by

I think some sentences got mashed together improperly here. Otherwise, is RTBC. Nice work.

Owen Barton’s picture

FileSize
4.48 KB

How about this?

Owen Barton’s picture

FileSize
4.9 KB

OK, that didn't parse either...try this

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

OK, lets give this poor issue a chance to shine.

sun’s picture

Status: Reviewed & tested by the community » Needs work
+++ includes/bootstrap.inc	9 Apr 2010 20:54:00 -0000
@@ -2045,6 +2045,24 @@ function drupal_maintenance_theme() {
 /**
+ * If fast 404 pages have been configured, and this is a matching page
+ * then return a simple 404 page.

This needs to be a proper function summary, stating what this function does. The current summary is rather part of the description.

"Returns a simple 404 Not Found page."

+++ sites/default/default.settings.php	9 Apr 2010 20:54:00 -0000
@@ -393,6 +393,36 @@ ini_set('session.cookie_lifetime', 20000
+ * Drupal can generate fully themed 404 pages. However, some of the time these
+ * responses are for image or other resource files that are never displayed
+ * to the end user, or from bots that will do nothing with the resulting page.
+ * This can waste bandwidth, and also generate server load.

"However, some of these responses are for images or other resources that are not displayed to the user. This can waste bandwidth and generate unnecessary server load."

+++ sites/default/default.settings.php	9 Apr 2010 20:54:00 -0000
@@ -393,6 +393,36 @@ ini_set('session.cookie_lifetime', 20000
+ * The options below return a simple, fast 404 page for URLs matching a
+ * specific pattern.
+ *
+ * - 404_fast_paths is a regular expression matching paths to return the
+ *   fast 404 on. If you don't have any aliases ending in htm or html you
+ *   can add '|s?html?' to the expression.
+ * - 404_fast_html is the html to return.

The preceding sentence should end with a colon (:), and the list should directly follow without blank line.

The array keys should be delimited with a colon, followed by a description, e.g.

- 404_fast_paths: A regular expression to match paths to return a simple 404 page on. [...]

+++ sites/default/default.settings.php	9 Apr 2010 20:54:00 -0000
@@ -393,6 +393,36 @@ ini_set('session.cookie_lifetime', 20000
+ * ¶

Trailing white-space here.

+++ sites/default/default.settings.php	9 Apr 2010 20:54:00 -0000
@@ -393,6 +393,36 @@ ini_set('session.cookie_lifetime', 20000
+/*

1) Should be /**

2) Missing blank line before this docblock.

116 critical left. Go review some!

Owen Barton’s picture

Status: Needs work » Needs review
FileSize
4.61 KB

Thanks for the review, Sun. This should resolve all these issues.

sun’s picture

Status: Needs review » Reviewed & tested by the community

Thanks!

kbahey’s picture

Webchiiiiiiiiiiiiick! Driiiiiiiiiiies!

Dries’s picture

Version: 7.x-dev » 8.x-dev
Category: task » feature

This looks like a new feature, and should go into Drupal 8 at this point.

kbahey’s picture

Dries I disagree that it is a new feature.

This is a performance improvement. We bootstrap the whole of Drupal when the request is a static file 404.

This patch does not change any behavior and is off by default, enabling it reduces resource waste when a lot of 404s for static files are happening.

sun’s picture

Category: feature » task

I agree that this would be good to do for D7.

However,

This patch does not change any behavior and is off by default

is not entirely true, due to this change:

+++ includes/bootstrap.inc	9 Apr 2010 22:38:30 -0000
@@ -2045,6 +2045,26 @@ function drupal_maintenance_theme() {
+function drupal_fast_404() {
+  $fast_paths = variable_get('404_fast_paths', FALSE);

+++ includes/common.inc	9 Apr 2010 22:38:31 -0000
@@ -2329,11 +2329,14 @@ function drupal_deliver_html_page($page_
+        // Check for and return a fast 404 page if configured.
+        drupal_fast_404();

+++ sites/default/default.settings.php	9 Apr 2010 22:38:31 -0000
@@ -393,6 +393,36 @@ ini_set('session.cookie_lifetime', 20000
+$conf['404_fast_paths'] = '/.(txt|png|gif|jpe?g|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp)$/i';

Since the pattern is always defined, you will get a fast 404 page instead of a full one, if the pattern matches.

Can we remove this change in behavior, for example, by introducing another variable to toggle the entire functionality?

113 critical left. Go review some!

kbahey’s picture

Version: 8.x-dev » 7.x-dev
FileSize
4.64 KB

Fair point sun. Thanks for the catch.

Here is a version of the patch that is completely off by default. The user needs to uncomment stuff in settings.php if they want that on.

Dries, it REALLY does not change any behavior at all, and fixes a performance bug.

Dries’s picture

I'm not ready to commit this, and want to discuss this some more.

For example, I'm not a fan of the list of extensions. For site visitors, it doesn't make much sense to get a different 404 depending on file extension.

Maybe we should go back to the Drupal 6 solution where 404 pages where fast (because blocks where omitted)? We should try to make all 404s fast and avoid more complex solutions.

kbahey’s picture

The 404s that are emitted because of static files are often invisible. For example, the site owner changes the name of a small PNG from CSS, or misspells it. The page renders normally for the end user, but this PNG (and perhaps others) triggers 404s that the user will never see. Despite these not being seen by the user, they do a full Drupal bootstrap with all the overhead. This solution avoids this unnecessary, yet invisible, overhead.

We are removing known overhead for known file types. The rest of 404s require a Drupal bootstrap to check the menu, ...etc. So we can't do much about them until we reach that stage. But why go that deep into Drupal only to send a 404 that the user will never see?

Even if the user sees the error, it is the site admin's decision to show them a full or brief error message. We just give them the means to do so, not force them to.

On all high traffic sites that 2bits.com manage, we have a variant of this in settings.php just to avoid this overhead. We look at the extension and emit a 404 and then exit right in settings.php.

As for the block-less 404, I personally don't like it since it looks inconsistent and odd, and only removes one overhead source (blocks) not the full bootstrap.

catch’s picture

Drupal 6 only omitted the left and right sidebars, not all blocks, and many sites don't use block module at all, likely to be less in D7 now it's optional.

Also, omitting blocks isn't much of an optimization depending on the site, say 10/20% of total page generation. Skipping rendering altogether is a good 50% or more on D7 since so much happens in drupal_render_page(), the two can't really be compared.

Additionally the D6 optimization often meant that search boxes or the navigation menu were omitted on 404s . Quick search shows a tonne of issues complaining about that optimization, following list isn't exhaustive. It's also not a simpler solution - requires messing about with the rendering process, and there are still cleanup patches outstanding for it:
#221399: Allow administrator to decide if blocks should be displayed for site_404 pages
#423992: Remove show_blocks page variable
#141162: 403/404 errors should allow blocks
#423696: Allow display of menu links on 404 error page

On the patch here, I fully agree with Khalid that most 404s for files are invisible - on one site I run which has been going since 2003, 2005 on Drupal, the majority of 404s for images are very old ones which were embedded in pages on external sites - unless you look in the net tab on firebug you might not even realise those images were ever supposed to be displayed. Going to all the effort of a full bootstrap for those is a bit much.

Even if someone visits your site directly and gets a 404 for an image, they're very unlikely to see the full page render because they'll be viewing a page with that image embedded too. The only case where you'll actually see the 404 for a file, is if it's linked to directly somewhere, both within sites (which tend not to link to dead files so much) and external links, neither are really the common case - and even then a lot of those hits will be from crawlers.

However, I'd like to see this patch remove the special handling for favicon.ico in .htaccess, or at least disable it by default. Seems like that was a stopgap which mainly went in because this issue hadn't. I'm also not too sure about having the function defined twice - are we 100% sure that's never going to cause issues with opcode caching etc?

kbahey’s picture

FileSize
5.31 KB

Catch, thanks for the review.

The attached patch removes the special handling for favicon in .htaccess. Yes, that was a special stopgap and is no longer needed with this new approach.

I can't find anywhere that the function is defined twice. It is defined in bootstrap.inc once only.

$ grep -r drupal_fast . | grep -v patch
./sites/default/default.settings.php:# drupal_fast_404();
./includes/common.inc:        drupal_fast_404();
./includes/bootstrap.inc:function drupal_fast_404() {
catch’s picture

That'll teach me to review patches on Sunday night, obviously it's not defined twice, just called in two different places.

Next question is what happens if you use image derivatives (or whatever imagecache is called now) with the settings.php version? I'm not up to speed with how it handles requests in D7 now, but if that's going to be affected, I think it's worth adding a particular note either way there that it's incompatible - given that's part of core now. Also if it /is/ incompatible, given that additional optimization is only in settings.php so doesn't require a core hack, it could be taken out without making the overall patch any less useful.

Dries’s picture

From an end user point of view, I actually like the fact that all pages do a full bootstrap because all 404s create an entry in my watchdog -- that is very convenient and allows me to actually fix 404s. I use that all the time.

moshe weitzman’s picture

@Dries - sure, that watchdog is very handy for some sites. Those sites will do nothing and still get the watchdog.

This new line in settings.php is for sites that can't afford a full bootstrap because of scaling constraints. This is a valuable weapon for those sysadmins. As you are hearing here, we've all been doing this in our own hackish way in settings.php for *years* and now we are trying to share that with the world. If this doesn't get committed, then only those in the know get the weapon. Owen, kbahey, catch and myself, don't need this patch. We're doing it to benefit everyone else.

I'm pretty sure that the sysadmins for Drupal Gardens and Buzzr would want this tool in their belt.

kbahey’s picture

@catch, I may be mistaken, but this is what I could find from a quick search on image derivatives:

$success = file_exists($destination) || image_style_create_derivative($style, $path, $destination);

So it seems to check the physical file on the file system, not rely on a 404. So that part does not need further action for this patch. If someone finds otherwise, I am happy to add a comment.

@Dries, the 404s in the watchdog are mostly noise on big sites. Old links from other sites, old elements from themes that have been changed for a while, ...etc. It is rarely about real full pages. In that case, the user still has the option to skip the new fast 404 as Moshe said.

kbahey’s picture

@Dries,

Here is what I use for several high traffic site, in settings.php.

So this patch comes out of real needs for real sites.


$error_404_header = 'HTTP/1.0 404 Not Found';
$error_404_text = '<html><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL was not found on this server.</p></body></html>';

if ($_SERVER['QUERY_STRING'] != 'rss.xml' || preg_match("/\.(txt|png|gif|jpe?g|shtml?|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp|xml)$/", $_SERVER['QUERY_STRING'])) {
  header($error_404_header);
  print $error_404_text;
  exit();
}
catch’s picture

@Dries, there's two optimizations here, one is to avoid the full bootstrap altogether, the other is to not do a full page render for those paths. With the latter, you still get a very large performance gain for those particular file extensions (at least 50%), but you also get the watchdog message. There's currently no way to do that without a core hack at the moment.

Owen Barton’s picture

Here is a table describing the status-quo and the 2 options added by this patch:

Configuration A- disabled Configuration B - called from drupal_deliver_html_page() Configuration C - called in settings.php
Heavy 404 page generation cost Smaller 404 page generation cost (~50% of A?) Minute 404 page generation cost (< fastpath cache)
Slow 404 page response time (e.g. 1-10 seconds) Faster 404 page response time (e.g. 0.5-5 seconds) Very fast 404 page response time (e.g. < 0.2 seconds)
Large 404 page size (2,000-30,000+ bytes, depending on site) Tiny 404 page size (~200 bytes) Tiny 404 page size (~200 bytes)
Inlined assets (images, JS, CSS) 404 as normal Inlined assets (images, JS, CSS) 404 as normal Inlined assets (images, JS, CSS) 404 as normal
Directly requested assets show fully themed 404 page Directly requested assets show simple 404 page Directly requested assets show simple 404 page
All 404s logged in watchdog as normal All 404s logged in watchdog as normal "Fast" 404s not logged in watchdog
All 404s logged in Apache logs as normal All 404s logged in Apache logs as normal All 404s logged in Apache logs as normal
Core image api works normally Core image api works normally Core image api works normally
404 based lazy loaders (e.g. D6 imagecache) work normally 404 based lazy loaders (e.g. D6 imagecache) work normally 404 based lazy loaders (e.g. D6 imagecache) don't work without tweaking the regexp
Pages with aliases that match regex load as normal Pages with aliases that match regex load as normal Pages with aliases that match regex 404 instead of load

From a scalability point of view 404s can be a huge problem for sites with old domain names that have gone through multiple re-structurings over the years. The 404 traffic simply from hotlinked files alone can be a significant draw on resources. The current setup is also a huge risk for active sites - if you add a 404 to a busy page, or perhaps accidentally move, chmod or otherwise make the files directory inaccessible, server load instantly skyrockets and can render the server unresponsive.

From an end user point of view this is also a major issue. Browsers will patiently sit around with the loading spinner going waiting for Drupal to fully bootstrap, generate a beautiful, fully themed page to represent an image or css file, that the browser then completely ignores as soon as it notes the 404 http code. Not only does this mean that we end up waiting for Drupal to bootstrap twice (or more!), but the response is at least a factor of ten larger than it needs to be. In some cases (e.g. whitehouse.gov or amnesty.org) it is over 100 times larger than necessary. Multiply this by all the 404s on all the Drupal 7 sites, and the end result is a lot of end user wait time, and a lot of bandwidth and cycles wasted.

Currently the default is configuration A - while this is safe and conservative, it has no benefit - most sites and users would be left with the same issues we have now (and a re-broken default favicon 404 to boot). I would strongly argue that if you consider the table above it is clear that configuration B is actually a perfectly reasonable and safe default for the overwhelming majority of sites.

Owen Barton’s picture

FileSize
5.35 KB

Here is an updated patch with configuration B as the default, which I think is the best choice by far. If the only way this can get committed is disabled by default then I think we need to go with the patch in 163, because we can't really remove the favicon.ico workaround if the fix is disabled.

kbahey’s picture

The whole idea for the earlier version (#167) is reduce resource usage and improve performance and scalability. The patch in #175 is a step back. Although it is better than the status quo, it is still too late in the boot process and consumes a lot of unnecessary resources.

I would like us to seriously consider #167 for commit.

Owen Barton’s picture

@kbahey - I think you have misread my patch - #167 is more conservative that #175 - the former is completely off by default, and identical to the status quo in terms of resource usage and scalability. #175 cuts out a significant chunk of process and bandwidth by default on all sites, yet retains almost identical behaviour and logging. If an admin decides they need more savings they can comment out the exact same settings.php line you have in #167 and return a 404, skipping the rest of the bootstrap.

In addition #167 means that favicon.ico is again broken by default, and will be creating a full bootstrap on every page (even cached pages) just to generate the 404 response to favicon.ico, which is completely unacceptable IMHO and why I suggested #163 as a fallback.

jpmckinney’s picture

I prefer having configuration B as described in #174 as the default, so I endorse the patch in #175.

If configuration A must be what is commited, then I agree that #163 is preferable to #167.

I would have set the status to RTBC, but it's already the case.

kbahey’s picture

@Owen Barton

My bad. I understand it now. It is better than the previous iterations, since it still preserves the watchdog. The early bailout functionality is optional and off by default and that is what I am looking for.

So +1 from me.

Dries’s picture

Would like to hear from moshe or catch on approach B. Seem like it is the only option given some of the regressions with option C. Great table Owen!

moshe weitzman’s picture

I'd be thrilled to have B as default. #175 gets +1 from me as well.

catch’s picture

B sounds great to me as well, don't think we can get away with C as a default.

geerlingguy’s picture

@kbahey / #172: I tried copying and pasting verbatim at the end of my settings.php (without the open/close PHP tags), but Drupal wouldn't load after doing so... is there something else you did to get the above code to work for 404s on files? [Edit: just noticed, there's no logic to check if we're at a 404 status here... thus all files will be served as 404]

My site gets about 500-1000 404s per minute for hotlinked images which have gone missing now, and this causes my D6.x installation to consume a constant background CPU usage of about 1 (for a 4-core server).

Also, subscribe.

kbahey’s picture

I emailed you the code via the d.o contact form so as not to create further noise in this issue (and dries can commit it ...)

ddorian’s picture

little bump

Christefano-oldaccount’s picture

ddorian, that's really not necessary. Please don't do that.

moshe weitzman’s picture

#175: 76824-14.patch queued for re-testing.

sun’s picture

+++ sites/default/default.settings.php	12 Apr 2010 17:54:10 -0000
@@ -393,6 +393,38 @@ ini_set('session.cookie_lifetime', 20000
+$conf['404_fast_html'] = '<html xmlns="http://www.w3.org/1999/xhtml"><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL /' . $_GET['q'] . ' was not found on this server.</p></body></html>';

Is $_GET['q'] set in D7 when settings.php is included?

85 critical left. Go review some!

kbahey’s picture

FileSize
6.02 KB

Actually, yes, it works at this point. Just add a print there and you will see it.

I put a better path using base_path() and $_GET['q'] in common.inc. base_path() cannot be used in settings.php though, so just enclosed it in double quotes.

Also, I did some benchmarks on this last patch, and here are the results:

Using apachebench, and visting example.com/none (which is a 404), with 5 users, 1000 requests

1. Without patch
Requests per second: 49.71

2. With patch, default settings
Requests per second: 51.11

3. With patch, with function uncommented in settings.php
Requests per second: 51.15

When visiting example.com/none.jpg (which a 404 that triggers a fast 404), 5 users, 1000 requests

With patch, with function uncommented in settings.php
Requests per second: 352.31

Orders of magnitude faster, now with numbers ...

Owen Barton’s picture

I wonder if the $_GET['q'] needs a check_plain - I think it might.

sun’s picture

Status: Reviewed & tested by the community » Needs work
+++ includes/bootstrap.inc	1 May 2010 03:06:16 -0000
@@ -2063,6 +2063,26 @@ function drupal_maintenance_theme() {
+    print variable_get('404_fast_html', '<html xmlns="http://www.w3.org/1999/xhtml"><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL /' . $_GET['q'] . ' was not found on this server.</p></body></html>');

+++ sites/default/default.settings.php	1 May 2010 03:06:21 -0000
@@ -393,6 +393,38 @@
+$conf['404_fast_html'] = '<html xmlns="http://www.w3.org/1999/xhtml"><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL "' . $_GET['q'] . '" was not found on this server.</p></body></html>';

So effectively two issues:

1) check_plain().

2) The actual path should really be injected via strtr() during output, and not during variable assignment.

Powered by Dreditor.

kbahey’s picture

Status: Needs work » Needs review
FileSize
6.1 KB

Both issues fixed now. The variable has a @path token that is replaced via strtr(), and also check_plain() is used.

moshe weitzman’s picture

Status: Needs review » Needs work

you can't run t() on dynamic strings

Owen Barton’s picture

Status: Needs work » Needs review
FileSize
5.93 KB

This fixes the t() call, and also replaces the $_GET['q'] with request_uri() (tested in all 3 usages), which I think is exactly what we need, since we want to show everything the after the domain part of the URI, and $_GET['q'] doesn't have the base path, and also can get messed with by language prefixes and similar systems.

Status: Needs review » Needs work

The last submitted patch, 76824-17.patch, failed testing.

kbahey’s picture

Status: Needs work » Needs review
FileSize
6.18 KB

I don't understand why this failure is happening after t() was fixed.

Rerolling for offsets against current HEAD.

Status: Needs review » Needs work

The last submitted patch, 76824-18.patch, failed testing.

moshe weitzman’s picture

The test appears flawed. It assumes that a language is always available but this is very early in the bootstrap so it is OK to call t() [this is new in d7], but only custom string translation will be performed.

kbahey’s picture

The flaws of automated testing ...

I can't fix the test to detect early and normal though ...

Moshe, would you consider using strtr() here appropriate?

geerlingguy’s picture

Just FYI, using this code (for me) has caused some problems with ImageCache; instead of IC getting wind of an attempt to access a .jpg, .png, etc. in one of the ImageCache directories, then creating the image, Drupal throws a 404 before ImageCache is called... something to watch out for.

ajayg’s picture

Perhaps having a check on the path where imagecache stores the cached images will do the trick.

If the path of image requested matches a parttern ( say "cache") then don't do anything else send 404.

Owen Barton’s picture

@geerlingguy and @ajayg - please see comment 174 this patch will not break imagecache type modules by default, and will not break the image generation ("imagecache in core") in Drupal 7 even when the optional settings.php "fast return" is enabled. If you use Khalid's code on a Drupal 6 site then it will break Drupal 6 imagecache if you don't add extra conditions. However, please do not discuss imagecache or Drupal 6 in this issue - this causes clutter and distraction.

WiredEscape’s picture

subscribe

alanburke’s picture

Status: Needs work » Needs review
Issue tags: -Performance

#196: 76824-18.patch queued for re-testing.

Status: Needs review » Needs work
Issue tags: +Performance

The last submitted patch, 76824-18.patch, failed testing.

scor’s picture

Status: Needs work » Needs review
FileSize
6.03 KB

The test fails because this hunk adds the 404 uri in the page

+++ includes/common.inc	1 May 2010 16:24:47 -0000
@@ -2362,7 +2365,7 @@ function drupal_deliver_html_page($page_
-          $return = t('The requested page could not be found.');
+          $return = t('The requested page "@path" could not be found.', array('@path' => request_uri()));

which will contain the $langcode, something that is not expected by locale.test. It appears to me that this test in locale.test (lin 263-265) is a hackish way of checking for a 404 anyways, so I've fixed that.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

Gosh, I thought this was in already. Thanks scor.

Owen Barton’s picture

FileSize
4.64 KB

Rerolling for patch fuzz...

marcingy’s picture

#208: 76824-fast_404_208_0.patch queued for re-testing.

marcingy’s picture

Priority: Normal » Major
moshe weitzman’s picture

It passed again. Pretty please commit please.

Anonymous’s picture

Is there any intermediate 6.x solution, while we all wait to move to d7?

I currently hacked common.inc's drupal_not_found() to directly load a custom static 404.html with some extra htaccess rules to handle files 404's.

geerlingguy’s picture

@morningtime - Here's code for Drupal 6 only (paste it at the bottom of settings.php):

/**
 * 404 Handling, to conserve server resources upon missing image/text/non-html files.
 * See: http://drupal.org/node/76824#comment-2834536
 * Note: JPG, PNG, GIF don't work with ImageCache, so can only do non-image files here...
 */
if (preg_match("/\.(txt|shtml?|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp)$/", $_SERVER['QUERY_STRING'])) {
 header('HTTP/1.0 404 Not Found');
 print '<html><head><title>404 Not Found</title></head><body><h1>404 Not Found</h1><p>The requested URL was not found on this server. If you think you reached this page in error, please visit the <a href="http://archstl.org/">Archdiocese of Saint Louis home page</a> and search for the page or file for which you are looking.</p></body></html>';
 exit();
}
kbahey’s picture

@morningtime
Don't hack common.inc. Kittens die horrible deaths. There is an easier and more maintainable solution, with only a change in settings.php.

@greelingguy
If you use imagecache, then you have to exclude its paths from being handled.

Here is the code I use for most sites: Reducing server resource utilization on busy sites by implementing fast 404s in Drupal.

scor’s picture

FileSize
6.03 KB

the default.settings.php hunk is missing from #208, rerolling.

Anonymous’s picture

@kbahey
Thanks! I learned a lot from your DrupalCon video!

Owen Barton’s picture

Issue tags: +Performance
FileSize
6.14 KB

I did some benchmarks to show the effect of the current slow inline 404s on browser display latency for end users. These were done using http://www.webpagetest.org/ testing the same site (in different configurations, 404s introduced through a node body promoted to home page) using default settings, 10 test runs each time.

First View Summary (seconds)
  No patch Patch Patch (fast 404 enabled)
No 404 2.591
Image 404 2.944 2.881 2.525
CSS 404 3.571 2.890 2.501
Percentage Improvement
  Patch Patch (fast 404 enabled)
Image 404 2% 14%
CSS 404 19% 30%
Absolute Improvement (seconds)
  Patch Patch (fast 404 enabled)
Image 404 0.063 0.419
CSS 404 0.681 1.07
Repeat View Summary (seconds)
  No patch Patch Patch (fast 404 enabled)
No 404 0.886
Image 404 1.716 1.488 1.011
CSS 404 1.604 1.500 1.287
Percentage Improvement
  Patch Patch (fast 404 enabled)
Image 404 13% 41%
CSS 404 6% 20%
Absolute Improvement (seconds)
  Patch Patch (fast 404 enabled)
Image 404 0.228 0.705
CSS 404 0.104 0.317

See attached file for more detailed benchmarks.

The results are not insignificant - an inline 404 currently slows down the parent page display by anything from 500ms to a full second, depending on the type of 404. Committing the patch will shave something like 100ms-500ms off that time. If site builders enable the "fast 404" return in settings.php the pages display almost as fast as if the 404 wasn't there - anything from 500ms to a full second faster than the current performance. In other words, this patch clearly has enormous benefits for both end user responsiveness, in addition to the server utilisation benefits described in depth above.

BenK’s picture

Thanks for the benchmarks... very interesting indeed!

webchick’s picture

This has Dries's fingerprints all over it, so I'll leave it to him to either commit or D8 this.

naxoc’s picture

Subscribe

scor’s picture

FileSize
6.05 KB

patch didn't apply any longer

Owen Barton’s picture

Summary is at #174.

Server side benchmarks are in #189 and front-end benchmarks are in #217.

sbozhko’s picture

Issue tags: -Performance

#221: 76824-fast_404_221.patch queued for re-testing.

sun’s picture

#221: 76824-fast_404_221.patch queued for re-testing.

sun’s picture

Version: 7.x-dev » 8.x-dev

Although badly needed, this is D8 material according to the rules (I had to learn today). It may be backported at a later point in time (though that's unlikely).

moshe weitzman’s picture

Version: 8.x-dev » 7.x-dev

recover from temper tantrum.

Owen Barton’s picture

This has been RTBC since June. Would love to get this in, or find out what is needed to get to that point! See meta-summary of summaries/benchmarks in #222

moshe weitzman’s picture

Issue tags: -Performance

#221: 76824-fast_404_221.patch queued for re-testing.

Status: Reviewed & tested by the community » Needs work
Issue tags: +Performance

The last submitted patch, 76824-fast_404_221.patch, failed testing.

moshe weitzman’s picture

Status: Needs work » Reviewed & tested by the community
FileSize
6.08 KB

Reroll. Keeping at RTBC since there are no changes.

Status: Reviewed & tested by the community » Needs work
Issue tags: -Performance

The last submitted patch, 404.diff, failed testing.

moshe weitzman’s picture

Status: Needs work » Needs review
Issue tags: +Performance

#230: 404.diff queued for re-testing.

moshe weitzman’s picture

Status: Needs review » Reviewed & tested by the community

Bot finally recovered from its bender. Back to RTBC.

tom_o_t’s picture

#230: 404.diff queued for re-testing.

webchick’s picture

Another one with Dries's fingerprints all over it.

Leaving for The Spikey Haired One to make the call.

rfay’s picture

Status: Reviewed & tested by the community » Needs work

Shoot, sorry for the interruption here... just noticed this.

Related issue: #177790: Standalone scripts like cron.php can be erroneously cached if they go temporarily missing.

Dreamhost regularly kills off running cron processes, creating an artificial 404 on cron.php. Which Drupal then caches oh so disfunctionally.

It would be fantastic if cron.php and update.php and the other php files in the root directory could be added to this list to resolve that longstanding WTF tension between Dreamhost's bizarre process-killing and Drupal's tendency to page-cache the wrong thing.

If that's inappropriate, feel free to go back to RTBC.

agentrickard’s picture

@rfay - given that this is my oldest core issue, I'd hate to see it derailed due to one host's sloppiness.

Is there a way that the Dreamhost error can be fixed in contrib or config, or do we have to account for that here?

rfay’s picture

@agentrickard... I understand completely. I was just mentioning this weird and problematic Drupal behavior that would be fixed by adding 3 words to the patch (I think).

If I read the patch correctly, all it would take to fix a Drupal install with the weird caching-404 problem would be to add cron.php to the variable provided by this patch. So it doesn't even take contrib.

But since cron.php and update.php are clear cases that should *also* not be cached, I just thought it might be nice to get those in.

I understand completely if you just set this *right back* to RTBC. I was just stopping by, and happy to see this happening.

Owen Barton’s picture

Status: Needs work » Reviewed & tested by the community

I don't think this is the right place to fix the cron.php issue - this issue is about a performance improvement that can optionally be disabled: if it was disabled the bug would reoccur. Also, a few contrib modules add their own .php entry points (for example ad module uses an adserve.php for javascript rotated ads), and a hardcoded "cron.php|update.php" in the variable means that these entry points would still be susceptible to that issue. It sounds to me like a better fix would be to test that the index.php entry point was used before caching a page, set the $conf['cache_inc'] cache.inc default in index.php (or set it to cache-install.inc in cron.php) or some similar approach.

Owen Barton’s picture

#230: 404.diff queued for re-testing.

Fidelix’s picture

I see this makes changes to .htaccess

Will this work with other webservers like nginx, lighttpd and cherokee?

Apache is not the only one running out there.

geerlingguy’s picture

@Fidelix / #241 - I know nginx doesn't support .htaccess files at all (a different mechanism is used for rewrites and such).

However, we have the web.config file for IIS (I think), and .htaccess for Apache. Those who use other server environments will need to do the work to translate the rewrite rules correctly for their own servers. There are hundreds of different web server projects out there—the htaccess hits the overwhelming majority of use cases for Drupal...

catch’s picture

The only change to .htaccess here is the removal of the special casing of favicon.ico.

kbahey’s picture

Version: 7.x-dev » 8.x-dev

This did not make it to Drupal 7.x.

Sigh ... moving to 8.x ...

I guess I have to fulfill a vow I made to myself and kill the family cat

geerlingguy’s picture

With the length of this issue, we might need to consider either pushing this in soon, or opening up a new summary issue for further discussion.

catch’s picture

Version: 8.x-dev » 7.x-dev

I don't think we should push this to 8.x until HEAD opens up. Edit - a new summary issue might help yeah.

mlncn’s picture

This is a critical (in the layman's sense) performance patch and the RTBC patch works.

If only i'd come across this last night, i would have known what to do with my extra drink tickets and access to Dries. Please?

kbahey’s picture

@catch

Do you think it has a chance to make it into Drupal 7.x?

I don't think so, personally, but if you think it would, then I would be pleasantly surprised.

That cat is doomed ...

jcisio’s picture

Subscribe. Love to see it get in D7.

Vacilando’s picture

Subscribing.

soyarma’s picture

Myself and others have created a module which achieves essentially what is discussed here, as well as very early URL checking (either right out of settings.php, or within hook_boot). Right now there is a D6 release and a D7 release is in the works.

http://drupal.org/project/fast_404

perusio’s picture

#241

Nginx doesn't need anything of this sort. It serves the static files first if they exist. AFAIK this is Apache specific.

Hop in to the Nginx group on g.d.o and check a few configs that do the right thing (tm).

Owen Barton’s picture

@perusio (#252) this is incorrect - this patch has just the same benefits for Nginx and any other webserver set up so that Drupal is still responsible for serving 404s (which it almost certainly will be if clean URLs are configured). The patch means that Drupal does reduced processing (either massively, or somewhat - depending on the configuration), and also returns a lower bandwidth response. This all happens in the PHP - the webserver has no effect here, other than checking that an actual static file doesn't exist before handing control over to PHP - this behavior is the same on all webservers that support Clean URLs, and is unchanged by this patch.

@Fidelix (#241) - this does actually not modify .htaccess, other than removing a hardcoded check for non-existent favicon.ico that is no longer needed. Earlier patches did use .htaccess to return lightweight 404s, but that turned out to not be a good approach for most sites. As described above this change is completely independent of whatever webserver is used.

perusio’s picture

Not really if you setup the proper rules in your nginx config. Supposing you setup a static page for handling the 404s.

location ~* ^.+\.(?:jp[e]*g|png|ico)$ {
    try_files $uri /404.html;
}

location ~* ^.+\.(?:css|js)$ {
    try_files $uri =204; # this returns a 204 status: no content
    access_log off;
}

Of course if you relay it to Drupal then the patch applies. But for static files you can be out of harm's way. In the case of images the issue gets complicated because of image cache. If you don't use it than the above config avoids altogether touching the PHP layer.

Also the configs discussed at the g.d.o Nginx group handle the static files properly. The above is just a random example. A proper config is more complex.

which it almost certainly will be if clean URLs are configured

No: almost 'certainly' it isn't. This all happens because you're relaying all 404s to Drupal.

jcisio’s picture

perusio, your nginx conf is the first attempt in the patch in #0 in 2006. Since then it has evoluted.

perusio’s picture

#255 I beg to differ. This all issue happens because of this:

# Make Drupal handle any 404 errors.
ErrorDocument 404 /index.php

If you remove it then this is patch is completely unnecessary. Now I'm afraid my mod_rewrite chops are quite limited and in that sense I don't know if you really need this for things like imagecache to work properly with Apache.

But I suppose that you can write a rewrite rule targeted to the imagecache URIs and act upon it. Of course the above "catch all" directive handles it out of the box. So no need to instruct people running imagecache for inserting such a rule in their .htaccess.

And it's not "my" config, but rather common shared knowledge between anyone that bothered to lookup on prior art before running drupal with nginx.

agentrickard’s picture

If you remove that (without access to server-level host configuration files), Drupal won't function.

perusio’s picture

Ok. I'm not privy to the inner workings of Apache and how all the container context delegation thing operates really. Which I suspect is what is being adressed here.

In Nginx there's no .htaccess equivalent. To reload a config you send a SIGHUP to the running instance and the config is read into memory where it stays unchanged until you send a new SIGHUP.

Can you elaborate on your comment? I would like to know how things are handled in Apache so that the ErrorDocument for 404 is needed.

Thanks

agentrickard’s picture

Drupal (and WordPress and Joomla) are giant 404 handlers. That is, they process any URL request that the webserver cannot. All Drupal requests are actually routed through example.com/index.php?q=path/of/request.

On shared hosts, most people don't have access to Apache configuration, so those rules have to be set in .htaccess. The side effect, of course, is this issue, which has been around for years.

For people with admin control over their hosting environment, be it Apache, Nginx or what have you, this can be remedied quite easily. For others -- including the non-technical user -- we need to provide sane defaults for them rather than the catch-all method of routing _all_ 404s through Drupal.

geerlingguy’s picture

@agentrickard - could you document that procedure (at least for Apache) here, or maybe add a documentation page to the performance section of the handbook, for everyone's benefit? It'd be nice to be able to have a wiki page where we can point people to for a brief overview of this huge issue...

perusio’s picture

Issue tags: -Performance

Oops. Browser snafu made me cancel the submission but the form hand been already submitted. Sorry :(

See comment below.

perusio’s picture

Issue tags: +Performance

@agentrickard I already knew that Drupal et al are "giant 404 handlers". That's not the point I was trying to get at. Instead I was under the impression that Apache context delegation thing doesn't allow to route around the issue of setting Drupal as the single 404 handler. So let me recap, this is my take on this issue, you and others feel free to chime in.

  1. There's an issue with Drupal in that static files that are a subrequest of a page request are not served directly, in the sense that if they're not found the default 404 handler is index.php so you always perform a full bootstrap in order to signal the 404. That's expensive, hence the appropriate term of expensive 404s coined by kbahey, AFAIK.
  2. Back in 2006 when the awareness of this issue was first raised there were attempts to route around the issue by placing new directives in the .htaccess file so that static files would have a static 404 handler instead of index.php.
  3. Due to the complexities and openness of the Drupal way, there isn't any way to have an .htaccess file that can possibly contemplate all hypothetic situations. An obvious example is imagecache. If the image is not found then you should fallback to drupal in order to (re)generate the missing image.
  4. The above patch resulted from a compromise between letting the expensive 404s issue linger around and degrade performance, and instead inserting some code in the early bootstrap process so that the 404s would be much less expensive.
  5. This can be solved in both Nginx, Apache, Lighty and I presume IIS also, by defining a location at the vhost level so that static files are served directly and there's no need to have either this patch or the ErrorDocument 404 index.php directive.

Is that a fair assesment of the issue? If so then there should be an handbook page documenting this problem and the possible strategies to solve it. There's no inexorable need to either use the patch or the ErrorDocument directive. From what I gather that's a mere convenience that Drupal provides. It's nice for non-technically inclined people and also if you cannot have it configured as it should at the vhost level. There should be also accompanying documentation for Drupal when this patch gets merged in the regular Drupal core release.

jcisio’s picture

@perusio: Your 2) and 5) are the same. AFAIK, there is no difference between the .htaccess and vhost level in routing a request/serving a 404 error.

Since very early, we don't want to solve it at the web server level as it is too restrictive (and not universal, maybe).

perusio’s picture

@jcisio not really. AFAIK you cannot specify locations at the directory context in Apache. The Location directive cannot be used in a directory context (.htaccess). This must be set at the Server or Virtual Host level.

What do you mean by "not universal, maybe"? I suppose all web servers let you control locations.

agentrickard’s picture

Three responses.

1) This new discussion is off-topic, and should be moved to a documentation issue (or jut to a docs page).

2) @perusio, you are ignoring the fact that a vast number of Drupal users have no server access to the settings you suggest we change. Or are you arguing that this should just be a performance documentation issue instead of a code-level fix?

3) I am not qualified to write the documentation page. Someone who knows more about Apache should take a stab at it.

Let's get back to actual reviews of the patch and its worthiness.

In this tangent, I think the original question is "Does this patch break Nginx?"

perusio’s picture

@agentrickard regarding 2). What I'm suggesting is that this issue be well documented and the tradeoffs explained.

Regarding 3) I'm not either. But I can seed it with the little I know. I no longer use Apache altogether. I can make a first draft and let others more knowledgeable then I improve it. I can take a stab at writing the Nginx related part, which is a nice segway into your last question.

This patch is unnecessary for Nginx with a decent config, unless you insist on letting index.php be the 404 handler. In most configs I've seen out there that's not the case. This patch introduces a performance penalty since you're checking if the file being requested is a static file or not. It may be small or not, it varies with the load and further investigation is required.

I'm not ignoring "vast number of Drupal users have no server access to the settings you suggest we change". What I'm saying is that if you have, like it's the case with anyone running Nginx, this patch is unnecessary.

I fail to see how it's "off-topic" since I believe the objective is to document as well as possible the tradeoffs involved. Althought I agree that a documentation page is now the proper venue to pursue this issue further.

EDIT: Regarding a documentation page in the handbook. The only thing related I could find is: http://drupal.org/node/56773

The handbook pages could benefit of a server related section IMHO. Does anyone here has power to start such a book page? I will gladly seed it with this issue as a section.

soyarma’s picture

Hey Perusio;

I think your solution makes a heck of a lot of sense. Since (I imagine this may not be the case, but it usually is in my experience) Nginx has access to the static filesystem it can check too to see if the file exists. There is no need for it to pass every 404 off to Apache and then Drupal. Of course, by the same extension even if you don't have Nginx, you can do this in apache in a vhost.

If you give the nginx script essentially the same list of extensions as the patch and tell it to go to a static 404 if those files don't exist (heck make it a purty page and have your reverse proxy cache it) you'll save a ton of time in not going through 2+ other layers to server the 404 when nginx could have done it up front.

Definitely a not seeing the forest for the trees sort of thing.

Let me know what your doc page is when you have it rolling and I'll link to it from my Fast 404 module (still useful for folks who can't modify server configs).

Owen Barton’s picture

@perusio - I am not following your logic here. Yes, if you use Ngnix configured to catch 404s for specific file extensions this patch is unnecessary. The same can be said if you use Varnish (which can easily accommodate these rules), various CDNs, or your custom built RFC1149 based protocol.

All of these approaches do have drawbacks (see above and below), are inaccessible to a great many users and are hence inappropriate solutions for Drupal core. Yes, of course they should be documented - but this issue is regarding a core patch, and discussion of documentation should focus on the patch and it's behavior. Trying to wedge documentation about the numerous alternative approaches that skilled site builders could employ is out of scope - please open a documentation issue to explore this, add comments to existing documentation pages, and/or apply to join the documentation team. By all means link to a documentation project issue from here.

I don't believe it is correct to say "with anyone running Nginx, this patch is unnecessary" - this is only true if the user has configured these rules, and that is not universally true - for example the first generic configuration on the Ngnix g.d.o. description (for MTecknology) does not have the required rules. In any case, even if this patch is unnecessary with a particular configuration, I don't see how that has any bearing on this patch.

Finally - dealing with these 404s in the webserver or proxy has a number of issues or side effects (that should be made clear in any documentation) - here is a summary:
- They requires root access on some web servers
- Imagecache style modules and/or non-default file directory paths each require a specific exclusion (that may require knowledge of regular expressions to implement)
- If a site uses Drupal's alias or redirect modules to handle legacy (non-404) paths ending in .html or similar, these may require specific exclusion (or vice versa if these are not used)
- Drupal no longer logs these 404s in the watchdog, which means many less skilled web admins will have no idea there is a static file 404, which is a problem in many situations
- Finally, the biggest issue with doing this in Apache (that Drupal supports, through it's default .htaccess file) is that the most effective rule prevents 404 paths from being logged at all (even in the webserver logs).

Can we please focus on the patch at hand which avoids all of these problem, whilst enabling a reasonable and simple high performance compromise in settings.php, and still doesn't prevent any of the web-server/proxy specific approaches from being implemented.

catch’s picture

This patch introduces a performance penalty since you're checking if the file being requested is a static file or not. It may be small or not, it varies with the load and further investigation is required.

Can you point out where this performance penalty is? The only checking I see is in drupal_fast_404(), and with the current patch, unless you manually edit settings.php, that code only gets executed if you are already going to serve a 404 from Drupal.

perusio’s picture

@catch

Yes it's dependant on a variable being set. I was under the impression that the variable was being set from an DB update or upon installation of Drupal. Is it not something that it will be enabled by default on stock Drupal installs?

In settings.php there's the definition of the regex mask for the type of files being checked.

If that's not the case then there's no penalty. Otherwise since drupal_fast_404() is invoked in drupal_deliver_html_page it always get executed. And I suppose that it's always faster to perform that check at the server level instead of doing it at the application level.

If I request the URI /foobar and if there's not a menu entry for that path the drupal_fast_404() function get's executed.

catch’s picture

@perusio, no that's deliberately left commented out in settings.php and it's a function call rather than a variable, so can't be affected by install or update (and there's nothing in the patch that affects install or update). That is purely an additional optimization over the main one this patch introduces, disabled by default.

otherwise since drupal_fast_404() is invoked in drupal_deliver_html_page it always get executed. And I suppose that it's always faster to perform that check at the server level instead of doing it at the application level.

Nope. drupal_deliver_html_page() will only execute that code if there is going to be a 404 anyway - since that whole hunk is inside a switch statement.

The only PHP being executed by default here is the assigning of two fairly short strings to $conf, which really is negligible by any standards (and that can be commented out by the same person that sets up the nginx config).

This leaves the fact we define the function at all in bootstrap.inc (which does have a small cost but barely measurable).

perusio’s picture

@catch Yes I know that. The function call in settings.php is for an even faster 404 since there isn't even a watchdog entry for the 404. The variable I'm talking about is in the condition of the if statement:

function drupal_fast_404() {
  $fast_paths = variable_get('404_fast_paths', FALSE); // This is the variable I'm talking about.
  if ($fast_paths && preg_match($fast_paths, $_GET['q']))
  (...)

It the $fast_paths is set, then the check is always performed. I gave the example of requesting a URI for which there isn't a menu entry, hence the switch statement branch active is the one with drupal_fast_404().

My issue with performance is related to the regex matching in preg_match the remainder is negligible I agree.

I'm reading directly from the patch #230 by Moshe. Isn't this the one being discussed? I don't see any other below that comment.

Owen Barton’s picture

@perusio - this regex in drupal_fast_404 is not "always" run. If you set 404_fast_paths to FALSE in settings.php then the regular expression never runs - php evaluates if statements in the normal approach - if the "left" test of an ANDed condition fails, it does not even run the "right" test. Hence, if you prefer to do this check at the webserver level (and accept the various issues I describe in #268), then you can disable this patches function completely.

perusio’s picture

@Owen Barton

Yes that's correct. But the patch enables it by default.

/**
   (...)
  * Add leading hash signs if you would like to disable this functionality.
*/
$conf['404_fast_paths'] = '/.(txt|png|gif|jpe?g|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp)$/i'; // enabled by default

Also I gather from what's being discussed above that the patch is for Drupal 8. Patching settings.php is potentially "dangerous". In the sense that this is a file specific to a given install and might have comments, code blocks, what have you, that will make the patch fail for this file: be it because of a too large offset, or because there are changes that make the patch tool not being able to find a match for inserting the new stuff.

Owen Barton’s picture

@perusio - yes, we do enable it be default. I think it is very reasonable to ask people who would rather put this in their webserver/proxy configuration (and are able to understand the fairly complex tradeoffs involved in this situation) to place a single "#" in their settings.php. For the vast majority of Drupal users they will not have the desire or skills to do this in their webserver/proxy, and we should not expect them to need to change settings.php (this is why we have the installer, after all) - having a safe and faster default configuration is a huge benefit to them.

Note that we never "patch" running settings.php (for a start, core is unable to upgrade itself, and most users upgrade via wget). We would only change the default.settings.php, so that new sites would benefit. If this was committed to Drupal 8, then backported to Drupal 7, settings.php changes would go in the release notes for users to add themselves.

jcisio’s picture

@perusio: it is enabled by default so that if you tune your server settings, you can then disable this feature.

I can't get what you mean by "patching settings.php". Once this patch get committed, all that you do is comment out that line. Nothing dangerous here. You do it manually, as instructed many other modules (caching modules, for instance) or Drupal core ("translation" of a few strings).

catch’s picture

default.settings.php is copied to settings.php, so updates to that file don't affect existing settings.php on Drupal 7 sites that are already installed unless you copy the changes across manually.

perusio’s picture

@Owen Barton

Ok. I'll now move this debate to a documentation page and link it from here when available. Thank you.

Regarding the settings.php patching thing I was thinking on the case of upgrading core using drush. Or just even doing a simple:

tar zxvf drupal-7.x.tar.gz --strip-components=1 --exclude=sites/*

And if providing a patch. Which isn't the case. The update will be counseled through release notes.

ogi’s picture

subscribe

Summit’s picture

Subscribing, greetings, Martijn

catch’s picture

Version: 7.x-dev » 8.x-dev
Issue tags: +Needs backport to D7
joelstein’s picture

For anyone needing a temporary solution, here's what I have in my settings.php files for D7 sites. Also included here are paths to exclude, such as the "styles" paths (aka, "imagecache" in D6). I've included another exclusion, so you can see how multiple ones can be strung together.

/**
 * Fast 404 pages, to conserve server resources upon missing non-Drupal files.
 */
$conf['404_fast_paths'] = '/\.(txt|png|gif|jpe?g|s?html?|css|js|ico|swf|flv|cgi|bat|pl|perl|dll|exe|asp|jsp|php|pdf)$/i';
$conf['404_fast_paths_exclude'] = '/^((sites\/default\/files\/styles)|(some\/other\/url))/i';

if (!preg_match($conf['404_fast_paths_exclude'], $_GET['q']) and preg_match($conf['404_fast_paths'], $_GET['q'])) {
  drupal_add_http_header('Status', '404 Not Found');
  print '<html xmlns="http://www.w3.org/1999/xhtml"><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL "' . check_plain(request_uri()) . '" was not found on this server.</p></body></html>';
  exit;
}

For D6 sites, I use something slightly different:

/**
 * Fast 404 pages, to conserve server resources upon missing non-Drupal files.
 */
$conf['404_fast_paths'] = '/\.(txt|png|gif|jpe?g|s?html?|css|js|ico|swf|flv|cgi|bat|pl|perl|dll|exe|asp|jsp|php|pdf)$/i';
$conf['404_fast_paths_exclude'] = '/^q=((sites\/default\/files\/imagecache)|(some\/other\/url))/i';

if (!preg_match($conf['404_fast_paths_exclude'], $_SERVER['QUERY_STRING']) and preg_match($conf['404_fast_paths'], $_SERVER['QUERY_STRING'])) {
  header('HTTP/1.0 404 Not Found');
  print '<html xmlns="http://www.w3.org/1999/xhtml"><head><title>404 Not Found</title></head><body><h1>Not Found</h1><p>The requested URL "' . check_plain(request_uri()) . '" was not found on this server.</p></body></html>';
  exit;
}

I hope this contributes in some way to the discussion.

EDIT: Adjusted to account for the note in #283.

joelstein’s picture

Status: Reviewed & tested by the community » Needs work
FileSize
6.38 KB

The patch in #230 has a problem. The regex doesn't escape the period, so paths that end with any of the extensions will issue a 404.

The attached patch fixes this.

jpmckinney’s picture

Status: Needs work » Needs review

Fixing status.

Status: Needs review » Needs work

The last submitted patch, 76824-283.diff, failed testing.

geerlingguy’s picture

Status: Needs work » Needs review
FileSize
5.77 KB

Updated git patch for 8.x-dev (looks like the other patch might've been against 7.x) straight from #283. Hopefully the testbot's happy...

anavarre’s picture

Subscribe

Tor Arne Thune’s picture

Owen Barton’s picture

Status: Needs review » Reviewed & tested by the community
webchick’s picture

YAY! Now that #1036132: Provide a mechanism for issue summaries is deployed, at 290 replies, this issue could really use a summary. :)

Could someone please edit the issue node using the issue summary template? That would really help both Dries and everyone else coming into this issue understand what's going on.

Owen Barton’s picture

Assigned: Unassigned » Owen Barton

Working on this now

Owen Barton’s picture

Assigned: Owen Barton » Unassigned

Summary is up - yay for issue summaries!

webchick’s picture

Assigned: Unassigned » Dries

Putting this into Dries's queue to review.

Anonymous’s picture

I applied the patch from #283 for D7. WARNING: enabling drupal_fast_404() prevents file styles images from being generated - it returns those requests as 404's! Has this been taken into account?

geerlingguy’s picture

@morningtime - Read through the issue summary, especially the notes about where regex (some tweaking) is required to make fast 404s compatible with image styles. I had to add any paths to the regex where dynamic content is created (anything drupal's path system doesn't know about that ends in a file extension defined in the fast 404 function).

Anonymous’s picture

@geerlingguy My point is why isn't that regex part of the patch Dries is going to look at? In it's current state, it's a dangerous patch, because only 100% of all D7 sites uses file styles.

webchick’s picture

Status: Reviewed & tested by the community » Needs work

That sounds like a needs work.

geerlingguy’s picture

Status: Needs work » Reviewed & tested by the community

@morningtime: by default, fast 404s are disabled, for precisely that reason. You have to know what you're doing to set this up, and its buried towards the bottom of settings.php with other advanced configuration options. You can turn on fast 404s if you want, at the settings.php level, by uncommenting the line with the function call. You can also enable them at a higher level, later in the bootstrap, for more safety when it comes to styles and such, but with a slight decrease in the performance impact.

Setting back to rtbc... Unless the contention is elsewhere.

marcingy’s picture

+1 on the rtbc this is much needed advanced feature

webchick’s picture

Status: Reviewed & tested by the community » Needs work

Damn, I really don't want to push this over 300 replies, but...

Advanced or not, I don't think it's acceptable to make the default regex break a core feature. It should be possible to put an exception for those paths in the regex.

Anonymous’s picture

@geerlingguy
>> You have to know what you're doing to set this up, and its buried towards the bottom of settings.php with other advanced configuration options.

I accept your logic. However, there should be an example regex included, or at least a warning it will break image styles. Neither was explained in the patch. So how are people going to figure that out? That's just my opinion.

By the way, settings.php is the first file any end-user looks at. You can't "bury" anything there :)

geerlingguy’s picture

I'll work on a new patch for D7 and D8 in a little bit. I actually need to set this up on a new site in a short bit, so it'll be good to test on.

geerlingguy’s picture

Patch against 7.x attached (I'll make an 8.x patch in the next comment). Leaving needs work until I can post the 8.x patch.

A little explanation of the one change I've made (after considering a few different methods): In drupal_fast_404(), I've added a check for the string 'styles' in the query string, as this should take care of the great majority of use cases, is still very performant (doesn't require any more regex, and strpos() is faster than another preg_match()), and is simple enough for someone that needs to add more paths (for example, for the Image Resize Filter's 'resize' directory) to be able to do so by adding another condition.

if ($fast_paths && preg_match($fast_paths, $_GET['q'] && !strpos($_GET['q'], 'styles'))) { ...

We could maybe do something where the user could set an array of options in settings.php that would be checked, but I think this method makes the best tradeoff between sensible defaults and performance... thoughts?

geerlingguy’s picture

Status: Needs work » Needs review
FileSize
5.93 KB

8.x patch attached.

The only way to make a configurable array of paths to check against the url for dynamic content in settings.php (so people wouldn't need to change bootstrap.inc to add more dynamic paths) would be to add a loop in drupal_fast_404() to check each string against the url using foreach... but maybe preg_match() would be quicker :-/

geerlingguy’s picture

Nevermind... I think this technique (everything's configurable in settings.php) should be a bit better. See attached patches, which also add an explanation of the dynamic paths configuration variable, for D7 and D8.

Status: Needs review » Needs work

The last submitted patch, 404_fast_paths_8x-76824-305.patch, failed testing.

geerlingguy’s picture

Status: Needs work » Needs review
FileSize
6.33 KB
6.33 KB

Yeah, sorry about that... ending parentheses:

Status: Needs review » Needs work

The last submitted patch, 404_fast_paths_8x-76824-307.patch, failed testing.

geerlingguy’s picture

Status: Needs work » Needs review
FileSize
6.33 KB
6.33 KB

Yeah, about those parentheses...

geerlingguy’s picture

Updated patch to use the more standardized variable name 404_fast_paths_exclude (instead of 404_dynamic_paths).

geerlingguy’s picture

joelstein’s picture

Status: Needs review » Reviewed & tested by the community

I like the idea of having "exclude" paths in addition to "include" paths, and I'm using the same technique with great success on one of my D7 websites. Patch works for me!

Anonymous’s picture

thumbs up, i think it's perfect

droplet’s picture

Status: Reviewed & tested by the community » Needs work
+++ b/sites/default/default.settings.phpundefined
@@ -428,6 +428,42 @@ ini_set('session.cookie_lifetime', 2000000);
+$conf['404_fast_paths_exclude'] = '/\b(styles)\w*\b/';

can anyone explain why use this regexp but not \b(styles)\b ?

/\b(styles)\w*\b/
Matched:
1. \stylesxxxxxxxxxxx\abc
2. \styles\abc
3. xxxx\styles.jpg
4. xxx\styles\xxx\xxx

/\b(styles)\b/
Only matched:
2. \styles\abc
3. xxxx\styles.jpg
4. xxx\styles\xxx\xxx

or consider this:
/\b\\(styles)\\\b/
Matched:
4. xxx\styles\xxx\xxx

20 days to next Drupal core point release.

geerlingguy’s picture

Status: Needs review » Needs work

I'm presuming you meant something more like:

  $conf['404_fast_paths_exclude'] = '/\b\/(styles)\/\b/';

(forward slashes instead of back slashes). I have the updated regex on my local checkout and will roll a patch in a few minutes. Pending testbot approval, I hope we can RTBC this asap and get it in.

I've tested with the following patterns:

  • http://www.example.com/example/styles - no match
  • http://www.example.com/example/styles/file.jpg - match (won't return a 404)
  • http://www.example.com/example/styles.jpg - no match

To add more 'dynamic' or 'safe' paths (for things like ctools-generated css, or other dynamically-created content), just add a pipe, and the string to match. For example, in settings.php, instead of the code at the top of this comment, put:

  $conf['404_fast_paths_exclude'] = '/\b\/(styles|ctools)\/\b/';
geerlingguy’s picture

Status: Needs work » Needs review
FileSize
6.33 KB
6.33 KB

Updated patches with regex from #315, tested thoroughly, hopefully ready for commit!

geerlingguy’s picture

Status: Needs work » Reviewed & tested by the community

Back to RTBC, as the simple line of regex is all that was changed, and it's been tested by two people.

rootwork’s picture

Subscribe (sorry, I know...)

markus_petrux’s picture

Capturing back references in regular expressions do not seem to be required. It could be optimized by adding ?: after the opening parenthesis.

xjm’s picture

FileSize
3.17 KB

Here's an interdiff against #286 to make it easier to see what's changed in #316.

xjm’s picture

Issue summary: View changes

First cut at issue summary

xjm’s picture

I updated the Remaining Tasks section of the summary to reflect the current status of the issue.

xjm’s picture

Issue summary: View changes

Updated issue summary.

droplet’s picture

@#315,
thanks geerlingguy, that's what i meant.

and we can simplify it to:

\/(?:styles)\/

(non-capturing, no word-boundary)

1. xxxxxxx/default/files/styles/large/public/field/image/xxxxxx.jpg - Matched
2. styles/default/files/xxxxxxx/large/public/field/image/xxxxxx.jpg
3. xxxxx/default/files/xxxxxx/large/public/field/image/styles.jpg

Yes, it can be more simplified to

\/styles\/

but first case, beginner users can just add a pipe:
\/(?:styles|ctools)\/

droplet’s picture

Issue summary: View changes

Removed unnecessary (and now inaccurate) phrase.

xjm’s picture

Setting to NR based on #319 and #322.

xjm’s picture

Status: Reviewed & tested by the community » Needs review

No, really. :)

geerlingguy’s picture

Will write up a new patch in the morning.

geerlingguy’s picture

I've set the new regex to:

  $conf['404_fast_paths_exclude'] = '/\/(?:styles)\//';

It was:

  $conf['404_fast_paths_exclude'] = '/\b\/(styles)\/\b/';

Tested with same paths as earlier, and D7/D8 patches are both attached. Nice to see profile module gone in D8!

xjm’s picture

Nice, hopefully we can get a couple people to review this version and RTBC it again.

droplet’s picture

Good Job! and this one too:

 $conf['404_fast_paths'] = '/\.(?:txt|png|gif|jpe?g|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp)$/i';

It was:

  $conf['404_fast_paths'] = '/\.(txt|png|gif|jpe?g|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp)$/i';

sorry I haven't mention it before.

Owen Barton’s picture

This looks great to me (tests not back yet though).

One further (very minor) tweak that comes to mind, is that if we used a regex delimiter other than '/' we wouldn't need to escape the slashes in the exclude expression, which might and reduce the overall "too many slashes!" fear factor and generally make the function of the slashes a bit more obvious to folks who are not regex geeks. Of course there is also a risk that a less typical delimiter would confuse people more than the slashes - feedback welcome :)

e.g. instead of:

$conf['404_fast_paths_exclude'] = '!/(?:styles)/!';
$conf['404_fast_paths'] = '!\.(?:txt|png|gif|jpe?g|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp)$!i';

we could have:

$conf['404_fast_paths_exclude'] = '!/(?:styles)/!';
$conf['404_fast_paths'] = '!\.(?:txt|png|gif|jpe?g|css|js|ico|swf|flv|cgi|bat|pl|dll|exe|asp)$!i';

I went with '!' as it seems the most common in core (after '/') - other non-escape-needed delimiter options (also used elsewhere in core) include '@', '%', '~', '#'. What do people think?

geerlingguy’s picture

That's a simple style preference... but I would argue in favor of escaping things with slashes; exclamation points make it seem like something special is going on there (to a less-experienced regex guy), and would be more confusing, imo.

geerlingguy’s picture

Status: Needs review » Reviewed & tested by the community

Back to RTBC... I think most people are in agreement over everything, except possibly the style of regex used—semantically, it's about as concise as it can get while still allowing ease of end-user modification in settings.php (ref #322).

Owen Barton’s picture

FWIW, I am perfectly happy with the regex as is. I suggested changing the delimiters to avoid escaping, but I agree "special" looking delimiters may be similar for new-to-regex folks - don't feel strongly either way.

Dries’s picture

Version: 8.x-dev » 7.x-dev

Committed to 8.x. Oh yeah.

Sorry for being slow to commit this patch. For the longest time I wasn't convinced this was the right thing to do and was quite torn about it. I just spent time thinking about it, and now think it is actually proper. Thanks for hanging on there.

Moving to 7.x.

Dries’s picture

Assigned: Dries » Unassigned
geerlingguy’s picture

Still RTBC for Drupal 7 as well - and as noted above, there are no regressions introduced by this patch, and for it to have any effect at all, a site would need to overwrite its existing settings.php file (for existing sites). Might we be able to get it committed soon?

geerlingguy’s picture

#328: 404_fast_paths_7x-76824-328.patch queued for re-testing.

geerlingguy’s picture

@webchick - any chance of getting this in before the next point release? Been RTBC for a few weeks now...

webchick’s picture

Title: Drupal should not handle 404 for certain files » Change notice for Drupal should not handle 404 for certain files
Priority: Major » Critical
Status: Reviewed & tested by the community » Active

Sorry; I've been swamped lately and haven't had a chance to give the D7 queue a good love-fest.

This seems like a pretty harmless addition for D7, and passes Dries's muster, so I'm cool with committing it.

So! Committed and pushed to 7.x.

This needs a change notice so it can be mentioned in the 7.9 release notes.

geerlingguy’s picture

The link was relative, so didn't work - should be http://drupal.org/node/add/changenotice.

I glance through there, but this is the first issue I've ever dealt with that needs a Change Notice, and I'm unfamiliar with what exactly that entails. Specifically, what would I need to do in the 'Updates Done' section, and is there need to attach any particular file (like the committed patch)?

I'd love to do this (or at least see a couple examples of change notice nodes), but just need a little nudge—do any docs exist?

geerlingguy’s picture

Status: Active » Needs review

Here's the change notice I've posted: http://drupal.org/node/1296384 — please see my questions in the comment above, as this is the first time I've submitted a change notices node...

Thanks!

webchick’s picture

Title: Change notice for Drupal should not handle 404 for certain files » Drupal should not handle 404 for certain files
Priority: Critical » Major
Status: Needs review » Fixed

That looks perfect to me, thanks! Restoring status.

The "Updates" section is for the documentation team, etc. to note when they have updated the relevant sections.

If you have suggestions on how to make this feature more clear, please file some feedback under http://drupal.org/project/issues/drupalorg.

perusio’s picture

Status: Fixed » Needs work

@geerlinguy #339

Your change notice is incomplete. It fails to mention that if you're "capturing" the handling of the static files at the server level there's no need to patch anything or fiddling with the settings.php file.

geerlingguy’s picture

@perusio - could you update the change notice, then? I'm not sure how I'd want to phrase that, as I thought it more or less self-explanatory that you bypass Drupal's handling if you handle 404s before Drupal got ahold of them.

perusio’s picture

Status: Needs work » Fixed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Shellingfox’s picture

Version: 7.x-dev » 6.x-dev
Assigned: Unassigned » Shellingfox
Issue tags: +Needs backport to D6

I'm working on this to backport to D6

Shellingfox’s picture

Status: Active » Closed (fixed)
FileSize
5.04 KB

Here is my patch for #346

Shellingfox’s picture

Status: Closed (fixed) » Active
thehong’s picture

Status: Closed (fixed) » Needs review
Shellingfox’s picture

Status: Needs review » Active
FileSize
5.04 KB

Since i was change status, last file wasn't test.

Shellingfox’s picture

Status: Active » Needs review
FileSize
5.04 KB

Re-create patch via git diff

Status: Needs review » Needs work

The last submitted patch, fast404-76824-347.patch, failed testing.

Shellingfox’s picture

Status: Needs work » Needs review
FileSize
5.06 KB
droplet’s picture

we don't has such D6 backport policy? It will change the default behavior.

Nigeria’s picture

#353: fast404-76824-351.patch queued for re-testing.

jcisio’s picture

Status: Needs review » Reviewed & tested by the community
Issue tags: -Needs backport to D7

It was backported to D7, so it could be backported to D6. The default behavior does not change.

There are two trailing spaces for nitpickers. The backport is straightaway. Works as expected.

pwolanin’s picture

Is this already in Pressflow? D6 is much less open to new features than D7 is right now.

jcisio’s picture

No, it is not in Pressflow. Pressflow recently has not been well maintained, no backport has been recently merged into Pressflow. Pressflow 6.26 needed two months to be tagged after the release of Drupal 6.26.

That said, if we decide not to backport this one, we can simply remove tag and close the issue. There are not many new D6 sites, so a change in the default settings file is not much useful.

donquixote’s picture

A quick note:

If this is what we currently find in
admin/config/search/search404 > Advanced settings > "EXTENSIONS TO ABORT SEARCH"
then there is the "ico" missing there, for favicon.ico

pwolanin’s picture

@jcisio - it looks like 6.26 was merged in: https://github.com/pressflow/6

Here's a 6.26 tag: https://github.com/pressflow/6/commits/pressflow-6.26.109

what's missing?

jcisio’s picture

Version: 6.x-dev » 7.x-dev
Status: Reviewed & tested by the community » Fixed
Issue tags: -Needs backport to D6

@pwolanin it needed two months after the release of Drupal 6.26 to be tagged. Pressflow has many pull requests (https://github.com/pressflow/6/pulls) but it looks like none was accepted, except for official Drupal releases merge.

In short:
- it is not in Pressflow now
- there is no interest for it to be in Pressflow 6 (there is mostly no new P6 sites)
- there is no interest for it to be in Drupal 6 (there is mostly no new D6 sites)

I'm closing this issue as it won't be in D6, unless Gabor goes ahead and commits it. Restoring the 7.x tag.

Status: Fixed » Closed (fixed)
Issue tags: -Performance

Automatically closed -- issue fixed for 2 weeks with no activity.

Anonymous’s picture

Issue summary: View changes

Updated issue summary.

jelo’s picture

Issue summary: View changes

This issue states at the top: "The previously submitted patch in #286 was incompatible with image styles and other generated files (such paths would return a 404). The latest patch in #316 addresses this issue by adding a configurable list of excluded paths that should not use the fast 404 feature."

I am not sure if "other generated files" includes clashes with images that are loaded using private file download method. I just spent a couple of hours troubleshooting why all inline images had disappeared from my Drupal site before I realized that fast 404 had been enabled with the default settings, i.e. no additional paths had been added.

The result is that all private image files that are served through /system/files/filename.jpg break. All other documents still function because fast 404 only defines extensions png|gif|jpe?g, but not pdf, xls etc.

Given that private file setting options are in Drupal core and could be enabled, should we not add /system/files as path by default to be excluded in case someone enables fast 404?

jelo’s picture

Given that this issue is closed I created a new issue with a proposal to add "system/files" by default to the excluded paths at https://www.drupal.org/node/2455057

jvieille’s picture

I patched pressflow 6 with this one
https://www.drupal.org/node/76824?page=1#comment-5319424
Thanks