Problem/Motivation
Drupal handles 404s for certain files that are not Drupal content.
Examples are .js, .css, .jpg, .gif, ...etc.
This has two disadvantages:
- We do an expensive, slow bootstrap for every 404 for such files that cause server load and delay browser page rendering.
- We send far more bytes for that 404 than if we let the web server standard error page get sent, wasting bandwidth and further delaying browser page rendering.
From a scalability point of view 404s can be a huge problem for sites with old domain names that have gone through multiple re-structurings over the years. The 404 traffic simply from hotlinked files alone can be a significant draw on resources. The current setup is also a huge risk for active sites - if you add a 404 to a busy page, or perhaps accidentally move, chmod or otherwise make the files directory inaccessible, server load instantly skyrockets and can render the server unresponsive. An example of this is the (missing by default) favicon.ico file which, until we added a workaround for in .htaccess was basically doubling server load for all Drupal sites.
From an end user point of view this is also a major issue. Browsers will patiently sit around with the loading spinner going waiting for Drupal to fully bootstrap, generate a beautiful, fully themed page to represent an image or css file, that the browser then completely ignores as soon as it notes the 404 http code. Not only does this mean that we end up waiting for Drupal to bootstrap twice (or more!), but the response is at least a factor of ten larger than it needs to be. In some cases (e.g. whitehouse.gov or amnesty.org) it is over 100 times larger than necessary. Multiply this by all the 404s on all the Drupal 7 sites, and the end result is a lot of end user wait time, and a lot of bandwidth and cycles wasted.
Proposed resolution
The approach is to return a lightweight 200 byte page earlier in the page generation cycle. By default it is returned in drupal_deliver_html_page(), which at least cuts out some amount of page generation time. However we also have an option to return the fast 404 when settings.php is being loaded, which is extremely fast but has some tradeoffs to consider. There are settings in settings.php to disable this completely (and get pretty 404s back), to adjust the extensions that trigger a fast 404, and to enable the settings.php early return.
We spent the first zillion comments or so considering .htaccess based approaches, but these had enough tradeoffs to not work for core as a default, and issues with logging that most people would not want to opt in to.
Here is a table describing the status-quo and the 2 options added by this patch:
Configuration A- disabled | Configuration B - called from drupal_deliver_html_page() | Configuration C - called in settings.php |
Heavy 404 page generation cost | Smaller 404 page generation cost (~50% of A?) | Minute 404 page generation cost (< fastpath cache) |
Slow 404 page response time (e.g. 1-10 seconds) | Faster 404 page response time (e.g. 0.5-5 seconds) | Very fast 404 page response time (e.g. < 0.2 seconds) |
Large 404 page size (2,000-30,000+ bytes, depending on site) | Tiny 404 page size (~200 bytes) | Tiny 404 page size (~200 bytes) |
Inlined assets (images, JS, CSS) 404 as normal | Inlined assets (images, JS, CSS) 404 as normal | Inlined assets (images, JS, CSS) 404 as normal |
Directly requested assets show fully themed 404 page | Directly requested assets show simple 404 page | Directly requested assets show simple 404 page |
All 404s logged in watchdog as normal | All 404s logged in watchdog as normal | "Fast" 404s not logged in watchdog |
All 404s logged in Apache logs as normal | All 404s logged in Apache logs as normal | All 404s logged in Apache logs as normal |
Core image api works normally | Core image api works normally | Core image api works normally |
404 based lazy loaders (e.g. D6 imagecache) work normally | 404 based lazy loaders (e.g. D6 imagecache) work normally | 404 based lazy loaders (e.g. D6 imagecache) don't work without tweaking the regexp |
Pages with aliases that match regex load as normal | Pages with aliases that match regex load as normal | Pages with aliases that match regex 404 instead of load |
Server side benchmarks
Using apachebench, and visting example.com/none (which is a 404), with 5 users, 1000 requests
1. Without patch
Requests per second: 49.71
2. With patch, default settings
Requests per second: 51.11
3. With patch, with function uncommented in settings.php
Requests per second: 51.15
When visiting example.com/none.jpg (which a 404 that triggers a fast 404), 5 users, 1000 requests
With patch, with function uncommented in settings.php
Requests per second: 352.31
Browser side benchmarks
Benchmarks show the effect of the current slow inline 404s on browser display latency for end users. These were done using http://www.webpagetest.org/ testing the same site (in different configurations, 404s introduced through a node body promoted to home page) using default settings, 10 test runs each time.
First View Summary (seconds) | |||
No patch | Patch | Patch (fast 404 enabled) | |
No 404 | 2.591 | ||
Image 404 | 2.944 | 2.881 | 2.525 |
CSS 404 | 3.571 | 2.890 | 2.501 |
Percentage Improvement | |||
Patch | Patch (fast 404 enabled) | ||
Image 404 | 2% | 14% | |
CSS 404 | 19% | 30% | |
Absolute Improvement (seconds) | |||
Patch | Patch (fast 404 enabled) | ||
Image 404 | 0.063 | 0.419 | |
CSS 404 | 0.681 | 1.07 | |
Repeat View Summary (seconds) | |||
No patch | Patch | Patch (fast 404 enabled) | |
No 404 | 0.886 | ||
Image 404 | 1.716 | 1.488 | 1.011 |
CSS 404 | 1.604 | 1.500 | 1.287 |
Percentage Improvement | |||
Patch | Patch (fast 404 enabled) | ||
Image 404 | 13% | 41% | |
CSS 404 | 6% | 20% | |
Absolute Improvement (seconds) | |||
Patch | Patch (fast 404 enabled) | ||
Image 404 | 0.228 | 0.705 | |
CSS 404 | 0.104 | 0.317 |
See attached file for more detailed benchmarks.
The results are not insignificant - an inline 404 currently slows down the parent page display by anything from 500ms to a full second, depending on the type of 404. Committing the patch will shave something like 100ms-500ms off that time. If site builders enable the "fast 404" return in settings.php the pages display almost as fast as if the 404 wasn't there - anything from 500ms to a full second faster than the current performance. In other words, this patch clearly has enormous benefits for both end user responsiveness, in addition to the server utilization benefits described in depth above.
Remaining tasks
This has been through many months of revision and tweaking. One potential improvement that comes to mind is a test that the lightweight page is indeed returned (we already check for 404 status in several cases) - this would testing some pretty trivial code (one function call, one if), but is probably worth considering.
The previously submitted patch in #286 was incompatible with image styles and other generated files (such paths would return a 404). The latest patch in #316 addresses this issue by adding a configurable list of excluded paths that should not use the fast 404 feature. An interdiff showing changes between the two patches is available in #320. This revision of the patch has been tested by several people.
It may be possible to further optimize the regex; see #319 and #322.
User interface changes
None
API changes
None
Comment | File | Size | Author |
---|---|---|---|
#353 | fast404-76824-351.patch | 5.06 KB | Shellingfox |
#351 | fast404-76824-347.patch | 5.04 KB | Shellingfox |
#350 | fast404file-76824-346.patch | 5.04 KB | Shellingfox |
#347 | fast404file-76824-346.patch | 5.04 KB | Shellingfox |
#328 | 404_fast_paths_7x-76824-328.patch | 6.33 KB | droplet |
Comments
Comment #1
kbahey CreditAttribution: kbahey commentedHere is another version.
It adds .ico to the list.
Comment #2
kbahey CreditAttribution: kbahey commentedLinks to mailing list discussion on this, all +1:
http://lists.drupal.org/pipermail/development/2006-August/018180.html
http://lists.drupal.org/pipermail/development/2006-August/018197.html
Comment #3
pwolanin CreditAttribution: pwolanin commentedQuick question based on the mailing list discussion-
Assume I want to be able to set up path aliases ending in .html or .htm becuase I've replaced static site content with Drupal content. Will this patch prevent that from working? It seems that it will, so I'd suggest separating out these sorts of endings (shtml, html, htm) into a separate rewrite statement and making it commented out by default.
Comment #4
agentrickardI opened another issue before getting to the end of the thread....
The original handbook page was written with reference to Drupal 4.6.x.
As discussed at http://drupal.org/node/56773, in 4.6, we made some edits to the default .htaccess file to allow the web server to handle bad file requests for most common file types.
Before this is standardized, some more testing is needed. Attached is out current .htaccess file.
Comment #5
pwolanin CreditAttribution: pwolanin commented@agentrickard - that attached file does not correspond to the current .htaccess. Did you add more to it?
Comment #6
agentrickardLet me clarify:
That file is the .htaccess file that we use in production (Drupal 4.6.9), which inspired the original Handbook page. It has been modified since the original handbook page was written.
I added it for reference.
Comment #7
kbahey CreditAttribution: kbahey commentedFor what its worth, I am using the patch attached in #1 on my live sites, and there is no ill effect at all ...
It does not seem to catch them all though ...
I changed it to this, but don't know if it works or not yet.
Comment #8
robertDouglass CreditAttribution: robertDouglass commented@agentrickard: I don't see how your .htaccess (from #4) is really applicable to Drupal core since you simply disable Drupal's 404 handling altogether:
Comment #9
robertDouglass CreditAttribution: robertDouglass commentedThe original version of this patch (and that which is in the handbook) works for nonexistant files requested from directories that actually exist in the filesystem. It does something like this:
Request: http://site.com/misc/image.gif
1. Look for directory /misc/.
2. /misc/ is found, so look for image.gif inside of it.
3. image.gif is not found, and image.gif matches the FilesMatch regexp, so return Apache default 404.
Request: http://site.com/some/image.gif
1. Look for the directory /some/.
2. Directory not found, so the URL gets rewritten to http://site.com/index.php?q=some/image.gif
3. Bootstrap Drupal, look in menu, not found, return Drupal 404 page.
The only fix I'm able to think of for this would be to modify index.php to intercept files using the same regexp from the FilesMatch directive, and return 404 from index.php. I don't know if the powers that be will buy such a solution.
Comment #10
kbahey CreditAttribution: kbahey commentedOh, I think they will, given large sites having lots of 404 really suffer because of that false bootstrap, and too many of them stressing the system.
I don't have the time to devise a patch, but will be happy to test one on my sites, and perhaps a large client site too.
Comment #11
robertDouglass CreditAttribution: robertDouglass commentedHere is what seems like an optimal solution. We keep the stuff in .htaccess as it is because having Apache handle things without invoking PHP is ideal. For the cases that Apache can't get to because of the rewrite rules, we run the same regular expression as the first line of index.php and return our own 404 if it matches. The results are as follows:
As long as developers don't start making paths in the menu that end in .gif, .png, .swf etc., we're fine. Does anyone know of any cases where such paths are generated? Perhaps by CAPTCHA or watermarking?
Comment #12
robertDouglass CreditAttribution: robertDouglass commentedSo much for my nice table.
/misc/blog.png
Apache will serve the image because it exists.
/misc/blogxyz.png
Apache will serve a 404 based on the .htaccess rules because the directory exists but the file doesn't.
/foo/bar.gif
Drupal's index.php will serve a 404 based on the rules there because the request ends with .gif. The very fact that we are in index.php means Apache has already decided it is a 404 case, with or without the modification to .htaccess (so the modification to index.php is not an Apache specific hack).
Comment #13
kbahey CreditAttribution: kbahey commentedHere is an updated patch.
It picks up from what Robert did, and changes the following:
- Adds .ico file as a type to be ignored.
- Makes .html as a commented out line with instructions. The reason is, some sites use the .html in URL aliases for various purposes (e.g. content migrated from older systems, making awstats give better stats by type, ...etc.)
I tested this and it works as expected. Should cut down resource usage on some busy sites with lots of flat files.
So, +1 from me.
Comment #14
Dries CreditAttribution: Dries commentedWhat if I'm serving images from the database, using the 'private file method'?
What if, in the near future, we want to do access control on images associated with nodes? (Something we should do.)
Comment #15
robertDouglass CreditAttribution: robertDouglass commentedThis breaks private files. Ugh.
Of course, we could update the regexp to accommodate the private files path... but what about filemanager files? And ecommerce files?
Yuck.
I don't know how we're going to introduce the access permissions on files attached to nodes (filemanager/attachment already has a solution for this), and don't think that it should be part of this patch.
Is the best thing to do to make the change in .htaccess and leave it at that? I suspect that we're only covering 5-10% of the target cases if that's the route we go. The other option would be to build the check into the bootstrap process and hope that we could do a partial bootstrap for 404 images instead of a full bootstrap, but this is clearly inferior.
Comment #16
forngren CreditAttribution: forngren commentedI'm writing something maybe related, http://drupal.org/node/77514. It performs a search for the missing URL with certain words stripped (eg html, php etc).
Comment #17
robertDouglass CreditAttribution: robertDouglass commentedI wouldn't be surprised if we end up introducing a second entry point to the bootstrap process, something like files.php, that handles any URL that has a file ending. That file would initiate an alternate bootstrap process where one of the first checks would be "Is this really a file that exists?", and right after that would check "Is this a path that is handled by a module?".
Haven't Walkah and Co. been doing some thinking on the file bootstrap issue? What is their concensus?
Comment #18
moshe weitzman CreditAttribution: moshe weitzman commentedi posted some thoughts on private files at http://www.tejasa.com/node/113. no code though.
Comment #19
Dries CreditAttribution: Dries commentedIMO, the best we can do is to try and serve a lightweight page. Dropping the sidebars on 404 pages could save a fair amount of resources; both bandwidth as CPU cycles.
Comment #20
moshe weitzman CreditAttribution: moshe weitzman commentedi guess that will have to do for now ... i've been meaning to add a block-less page anyway. needed for panels.module ... if anyone has ideas on how to implement this, please chime in. assigning to self.
Comment #21
kbahey CreditAttribution: kbahey commentedIntroduce a new theme function? mini_page based on theme_page() in theme.inc?
Comment #22
Dries CreditAttribution: Dries commentedtheme_maintenance does something similar. That or provide a
drupal_is_404()
function for the template files? Food for thought. :)Comment #23
moshe weitzman CreditAttribution: moshe weitzman commentedThats was easy. i just added minor conditionals in phptemplate.engine and chameleon.theme. Is used by 404 page, and will certainly be used by panels.module - http://angrydonuts.com/the_dashboard_revisited
To omit blocks, just pass FALSE as 3rd argument - for example theme('page', $output, FALSE)
Comment #24
moshe weitzman CreditAttribution: moshe weitzman commentedoops - here is the patch
Comment #25
Dries CreditAttribution: Dries commentedIt begs the question: do we need to redo the theme_maintenace() page using this system? Can, and if so, should we merge them?
Comment #26
robertDouglass CreditAttribution: robertDouglass commentedHere's the .htaccess portion of the patch. We should adopt this part regardless of the solution that we come to for the rest because it is the most effecient way to handle requests for files that don't exist.
Comment #27
moshe weitzman CreditAttribution: moshe weitzman commentedi looked into maintainence_page and install_page but they are specialized and merit own function. They avoid calls to functions that haven't been loaded yet. I think we can leave them alone.
Comment #28
Dries CreditAttribution: Dries commentedI'm ok with Moshe's patch. My original thought was to introduce a helper function (drupal_is_404()) but I guess this works too. A drupal_is_404() might be a little more powerful, and more transparent though.
Then again, the extra paramater passed to theme_page() might, for example, be useful to simply certain pages (like the welcome page after installation).
Comment #29
moshe weitzman CreditAttribution: moshe weitzman commentedi'll take some more feedback from cvs commit team, or proceed with commit. the extra param on theme_page() is vital for panels.module (whichis super cool if you haven't looked yet - http://drupal.org/files/issues/htaccess-404.patch_0.txt)
Comment #30
drummWhich patch are we talking about? I can't tell from moshe's review if he thought his own patch was ready for commit or Robert's.
Comment #31
moshe weitzman CreditAttribution: moshe weitzman commentedi am talking about my patch. i have not tested robert's patch.
Comment #32
m3avrck CreditAttribution: m3avrck commentedUpdated patch... includes Moshe's patch + Roberts + update for HEAD.
Tested and working well. I really like the fact that blocks go off when you get a 404, that really can save a lot of bandwith + CPU cycles.
Also, the Apache 404 for actual files is *very* necessary. If you have a link to http://www.example.com/somethingFunnyLooking.gif ... you are linking to a file directly--you are expecting either the file or a 404, *not* a full Drupal boostrap to give you a 404. 99% of the time that link is going to be in the theme or some link posted by someone and they'll never see that 404... so let's give them the absolute least amount of overhead with that 404.
I'd say this is ready to go, nothing new here and we solve 2 problems at once: 404s for content in Drupal (no blocks, less CPU, SQL, etc..) and 404s for files (*nothing* at all).
Comment #33
m3avrck CreditAttribution: m3avrck commentedAlso when/if this goes in, we need to update the theming docs about this new $show_blocks variable so themers can be sure to use it so save some CPU cycles on 404s.
Comment #34
m3avrck CreditAttribution: m3avrck commentedTalked with Moshe in IRC, we both agreed this is RTBC. It was ready before Neil just didn't know what to commit (now it's both).
Comment #35
drummThis is two separate issues. Make a separate issue for one half of this.
Comment #36
m3avrck CreditAttribution: m3avrck commentedNeil, ok here is the first patch then, this is for the blocks. Note, this will need a documentation update.
This patch is the same from before, just synced with HEAD.
Comment #37
m3avrck CreditAttribution: m3avrck commentedComment #38
m3avrck CreditAttribution: m3avrck commentedAfter the patch in #36 is commited (same as in #24 just updated for HEAD), here is the next patch to address the .htaccess and 404.
Comment #39
drummPlease post different patches on /separate/ issues.
Comment #40
robertDouglass CreditAttribution: robertDouglass commentedI moved the most recent version of Moshe's patch to a different issue since the title of this issue fits my patch the best:
http://drupal.org/node/80201
Comment #41
m3avrck CreditAttribution: m3avrck commentedHere's the 404 patch which is for this issue :-)
Thoroughly tested, this is ready to go. Like I said in #32, this is needed for files, it can save a ton of wasted Drupal bootstraps and server overheads.
Comment #42
forngren CreditAttribution: forngren commentedBecause of this (two) patches I think the need for http://drupal.org/node/76824 has increased, since the 404 page just "blanks" ATM. Most users wont configure a 404 page...
Comment #43
drummThe nearly-identical commented lines next to the uncommented need a better explanation.
From what I can tell this still breaks private file downloads.
Comment #44
moshe weitzman CreditAttribution: moshe weitzman commentedsad to see this one languishing. if it cam edown to it, i would add this and put some doc for private files tha tthis must be commented out.
Comment #45
coreb CreditAttribution: coreb commentedmoving from x.y.z to 6.x-dev version queue. Hopefully this will also bump the feature back into people's attention.
Comment #46
moshe weitzman CreditAttribution: moshe weitzman commentedcolor.module also does not work for private files and it is a quite prominent part of our default theme. see http://drupal.org/node/92059
i think private file incompatibility is acceptable for drupal 5. we will fix that in 6. anyone want to address drumm's last comment and resubmit?
Comment #47
dopry CreditAttribution: dopry commented-1 on this addition to the .htaccess, It will break any dynamic generation of the restricted file types & drupal based file access control.
If it is added I'd like to see it commented out by default, so people who have concerns about bandwidth and server load can uncomment them. We could also choose a special case for files/ or system/files.
Comment #48
m3avrck CreditAttribution: m3avrck commentedHere's and updated patch which cleans up some stuff.
Known issues:
Comment #49
dopry CreditAttribution: dopry commentedThe problem with 'files' is that it can change and we don't have access to file_directory_path from the .htaccess file...
Comment #50
moshe weitzman CreditAttribution: moshe weitzman commentedit is reasonable to force folks to have 'files' in the difrectory path, like we do for modules and themes. but if not, i htink this can go in commented out and ask folks to enter their files path.
Comment #51
jvandyk CreditAttribution: jvandyk commentedHere's an updated patch. Moved definition of ErrorDocument below our existing definition (otherwise it'll just get stomped on by the later
ErrorDocument 404 index.php
line).Changed the description a bit. Commented the lines out per #50, except the RewriteCond.
Comment #52
RobRoy CreditAttribution: RobRoy commentedDon't have time to test this just now, but will not handling rewriting for .html files mess when you have node/2 aliased to about.html, for example?
Comment #53
RobRoy CreditAttribution: RobRoy commentedAlso, in those new comments it might help to have a reason why one might NOT want to uncomment those lines. For example, why does this come commented-out by default.
Comment #54
m3avrck CreditAttribution: m3avrck commentedHere's an updated patch with a regex from Steven on IRC to ignore those files when they exist in /files which can interfere with private downloads and imagecache module.
Regex needs some tweaking though, almost there.
Comment #55
m3avrck CreditAttribution: m3avrck commentedAlso uncommented the regex because if it can ignore anything in files/ it shouldn't break anything. Feel free to comment out though if I missed a use case.
Comment #56
Morbus Iffm3vrck: FilesMatch does absolutely nothing with Directories. Scanning against files/ won't work. The only way forward on that avenue is to get the default handler into the .htaccess that is generated into the files/ directory.
Comment #57
drummAnd files isn't always named 'files'.
Comment #58
kbahey CreditAttribution: kbahey commentedHow about a new approach to this issue that has been going back and forth for so long, rolled back, ...etc?
Instead of doing things in .htaccess, requiring people to uncomment things, ..etc., can we move this into Drupal itself?
Can we do a light bootstrap (as early as possible), read a variable (perhaps in settings.php) that has a pattern, try to match it, then exit early?
This is of course more overhead than letting Apache handle the whole thing, but it would still be better that what we have now, provided we do it as early as possible.
Comments?
Comment #59
moshe weitzman CreditAttribution: moshe weitzman commented@kbahey - now is not the time to switch gears on this issue. thats a radical change and quite debateable.
@drumm - i think a code comment in htaccess and possibly where we set the files directory is enough to satisfy the rare case where files dir does not end in 'files'.
Comment #60
kbahey CreditAttribution: kbahey commentedmoshe, the reason I suggested an alternative is that this is dragging on, due to various issues that are pointed out by various people. So, getting consensus on it will be difficult.
So, in my view, it may not make it to 5.0.
My suggestion was to move it inside drupal, so we can manage the patterns to exclude, ...etc. from inside Drupal with a little bit of sacrifice on the overhead side.
If others think it will get in core via htaccess soon, then I am happy to go with that too.
Comment #61
moshe weitzman CreditAttribution: moshe weitzman commentedanyone care to carry this forward?
Comment #62
kbahey CreditAttribution: kbahey commentedAnyone willing to work on this performance enhancing patch?
The code freeze is a few weeks away ...
Comment #63
RobRoy CreditAttribution: RobRoy commentedCan someone answer my question at http://drupal.org/node/76824#comment-173567, which is the same as #3 submitted by pwolanin? What if I have an alias for .html pointing to a missing node? Not that that is a huge deal, just would like clarification. Also, the hardcoding /files issue is still there. What about for multisite?
@kbahey, I agree that moving this inside Drupal would save us a lot as we could get the right files dir and some of the other issues might be alleviated. But, maybe that would soften the performance gain that this is trying to achieve. Since this is a bit funky, I think the easiest way is to put this in .htaccess and comment it out initially, allowing users to turn on this boost if they want. What do you think?
Comment #64
kbahey CreditAttribution: kbahey commentedMany of my site has .html as the suffix, mainly so awstats can give good stats on them.
I think that this patch is for performance more than anything else. If we avoid a Drupal bootstrap, we waste less resources when we have a 404.
So, the options are:
1. We add the stuff in .htaccess and index.php and comment it out. This is the path of least resistance and would give us what we need for D6.
2. Have .htaccess generated automatically from Drupal like we do with the CSS cache stuff. A default setting provides which file extensions to use and the admin can override them (e.g. remove html or add xyz)
The .htaccess need not be writable. We can take a non-fancy approach and just display the .htaccess contents on the screen and ask the admin to copy/paste, or we can write to the tmp directory and ask the admin to move the file manually. Since this is an advanced feature, I think this is acceptable.
3. Move parts of that in a lighweight bootstrap, as I suggested earlier.
I think 1 will have the best performance. If Dries is amenable to this, I can work on this towards end of next week.
Comment #65
moshe weitzman CreditAttribution: moshe weitzman commentedI agree that #1 is best for now. I hope kbahey or other have a chance to revive this.
Comment #66
kbahey CreditAttribution: kbahey commentedI hate to see Drupal 6 go out the door without this simple yet effective feature.
Here is a patch for current HEAD.
Comment #67
kbahey CreditAttribution: kbahey commentedHere is a patch that moves the index.php stuff to settings.php, and comment it out.
This way, we have no cruft in index.php, and those who need this can just uncomment it (albeit in 3 different places, but better than a whole Drupal bootstrap for a measly 404).
Comment #68
moshe weitzman CreditAttribution: moshe weitzman commentedworks as advertised. i think this is a good compromise, and all the lines are needed.
Comment #69
Steven CreditAttribution: Steven commentedOne of the chunks is double commented, the other is not.
Comment #70
kbahey CreditAttribution: kbahey commentedHere it is, fresh from the microwave ... ;-)
Comment #71
agentrickardI think this patch is important (I wrote the original handbook page).
Circling back, have you tested against http://drupal.org/node/56773#comment-243479? And against file uploads in general?
Comment #72
Gábor HojtsyReread the htaccess comments, there are typos and repeated stuff as mentioned by Steven. Also if the functionality in htaccess is effective, there would not be a way to get to the default settings.php change. Are three three marketed as an 'all or nothing' solution to uncomment all three to cover all bases?
Comment #73
kbahey CreditAttribution: kbahey commentedThe 3 are needed but for different reasons. When a directory exists but a file in it does not, then Apache is able to handle the 404, and hence the .htaccess changes.
However, if the directory does not exist, then Drupal is still bootstrapped, so we exit as early as we can, and hence the settings.php change.
Rerolled the patch. Hopefully comments are OK now. I also changed it to use REQUEST_URI instead of QUERY_STRING, since that is what we are checking in .htaccess.
Also added a small "Page not found" message.
Comment #74
Morbus Iff* What's a "flat file"? I know what a flat file is. It's not what the documentation prefaces.
* None of the sentences that end with "settings.php" have a period at the end.
* None of these lines are commented (and they should be).
* "corresponding the RewriteCond" has an extra "the".
* I can agree that we'll save "server" resources, but not "bandwidth" resources.
* "corresponding FilesMatch line" - it's actually three line[s], not just one "line".
* Technically, if a number is less than 10, it should be written out ("three lines", not "3 lines").
I'm not sure I like this third check. I worry about dynamically generated images (ala CAPTCHA or various statistics) - I suspect, just off the top of my head, that one of the "make an image from text" modules will fail, like textimage.module. This third check also defeats a request for an image uploaded to a node when private downloads are enabled (URL path is ?q=system/files/1608.png, which triggers your check).
Comment #75
Morbus IffAnd note that #74 and the "private downloads" issue was brought up as far back as #14.
Comment #76
cburschkaI have trouble believing that the performance gain is worth completely eliminating file extensions in the virtual file system. The whole simple beauty of the "no file named $1? let index.php?q=$1 handle it" method is that there is no difference from the outside between the Drupal handler and the files served directly.
What performance gain is this supposed to give, anyway - it seems to speed up only 404 errors. For a site without broken links and sensible, persistent URLs, shouldn't those be the exception rather than the norm?
Just my two cents there...
Comment #77
Morbus IffWhat performance gain is this supposed to give, anyway - it seems to speed up only 404 errors. For a site without broken links and sensible, persistent URLs, shouldn't those be the exception rather than the norm? - Not really. Given enough time, your site will be constantly hit by broken spiders, endless requests for favicon.ico (which may not exist), and various cracking attempts via known vulnerabilities in IIS and other webservers. My top WEEKLY 404s are 14045 /favicon.ico(has never existed and never will), 2296 /popular (never existed. broken bot), 1668 /ghostsites/images/pixel.gif (probably once existed.), 1414 /popular/alltime (never existed. broken bot), 1025 /images/pixel.gif (probably once existed), 467 /cgi-bin/isp-post/isp-post.cgi (last existed 7 years ago. bot attempting exploits), 409 /files/users/picture-11.jpg (existed once. user, and all his content, was deleted.), 178 /db/zone.htm (never existed. no idea where this comes from).
Comment #78
agentrickard@Arancaytar
Just for background: This is a massive problem for sites that migrate to Drupal from a flat-file system. In the case of SavannahNOW, we had thousands of now dead links in Google, Yahoo, et. al. We were basically getting DOS'ed by 404 errors.
Plus, if someone requests a page that no longer exists, and that .html file contains two .css and 10 .gif files, that turns into 13 404 errors that "fall through" to Drupal. In effect, Drupal spends all of its time searching for files instaed of serving pages.
It is also a generic problem with bots and spiders -- particularly spambots and hackers. Our traffic logs show repeated requests for .dll and .cgi files that point to known or suspected exploits in IIS.
All that aside, the case may be that the best path here is a well-tested and documented handbook page, with a reference in the core INSTALL.txt that points to those instructions. I say that simply because edge-cases seem to be drowning this issue. If we can't fix all the edge cases, I doubt this will get into core.
Comment #79
kbahey CreditAttribution: kbahey commentedHere is the wording, before I roll a patch again:
.htaccess part1:
.htaccess part2:
settings.php:
Regarding private download, we can add a comment that this does not work in such a case. How about that? Any wording suggestions Morbus?
Comment #80
Morbus IffCSS is a acronym and should be capitalized. I'd much rather these sentences be merged into one paragraph (for each block).
As for the settings.php, I can't support this patch with that block in - adding wording-to-the-wise about private downloads doesn't solve the issue of textimage (which creates dynamic images) or any other module that attempts to do something dynamic with normally static files. It's something that would cause more confusion and endless "your module is broken!" requests.
I support the .htaccess additions.
Comment #81
kbahey CreditAttribution: kbahey commentedMorbus is right about private file download. This is something we need to think about. However, that should not stop the .htaccess parts.
Not sure what merging sentences in one block for each paragraph means.
I am unassigning myself from this issue. Someone else can take it and run with it.
Comment #82
moshe weitzman CreditAttribution: moshe weitzman commentedIn order not to break textimage and imagecache, I just added this condition in the settings.php if block:
i know that one could use a files dir that doesn't begin with 'files' but thats a bug in core IMO. we require modules dir to be modules, themes to be themes, and we have every right to require that files be files. i just need to submit a patch for that. if folks agree, i'll get started.
note that color.module does not work with private files and thats a far more important feature than these chunks of commented out code that are proposed.
Comment #83
cburschkaGiven that any module (or even custom site code) might attempt to "fake" a file extension in the URL, it seems like this fixes one issue (404 overload) at the cost of introducing an uncounted number of problems whose cause cannot not readily identified.
"Why is such-and-such a page suddenly returning 404 errors?" seems like a far more difficult support issue to solve than "Why is my Drupal site constantly running into my host's server load limits?". When the server overloads, you have a very obvious problem that will probably be handled by an admin who can poke around in .htaccess files. The other one is encountered by users who have no idea what happened.
Because of this, I'd propose that the relative addition/change in .htaccess be commented out with a description of how to activate it, and document this as a performance tweak in the handbook...
Comment #84
mcurry CreditAttribution: mcurry commentedsubscribe
Comment #85
Christefano-oldaccount CreditAttribution: Christefano-oldaccount commentedWhere has this issue been all my life? Subscribing.
Comment #86
Owen Barton CreditAttribution: Owen Barton commentedSubscribing
Comment #87
juerg CreditAttribution: juerg commentedSubscribing
Comment #88
PanchoMoving feature requests to D7 queue.
Comment #89
zeezooz CreditAttribution: zeezooz commentedFor handle file in non-existent directory I use rewrite to /404.ext in .htaccess. Put
before last RewriteConds/RewriteRule and
just after
It does not require to patch other files and seems to work well in 5.7 (including imagefield preview), but error message contain 404.ext filename so I use custom message.
Comment #90
Christefano-oldaccount CreditAttribution: Christefano-oldaccount commented@zeezooz, can you roll a patch?
New features are going into 7.x-dev, not 5.7.
Comment #91
Owen Barton CreditAttribution: Owen Barton commentedThe trickiest thing to work around here are modules like imagecache that rely on the 404 handling to auto-generate versions of files that don't yet exist. I wonder if we could get around this by having them add a querystring paramete (e.g. 404handler=drupal) that will cause .htaccess to use the Drupal 404 for these files, rather than the lightweight static 404 page. This way we could potentially have the rule enabled by default, and still allow these modules to work without having people hack the (fragile!) .htaccess regexi.
Anyone fancy taking a shot at this?
Comment #92
zeezooz CreditAttribution: zeezooz commentedpatch for #89
Comment #93
Gábor HojtsyNote: Favicon.ico just got such special treatment in Drupal 6.3.
Comment #94
m3avrck CreditAttribution: m3avrck commentedpatch #92 works great!
We are using this on MothersClick with imagecache & imagefield with *no* problems at all, test it yourself:
http://www.mothersclick.com/no-directory/this-really-works.css
The patch that went into D7 and D6 for favico should be reverted in favor of this patch which works great.
Comment #95
catchStill applies with fuzz, back to needs review.
I agree 100% that a more generic solution would be great for this issue rather than special casing favicon.ico - it's a big performance issue on some sites (including one I run) that have old legacy linkrot for images etc.
Comment #96
cburschkaCould we leave default behavior as it is? This will break functionality on a great number of sites - whether you are using a private file system, or any module that dynamically generates file downloads, or even if you are in the habit of adding ".html" to the URL aliases of your nodes.
It's a fine tweak, but I don't see that we should knowingly and without warning or explanation break the sites of one part of the user base in order to improve performance for another. These lines should be commented out initially, so you can uncomment them if you need the performance. We don't default to aggressive caching either, for the same reasons.
Comment #97
m3avrck CreditAttribution: m3avrck commentedThis patch DOES NOT break dynamically generated files like images with imagecache. Like I said, we're using it on MothersClick and we rely on imagecache and everything works a-ok. Before saying a patch doesn't work, test it. There is a line in there that says ^/files/ ...
Comment #98
cburschkaI'm sorry, I often get mod_rewrite rules wrong. Does the following prevent the file being handled by Drupal, or is it an exception for files that are handled by Drupal?
Comment #99
gpk CreditAttribution: gpk commentedRelated: #295456: Bug in .htaccess file.
Also the patch Gábor mentions in #93: #174940: Root /favicon.ico accounts for majority of 'Page not found' errors.
Is it proposed that *by default* Drupal should not handle 404s for the image file types? Two comments:
- if Drupal doesn't handle them then I lose an easy way of finding broken img tags etc. on my sites
- the comment sections e.g.
# The following lines prevents [sic] ... # You must uncomment the corresponding RewriteCond and RewriteRule lines later in this file.
suggests that by default these lines will be commented out??@98, and for my own edification: What I think those 2 lines do is the following:
- first of all, %1 is a backreference to the file extension of the image file (as matched the line before in the RewriteCond)
- so
RewriteCond %{REQUEST_URI} !^404.%1$
makes sure that the request was not for 404.jpg say (this is needed so that the next line doesn't create a loop)-
RewriteRule ^(.*)$ 404.%1 [L]
then rewrites the current request (e.g. picture.jpg) to 404.jpg and ends ReWrite processing (so that the following rule - Drupal's normal one - doesn't get applied).- only then does the FilesMatch directive get applied
Actually I'm not sure the request URI needs to be rewritten ... just need to make sure that Drupal's normal RewriteRule doesn't get run?
Comment #101
gpk CreditAttribution: gpk commented#100 was the testbot problem #335122: Test clean HEAD after every commit.
Comment #102
Owen Barton CreditAttribution: Owen Barton commentedHere is a rerolled patch that does the following:
- Reverts the favicon.ico rules, because they now are handled by these rules
- Adds some additional extensions for common 404s (mostly exploit-bots)
- Changes the 'files' rule so it does not need to be at the start of a line. This is needed so we can catch sites/default/files paths (the default since Drupal 6). This also fixes (AFAICS) the case where private files are enabled.
- Adds a help text and a validation rule to the file system settings to ensure the file system path contains the word 'files' somewhere (as suggested by moshe in #82).
- Removes handling for htm/html files by default, because migrating from a static html site is a pretty common use case. I have added comments and sample lines for people who know they don't/won't use these aliases, but by default you don't have to do this (and you still save a large number of CPU cycles).
- Makes this enabled by default. I don't think it's worth including this if it is going to be disabled by default, especially if it means uncommenting 2 far apart sections of code in a hard for newbies-to-grok file. If we are not going to enable it by default this could live just as happily in a documentation section for high traffic domains.
Some thoughts:
- This opens up the possibility of a contrib module to save static versions of the error pages as 404.whatever (a bit like boost module does for regular pages), or introduce some lightweight way of logging these errors (404-handler.php) with a minimal change to the rewrite rules.
Comment #103
cburschka+1, though leaving it in as a comment will make it far easier to enable - just tell them to remove the comment on the documentation page.
Again: This change will speed things up for some people and break the sites of everyone using private file downloads, since you're not even using files/ or system/ exceptions, as well as anyone generating dynamic images with extensions, which can't be fixed with any additional special exception. Is it too much to ask not to break people's site by default?
Comment #104
Owen Barton CreditAttribution: Owen Barton commented@Arancaytar - it has been stated multiple times in the thread that this patch does not break imagecache or any similar modules and has been used successfully on large production sites that also use imagecache. If you read my post you would also note that this should now work with private file downloads. If you would like to find something that it does break, then please by all means try the patch and report back, but repeating "this breaks imagecache" over and over is neither accurate nor helpful :)
Comment #106
Owen Barton CreditAttribution: Owen Barton commentedIt occurred to me that this would be a great candidate for some tests (which would help settle the what this does/does not break debates!) - any takers?
Comment #107
m3avrck CreditAttribution: m3avrck commentedThe issue I have with this patch is treats 404s for files differently than 404s for something in Drupal's menu system.
IMO these should be treated the same. The attached patch is how we handle on MothersClick.com we a single configurable file that both Apache and Drupal can use.
While not core worthy, hopefully this sparks proper direction for this patch.
Comment #108
moshe weitzman CreditAttribution: moshe weitzman commentedWhat a cursed issue we have here.
Comment #109
sunPatch in #102 should not throw a form validation error. Instead, just use a note for the form element description, so site admins who need to use a path without "files" do not have to hack core.
Patch #107 looks interesting. Just an idea: Could we use a separate 404.php in the root directory, which performs a minimal bootstrap, checks the configured file directory path for the site in question, and short-circuits to a (optionally configurable) 404 output - or - if Drupal is supposed to handle the request, performs a full bootstrap?
Comment #110
kbahey CreditAttribution: kbahey commentedI hate that D7 would be released still handling all 404s via an expensive bootstrap. This is a big performance suck.
The attached patch solves this in two ways:
- Adds rules to .htaccess to bypass 404s for certain files.
- Adds a safeguard in settings.php to stop very early if we are handling a static file.
Please please let us have this in D7. We can refine later if needed, but let us ship with something that prevents Drupal from handling static files.
Comment #112
chx CreditAttribution: chx commentedYou want to write a test module which defines a callback in files, say 'test.gif' and check the return of it -- stuff a random string in a variable in the test and print variable_get in the test module.
Comment #113
Owen Barton CreditAttribution: Owen Barton commentedHere is an updated patch, with 3 new assertions to ensure that the web server is passing "fast 404" responses to specific file type requests outside the files directory, and that Drupal can still handle requests within the files directory (system_test module now registers 2 imagecache style handlers for this purpose) for both public and private file serving.
I have also tried to make the comment help as clear as possible, although it is hard to balance clarity against excess verbosity - feedback is welcome.
I also took the settings.php fallback out - I am happy to add this back in if someone can explain why we should. This would only be reached on non-Apache webservers. We generally don't duplicate .htaccess functionality in php (especially for "optional" functionality such as this) - for example the www/no-www redirection could easily be done in settings.php, but is not. I think it is reasonable to request that people set up equivalent functionality in their non-Apache webserver configurations, and/or add code to their settings.php to do the same - we have places in the manual to document these.
Comment #114
Owen Barton CreditAttribution: Owen Barton commentedComment #116
Owen Barton CreditAttribution: Owen Barton commentedHmm, I can't reproduce the test fail locally - anyone else care to test?
Comment #117
chx CreditAttribution: chx commentedOwen, I do not get this, the rewrite rule changes the request to 404.gif or similar? Or what is the 404.%1 wanted to be? And, why wont the filesmatch grab your request, even in the files dir before any rewrite?
Comment #118
Owen Barton CreditAttribution: Owen Barton commentedThe rewrite rule simply prevents the request from being rewritten and passed to Drupal (if the file does not exist, the file is outside */files/* and has a specified extension). This causes Apache to 404 the request and look up the appropriate ErrorDocument for the request, which has been previously set for files with this extension. The filesmatch does not do anything at all unless Apache looks up an ErrorDocument for the request.
There perhaps could be problems with this if clean URLs are disabled - kbahey, did you check this at all?
Comment #119
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedI am wondering if this is a good idea. While it might be good for performance, you will never notice that some files are missing if you do nto additionally look into the apache log.
Comment #120
Morbus IffYeah, honestly, I'm not hugely fond of this functionality - it's pretty hackish, and severly cripple's Drupal's own log functionality (which, over the many years, people have probably come to expect to be a literal equivalent to server logfiles). I don't doubt that heavy 404s can cause performance issues - I just don't feel this is the answer.
Comment #121
catchI'm wondering if we should just do either extension matching or file directory matching in drupal_not_found() - so log the 404 for site admins, but then return straight away if it's a file rather than a page. That'd save all the page building and rendering for the 404, but still give us logs.
Comment #122
moshe weitzman CreditAttribution: moshe weitzman commentedIt simply is not Drupal's job to catch all possible requests under its tree and log them. I like this patch a lot. catch's idea still requires a bootstrap which is a big problem. some contrib module can implement an apache log viewer if people need to view these 404's inside of drupal. or make a contrib module which lets you add a simple link to awstats/webalizer the reports block. web hosts could even make their own distro with that link pre-configured.
Comment #123
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedIt has done so for years and considering everything this behaviour has caused very little problems.
I think it would be bad user experience to change this behaviour.
Catch's idea sounds interesting, but I am not sure how it could work.
Summary: I'd won't fix this if I were a core committer.
Comment #124
Morbus IffI don't like, at all, how the patch redirects the user to 404.$fileextension. This is *really* damaging - that's the actual entry that will get logged into the server log file, meaning that we are *crippling* a traditional error_log by filling it with garbage. How can I fix my broken links or embeds (etc.)., if Drupal is filling my log with "404.jpg" and "404.html"? This'll corrupt not just the error_log, but also anything that works on error_logs, such as Webalizer and so forth.
Comment #125
Morbus IffTo amend my previous comment, here's what the error log looks like with this patch:
The first entry is the rewritten file - I had requested ahem.jpg. The second entry is a modern browser's standard attempt to get the favicon.ico (which, naturally, we can't really tell here, since it's been rewritten).
Huge -1.
Comment #126
catchAlso has anyone tested this with imagecache?
Comment #127
Morbus Iffcatch: assuming that the Drupal 'files' directory has been set to 'files', then imagecache would be fine. If the 'files' directory was named something not 'files', then imagecache, private downloads, and other dynamic generations would fail. They could be made accessible again only if the .htaccess and Drupal source code was modified (fun, right?).
Comment #128
catchShould've read the patch again, thanks though Morbus.
Comment #129
moshe weitzman CreditAttribution: moshe weitzman commented@Morbus - could you recommend an approach then? Right now, a site can nearly crumble if a few bad invalid images get posted to the home page (for example)
Comment #130
catchJust a note I think we could do my approach via #444756: Add a page_not_found() hook to Drupal with a few modifications.
Comment #131
Morbus Iff@moshe: I don't have a suggestion you'd like, no - I don't think it can be cleanly fixed at the .htaccess level (i.e., meaning we'd need to do /some/ sort of bootstrap; ala @catch's point). However, I also disagree entirely with your assertion that "a site can nearly crumble with a few invalid images" - that's ridiculous: as someone who has been running the same site for more than 10 years, and gets dozens of 404s every minute, (historical, rootkitish, and accidental, etc.) it's a bit FUDdish. To think that Drupal could get to where it is now, with such a easy 404 DoS, is inane.
I will agree, however, that certain shared hosts can have such a problem with a popular site and sliced resources. I just don't think, however, it's so incredibly widespread that Drupal sites everywhere are being crippled because of it. Do I think something should be done about it? Sure, I have no problem with that. Do I think it's critical? Nope. Do I think it's critical enough to a) strip an existing expectation from Drupal (404s in the reports) and b) break the equivalent functionality at the webserver/log level? Most definitely not.
Comment #132
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedAlso, putting the notice about the missing images into the watchdog will allow site maintainers to create the missign images or to remove the broken node/template/whatever.
This issue is pure sillyness.
Comment #133
moshe weitzman CreditAttribution: moshe weitzman commented@catch - in this issue, we want to handle 404 without a bootstrap. really we want to handle a 404 without any php at all but definately no bootstrap. a 404 hook completely the wrong direction IMO. a 404 hook for the files directory might be fine and can be pursued in that issue.
Comment #134
catchAlso we haven't been doing a full bootstrap for favicon.ico 404s since 28th June 2008, so this patch only fixes image files which aren't favicon.ico and aren't in sites/*/files. I think with something in drupal_not_found() we could skip rendering the page for files in sites/*/files as well.
Comment #135
gpk CreditAttribution: gpk commented>skip rendering the page
If page caching is on then the response will be from the page cache anyway, so there will only be a full bootstrap/rendering of the page if the request hasn't already been cached.
Comment #136
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedcool, so I need to check if favicon.ico exists myself? What a truly great idea. I'd expect something like this from WP or Joomla, but not from Drupal...
Comment #137
Owen Barton CreditAttribution: Owen Barton commentedComment 127 is not correct - the file directory path does not need to be set to "files", but simply needs to include the string "files" somewhere within it. There is a validation on the setting in the patch.
I agree that the current patch breaks Apache logging (404.ext) in an unacceptable way and is not committable. Possibly there is a way to fix this - I tried briefly to no avail. I think I am also leaning towards a php based solution, ideally with optional Drupal logging (and hence DB bootstrap). Either way, I DO think we need an improvement here - I have seen several site outages caused by 404s when assets on a single browser page view creates many Drupal hits, or the page has repeating AJAX callbacks that are 404ing.
Either way, I think 2 of the 3 tests I created are still worthwhile, since they ensure the imagecache style facility is functional (which I think could pick up some subtle breakages in the file path/url functions).
Comment #138
Morbus IffOwen - the RewriteCond specifically looks for "/files/" (unbounded), but the validation code only looks for "files". This would allow me to use "drupal_files" inside Drupal, and pass the validation, but lose the "fast" 404s.
Comment #139
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedI actually like to translate the default files path to use "dateien"...
If your system is prone to meltdown because of a few missing files I tend to think that this is not a Drupal problem per se.
You should evaluate your hardware and possibly buy more.
You also should test your site on a staging system to catch such 404s in advance.
Comment #140
Owen Barton CreditAttribution: Owen Barton commented@Morbus - correct, this was a bug - however the symptoms would be broken imagecache, rather than lost fast 404s.
Attached patch fixes this, and also fixes the issue with the tests failing when mod_rewrite is not enabled. It doesn't fix the Apache 404.ext logging issue however.
Comment #141
Owen Barton CreditAttribution: Owen Barton commentedSigh
Comment #142
moshe weitzman CreditAttribution: moshe weitzman commentedWell, 5 broken images on your home page can be a nearly a 5x multiplier on your traffic. It leads to 5x on the number of bootstraps that your web and DB servers have to service.
Comment #143
killes@www.drop.org CreditAttribution: killes@www.drop.org commentedExcellent reason to just fix the stupid images. Thanks to Drupal (as it is now) you'll at least be alerted to the problem.
Comment #144
Owen Barton CreditAttribution: Owen Barton commentedYou won't be alerted to the problem if the site is down due to 5x traffic
Comment #145
Morbus IffYou can't solve the issue if your error log shows 5 entries for 404.jpg.
(See, we can do snarky replies too!) ;)
Comment #146
Owen Barton CreditAttribution: Owen Barton commentedSnarky replies aside, I think everyone agrees that logging 404.ext for missing files is not acceptable.
What we need is either an alternate .htaccess approach (that logs 404s correctly, and works acceptably both with mod_rewrite enabled and disabled) - OR a PHP based approach that is sufficiently faster than a cached bootstrap to make it worth doing.
Comment #147
m3avrck CreditAttribution: m3avrck commentedFWIW, I do this on my sites:
Than I usually set the path to sites/whatever/theme/whatever/404.php and include my logic there.
This is in addition to setting the htaccess to point to this same 404 file as well (and having similar htaccess work).
Comment #148
Owen Barton CreditAttribution: Owen Barton commentedOK, let's try a different approach. I think this one could even make everyone happy!
I think we have demonstrated that a .htaccess based approach is too problematic. So this is a simple PHP approach. It allows you to do send a lightweight 404 for matching URLs either:
a) just before the normal 404 would otherwise be sent (the default). This has minimal side effects (actual pages that happen to match still get served as normal, and 404s are logged correctly in both the Apache and system log). Of course, you still have a pretty good chunk of processing needed to reach this point (we manage to skip most of the theming work though), but users still benefit from bandwidth savings, and this should be safe enough to enable by default.
b) Alternatively (and not the default!) you can return the 404 directly during the configuration bootstrap. This has huge benefits for server load, but significant side effects (identical to the htaccess ones we saw) - even valid pages that happen to match the pattern will return 404s, rather than the actual page. It will also prevent the log entry. This could still be perfectly acceptable for large/active sites with well understood URLs looking to get an edge off their server load.
There are a few things I would like feedback on:
- Does this general approach work for folks? If there are fundamental objections please speak up.
- Naming - is "lightweight 404s" grokable? I didn't want to use "fast" since it is really not a super performance bonus by default. Should we say "not_found" instead of 404?
- Regular expressions in settings.php - this strikes me as a bit risky/unfriendly. drupal_match_path() could work in a pinch, but needs case lowering, and would mean we couldn't use the jp?eg style shortcuts. Other ideas?
- Calling drupal_lightweight_404() directly in setting.php (optionally, of course) is also kind of a new concept. Should we make this a $conf variable (lightweight_404_fastpath, perhaps), and check it in bootstrap.inc somewhere?
- Are there any clean up tasks (hook_exit etc) we really need to do in either case (a) or (b)?
Comment #149
kbahey CreditAttribution: kbahey commentedOwen
I like this. It is totally configurable and off by default. So no one should complain about it.
My modification is attached. The most important modifications are:
1. In drupal_fast_404(), you were missing this before the print statement:
2. Added a pattern for cached css and js where we have a ? and then one letter after it.
I did some bikeshedding, please feel free to ignore. It is renaming "fast" instead of "lightweight". Shorter to type and remember. This should NOT derail the patch. I am happy to revert.
Please let us get this in in Drupal 7. It is way over due.
Can we get a review and an RTBC?
Comment #150
Owen Barton CreditAttribution: Owen Barton commentedGood catch - I tweaked it to add the header with drupal_add_http_header() rather than header() and also changed the order of operations in drupal_deliver_html_page() which avoids a duplicate header (or now function call).
I also removed the css?. and js?. patterns - we are checking against $_GET['q'], which already excludes the querystring, so these aren't needed.
No objections to the name change.
Comment #151
Owen Barton CreditAttribution: Owen Barton commentedComment #152
moshe weitzman CreditAttribution: moshe weitzman commentedI think some sentences got mashed together improperly here. Otherwise, is RTBC. Nice work.
Comment #153
Owen Barton CreditAttribution: Owen Barton commentedHow about this?
Comment #154
Owen Barton CreditAttribution: Owen Barton commentedOK, that didn't parse either...try this
Comment #155
moshe weitzman CreditAttribution: moshe weitzman commentedOK, lets give this poor issue a chance to shine.
Comment #156
sunThis needs to be a proper function summary, stating what this function does. The current summary is rather part of the description.
"Returns a simple 404 Not Found page."
"However, some of these responses are for images or other resources that are not displayed to the user. This can waste bandwidth and generate unnecessary server load."
The preceding sentence should end with a colon (:), and the list should directly follow without blank line.
The array keys should be delimited with a colon, followed by a description, e.g.
- 404_fast_paths: A regular expression to match paths to return a simple 404 page on. [...]
Trailing white-space here.
1) Should be /**
2) Missing blank line before this docblock.
116 critical left. Go review some!
Comment #157
Owen Barton CreditAttribution: Owen Barton commentedThanks for the review, Sun. This should resolve all these issues.
Comment #158
sunThanks!
Comment #159
kbahey CreditAttribution: kbahey commentedWebchiiiiiiiiiiiiick! Driiiiiiiiiiies!
Comment #160
Dries CreditAttribution: Dries commentedThis looks like a new feature, and should go into Drupal 8 at this point.
Comment #161
kbahey CreditAttribution: kbahey commentedDries I disagree that it is a new feature.
This is a performance improvement. We bootstrap the whole of Drupal when the request is a static file 404.
This patch does not change any behavior and is off by default, enabling it reduces resource waste when a lot of 404s for static files are happening.
Comment #162
sunI agree that this would be good to do for D7.
However,
is not entirely true, due to this change:
Since the pattern is always defined, you will get a fast 404 page instead of a full one, if the pattern matches.
Can we remove this change in behavior, for example, by introducing another variable to toggle the entire functionality?
113 critical left. Go review some!
Comment #163
kbahey CreditAttribution: kbahey commentedFair point sun. Thanks for the catch.
Here is a version of the patch that is completely off by default. The user needs to uncomment stuff in settings.php if they want that on.
Dries, it REALLY does not change any behavior at all, and fixes a performance bug.
Comment #164
Dries CreditAttribution: Dries commentedI'm not ready to commit this, and want to discuss this some more.
For example, I'm not a fan of the list of extensions. For site visitors, it doesn't make much sense to get a different 404 depending on file extension.
Maybe we should go back to the Drupal 6 solution where 404 pages where fast (because blocks where omitted)? We should try to make all 404s fast and avoid more complex solutions.
Comment #165
kbahey CreditAttribution: kbahey commentedThe 404s that are emitted because of static files are often invisible. For example, the site owner changes the name of a small PNG from CSS, or misspells it. The page renders normally for the end user, but this PNG (and perhaps others) triggers 404s that the user will never see. Despite these not being seen by the user, they do a full Drupal bootstrap with all the overhead. This solution avoids this unnecessary, yet invisible, overhead.
We are removing known overhead for known file types. The rest of 404s require a Drupal bootstrap to check the menu, ...etc. So we can't do much about them until we reach that stage. But why go that deep into Drupal only to send a 404 that the user will never see?
Even if the user sees the error, it is the site admin's decision to show them a full or brief error message. We just give them the means to do so, not force them to.
On all high traffic sites that 2bits.com manage, we have a variant of this in settings.php just to avoid this overhead. We look at the extension and emit a 404 and then exit right in settings.php.
As for the block-less 404, I personally don't like it since it looks inconsistent and odd, and only removes one overhead source (blocks) not the full bootstrap.
Comment #166
catchDrupal 6 only omitted the left and right sidebars, not all blocks, and many sites don't use block module at all, likely to be less in D7 now it's optional.
Also, omitting blocks isn't much of an optimization depending on the site, say 10/20% of total page generation. Skipping rendering altogether is a good 50% or more on D7 since so much happens in drupal_render_page(), the two can't really be compared.
Additionally the D6 optimization often meant that search boxes or the navigation menu were omitted on 404s . Quick search shows a tonne of issues complaining about that optimization, following list isn't exhaustive. It's also not a simpler solution - requires messing about with the rendering process, and there are still cleanup patches outstanding for it:
#221399: Allow administrator to decide if blocks should be displayed for site_404 pages
#423992: Remove show_blocks page variable
#141162: 403/404 errors should allow blocks
#423696: Allow display of menu links on 404 error page
On the patch here, I fully agree with Khalid that most 404s for files are invisible - on one site I run which has been going since 2003, 2005 on Drupal, the majority of 404s for images are very old ones which were embedded in pages on external sites - unless you look in the net tab on firebug you might not even realise those images were ever supposed to be displayed. Going to all the effort of a full bootstrap for those is a bit much.
Even if someone visits your site directly and gets a 404 for an image, they're very unlikely to see the full page render because they'll be viewing a page with that image embedded too. The only case where you'll actually see the 404 for a file, is if it's linked to directly somewhere, both within sites (which tend not to link to dead files so much) and external links, neither are really the common case - and even then a lot of those hits will be from crawlers.
However, I'd like to see this patch remove the special handling for favicon.ico in .htaccess, or at least disable it by default. Seems like that was a stopgap which mainly went in because this issue hadn't. I'm also not too sure about having the function defined twice - are we 100% sure that's never going to cause issues with opcode caching etc?
Comment #167
kbahey CreditAttribution: kbahey commentedCatch, thanks for the review.
The attached patch removes the special handling for favicon in .htaccess. Yes, that was a special stopgap and is no longer needed with this new approach.
I can't find anywhere that the function is defined twice. It is defined in bootstrap.inc once only.
Comment #168
catchThat'll teach me to review patches on Sunday night, obviously it's not defined twice, just called in two different places.
Next question is what happens if you use image derivatives (or whatever imagecache is called now) with the settings.php version? I'm not up to speed with how it handles requests in D7 now, but if that's going to be affected, I think it's worth adding a particular note either way there that it's incompatible - given that's part of core now. Also if it /is/ incompatible, given that additional optimization is only in settings.php so doesn't require a core hack, it could be taken out without making the overall patch any less useful.
Comment #169
Dries CreditAttribution: Dries commentedFrom an end user point of view, I actually like the fact that all pages do a full bootstrap because all 404s create an entry in my watchdog -- that is very convenient and allows me to actually fix 404s. I use that all the time.
Comment #170
moshe weitzman CreditAttribution: moshe weitzman commented@Dries - sure, that watchdog is very handy for some sites. Those sites will do nothing and still get the watchdog.
This new line in settings.php is for sites that can't afford a full bootstrap because of scaling constraints. This is a valuable weapon for those sysadmins. As you are hearing here, we've all been doing this in our own hackish way in settings.php for *years* and now we are trying to share that with the world. If this doesn't get committed, then only those in the know get the weapon. Owen, kbahey, catch and myself, don't need this patch. We're doing it to benefit everyone else.
I'm pretty sure that the sysadmins for Drupal Gardens and Buzzr would want this tool in their belt.
Comment #171
kbahey CreditAttribution: kbahey commented@catch, I may be mistaken, but this is what I could find from a quick search on image derivatives:
So it seems to check the physical file on the file system, not rely on a 404. So that part does not need further action for this patch. If someone finds otherwise, I am happy to add a comment.
@Dries, the 404s in the watchdog are mostly noise on big sites. Old links from other sites, old elements from themes that have been changed for a while, ...etc. It is rarely about real full pages. In that case, the user still has the option to skip the new fast 404 as Moshe said.
Comment #172
kbahey CreditAttribution: kbahey commented@Dries,
Here is what I use for several high traffic site, in settings.php.
So this patch comes out of real needs for real sites.
Comment #173
catch@Dries, there's two optimizations here, one is to avoid the full bootstrap altogether, the other is to not do a full page render for those paths. With the latter, you still get a very large performance gain for those particular file extensions (at least 50%), but you also get the watchdog message. There's currently no way to do that without a core hack at the moment.
Comment #174
Owen Barton CreditAttribution: Owen Barton commentedHere is a table describing the status-quo and the 2 options added by this patch:
From a scalability point of view 404s can be a huge problem for sites with old domain names that have gone through multiple re-structurings over the years. The 404 traffic simply from hotlinked files alone can be a significant draw on resources. The current setup is also a huge risk for active sites - if you add a 404 to a busy page, or perhaps accidentally move, chmod or otherwise make the files directory inaccessible, server load instantly skyrockets and can render the server unresponsive.
From an end user point of view this is also a major issue. Browsers will patiently sit around with the loading spinner going waiting for Drupal to fully bootstrap, generate a beautiful, fully themed page to represent an image or css file, that the browser then completely ignores as soon as it notes the 404 http code. Not only does this mean that we end up waiting for Drupal to bootstrap twice (or more!), but the response is at least a factor of ten larger than it needs to be. In some cases (e.g. whitehouse.gov or amnesty.org) it is over 100 times larger than necessary. Multiply this by all the 404s on all the Drupal 7 sites, and the end result is a lot of end user wait time, and a lot of bandwidth and cycles wasted.
Currently the default is configuration A - while this is safe and conservative, it has no benefit - most sites and users would be left with the same issues we have now (and a re-broken default favicon 404 to boot). I would strongly argue that if you consider the table above it is clear that configuration B is actually a perfectly reasonable and safe default for the overwhelming majority of sites.
Comment #175
Owen Barton CreditAttribution: Owen Barton commentedHere is an updated patch with configuration B as the default, which I think is the best choice by far. If the only way this can get committed is disabled by default then I think we need to go with the patch in 163, because we can't really remove the favicon.ico workaround if the fix is disabled.
Comment #176
kbahey CreditAttribution: kbahey commentedThe whole idea for the earlier version (#167) is reduce resource usage and improve performance and scalability. The patch in #175 is a step back. Although it is better than the status quo, it is still too late in the boot process and consumes a lot of unnecessary resources.
I would like us to seriously consider #167 for commit.
Comment #177
Owen Barton CreditAttribution: Owen Barton commented@kbahey - I think you have misread my patch - #167 is more conservative that #175 - the former is completely off by default, and identical to the status quo in terms of resource usage and scalability. #175 cuts out a significant chunk of process and bandwidth by default on all sites, yet retains almost identical behaviour and logging. If an admin decides they need more savings they can comment out the exact same settings.php line you have in #167 and return a 404, skipping the rest of the bootstrap.
In addition #167 means that favicon.ico is again broken by default, and will be creating a full bootstrap on every page (even cached pages) just to generate the 404 response to favicon.ico, which is completely unacceptable IMHO and why I suggested #163 as a fallback.
Comment #178
jpmckinney CreditAttribution: jpmckinney commentedI prefer having configuration B as described in #174 as the default, so I endorse the patch in #175.
If configuration A must be what is commited, then I agree that #163 is preferable to #167.
I would have set the status to RTBC, but it's already the case.
Comment #179
kbahey CreditAttribution: kbahey commented@Owen Barton
My bad. I understand it now. It is better than the previous iterations, since it still preserves the watchdog. The early bailout functionality is optional and off by default and that is what I am looking for.
So +1 from me.
Comment #180
Dries CreditAttribution: Dries commentedWould like to hear from moshe or catch on approach B. Seem like it is the only option given some of the regressions with option C. Great table Owen!
Comment #181
moshe weitzman CreditAttribution: moshe weitzman commentedI'd be thrilled to have B as default. #175 gets +1 from me as well.
Comment #182
catchB sounds great to me as well, don't think we can get away with C as a default.
Comment #183
geerlingguy CreditAttribution: geerlingguy commented@kbahey / #172: I tried copying and pasting verbatim at the end of my settings.php (without the open/close PHP tags), but Drupal wouldn't load after doing so... is there something else you did to get the above code to work for 404s on files? [Edit: just noticed, there's no logic to check if we're at a 404 status here... thus all files will be served as 404]
My site gets about 500-1000 404s per minute for hotlinked images which have gone missing now, and this causes my D6.x installation to consume a constant background CPU usage of about 1 (for a 4-core server).
Also, subscribe.
Comment #184
kbahey CreditAttribution: kbahey commentedI emailed you the code via the d.o contact form so as not to create further noise in this issue (and dries can commit it ...)
Comment #185
ddorian CreditAttribution: ddorian commentedlittle bump
Comment #186
Christefano-oldaccount CreditAttribution: Christefano-oldaccount commentedddorian, that's really not necessary. Please don't do that.
Comment #187
moshe weitzman CreditAttribution: moshe weitzman commented#175: 76824-14.patch queued for re-testing.
Comment #188
sunIs $_GET['q'] set in D7 when settings.php is included?
85 critical left. Go review some!
Comment #189
kbahey CreditAttribution: kbahey commentedActually, yes, it works at this point. Just add a print there and you will see it.
I put a better path using base_path() and $_GET['q'] in common.inc. base_path() cannot be used in settings.php though, so just enclosed it in double quotes.
Also, I did some benchmarks on this last patch, and here are the results:
Using apachebench, and visting example.com/none (which is a 404), with 5 users, 1000 requests
1. Without patch
Requests per second: 49.71
2. With patch, default settings
Requests per second: 51.11
3. With patch, with function uncommented in settings.php
Requests per second: 51.15
When visiting example.com/none.jpg (which a 404 that triggers a fast 404), 5 users, 1000 requests
With patch, with function uncommented in settings.php
Requests per second: 352.31
Orders of magnitude faster, now with numbers ...
Comment #190
Owen Barton CreditAttribution: Owen Barton commentedI wonder if the $_GET['q'] needs a check_plain - I think it might.
Comment #191
sunSo effectively two issues:
1) check_plain().
2) The actual path should really be injected via strtr() during output, and not during variable assignment.
Powered by Dreditor.
Comment #192
kbahey CreditAttribution: kbahey commentedBoth issues fixed now. The variable has a @path token that is replaced via strtr(), and also check_plain() is used.
Comment #193
moshe weitzman CreditAttribution: moshe weitzman commentedyou can't run t() on dynamic strings
Comment #194
Owen Barton CreditAttribution: Owen Barton commentedThis fixes the t() call, and also replaces the $_GET['q'] with request_uri() (tested in all 3 usages), which I think is exactly what we need, since we want to show everything the after the domain part of the URI, and $_GET['q'] doesn't have the base path, and also can get messed with by language prefixes and similar systems.
Comment #196
kbahey CreditAttribution: kbahey commentedI don't understand why this failure is happening after t() was fixed.
Rerolling for offsets against current HEAD.
Comment #198
moshe weitzman CreditAttribution: moshe weitzman commentedThe test appears flawed. It assumes that a language is always available but this is very early in the bootstrap so it is OK to call t() [this is new in d7], but only custom string translation will be performed.
Comment #199
kbahey CreditAttribution: kbahey commentedThe flaws of automated testing ...
I can't fix the test to detect early and normal though ...
Moshe, would you consider using strtr() here appropriate?
Comment #200
geerlingguy CreditAttribution: geerlingguy commentedJust FYI, using this code (for me) has caused some problems with ImageCache; instead of IC getting wind of an attempt to access a .jpg, .png, etc. in one of the ImageCache directories, then creating the image, Drupal throws a 404 before ImageCache is called... something to watch out for.
Comment #201
ajayg CreditAttribution: ajayg commentedPerhaps having a check on the path where imagecache stores the cached images will do the trick.
If the path of image requested matches a parttern ( say "cache") then don't do anything else send 404.
Comment #202
Owen Barton CreditAttribution: Owen Barton commented@geerlingguy and @ajayg - please see comment 174 this patch will not break imagecache type modules by default, and will not break the image generation ("imagecache in core") in Drupal 7 even when the optional settings.php "fast return" is enabled. If you use Khalid's code on a Drupal 6 site then it will break Drupal 6 imagecache if you don't add extra conditions. However, please do not discuss imagecache or Drupal 6 in this issue - this causes clutter and distraction.
Comment #203
WiredEscape CreditAttribution: WiredEscape commentedsubscribe
Comment #204
alanburke CreditAttribution: alanburke commented#196: 76824-18.patch queued for re-testing.
Comment #206
scor CreditAttribution: scor commentedThe test fails because this hunk adds the 404 uri in the page
which will contain the $langcode, something that is not expected by locale.test. It appears to me that this test in locale.test (lin 263-265) is a hackish way of checking for a 404 anyways, so I've fixed that.
Comment #207
moshe weitzman CreditAttribution: moshe weitzman commentedGosh, I thought this was in already. Thanks scor.
Comment #208
Owen Barton CreditAttribution: Owen Barton commentedRerolling for patch fuzz...
Comment #209
marcingy CreditAttribution: marcingy commented#208: 76824-fast_404_208_0.patch queued for re-testing.
Comment #210
marcingy CreditAttribution: marcingy commentedComment #211
moshe weitzman CreditAttribution: moshe weitzman commentedIt passed again. Pretty please commit please.
Comment #212
Anonymous (not verified) CreditAttribution: Anonymous commentedIs there any intermediate 6.x solution, while we all wait to move to d7?
I currently hacked common.inc's drupal_not_found() to directly load a custom static 404.html with some extra htaccess rules to handle files 404's.
Comment #213
geerlingguy CreditAttribution: geerlingguy commented@morningtime - Here's code for Drupal 6 only (paste it at the bottom of settings.php):
Comment #214
kbahey CreditAttribution: kbahey commented@morningtime
Don't hack common.inc. Kittens die horrible deaths. There is an easier and more maintainable solution, with only a change in settings.php.
@greelingguy
If you use imagecache, then you have to exclude its paths from being handled.
Here is the code I use for most sites: Reducing server resource utilization on busy sites by implementing fast 404s in Drupal.
Comment #215
scor CreditAttribution: scor commentedthe default.settings.php hunk is missing from #208, rerolling.
Comment #216
Anonymous (not verified) CreditAttribution: Anonymous commented@kbahey
Thanks! I learned a lot from your DrupalCon video!
Comment #217
Owen Barton CreditAttribution: Owen Barton commentedI did some benchmarks to show the effect of the current slow inline 404s on browser display latency for end users. These were done using http://www.webpagetest.org/ testing the same site (in different configurations, 404s introduced through a node body promoted to home page) using default settings, 10 test runs each time.
See attached file for more detailed benchmarks.
The results are not insignificant - an inline 404 currently slows down the parent page display by anything from 500ms to a full second, depending on the type of 404. Committing the patch will shave something like 100ms-500ms off that time. If site builders enable the "fast 404" return in settings.php the pages display almost as fast as if the 404 wasn't there - anything from 500ms to a full second faster than the current performance. In other words, this patch clearly has enormous benefits for both end user responsiveness, in addition to the server utilisation benefits described in depth above.
Comment #218
BenK CreditAttribution: BenK commentedThanks for the benchmarks... very interesting indeed!
Comment #219
webchickThis has Dries's fingerprints all over it, so I'll leave it to him to either commit or D8 this.
Comment #220
naxoc CreditAttribution: naxoc commentedSubscribe
Comment #221
scor CreditAttribution: scor commentedpatch didn't apply any longer
Comment #222
Owen Barton CreditAttribution: Owen Barton commentedSummary is at #174.
Server side benchmarks are in #189 and front-end benchmarks are in #217.
Comment #223
sbozhko CreditAttribution: sbozhko commented#221: 76824-fast_404_221.patch queued for re-testing.
Comment #224
sun#221: 76824-fast_404_221.patch queued for re-testing.
Comment #225
sunAlthough badly needed, this is D8 material according to the rules (I had to learn today). It may be backported at a later point in time (though that's unlikely).
Comment #226
moshe weitzman CreditAttribution: moshe weitzman commentedrecover from temper tantrum.
Comment #227
Owen Barton CreditAttribution: Owen Barton commentedThis has been RTBC since June. Would love to get this in, or find out what is needed to get to that point! See meta-summary of summaries/benchmarks in #222
Comment #228
moshe weitzman CreditAttribution: moshe weitzman commented#221: 76824-fast_404_221.patch queued for re-testing.
Comment #230
moshe weitzman CreditAttribution: moshe weitzman commentedReroll. Keeping at RTBC since there are no changes.
Comment #232
moshe weitzman CreditAttribution: moshe weitzman commented#230: 404.diff queued for re-testing.
Comment #233
moshe weitzman CreditAttribution: moshe weitzman commentedBot finally recovered from its bender. Back to RTBC.
Comment #234
tom_o_t CreditAttribution: tom_o_t commented#230: 404.diff queued for re-testing.
Comment #235
webchickAnother one with Dries's fingerprints all over it.
Leaving for The Spikey Haired One to make the call.
Comment #236
rfayShoot, sorry for the interruption here... just noticed this.
Related issue: #177790: Standalone scripts like cron.php can be erroneously cached if they go temporarily missing.
Dreamhost regularly kills off running cron processes, creating an artificial 404 on cron.php. Which Drupal then caches oh so disfunctionally.
It would be fantastic if cron.php and update.php and the other php files in the root directory could be added to this list to resolve that longstanding WTF tension between Dreamhost's bizarre process-killing and Drupal's tendency to page-cache the wrong thing.
If that's inappropriate, feel free to go back to RTBC.
Comment #237
agentrickard@rfay - given that this is my oldest core issue, I'd hate to see it derailed due to one host's sloppiness.
Is there a way that the Dreamhost error can be fixed in contrib or config, or do we have to account for that here?
Comment #238
rfay@agentrickard... I understand completely. I was just mentioning this weird and problematic Drupal behavior that would be fixed by adding 3 words to the patch (I think).
If I read the patch correctly, all it would take to fix a Drupal install with the weird caching-404 problem would be to add cron.php to the variable provided by this patch. So it doesn't even take contrib.
But since cron.php and update.php are clear cases that should *also* not be cached, I just thought it might be nice to get those in.
I understand completely if you just set this *right back* to RTBC. I was just stopping by, and happy to see this happening.
Comment #239
Owen Barton CreditAttribution: Owen Barton commentedI don't think this is the right place to fix the cron.php issue - this issue is about a performance improvement that can optionally be disabled: if it was disabled the bug would reoccur. Also, a few contrib modules add their own .php entry points (for example ad module uses an adserve.php for javascript rotated ads), and a hardcoded "cron.php|update.php" in the variable means that these entry points would still be susceptible to that issue. It sounds to me like a better fix would be to test that the index.php entry point was used before caching a page, set the $conf['cache_inc'] cache.inc default in index.php (or set it to cache-install.inc in cron.php) or some similar approach.
Comment #240
Owen Barton CreditAttribution: Owen Barton commented#230: 404.diff queued for re-testing.
Comment #241
Fidelix CreditAttribution: Fidelix commentedI see this makes changes to .htaccess
Will this work with other webservers like nginx, lighttpd and cherokee?
Apache is not the only one running out there.
Comment #242
geerlingguy CreditAttribution: geerlingguy commented@Fidelix / #241 - I know nginx doesn't support .htaccess files at all (a different mechanism is used for rewrites and such).
However, we have the web.config file for IIS (I think), and .htaccess for Apache. Those who use other server environments will need to do the work to translate the rewrite rules correctly for their own servers. There are hundreds of different web server projects out there—the htaccess hits the overwhelming majority of use cases for Drupal...
Comment #243
catchThe only change to .htaccess here is the removal of the special casing of favicon.ico.
Comment #244
kbahey CreditAttribution: kbahey commentedThis did not make it to Drupal 7.x.
Sigh ... moving to 8.x ...
I guess I have to fulfill a vow I made to myself and kill the family cat
Comment #245
geerlingguy CreditAttribution: geerlingguy commentedWith the length of this issue, we might need to consider either pushing this in soon, or opening up a new summary issue for further discussion.
Comment #246
catchI don't think we should push this to 8.x until HEAD opens up. Edit - a new summary issue might help yeah.
Comment #247
mlncn CreditAttribution: mlncn commentedThis is a critical (in the layman's sense) performance patch and the RTBC patch works.
If only i'd come across this last night, i would have known what to do with my extra drink tickets and access to Dries. Please?
Comment #248
kbahey CreditAttribution: kbahey commented@catch
Do you think it has a chance to make it into Drupal 7.x?
I don't think so, personally, but if you think it would, then I would be pleasantly surprised.
That cat is doomed ...
Comment #249
jcisio CreditAttribution: jcisio commentedSubscribe. Love to see it get in D7.
Comment #250
Vacilando CreditAttribution: Vacilando commentedSubscribing.
Comment #251
soyarma CreditAttribution: soyarma commentedMyself and others have created a module which achieves essentially what is discussed here, as well as very early URL checking (either right out of settings.php, or within hook_boot). Right now there is a D6 release and a D7 release is in the works.
http://drupal.org/project/fast_404
Comment #252
perusio CreditAttribution: perusio commented#241
Nginx doesn't need anything of this sort. It serves the static files first if they exist. AFAIK this is Apache specific.
Hop in to the Nginx group on g.d.o and check a few configs that do the right thing (tm).
Comment #253
Owen Barton CreditAttribution: Owen Barton commented@perusio (#252) this is incorrect - this patch has just the same benefits for Nginx and any other webserver set up so that Drupal is still responsible for serving 404s (which it almost certainly will be if clean URLs are configured). The patch means that Drupal does reduced processing (either massively, or somewhat - depending on the configuration), and also returns a lower bandwidth response. This all happens in the PHP - the webserver has no effect here, other than checking that an actual static file doesn't exist before handing control over to PHP - this behavior is the same on all webservers that support Clean URLs, and is unchanged by this patch.
@Fidelix (#241) - this does actually not modify .htaccess, other than removing a hardcoded check for non-existent favicon.ico that is no longer needed. Earlier patches did use .htaccess to return lightweight 404s, but that turned out to not be a good approach for most sites. As described above this change is completely independent of whatever webserver is used.
Comment #254
perusio CreditAttribution: perusio commentedNot really if you setup the proper rules in your nginx config. Supposing you setup a static page for handling the 404s.
Of course if you relay it to Drupal then the patch applies. But for static files you can be out of harm's way. In the case of images the issue gets complicated because of image cache. If you don't use it than the above config avoids altogether touching the PHP layer.
Also the configs discussed at the g.d.o Nginx group handle the static files properly. The above is just a random example. A proper config is more complex.
No: almost 'certainly' it isn't. This all happens because you're relaying all 404s to Drupal.
Comment #255
jcisio CreditAttribution: jcisio commentedperusio, your nginx conf is the first attempt in the patch in #0 in 2006. Since then it has evoluted.
Comment #256
perusio CreditAttribution: perusio commented#255 I beg to differ. This all issue happens because of this:
If you remove it then this is patch is completely unnecessary. Now I'm afraid my mod_rewrite chops are quite limited and in that sense I don't know if you really need this for things like imagecache to work properly with Apache.
But I suppose that you can write a rewrite rule targeted to the imagecache URIs and act upon it. Of course the above "catch all" directive handles it out of the box. So no need to instruct people running imagecache for inserting such a rule in their .htaccess.
And it's not "my" config, but rather common shared knowledge between anyone that bothered to lookup on prior art before running drupal with nginx.
Comment #257
agentrickardIf you remove that (without access to server-level host configuration files), Drupal won't function.
Comment #258
perusio CreditAttribution: perusio commentedOk. I'm not privy to the inner workings of Apache and how all the container context delegation thing operates really. Which I suspect is what is being adressed here.
In Nginx there's no .htaccess equivalent. To reload a config you send a SIGHUP to the running instance and the config is read into memory where it stays unchanged until you send a new SIGHUP.
Can you elaborate on your comment? I would like to know how things are handled in Apache so that the
ErrorDocument
for 404 is needed.Thanks
Comment #259
agentrickardDrupal (and WordPress and Joomla) are giant 404 handlers. That is, they process any URL request that the webserver cannot. All Drupal requests are actually routed through example.com/index.php?q=path/of/request.
On shared hosts, most people don't have access to Apache configuration, so those rules have to be set in .htaccess. The side effect, of course, is this issue, which has been around for years.
For people with admin control over their hosting environment, be it Apache, Nginx or what have you, this can be remedied quite easily. For others -- including the non-technical user -- we need to provide sane defaults for them rather than the catch-all method of routing _all_ 404s through Drupal.
Comment #260
geerlingguy CreditAttribution: geerlingguy commented@agentrickard - could you document that procedure (at least for Apache) here, or maybe add a documentation page to the performance section of the handbook, for everyone's benefit? It'd be nice to be able to have a wiki page where we can point people to for a brief overview of this huge issue...
Comment #261
perusio CreditAttribution: perusio commentedOops. Browser snafu made me cancel the submission but the form hand been already submitted. Sorry :(
See comment below.
Comment #262
perusio CreditAttribution: perusio commented@agentrickard I already knew that Drupal et al are "giant 404 handlers". That's not the point I was trying to get at. Instead I was under the impression that Apache context delegation thing doesn't allow to route around the issue of setting Drupal as the single 404 handler. So let me recap, this is my take on this issue, you and others feel free to chime in.
ErrorDocument 404 index.php
directive.Is that a fair assesment of the issue? If so then there should be an handbook page documenting this problem and the possible strategies to solve it. There's no inexorable need to either use the patch or the ErrorDocument directive. From what I gather that's a mere convenience that Drupal provides. It's nice for non-technically inclined people and also if you cannot have it configured as it should at the vhost level. There should be also accompanying documentation for Drupal when this patch gets merged in the regular Drupal core release.
Comment #263
jcisio CreditAttribution: jcisio commented@perusio: Your 2) and 5) are the same. AFAIK, there is no difference between the .htaccess and vhost level in routing a request/serving a 404 error.
Since very early, we don't want to solve it at the web server level as it is too restrictive (and not universal, maybe).
Comment #264
perusio CreditAttribution: perusio commented@jcisio not really. AFAIK you cannot specify locations at the directory context in Apache. The
Location
directive cannot be used in a directory context (.htaccess). This must be set at the Server or Virtual Host level.What do you mean by "not universal, maybe"? I suppose all web servers let you control locations.
Comment #265
agentrickardThree responses.
1) This new discussion is off-topic, and should be moved to a documentation issue (or jut to a docs page).
2) @perusio, you are ignoring the fact that a vast number of Drupal users have no server access to the settings you suggest we change. Or are you arguing that this should just be a performance documentation issue instead of a code-level fix?
3) I am not qualified to write the documentation page. Someone who knows more about Apache should take a stab at it.
Let's get back to actual reviews of the patch and its worthiness.
In this tangent, I think the original question is "Does this patch break Nginx?"
Comment #266
perusio CreditAttribution: perusio commented@agentrickard regarding 2). What I'm suggesting is that this issue be well documented and the tradeoffs explained.
Regarding 3) I'm not either. But I can seed it with the little I know. I no longer use Apache altogether. I can make a first draft and let others more knowledgeable then I improve it. I can take a stab at writing the Nginx related part, which is a nice segway into your last question.
This patch is unnecessary for Nginx with a decent config, unless you insist on letting index.php be the 404 handler. In most configs I've seen out there that's not the case. This patch introduces a performance penalty since you're checking if the file being requested is a static file or not. It may be small or not, it varies with the load and further investigation is required.
I'm not ignoring "vast number of Drupal users have no server access to the settings you suggest we change". What I'm saying is that if you have, like it's the case with anyone running Nginx, this patch is unnecessary.
I fail to see how it's "off-topic" since I believe the objective is to document as well as possible the tradeoffs involved. Althought I agree that a documentation page is now the proper venue to pursue this issue further.
EDIT: Regarding a documentation page in the handbook. The only thing related I could find is: http://drupal.org/node/56773
The handbook pages could benefit of a server related section IMHO. Does anyone here has power to start such a book page? I will gladly seed it with this issue as a section.
Comment #267
soyarma CreditAttribution: soyarma commentedHey Perusio;
I think your solution makes a heck of a lot of sense. Since (I imagine this may not be the case, but it usually is in my experience) Nginx has access to the static filesystem it can check too to see if the file exists. There is no need for it to pass every 404 off to Apache and then Drupal. Of course, by the same extension even if you don't have Nginx, you can do this in apache in a vhost.
If you give the nginx script essentially the same list of extensions as the patch and tell it to go to a static 404 if those files don't exist (heck make it a purty page and have your reverse proxy cache it) you'll save a ton of time in not going through 2+ other layers to server the 404 when nginx could have done it up front.
Definitely a not seeing the forest for the trees sort of thing.
Let me know what your doc page is when you have it rolling and I'll link to it from my Fast 404 module (still useful for folks who can't modify server configs).
Comment #268
Owen Barton CreditAttribution: Owen Barton commented@perusio - I am not following your logic here. Yes, if you use Ngnix configured to catch 404s for specific file extensions this patch is unnecessary. The same can be said if you use Varnish (which can easily accommodate these rules), various CDNs, or your custom built RFC1149 based protocol.
All of these approaches do have drawbacks (see above and below), are inaccessible to a great many users and are hence inappropriate solutions for Drupal core. Yes, of course they should be documented - but this issue is regarding a core patch, and discussion of documentation should focus on the patch and it's behavior. Trying to wedge documentation about the numerous alternative approaches that skilled site builders could employ is out of scope - please open a documentation issue to explore this, add comments to existing documentation pages, and/or apply to join the documentation team. By all means link to a documentation project issue from here.
I don't believe it is correct to say "with anyone running Nginx, this patch is unnecessary" - this is only true if the user has configured these rules, and that is not universally true - for example the first generic configuration on the Ngnix g.d.o. description (for MTecknology) does not have the required rules. In any case, even if this patch is unnecessary with a particular configuration, I don't see how that has any bearing on this patch.
Finally - dealing with these 404s in the webserver or proxy has a number of issues or side effects (that should be made clear in any documentation) - here is a summary:
- They requires root access on some web servers
- Imagecache style modules and/or non-default file directory paths each require a specific exclusion (that may require knowledge of regular expressions to implement)
- If a site uses Drupal's alias or redirect modules to handle legacy (non-404) paths ending in .html or similar, these may require specific exclusion (or vice versa if these are not used)
- Drupal no longer logs these 404s in the watchdog, which means many less skilled web admins will have no idea there is a static file 404, which is a problem in many situations
- Finally, the biggest issue with doing this in Apache (that Drupal supports, through it's default .htaccess file) is that the most effective rule prevents 404 paths from being logged at all (even in the webserver logs).
Can we please focus on the patch at hand which avoids all of these problem, whilst enabling a reasonable and simple high performance compromise in settings.php, and still doesn't prevent any of the web-server/proxy specific approaches from being implemented.
Comment #269
catchCan you point out where this performance penalty is? The only checking I see is in drupal_fast_404(), and with the current patch, unless you manually edit settings.php, that code only gets executed if you are already going to serve a 404 from Drupal.
Comment #270
perusio CreditAttribution: perusio commented@catch
Yes it's dependant on a variable being set. I was under the impression that the variable was being set from an DB update or upon installation of Drupal. Is it not something that it will be enabled by default on stock Drupal installs?
In settings.php there's the definition of the regex mask for the type of files being checked.
If that's not the case then there's no penalty. Otherwise since drupal_fast_404() is invoked in drupal_deliver_html_page it always get executed. And I suppose that it's always faster to perform that check at the server level instead of doing it at the application level.
If I request the URI /foobar and if there's not a menu entry for that path the drupal_fast_404() function get's executed.
Comment #271
catch@perusio, no that's deliberately left commented out in settings.php and it's a function call rather than a variable, so can't be affected by install or update (and there's nothing in the patch that affects install or update). That is purely an additional optimization over the main one this patch introduces, disabled by default.
Nope. drupal_deliver_html_page() will only execute that code if there is going to be a 404 anyway - since that whole hunk is inside a switch statement.
The only PHP being executed by default here is the assigning of two fairly short strings to $conf, which really is negligible by any standards (and that can be commented out by the same person that sets up the nginx config).
This leaves the fact we define the function at all in bootstrap.inc (which does have a small cost but barely measurable).
Comment #272
perusio CreditAttribution: perusio commented@catch Yes I know that. The function call in settings.php is for an even faster 404 since there isn't even a watchdog entry for the 404. The variable I'm talking about is in the condition of the if statement:
It the
$fast_paths
is set, then the check is always performed. I gave the example of requesting a URI for which there isn't a menu entry, hence the switch statement branch active is the one withdrupal_fast_404()
.My issue with performance is related to the regex matching in
preg_match
the remainder is negligible I agree.I'm reading directly from the patch #230 by Moshe. Isn't this the one being discussed? I don't see any other below that comment.
Comment #273
Owen Barton CreditAttribution: Owen Barton commented@perusio - this regex in drupal_fast_404 is not "always" run. If you set 404_fast_paths to FALSE in settings.php then the regular expression never runs - php evaluates if statements in the normal approach - if the "left" test of an ANDed condition fails, it does not even run the "right" test. Hence, if you prefer to do this check at the webserver level (and accept the various issues I describe in #268), then you can disable this patches function completely.
Comment #274
perusio CreditAttribution: perusio commented@Owen Barton
Yes that's correct. But the patch enables it by default.
Also I gather from what's being discussed above that the patch is for Drupal 8. Patching settings.php is potentially "dangerous". In the sense that this is a file specific to a given install and might have comments, code blocks, what have you, that will make the patch fail for this file: be it because of a too large offset, or because there are changes that make the patch tool not being able to find a match for inserting the new stuff.
Comment #275
Owen Barton CreditAttribution: Owen Barton commented@perusio - yes, we do enable it be default. I think it is very reasonable to ask people who would rather put this in their webserver/proxy configuration (and are able to understand the fairly complex tradeoffs involved in this situation) to place a single "#" in their settings.php. For the vast majority of Drupal users they will not have the desire or skills to do this in their webserver/proxy, and we should not expect them to need to change settings.php (this is why we have the installer, after all) - having a safe and faster default configuration is a huge benefit to them.
Note that we never "patch" running settings.php (for a start, core is unable to upgrade itself, and most users upgrade via wget). We would only change the default.settings.php, so that new sites would benefit. If this was committed to Drupal 8, then backported to Drupal 7, settings.php changes would go in the release notes for users to add themselves.
Comment #276
jcisio CreditAttribution: jcisio commented@perusio: it is enabled by default so that if you tune your server settings, you can then disable this feature.
I can't get what you mean by "patching settings.php". Once this patch get committed, all that you do is comment out that line. Nothing dangerous here. You do it manually, as instructed many other modules (caching modules, for instance) or Drupal core ("translation" of a few strings).
Comment #277
catchdefault.settings.php is copied to settings.php, so updates to that file don't affect existing settings.php on Drupal 7 sites that are already installed unless you copy the changes across manually.
Comment #278
perusio CreditAttribution: perusio commented@Owen Barton
Ok. I'll now move this debate to a documentation page and link it from here when available. Thank you.
Regarding the settings.php patching thing I was thinking on the case of upgrading core using drush. Or just even doing a simple:
And if providing a patch. Which isn't the case. The update will be counseled through release notes.
Comment #279
ogi CreditAttribution: ogi commentedsubscribe
Comment #280
Summit CreditAttribution: Summit commentedSubscribing, greetings, Martijn
Comment #281
catchComment #282
joelstein CreditAttribution: joelstein commentedFor anyone needing a temporary solution, here's what I have in my settings.php files for D7 sites. Also included here are paths to exclude, such as the "styles" paths (aka, "imagecache" in D6). I've included another exclusion, so you can see how multiple ones can be strung together.
For D6 sites, I use something slightly different:
I hope this contributes in some way to the discussion.
EDIT: Adjusted to account for the note in #283.
Comment #283
joelstein CreditAttribution: joelstein commentedThe patch in #230 has a problem. The regex doesn't escape the period, so paths that end with any of the extensions will issue a 404.
The attached patch fixes this.
Comment #284
jpmckinney CreditAttribution: jpmckinney commentedFixing status.
Comment #286
geerlingguy CreditAttribution: geerlingguy commentedUpdated git patch for 8.x-dev (looks like the other patch might've been against 7.x) straight from #283. Hopefully the testbot's happy...
Comment #287
anavarreSubscribe
Comment #288
Tor Arne Thune CreditAttribution: Tor Arne Thune commentedSubscribing.
Comment #289
Owen Barton CreditAttribution: Owen Barton commentedComment #290
webchickYAY! Now that #1036132: Provide a mechanism for issue summaries is deployed, at 290 replies, this issue could really use a summary. :)
Could someone please edit the issue node using the issue summary template? That would really help both Dries and everyone else coming into this issue understand what's going on.
Comment #291
Owen Barton CreditAttribution: Owen Barton commentedWorking on this now
Comment #292
Owen Barton CreditAttribution: Owen Barton commentedSummary is up - yay for issue summaries!
Comment #293
webchickPutting this into Dries's queue to review.
Comment #294
Anonymous (not verified) CreditAttribution: Anonymous commentedI applied the patch from #283 for D7. WARNING: enabling
drupal_fast_404()
prevents file styles images from being generated - it returns those requests as 404's! Has this been taken into account?Comment #295
geerlingguy CreditAttribution: geerlingguy commented@morningtime - Read through the issue summary, especially the notes about where regex (some tweaking) is required to make fast 404s compatible with image styles. I had to add any paths to the regex where dynamic content is created (anything drupal's path system doesn't know about that ends in a file extension defined in the fast 404 function).
Comment #296
Anonymous (not verified) CreditAttribution: Anonymous commented@geerlingguy My point is why isn't that regex part of the patch Dries is going to look at? In it's current state, it's a dangerous patch, because only 100% of all D7 sites uses file styles.
Comment #297
webchickThat sounds like a needs work.
Comment #298
geerlingguy CreditAttribution: geerlingguy commented@morningtime: by default, fast 404s are disabled, for precisely that reason. You have to know what you're doing to set this up, and its buried towards the bottom of settings.php with other advanced configuration options. You can turn on fast 404s if you want, at the settings.php level, by uncommenting the line with the function call. You can also enable them at a higher level, later in the bootstrap, for more safety when it comes to styles and such, but with a slight decrease in the performance impact.
Setting back to rtbc... Unless the contention is elsewhere.
Comment #299
marcingy CreditAttribution: marcingy commented+1 on the rtbc this is much needed advanced feature
Comment #300
webchickDamn, I really don't want to push this over 300 replies, but...
Advanced or not, I don't think it's acceptable to make the default regex break a core feature. It should be possible to put an exception for those paths in the regex.
Comment #301
Anonymous (not verified) CreditAttribution: Anonymous commented@geerlingguy
>> You have to know what you're doing to set this up, and its buried towards the bottom of settings.php with other advanced configuration options.
I accept your logic. However, there should be an example regex included, or at least a warning it will break image styles. Neither was explained in the patch. So how are people going to figure that out? That's just my opinion.
By the way, settings.php is the first file any end-user looks at. You can't "bury" anything there :)
Comment #302
geerlingguy CreditAttribution: geerlingguy commentedI'll work on a new patch for D7 and D8 in a little bit. I actually need to set this up on a new site in a short bit, so it'll be good to test on.
Comment #303
geerlingguy CreditAttribution: geerlingguy commentedPatch against 7.x attached (I'll make an 8.x patch in the next comment). Leaving needs work until I can post the 8.x patch.
A little explanation of the one change I've made (after considering a few different methods): In drupal_fast_404(), I've added a check for the string 'styles' in the query string, as this should take care of the great majority of use cases, is still very performant (doesn't require any more regex, and strpos() is faster than another preg_match()), and is simple enough for someone that needs to add more paths (for example, for the Image Resize Filter's 'resize' directory) to be able to do so by adding another condition.
We could maybe do something where the user could set an array of options in settings.php that would be checked, but I think this method makes the best tradeoff between sensible defaults and performance... thoughts?
Comment #304
geerlingguy CreditAttribution: geerlingguy commented8.x patch attached.
The only way to make a configurable array of paths to check against the url for dynamic content in settings.php (so people wouldn't need to change bootstrap.inc to add more dynamic paths) would be to add a loop in drupal_fast_404() to check each string against the url using foreach... but maybe preg_match() would be quicker :-/
Comment #305
geerlingguy CreditAttribution: geerlingguy commentedNevermind... I think this technique (everything's configurable in settings.php) should be a bit better. See attached patches, which also add an explanation of the dynamic paths configuration variable, for D7 and D8.
Comment #307
geerlingguy CreditAttribution: geerlingguy commentedYeah, sorry about that... ending parentheses:
Comment #309
geerlingguy CreditAttribution: geerlingguy commentedYeah, about those parentheses...
Comment #310
geerlingguy CreditAttribution: geerlingguy commentedUpdated patch to use the more standardized variable name
404_fast_paths_exclude
(instead of404_dynamic_paths
).Comment #311
geerlingguy CreditAttribution: geerlingguy commentedFixed comments.
Comment #312
joelstein CreditAttribution: joelstein commentedI like the idea of having "exclude" paths in addition to "include" paths, and I'm using the same technique with great success on one of my D7 websites. Patch works for me!
Comment #313
Anonymous (not verified) CreditAttribution: Anonymous commentedthumbs up, i think it's perfect
Comment #314
droplet CreditAttribution: droplet commentedcan anyone explain why use this regexp but not
\b(styles)\b
?/\b(styles)\w*\b/
Matched:
1. \stylesxxxxxxxxxxx\abc
2. \styles\abc
3. xxxx\styles.jpg
4. xxx\styles\xxx\xxx
/\b(styles)\b/
Only matched:
2. \styles\abc
3. xxxx\styles.jpg
4. xxx\styles\xxx\xxx
or consider this:
/\b\\(styles)\\\b/
Matched:
4. xxx\styles\xxx\xxx
20 days to next Drupal core point release.
Comment #315
geerlingguy CreditAttribution: geerlingguy commentedI'm presuming you meant something more like:
(forward slashes instead of back slashes). I have the updated regex on my local checkout and will roll a patch in a few minutes. Pending testbot approval, I hope we can RTBC this asap and get it in.
I've tested with the following patterns:
http://www.example.com/example/styles
- no matchhttp://www.example.com/example/styles/file.jpg
- match (won't return a 404)http://www.example.com/example/styles.jpg
- no matchTo add more 'dynamic' or 'safe' paths (for things like ctools-generated css, or other dynamically-created content), just add a pipe, and the string to match. For example, in settings.php, instead of the code at the top of this comment, put:
Comment #316
geerlingguy CreditAttribution: geerlingguy commentedUpdated patches with regex from #315, tested thoroughly, hopefully ready for commit!
Comment #317
geerlingguy CreditAttribution: geerlingguy commentedBack to RTBC, as the simple line of regex is all that was changed, and it's been tested by two people.
Comment #318
rootworkSubscribe (sorry, I know...)
Comment #319
markus_petrux CreditAttribution: markus_petrux commentedCapturing back references in regular expressions do not seem to be required. It could be optimized by adding
?:
after the opening parenthesis.Comment #320
xjmHere's an interdiff against #286 to make it easier to see what's changed in #316.
Comment #320.0
xjmFirst cut at issue summary
Comment #321
xjmI updated the Remaining Tasks section of the summary to reflect the current status of the issue.
Comment #321.0
xjmUpdated issue summary.
Comment #322
droplet CreditAttribution: droplet commented@#315,
thanks geerlingguy, that's what i meant.
and we can simplify it to:
(non-capturing, no word-boundary)
Yes, it can be more simplified to
but first case, beginner users can just add a pipe:
\/(?:styles|ctools)\/
Comment #322.0
droplet CreditAttribution: droplet commentedRemoved unnecessary (and now inaccurate) phrase.
Comment #323
xjmSetting to NR based on #319 and #322.
Comment #324
xjmNo, really. :)
Comment #325
geerlingguy CreditAttribution: geerlingguy commentedWill write up a new patch in the morning.
Comment #326
geerlingguy CreditAttribution: geerlingguy commentedI've set the new regex to:
It was:
Tested with same paths as earlier, and D7/D8 patches are both attached. Nice to see profile module gone in D8!
Comment #327
xjmNice, hopefully we can get a couple people to review this version and RTBC it again.
Comment #328
droplet CreditAttribution: droplet commentedGood Job! and this one too:
It was:
sorry I haven't mention it before.
Comment #329
Owen Barton CreditAttribution: Owen Barton commentedThis looks great to me (tests not back yet though).
One further (very minor) tweak that comes to mind, is that if we used a regex delimiter other than '/' we wouldn't need to escape the slashes in the exclude expression, which might and reduce the overall "too many slashes!" fear factor and generally make the function of the slashes a bit more obvious to folks who are not regex geeks. Of course there is also a risk that a less typical delimiter would confuse people more than the slashes - feedback welcome :)
e.g. instead of:
we could have:
I went with '!' as it seems the most common in core (after '/') - other non-escape-needed delimiter options (also used elsewhere in core) include '@', '%', '~', '#'. What do people think?
Comment #330
geerlingguy CreditAttribution: geerlingguy commentedThat's a simple style preference... but I would argue in favor of escaping things with slashes; exclamation points make it seem like something special is going on there (to a less-experienced regex guy), and would be more confusing, imo.
Comment #331
geerlingguy CreditAttribution: geerlingguy commentedBack to RTBC... I think most people are in agreement over everything, except possibly the style of regex used—semantically, it's about as concise as it can get while still allowing ease of end-user modification in settings.php (ref #322).
Comment #332
Owen Barton CreditAttribution: Owen Barton commentedFWIW, I am perfectly happy with the regex as is. I suggested changing the delimiters to avoid escaping, but I agree "special" looking delimiters may be similar for new-to-regex folks - don't feel strongly either way.
Comment #333
Dries CreditAttribution: Dries commentedCommitted to 8.x. Oh yeah.
Sorry for being slow to commit this patch. For the longest time I wasn't convinced this was the right thing to do and was quite torn about it. I just spent time thinking about it, and now think it is actually proper. Thanks for hanging on there.
Moving to 7.x.
Comment #334
Dries CreditAttribution: Dries commentedComment #335
geerlingguy CreditAttribution: geerlingguy commentedStill RTBC for Drupal 7 as well - and as noted above, there are no regressions introduced by this patch, and for it to have any effect at all, a site would need to overwrite its existing settings.php file (for existing sites). Might we be able to get it committed soon?
Comment #336
geerlingguy CreditAttribution: geerlingguy commented#328: 404_fast_paths_7x-76824-328.patch queued for re-testing.
Comment #337
geerlingguy CreditAttribution: geerlingguy commented@webchick - any chance of getting this in before the next point release? Been RTBC for a few weeks now...
Comment #338
webchickSorry; I've been swamped lately and haven't had a chance to give the D7 queue a good love-fest.
This seems like a pretty harmless addition for D7, and passes Dries's muster, so I'm cool with committing it.
So! Committed and pushed to 7.x.
This needs a change notice so it can be mentioned in the 7.9 release notes.
Comment #339
geerlingguy CreditAttribution: geerlingguy commentedThe link was relative, so didn't work - should be http://drupal.org/node/add/changenotice.
I glance through there, but this is the first issue I've ever dealt with that needs a Change Notice, and I'm unfamiliar with what exactly that entails. Specifically, what would I need to do in the 'Updates Done' section, and is there need to attach any particular file (like the committed patch)?
I'd love to do this (or at least see a couple examples of change notice nodes), but just need a little nudge—do any docs exist?
Comment #340
geerlingguy CreditAttribution: geerlingguy commentedHere's the change notice I've posted: http://drupal.org/node/1296384 — please see my questions in the comment above, as this is the first time I've submitted a change notices node...
Thanks!
Comment #341
webchickThat looks perfect to me, thanks! Restoring status.
The "Updates" section is for the documentation team, etc. to note when they have updated the relevant sections.
If you have suggestions on how to make this feature more clear, please file some feedback under http://drupal.org/project/issues/drupalorg.
Comment #342
perusio CreditAttribution: perusio commented@geerlinguy #339
Your change notice is incomplete. It fails to mention that if you're "capturing" the handling of the static files at the server level there's no need to patch anything or fiddling with the settings.php file.
Comment #343
geerlingguy CreditAttribution: geerlingguy commented@perusio - could you update the change notice, then? I'm not sure how I'd want to phrase that, as I thought it more or less self-explanatory that you bypass Drupal's handling if you handle 404s before Drupal got ahold of them.
Comment #344
perusio CreditAttribution: perusio commented@geerlingguy Done here: http://drupal.org/node/1296384#comment-5070138
Comment #346
Shellingfox CreditAttribution: Shellingfox commentedI'm working on this to backport to D6
Comment #347
Shellingfox CreditAttribution: Shellingfox commentedHere is my patch for #346
Comment #348
Shellingfox CreditAttribution: Shellingfox commentedComment #349
thehong CreditAttribution: thehong commentedComment #350
Shellingfox CreditAttribution: Shellingfox commentedSince i was change status, last file wasn't test.
Comment #351
Shellingfox CreditAttribution: Shellingfox commentedRe-create patch via git diff
Comment #353
Shellingfox CreditAttribution: Shellingfox commentedComment #354
droplet CreditAttribution: droplet commentedwe don't has such D6 backport policy? It will change the default behavior.
Comment #355
Nigeria CreditAttribution: Nigeria commented#353: fast404-76824-351.patch queued for re-testing.
Comment #356
jcisio CreditAttribution: jcisio commentedIt was backported to D7, so it could be backported to D6. The default behavior does not change.
There are two trailing spaces for nitpickers. The backport is straightaway. Works as expected.
Comment #357
pwolanin CreditAttribution: pwolanin commentedIs this already in Pressflow? D6 is much less open to new features than D7 is right now.
Comment #358
jcisio CreditAttribution: jcisio commentedNo, it is not in Pressflow. Pressflow recently has not been well maintained, no backport has been recently merged into Pressflow. Pressflow 6.26 needed two months to be tagged after the release of Drupal 6.26.
That said, if we decide not to backport this one, we can simply remove tag and close the issue. There are not many new D6 sites, so a change in the default settings file is not much useful.
Comment #359
donquixote CreditAttribution: donquixote commentedA quick note:
If this is what we currently find in
admin/config/search/search404 > Advanced settings > "EXTENSIONS TO ABORT SEARCH"
then there is the "ico" missing there, for favicon.ico
Comment #360
pwolanin CreditAttribution: pwolanin commented@jcisio - it looks like 6.26 was merged in: https://github.com/pressflow/6
Here's a 6.26 tag: https://github.com/pressflow/6/commits/pressflow-6.26.109
what's missing?
Comment #361
jcisio CreditAttribution: jcisio commented@pwolanin it needed two months after the release of Drupal 6.26 to be tagged. Pressflow has many pull requests (https://github.com/pressflow/6/pulls) but it looks like none was accepted, except for official Drupal releases merge.
In short:
- it is not in Pressflow now
- there is no interest for it to be in Pressflow 6 (there is mostly no new P6 sites)
- there is no interest for it to be in Drupal 6 (there is mostly no new D6 sites)
I'm closing this issue as it won't be in D6, unless Gabor goes ahead and commits it. Restoring the 7.x tag.
Comment #362.0
(not verified) CreditAttribution: commentedUpdated issue summary.
Comment #364
jelo CreditAttribution: jelo commentedThis issue states at the top: "The previously submitted patch in #286 was incompatible with image styles and other generated files (such paths would return a 404). The latest patch in #316 addresses this issue by adding a configurable list of excluded paths that should not use the fast 404 feature."
I am not sure if "other generated files" includes clashes with images that are loaded using private file download method. I just spent a couple of hours troubleshooting why all inline images had disappeared from my Drupal site before I realized that fast 404 had been enabled with the default settings, i.e. no additional paths had been added.
The result is that all private image files that are served through /system/files/filename.jpg break. All other documents still function because fast 404 only defines extensions png|gif|jpe?g, but not pdf, xls etc.
Given that private file setting options are in Drupal core and could be enabled, should we not add /system/files as path by default to be excluded in case someone enables fast 404?
Comment #365
jelo CreditAttribution: jelo commentedGiven that this issue is closed I created a new issue with a proposal to add "system/files" by default to the excluded paths at https://www.drupal.org/node/2455057
Comment #366
jvieille CreditAttribution: jvieille commentedI patched pressflow 6 with this one
https://www.drupal.org/node/76824?page=1#comment-5319424
Thanks