Our current drupal_build_css_cache() and drupal_build_js_cache() logic works like this:

* There's a request to the page.
* If the bundle doesn't already exists, drupal_build_css/js_cache() will generate and save all it.
* After this initial request is finished, the browser the requests the saved files as normal and serves the page.

There are multiple issues with this, most of which is documented at #886488: Add stampede protection for css and js aggregation - to summarize:

* There is no logic in HEAD to prevent race conditions where the same file could be built multiple times by different requests to PHP, if you have numerous bundles and high traffic this can cause i/o issues.
* On an i/o constrained setup, I'm seeing this process take around 400ms (for ten bundles in total between css and js), that's around 30% of the entire request as measured by webgrind.
* Attempts to introduce locking can lock every PHP request to the site - for example if a bundle is served on a large number of pages that all get requested during a cold start situation, and they get stuck in lock_wait() for the first request to build and save the file, so while doing this can reduce the i/o impact, it can still take a site down due to hanging apache processes.
* Every bundle saved is also a call to variable_set(), which has its own issues with locking and race conditions - see #973436: variable_set() should rebuild the variable cache, not just clear it and #987768: Optimize variable caching for attempts to solve this.

So I'm thinking we should look instead at copying what imagecache does:

* Register menu callbacks for files/css and files/js
* Make sure it's possible to reverse-lookup aggregates by filename - we'll need to have a record of the hash stored alongside the files and their order used to created it.
* Normal page requests will only confirm that we have the hash stored (probably in variables), create it if it's not there, and generate the link to the aggregate.
* The callback will pick up the filename, generate the css/js aggregate, and serve it from PHP the first time it's requested.
* In the case of stampede, we can check file_exists() both before building the file, and before trying to actually save it. As soon as the file is saved, subsequent requests won't be hitting PHP anyway ( afaik image module doesn't do this at the moment, but it probably could)

No code yet, just getting issue the issue written up for now.

Files: 
CommentFileSizeAuthor
#80 1014086-80-css-imagecache.patch9.32 KBbeejeebus
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 1014086-80-css-imagecache.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#77 1014086-77-css-imagecache.patch9.08 KBbeejeebus
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 1014086-77-css-imagecache.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]
#76 1014086-76-css-imagecache.patch7.61 KBbeejeebus
PASSED: [[SimpleTest]]: [MySQL] 35,378 pass(es).
[ View ]
#74 lolz.what_.css_.hash_.patch7.25 KBbeejeebus
PASSED: [[SimpleTest]]: [MySQL] 35,631 pass(es).
[ View ]
#47 1014086_aggregation.patch19.9 KBcatch
PASSED: [[SimpleTest]]: [MySQL] 29,765 pass(es).
[ View ]

Comments

Issue tags:+Performance

Tagging.

This approach would also remove file_exists() calls on every request that renders any js or css. On shared hosting there may well be i/o issues, on sites with more than one web head, css and javascript aggregates will nearly always be on NFS - and there's no file_exists() stat/realpath caching with NFS, so there will be i/o issues too. I've seen those file_exists() take as much as 7% of some requests, and this on a system where there was zero load from other requests. Due to this, bumping to major.

In irc sun suggested this should be a contrib module before a core patch so it can be tested in the wild easily first, that might be a decent option.

http://drupal.org/project/agrcache - 7.x contrib module.

Seems like a terrific initiative. +1

subscribing

subscribe.

the more i think about it, the more i like this.

one thing that occurs to me is that the hash is just as unique if built off the files *pre-regex* as after, so we can simply ditch all that in the page request, including skipping all the gumph in drupal_load_stylesheet_content().

using md5() on the contents looks like the right way to make the hash (rather than another totally braindead use of a strong crypto for no good reason), but i guess we've already lost that battle.

so, a snippet like this could work:

<?php
   $contents
= '';
    foreach (
$css as $stylesheet) {
      if (
$stylesheet['type'] == 'file' && file_exists($stylesheet['data'])) {
       
$contents .= file_get_contents($stylesheet['data']);
      }
    }
   
$filename = 'css_' . md5($contents) . '.css';
   
variable_set('css_cache_' . $filename, $css);
    return
$filename;
?>

but if we're going to make such big changes, i think we should consider the incrementing a counter proposals floating around before. this would allow us to punt on hashing the contents entirely - we could just hash an ordered list of filenames.

subscribe

I'm a bit split on incrementing the counter. While it would save some mucking about here, css/js files change quite infrequently, and incrementing a counter every update.php run or similar would runs the risk building new versions of files more often than we need to. However we could possibly hash each css file that we know about when incrementing the counter, and only do so if one has changed? That'd make it part of a genuine rebuild operation rather than just on misses, then the actual map could be based on the list of files.

I liked the idea of hashing based on contents prior to preprocessing, so committed to agrcache here: #1040492: Use unprocessed contents of files for css hashes.

Don't have any more ideas for that module at the moment, so will bring the code back to a core patch here fairly soon unless something crops up.

Also I saw #352951: Make JS & CSS Preprocessing Pluggable again recently, if we move to pluggable preprocessors (or the very interesting idea mooted in that issue of moving to a text format-style mechanism where you can chain different filters together), then that is going to massively increase the processing for each individual file, so it makes even more sense to do it outside the main page request.

reposting over here....

I was thinking a little further on #1. If we take a look at what espn.go.com does we see something like:

<link rel="stylesheet" charset="utf-8" media="screen" href="http://a.espncdn.com/combiner/c?v=201010121210&css=global_reset.r1.css,base.r228.css,modules.r424.css,global_header.r38.css,modules/global_nav.r47.css,/espn/espn/styles/frontpage_scoreboard_10,modules/insider_enhanced.200910131831.css,sn_icon_sprite.200907150955.css,mem2010/mem.r5.css,mem2010/mem_espn360.r2.css,universal_overlay/universal_overlay.css,universal_overlay/media_overlay.css,universal_overlay/video_overlay.css,universal_overlay/photo_overlay.css,universal_overlay/dyk_overlay.css,warvertical12a.css,twin.css" />
...
<script src="http://a.espncdn.com/combiner/c?v=201012011221&js=jquery-1.4.2.1.js,plugins/json2.r3.js,plugins/teacrypt.js,plugins/jquery.metadata.js,plugins/jquery.bgiframe.js,plugins/jquery.easing.1.3.js,plugins/jquery.hoverIntent.js,plugins/jquery.jcarousel.js,plugins/jquery.tinysort.r3.js,plugins/jquery.vticker.1.3.1.js,plugins/jquery.pubsub.r5.js,ui/1.8.2/jquery.ui.core.js,ui/1.8.2/jquery.ui.widget.js,ui/1.8.2/jquery.ui.tabs.js,plugins/ba-debug-0.4.js,espn.l10n.r8.js,swfobject/2.2/swfobject.js,flashObjWrapper.r7.js,plugins/jquery.colorbox.1.3.14.js,plugins/jquery.ba-postmessage.js,espn.core.duo.r50.js,espn.mem.r15.js,stub.search.r3.js,espn.nav.mega.r24.js,espn.storage.r6.js,espn.p13n.r9.js,espn.video.r33a.js,registration/staticLogin.r10-14.js,espn.universal.overlay.r1.1.js,espn.insider.r5.js,espn.espn360.stub.r9.js,espn.myHeadlines.stub.r12.js,espn.myfaves.stub.r3.js,espn.scoreboard.r4.js,%2Fforesee_v3%2Fforesee-alive.js&debug=false"></script>

We could do something like:

script src="http://example.com/files/js/js_hash12342341234234?files=misc/jquery.js,misc/jquery.once.js,misc/drupal.js"></script>

The hash would be built like it is now and the combiner could know the files to use. This could be easily cached by the browser and we could test to make sure the files generate the right hash for security. I even like that it documents what files are included in the hash as it can be a pain to discover when you do need to know.

Issue tags:+i/o

mfer asked about the hash lookup at #1048316: [meta] CSS and Javascript aggregation improvement.

What I'm doing is keeping the same lookup array as core does with $hash => $filename (since the $hash is built from the css/js definitions, but the filename is built from the contents of the files).

But I also keep a lookup array of filename => css/js definitions as well.

When rendering links to css/js, we only need hash to filename. When building css/js aggregates, we only need filename => css/js definition arrays.

When you hit the callback, it takes the filename, looks up the css/js definitions, and can then proceed to build the aggregate using that information.

One @todo is to not store the filename => css/js definition arrays in the variable system, since we don't need to load that apart from in the callbacks themselves, so it's a waste of memory having it there and pulling it back every page. This is a general issue with the variables system though, and could also be handled by #987768: Optimize variable caching - although need to write up why on that issue, will do that now too...

edit: current code http://drupalcode.org/viewvc/drupal/contributions/modules/agrcache/agrca...

When the function that creates the script tag is called it could write something to a table that contains a list of the files for that hash. It would be a really fast lookup. But, how do you keep that table cleaned up? If you just write to if forever it could grow to be big. You can't truncate it like a cache table because there could be a cached page (in a browser) that tries pulling to a file that no longer exists and we don't have the data to generate.

Basically, there are multiple race conditions that can occur (one is even noted in the agrcache code comments). We should build a sub-system without the race conditions.

If we did something like:

<script src="http://example.com/files/js/js_hash12342341234234?files=misc/jquery.js,misc/jquery.once.js,misc/drupal.js"></script>

We could generate the hash based on the file names + predictable details for the script tag. Then, when we go to generate the file we check the passed in hash against a new hash created based on the name + predictable details. If they match we generate the file. If not, we don't. This would stop misuse of this combiner.

A method like this could be used to remove the db lookup and some of the noted race conditions. The only race condition I currently see would be the file existing/generation between multiple requests at the same time.

Thoughts?

Never let your logic be controlled by GET parameter or URL generation logic. Aggregation and menu callback should let visible as URL's only server side computed hashed. On a production site, the variable will never grows since all hashes will already have been generated on most pages.

If the URL building logic is predictable you can then have malicious hack attempt or DDoS attacks based on that.

There is nothing ugly about storing hashes with associated files array in server side as long as you don't generate different aggregated file on every page (which is quite dump thing to do). The whole aggregation goal is to have the less files possible it can have to ensure that a single client browsing on site won't fetch a new CSS and JS file each new page it comes to.

I'm a little unclear on one thing: What's to keep the filesystem from being filled by thousands and thousands of rendered stylesheets from days of yore? On our site we regularly see hundreds of variants of an aggregated CSS file. As small tweaks are made to CSS, more and more unique hashes will be generated, and more and more files will be generated.

Against the need to prevent massive numbers of CSS/JS files from filling up the filesystem, there's the concern that we should not simply delete CSS/JS files with each cache clear. Why not? Because external caches (Akamai, Varnish, other people's proxy caches, etc) may keep their own copies of cached pages still referring to older versions of the stylesheets. Attempting to manage CSS/JS files by simply deleting old ones when a new one is created causes huge problems for large sites with distributed caches.

You don't have that many solutions. If you want to limit the number of generated files, then you have the files to be the more static as possible as you can.

If I was an horrible dictator, I would force the whole CSS/JS handling to be highly declarative (each module would have to expose their own CSS and JS files in a declaritive way, either in the .info or in a custom hook) then I would aggregate all of them on every page, and this problem would be solved.

Because sometimes, for some execptions, some files should not be aggregated on every page (and the dynamic inclusion should be the exception in a static oriented pattern, the actual D7 core pattern is upside down) then some options for controlling that should be declared in this files registry, statically again.

Plone works this way as I understood.

Static register of every CSS and JS files would allow you to make the aggregated files fully predictable, then the dynamic bits of it would be the exceptions in the pattern, and you could treat it differently (maybe not aggregating them at all would be a simple solution). It would also allow sysadmins to override the aggregated files and profiling the aggregation the way that fits more to their physical environment.

I like mfer's suggesting in 14. But we might need one more thing in the hash.

Say we have foo.js. On version 1.0 of the code that uses foo.js foo.js calls bar(). On version 1.1 of the code, bar() is removed from foo.js.

Now, consider the case where we are using an external page cache (Varnish, Akamai, etc.). Because a Drupal cache clear cannot clear all of the nodes on Akamai's network -- or even on Varnish -- we will for some time have pages cached in the external cache that are still using version 1.0. The removal of bar() in 1.1 will cause all of those pages to break.

So we would probably need one more piece of information in the hash (at least) or elsewhere in the GET string.

If you update to 1.1 your lib, and the hash i based on md5'ing the files, then the hash changes and the cache will be refreshed. Akamai and Varnish cache will gracefully die after their TTL or when their memory will be filled by new data. I don't see no problem with using nothing else but a single hash, depends on how you compute it.

And again, that's not a real problem, you don't update a production site every day. When you do update a production site, you are going to have problems with reverse proxy cache anyway, whatever happens at least during the page TTL.

EDIT: Again, we can afford md5'ing the files because it will happen only when the site is in development. Once the site is put in production mode, every hash will already have been computed. JS and CSS files are not something that evolves like you data, there is no dynamism at all.

Incidentally, how we solve the deletion of CSS/JS files is to run a cron that checks the atimes on aggregated files and deletes files that have not been accessed in over X hours/days.

Or just let the site admin press a button to tell the system he manually updated a module.

subscribe

I know that Drupal core developers are emotionally attached to high dynamism, but the CSS/JS case, we are talking about pure static files, that will never, ever be modified except if modules are updated. When modules are physically updated, then wipe out the aggregated files cache, that should be more than enough.

When you update production you should NOT have trouble in your reverse proxy. We need to serve both the small and large clients in this.

Creating the hash would use the files plus some "predictable details". This could include elements like a private site key and css_js_query_string. I'm not entirely sure what would be appropriate here as we need to explore a little more.

@Poundard

And again, that's not a real problem, you don't update a production site every day. When you do update a production site, you are going to have problems with reverse proxy cache anyway, whatever happens at least during the page TTL.

What I am talking about is completely missing JS and CSS. We regularly release production code updates that do not break because of external caches. This should be the norm for un-hacked Drupal.

When CSS/JS aggregated files go missing because of a minor changes, the site goes unstyled, the ads don't show up, and features don't work. Unacceptable. Period.

When CSS/JS aggregated files go missing because of a minor changes, the site goes unstyled, the ads don't show up, and features don't work. Unacceptable. Period.

Using a MD5 hash of the files themselves to create the CSS/JS files hashes leads to no cache hits the first time then never ends up with a blank site.

Then again, serving a 1 day TTL'd cached page will then give the oldest outdated CSS file hash, that would the same TTL than the page itself. Then you have 2 different choices here, either you keep outdated files some days (but mark them somewhere as being outdated so you can remove them later) or just put a higher TTL on them to ensure that cached outdated pages will live less time than them.

There are probably a lot of other solutions, and that's not the real problem here. You don't need anything than a hash, you don't need GET parameters, the problems we are currently arguing about right now is nothing resolved by the hash itself.

I'd prefer to speak about the original issue, which is nothing such as the files TTL. We are talking about how to generate them, not about how to remove them. The deletion algorithm can be implemented outside this issue, and can be über complex if you want to, it does not belong to this thread IMHO.

When the hash is created it could be files + unique site key + query string (updated on each cache clear). That would be easy to block from people trying to use the combiner from the outside and work with reverse proxies (I think).

A physical fileS MD5 is almost as hard to guess that the answer of what is the universe. I won't worry about this.

The only thing that really stops the DDoS is server side check in order to know if the hash has been registered or not (like the form tokens for example), the registration must a server generated hash, some sort of unique id. If a user try to DDoS you with an existing one, it'll just try to DDoS the CDN :) If a user tries with a non existing hash, it'll just the exact same as flooding your full homepage, you just can't stop it.

The only thing I warned you about is that the server should not respond to client orders (what I mean here is that the file selection should not come from any specific client URL but should be pre-determined by the server itself).

EDIT: Better expression.

@mfer: Yeah. That would work. Though you might still need some version string or timestamp to prevent a new aggregated file from overwriting an old one (assuming you're still talking about file *names* in the hash).

Maybe we look into a method using the most recent time a file was updated as part of the process. A file to be aggregated that is.

@pounard Form tokens don't stop a ddos. If you want to stop a ddos get the IP (or range) doing the damage and block it at a level much lower than the application.

#31, #32 and a lot of other posts: We are not arguing the right details here. Core aggregated files naming is the only good part of core aggregation mecanism and I really don't see why changing it. I might have expressed myself wrongly at start.

@mfer's #13. That is essentially what agrcache is doing now, it has the same strengths and limitations as the current core method more or less.

The race condition that's documented in agrcache is just documenting the workaround for the existing race condition in any code in Drupal 6 or 7 that uses variable_set(). All agrcache is doing there is reducing the window by several hundred milliseconds to one or two. Issues dealing with that are #987768: Optimize variable caching and #973436: variable_set() should rebuild the variable cache, not just clear it. I need to generalize the variable_get_from_storage() and _add_to_variable() functions into another issue somewhere, or one of these.

If we went with the filename appendage for hash lookups, then it looks like we can remove storing the information about the files to be aggregated. However I'm pretty sure we can't avoid storing the hash => filename lookup that core has already (and we'd need to be able to do reverse lookups efficiently, although array_search() on a cache miss might not be too bad). That suffers from the exact same race condition I'm trying to minimize in that code, where one process can 'delete' items from the array that's just been saved by a previous process. Locking here is dangerous since this is all in the critical path. So while it's relevant here, it's really something to be solved in the variables system itself.

Currently we hash filenames on every request to get the filename (not too expensive). And the filename is generated from a hash of the actual contents of those files (very expensive but only done once or twice). The other option would be a strict versioning system for every css and js file, or arbitrarily raising a counter each flush, or just relying on the filenames, but then you either risk not updating when the contents change, or updating too often by changing the hash too aggressively even though the files haven't changed. Whichever way there's going to be a trade-off, and this part of the current logic is actually really good. Also we can't add versioning for non-libraries to Drupal 7 and I'd like to get at least some of these improvements in there so that people's sites stop going down or taking extra seconds to load.

The way core uses to work around deleted files being linked from cached pages, and which answer's Matt Butcher's question in #16 is http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_cl... and http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_cl.... The default 'stale' age for an aggregate is 30 days - so that partially deals with cached pages requesting old file names, since they will often be there (but not always). (This indirectly points out a bug in agrcache since I don't have anything to delete the variable in there yet and they're named differently to the core ones to avoid conflicts on enable/disable). You could raise that limit a bit (say to 300 or so), but if you don't rely on existing files and/or a hash in the database, then you get precisely into the situation pounard describes where if you always guarantee a file will be available at that address under any circumstances, you're open to ddos or have to make other trade-offs. If we store hashes in the database (for some length of time), then there shouldn't be a ddos though, since if the hash doesn't exist, you just throw a 404, and it's easy to trigger a 404 in Drupal.

There's a tonne of comments still left, and looks like more posted since I started typing this, so more later..

Now, consider the case where we are using an external page cache (Varnish, Akamai, etc.). Because a Drupal cache clear cannot clear all of the nodes on Akamai's network -- or even on Varnish -- we will for some time have pages cached in the external cache that are still using version 1.0. The removal of bar() in 1.1 will cause all of those pages to break.

Er, yes it can. http://drupal.org/project/varnish and http://drupal.org/project/akamai support doing exactly that.

If we really care about this issue beyond the problems that already have a solution, then we just need to consider whether other caches (local proxies, google cache, web archive) will also cache the css/js files along with the page, or call back to the site. Some might, some won't. Then it's a question of "should Drupal support indefinite caching of HTML by third parties we have no control over who expect to be able to link back to assets from the original site in perpetuity", which is a feature request. Like pounard says that's the wrong detail to be discussing here and if you really want to discuss that in depth please open a new issue, since it has nothing to do with changing the mechanism of aggregate generation which is what this issue is for.

When the hash is created it could be files + unique site key + query string (updated on each cache clear).

No arbitrary query strings please, see #9 - we'd then be invalidating CDN and browser caches every time someone sneezes. I could live with a counter that only gets incremented if files have changed, but that means keeping some record of if the files have changed or not over time, which is a whole separate issue again.

@catch thanks for the detailed explanation. It helped clarify where you come from and I like it. So, here is what I see:

  • drupal_get_css() stores the hash and what files are included + inserts script tag into page.
  • When the js/css file is generated it looks up the files from the hash in the db and generates the file. Future requests will hit the file instead.
  • For cleanup, when the css/js files are deleted after 30 days (or other set time) the record in the database is removed as well.

Does this sound about right?

@catch Regarding #35, I misspoke. I suppose I was *assuming* that we wouldn't globally invalidate a cache for each code update, which is pretty much necessary to avoid getting slammed with uncached page requests. So, yes, we *could* get Drupal to clear the entire Akamai/Varnish/... cache. But it is preferable to avoid external cache rebuilding for CSS/JS updates. Letting them expire is preferable. And the new staleness code in D7 probably fits the bill.

But you are right, the issue is only tangentially related to this, in that any structuring of the CSS/JS filename needs to be (loosely) version sensitive for the above to work. So as long as that's addressed, I'm fine.

@mfer: currently point three of #37 is a little bit different, in core we reset the variable on css/js clears, and at that point delete the files, the files won't be expired any other time iirc. This means that if you never clear css/js on a site, then those files will sit in the file system indefinitely (although also new files won't be added much either so I don't think this is a big deal).

I don't think we want to do any scheduled cleanup, since it doesn't make sense to remove those files /unless/ there's a cache clear either - it's completely valid for an aggregate to be two years old if your site's theme has had no updates. We don't know for sure if those files are valid without the hash, and we don't know if the hashes are valid at all because they're created dynamically - unless there's an explicit flush in which case all current hashes are invalid until they're regenerated. It's also possible for a file to be valid even without a hash in the database - since you can clear the cache, leave files younger than 30 days in the system, and because the filenames are based on contents, they'll continue to be used rather than regenerated - this is why I don't like incrementing a counter since this would stop working, although a manually incremented counter might not do any harm as an option for people to use. Again, the current logic here is pretty good, I'm sure we can tweak it but other parts of aggregation are bigger wins.

I don't feel this issue shoud live by itself, and that's why I though yesterday:

We have to focus on the simple goal we want to achieve for the full aggregation part

Reduce bandwidth consumption
For this, the aggregation need to be more static. If you look some numbers, full CSS/JS aggregation in one file only seems to be a good thing to do. Each client will download one and only file during its browsing. Again this is not the only solution, but it's an easy and efficient one. Starting by implementing a declarative oriented API for managing files seems to be a good point.
Remove all runtime file system I/O
Core is not a failsafe mecanism for poorly coded modules. Non existing referenced files are a bug, not something that modules can afford to do. Even more than that, the site header itself should probably cached as pure text and never be recomputed on each client hit.
Support concurency access while rebuilding caches
catch's solution (this issues) seems to be a nice solution here, and can easily be factorised with image styles technical aspects.

But, we have also secondary goals here

Integration with CDN's
For CDN integration, going the way of image style is a good point, if this issue is solved for one, it's solved for the two subsystems.
Make it pluggable
catch's proposal with backend chaining for files processing seems a good way to do it. I'd see a more static solution but this one is über cool. It'll need some POO code I think because a lot of existing designs patterns already have really simple code solutions for this.
Let the sysadmin configure it through a single and simple configuration file
This one is the last, but not the least. Sysadmins will always have the last word about scaling, and they now better their environment than any alive developer. Let'em tune the system the way they want, do not, never, ever, assume that core has to be smarter than them, it's a design error and will lead you to a lot of useless runtime checks, that coast *a lot*.

Considering all this points here

I think that is issue should probably be the last one to be fixed within all others.

The first point you have to consider is because you start to think about using the same mecanism as image styles, and talking CDN integration, you have to merge this issue with the image style subsytem one itself. This in order to find one and only technical solution for this, with a common code base. Integrating this problem around the stream wrapper API seems to be the smarter way to do it. To be clear, even the menu callback itself should be a common menu entry over the files directory, for all, and then derivated running code would live in specific stream wrapper implementations for example. Factorise the problem, factorise the solution, and it will work as-is without any code duplication for all the files and CDN integration problems.

This issues does not belong to the CSS/JS aggregation problem, it should be a global pattern in the core for all subsystems tied to the lowest level API possible.

Because of this statement, the CSS/JS aggregation problem should only focus on the first three point I quoted above. Catch already has part of the solution in agrcache module, and I do have some of them in core_library module, let's start thinking with some nice component diagrams and schemas that will be way more understandable that any other potential written argument here.

The first thing on which to concentrate of is to destruct and throw away the actual implementation, and build one, API and backend orientend, static and declarative API, with no I/O and delayed possible aggregated files construction. Then, let people that worked on the stream wrapper implementation solve the real CDN problem on their side. Once they found a solution to this problem, plugging the new fresh aggregation API over it would be something like 2 lines of code, if they did it right.

EDIT: Some typo, sorry for my poor english.
Re-EDIT: <strong>'ed some important parts.

@catch Let me clarify my 3rd bullet in #37. When we create hashes and store them in the database we shouldn't so that forever. If they are in the variable table or in their own table this is space we don't need to keep filling up. We need a method to remove stale hash entries from the database.

My suggestion (and there may likely be a better way) is to remove the hash data from the database when drupal_clear_css_cache or drupal_clear_js_cache are run and use a similar logic method as drupal_delete_file_if_stale to decide if an item is stale and should be removed from the database.

How did you plan on removing stale hash entries?

@mfer: I'd keep it the same as now, the intention here isn't to fix everything about aggregation, just the aggregate generation method.

see http://api.drupal.org/api/drupal/includes--common.inc/function/drupal_cl...

catch +1 this is a detail, and the solution will come with a new API. We can't solve details without having taken the time of the reflexion of the overall design. Actual solution isn't the best, but at least it works and it's not the main problem here.

Also generating the hashes is less expensive in terms of time than generating the files, at least on i/o constrained systems. The main thing is trying to avoid creating more aggregates than necessary. I don't see a nice way to change this with the variable storage, maybe a dedicated table but that's a lot of extra to manage.

Subscribing.

Subscribe

Status:Active» Needs review
Issue tags:+needs backport to D7
StatusFileSize
new19.9 KB
PASSED: [[SimpleTest]]: [MySQL] 29,765 pass(es).
[ View ]

Here's a patch. Should work except for clean URL support. Haven't run tests yet, leaving that for the bot.

After a first reading it seems not bad, I'll take a deep look at it later.

6.x version that covers just about everything: http://drupal.org/project/advagg If your wondering I've been kicking this idea around in my head for about a year now: http://groups.drupal.org/node/53973. Couple of things this does that might interest this crowd that is not listed in the readme.
- Will test the CDN and non CDN url path to see if the advagg 404 menu hook is called.
- Implements fast 404s on a complete miss.
- Request is sent async to the newly created bundle on page generation. That means the bundle is generating in the background while the page is getting generated on the server... yes multi-threading in short.
- In theory on demand generation will work with clean URLs disabled.

I'm sure I've missed some others... in short advagg rocks! Feedback is greatly appreciated.

Subscribe.

This code is starting to get quite large... Would love to make it all pluggable.

Pluggable is good, but I would love to get this into Drupal 7 (since the current logic can bring sites down), then make it all pluggable after that. Pluggable things still need sane defaults, and IMO this is the biggest win we have in terms of performance of this stuff in PHP.

I started a thread #1166756: Support for Google's Closure about using Google's Closure code to optimize jquery & then include those more optimized files rather than those that ship with Drupal.

For me that started with was of leveraging Google's http://code.google.com/apis/libraries/ so that we can all benefit from not having to load this common library with every site.

I'm in favour of encouraging this type of behaviour in the community (but not putting direct links to Google's files in core or anything. It should be an opt-in situation, but if it is then where is the danger of hosting the files off of our own servers?

@mgifford This issue is separate from what you are describing. Separate enough it should be discussed in a separate place.

Hello again! Long time I didn't post here.

I have been doing some work about how the menu system could completely generate files on demand, based on partial file path. I finally ended up with a simple schema that could actually work.

You can see the diagram right here, in order to fully understand it, you might want to read what's next on this post with the diagram open side by side.

Concept

The whole concept is oriented around a fully pluggable system based on interfaces.
@catch: Yes, fully "pluggable".

The main concept is split into those two statements:

  • Menu doesn't care about specific implementations behind.
  • Generated resources are static files, handled by Drupal, with or without a lifetime set.

Definitions

It's based on those definitions, that I arbitrary choosed:

Resource
A resource is a static (or not) asset that the client (web browser) can download with a fixed a single URL.
Resource Provider
It's a single object (mostly singleton, one per available implementation) that can find what resource exists or not.
Resource Provider Registry
The key that makes the system pluggable. Basically each component registers available providers to it.

Overall design

The main idea to implement it is to allow any component (e.g. ImageStyle, core aggregation system, maybe Views for generating XML, anything that is a module or a core API) to register their providers to the system (See note 1).

Each provider has a magic path (provided by the getDynamicPath() method (See note 2).

Runtime and error handling

Menu callback catch a lock on full file path (relative to files path). This lock will be released on errors, or when the file handling finished.

If the file is already locked, this callback shouldn't wait, it should either throw a:

  • 409 Conflict
  • 503 Service Unavailable

This first one seems the best, while the 423 Locked also exists, but it's WEBDAV specifics so we don't want to use it.

The dynamic resource provider registry can find which provider to use depending on a set of (extremely basic) rules matching using the partial file path from URL.

Each provider can spawn files instances using the partial relative path, relative to files folder. Never relies on absolute URL.

If the path means something, then it delivers the instance. If not, it throws an exception catched in the menu callback that triggers an HTTP error

  • A real 404 this time, because the file really do not exists!
  • Or a 500 or 503 error if the exception is not an VoidResourcePathException but another technical error for any underlaying package.

This model allow the lowest code indirection possible, and will be extremely fast if no module messes up with hook_init(), which is the actual bottleneck of most Drupal 404 errors. Of course, Drupal shouldn't be bootstrapped on second attempt but that's up to the CDN/HTTPd/proxy configuration here maybe.

Real file handling

The instance can be an non physical file, case in which it implements the DynamicResourceInterface, and is able to give back content.

If, in the opposite, it matches a file that should be statically created (non volatile data for the most), then it should be an instance of DynamicResourceFileInterface that will actually also be a direct implementation of core existing StreamWrapperInterface (in most case a public specific implemention, but could any other).
This file instance using the upper named interface has a new method, which is rebuild(). This is purely internals, and should be lazzy called by the instance itself, after checking with exists() (See note 3).

The menu callback would not be aware the distinction exists, it would just return the getContent() call, whatever happen (this is where to plug the lazzy rebuild for file-based implementation).

Each resource is able to give a lifetime (which can be forever) this will help the system to give the right HTTP headers.

Magic pathes and ruleset

So I described pretty much of it, now there is something more for you to understand, a schema of the path ruleset applied:

  http://mysite.com/sites/default/files/aggregate/css/SOMEUNIQUEID.css
       |                                   |          |
    We will ignore this.               Ruleset!    File identifier

Let's see that again:
http://mysite.com is base URL, throw it.
sites/default/files is file path, throw it.
aggregate designate our provider (registered partial path).
css/SOMEUNIQUEID.css file path given to provider (identifier).

So, we could have conflicts, basically. Imagine I do an image aggregator which registers: aggregate/images (this is valid), it would conflict with the already registered aggregate partial path.

The solution here (what I call the ruleset) is basically to determine which one to use by parsing the partial path from the most specific towards the most generic.

Example: I give this URL to my menu callback:
http://mysite.com/sites/default/files/aggregate/images/foo.png
The discovery will based on a pre-ordered list of potential conflictual partial pathes when registering the providers (we probably would need cache then, but not that sure, needs benchmarking):

  1. aggregate/images
  2. aggregate
  3. imagestyle

Will then test the URL's one by one (either using substrings, either using regexes) in order to find the right one in order (See note 4).

So if I give the URL quoted above, you will get this:
Does aggregate/images is a substring? Yes! Spawn the provider.

Now, if I give:
http://mysite.com/sites/default/files/aggregate/css/foo.css
This would give:
Does aggregate/images is a substring? No! Pass.
Does aggregate is a substring? Yes! Spawn the provider.

Providers are magic? CSS and JS specifics

Absolutely not! Each provider should carry its own business code.

The simplest example is for CSS and JS aggregation. Actually when core derivate a new JS/CSS file, it build a hash of the new file.

The CSS/JS implementation should build a random and unique identifier (See note 5) and store it in a database table, along with the full file list. Then, on menu callback call it can restore it and build the real file. The beauty of this model is once it's build, we will never query again the database because HTTPd will catch the real static file (See note 6).

These files entries must not be cached in a bin, but in a real table once the JS/CSS aggregation profile / set of file derivated to a URL, this must be non volatile in order to handle further requests from outdated browser's or proxies cache.

And what now?

Give 1 hour, I make you the code for the original design.
Give 2 extra hours, I make the CSS/JS partially working.
Give 3 extra hours, I overload the actual Image Style handling, do the menu alter, and backport it to D7 gracefully.
Then give me 5 hour of your time, and we all discuss this, fix bugs, evolve the API or throw it away and pass to another solution.

Footnotes

  1. This can be done either using a hook_info(), or in modules .info file, or anyother discovery system, this is where the system becomes fully pluggable and will be integrated with future plugins system discussed into the WSCCI initiative.
  2. It can also be provided by hook metadata, this not fixed anyway and won't change the hole design.
  3. Indeed, we could get here because the HTTPd didn't find the file, while another thread could have created it (even with the locking framework, race conditions can happen).
  4. Substrings would probably be faster, that's why I want the schema to respect the KISS principle (Keep It Stupid Simple) and not implementing the whole matching ruleset using complex parametrable regexes (a lot slower).
  5. Why not a UUID V4 here, this is exactly the kind of purpose they have been created for.
  6. Actually the Image Style module (formerly ImageCache for D6) already works this way, we need to keep this!

Eventually, some additional ideas to get further:

1. Create a files directory for assets (content related static files) and resources (at the same level) for resources (dynamic content, not content related, static files or not).

In the long term, separating actual PHP code and remove it from the public web root dir would be a really good idea for security. Files (asset and resources) could then be elsewhere than the folder you commit into you git repositories when you do a project :)

2. Make the modules defines their CSS and JS in a declarative manner (the framework would be able either to move them at install time into the resources dir, either do it on demand with a specific provider and dynamic resource implementation, such as DefaultModuleCssFileResourceInterface for example.

@pounard
What are your thoughts about my side project http://drupal.org/project/advagg?
In terms of matching the files directory and intercepting it before menu_execute_active_handler fires check out http://drupal.org/project/files_proxy (runs in hook_init). These are done in D6, but once advagg is reaches 1.0, I will be working on a 2.0 release for 7.x; aggregation changed a lot between D6 & D7 so having a 2.x release makes sense.

Some interesting thoughts so far - using a menu callback for generation is an interesting idea. @catch, have you tried this patch on any production sites?

@mikeytown #57, I just looked up at advagg code, I think the design I made has pretty much ideas from it, but I think it's because catch originally express those. The main difference is I created an OOP design, much more centric, that mutualize it for anything, instead of having N menu items, my solution would only carry one for everything.

I might have looked at the wrong project branch, though.

Nevetheless that's where my design take an opposite from your module, it's interface centric and highly pluggable. It totally decouples business logic from the way to achieve it, and gives a set of extremely basic interfaces to ensure pure business code won't bother the middle complexity (which I solved partially) such as testing the file existence, locking, etc..

@pounard
I never use master; here is the 6.x-1.x Tree. advagg.missing.inc is what your interested in most likely.

In terms of the files existence, htaccess handles that; I use locks.inc for locking; it stalls then 307's once the file is generated.

Nice thanks, I'll look into it.

subscribing

I have been working on http://drupalcode.org/project/core_library.git/tree/67dd838 (module Core Library). I actually implemented the schema upper, and written a totally useless sample implementation as well as the core CSS override (using #aggregate_callback override from style element info). It actually works quite well.

See the sub module 'resource' in http://drupalcode.org/project/core_library.git/tree/67dd838:/modules/res... for the implementation details (the lib/resource.inc file is where you should find the real useful code, as in the resource.module file you will find the deliver callback).

The implementation does not really fit the diagram I did the other day, but it looks pretty much like it, except I added the dynamically generated file $uri in the center of the design (which provide the link with the stream wrapper API) and removed the direct inheritance link with stream wrappers (much more cleaner).

@#60 Your module does *a lot* more than this, but I wanted to keep a simple implementation (while fully pluggable) that actually matches the issue's title.

Did an alpha2 release (that will come in 5 minutes), and updated class diagram to fit the code.

Pretty much same discovery mecanism as upper except I did put the URI and Stream Wrapper in the center of the design. Each public local stream wrapper sees it's base directory path handled by one registry (registries are singletons, on a per scheme basis). Providers can basically work on all schemes, this has really no importance, this is decoupled and allowed by design. A shortcut method of the registry allows to spawn resources directly (internally spawning the provider) which take cares of full and final resource URI build and pass.
File based resources are no longer extending the stream wrapper API (it was a huge mistake on my first diagram), but as each resource (not only files, all resources) have a fully qualified URI, you can use the adapted stream wrapper really easily. It makes the file based resources an unneeded abstraction, but it exists for code mutualization.

The overall code weight is 1000 lines, including CSS and JS integration, as well as Image Style file generation (basically OOP rewrite of the actual core menu callback), file delivering, HTTP headers handling, some basic configuration screens and form alters, a custom stream wrapper implementation, and all is fully documented (documentation included in the line count).

If anyone has some time to check/review/hate what I done, I'd be pleased to have some feedback.

EDIT: Just released an alpha3, this particular one solves the CSS and JS files modification problem. A version string is preprend to the URL (method used by some other software). When cache are cleared, each files of a computed map are being checked (once per hash only), if at least one of the files have a mtime superior to database entry creation, the version number is incremented. If a race condition happen (it can, it's not thread safe), the version string will be happened twice (which should make the whole stuff work anyway).

@pounard
You should also include md5 if you haven't done so as a way to test if a file has changed. Not all filesystems keep an accurate mtime. Do you want to team up? Sounds like your starting to port a lot of advagg's code into D7 & into your module. I already have my first request for D7 version of advagg [#31171546]. We both have the same goal, we should work together. PS you need to call clearstatcache() before using filemtime().

Problem with file md5 is that it will be a lot slower. If the mtime is not accurate on the system, it should probably be overridable with an option (variable?).

Your module does a lot more than mine, and actually does not the same way! I'm totally pro-merging efforts if you want to, but we need to discuss how before.

EDIT:
@mickeytown2, I'm going to look more deep into your code in order to see where the patterns fit and how we could potentially merge. Don't hesitate to pm on IRC or mail via my contact page.

My code still remains a PoC of what could be such system in core, not sure the merge is something I want right now, but as promised, I'm going to study it.

A real 404 this time, because the file really do not exists!
Or a 500 or 503 error if the exception is not an VoidResourcePathException but another technical error for any underlaying package.

I'd rather wait here - it's good to avoid stampedes, but we should go ahead and serve something to the unlucky requests.

And/or, generate the resource and serve it, but don't try to write to the filesystem on lock misses.

The solution here (what I call the ruleset) is basically to determine which one to use by parsing the partial path from the most specific towards the most generic.

The router system already handles this, so I'd rather use what it does there rather than re-implement it - there are not going to be dozens and dozens of resource callbacks, we have three potentials in core right now (css, js image derivative).

Three different custom implementations of this (assuming we do this for css/js at all) is definitely enough to warrant adding an API for it and trying to centralize some of the code. I had a read through core_library and the general approach looks good.

@Owen, I haven't got agrcache running on any production sites, although there is one client that has it in their backlog and I'm hoping will report back.

@MikeyP: hoping to have some time to take a look at advagg soonish.

subscribing

@#67 Thanks for feedback. I agree for the menu system, it definitely can use it extensively instead of doing its own path split. I'm not sure it will fully remove the path split (how could we in a generic manner fully handle file extension and subpath). The idea behind the ruleset was leaving to business implementation what belongs to business implementation. This is arguable though. It could at least use the menu for one level deeper (provider level).

The whole goal of this is to centralize all three different ways of doing it into one generic enough to cover all actual needs.

A note:
I once had some difficulties to make imagecache work with lighttpd.
The reason was that lighty can not be as easily configured to run php only if a file does not exist in this location.

I did not try hard enough to say it is impossible (it was not really my job to get this running).
In fact it is supposed to be possible with mod_magnet. http://nordisch.org/2007/2/6/drupal-on-lighttpd-with-clean-urls

All I want to say here is that a lot of people will run into this issue if this goes into core, and maybe some of them will have difficulties to correctly set up mod_magnet.

Since image derivatives are in core already, this doesn't make that problem with lighty any worse.

Related, and will most likely land first: #865536: drupal_add_js() is missing the 'browsers' option

It looks like there has been some major discussion since the last patch. It would be helpful to incorporate agreed-on adjustments to the plan in the issue summary, and also compile and add a list of remaining discussion items to it (if any).

StatusFileSize
new7.25 KB
PASSED: [[SimpleTest]]: [MySQL] 35,631 pass(es).
[ View ]

here's a dirty, dirty patch that implements a PoC don't use hashes just put the paths in the URL approach.

Btw, in some modules I worked on that did something similar to imagecache (css sprites generation, gradient image generation), I found this pattern to be useful:
$menu['sites/%site_dir/files/...']
instead of
$menu['sites/whatever/files/...']

Then a wildcard loader function

<?php
function site_dir_load($fragment) {
  if (
realpath("sites/$fragment") === realpath("sites/whatever")) {
    return
TRUE;
  }
}
?>

Benefit: It does work for any alias of the sites/whatever dir, not just the "canonical" one.

----------

And re myself #70
I read that lighty has something new, which allows to check for non-existent files, and this does not require mod_magnet.

StatusFileSize
new7.61 KB
PASSED: [[SimpleTest]]: [MySQL] 35,378 pass(es).
[ View ]

updated #74 for latest D8 head. next up, will add js support.

StatusFileSize
new9.08 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 1014086-77-css-imagecache.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

this one is probably worth some real reviews:

- path to css folder is no longer hardcoded, uses configured public files dir

- better docblocks, could definitely be improved

- aggregate files get a query string that changes with cache clears

sun suggested in IRC not doing js until this patch has been accepted, so i've left that out for now.

#77: 1014086-77-css-imagecache.patch queued for re-testing.

Status:Needs review» Needs work
Issue tags:+Performance, +i/o, +needs backport to D7

The last submitted patch, 1014086-77-css-imagecache.patch, failed testing.

Status:Needs work» Needs review
StatusFileSize
new9.32 KB
FAILED: [[SimpleTest]]: [MySQL] Unable to apply patch 1014086-80-css-imagecache.patch. Unable to apply patch. See the log in the details link for more information.
[ View ]

rerolled to keep up with head.

I'm still thinking about making this a more generic API, that would handle both public files, images, css and js, would be a better way to go.

I see multiple issues in this patch:

  • Yet another specific piece of code for a more general generic file handling use case
  • It doesn't lock when building the CSS files, so we could have concurrent threads doing the same work at the same time, having potential conflicts when saving the files
  • It uses variables for storing the aggregated files [EDIT: file LIST, not the file itself], which is an existing design flaw in current stable too: when CMI will come, this would trigger dual storage writes, we should either use a cache backend or a table in database[EDIT: misread the patch for this one, sorry, thanks msonnabaum for pointing that out to me]
  • The new WSCCI patch which being stabilized at this time introducing the request/response kernel from symfony will also bring a totally new way of handling this, which will make this code obsolete the second it will be commited to core

Going the way I proposed in #55 would reconcile this with the WSCCI initiative, of course my proposal is now obsolete too, but the design is much closer from what we'll have to implement in the end when the WSCCI patch will land.

Further, after some chats on IRC which conforted me in my opinion, I think we should orient this issue towards a generic listener based API approach for all file generation.

The idea would be to have the same mecanism as the Assetic library (thanks Crell for reminding me it exists). We have a listener on the kernel dispatch, it catches requests that matches files pattern, then does whatever it has to do depending on an internal logic in order to restitute a file.

This internal logic is what I was proposing in #55, but it can be different, but we desperatly need to have, at some point, assets handlers and providers which we need to be able to resolve dynamically at the listener/controller level.

We then have two path we could follow:

  • Using Assetic, we need someone to take a serious look at this library: it may or may not integrate well with our need
  • Implement our own internal logic (which doesn't sound that bad, the only difference with using a generic library is that the generic library will probably be full featured and more flexible)

One of the good sides in this approach is we can do it without having to register any router items. This sounds like a huge win to me.

I'd prefer to move forward with this patch and proposed solution as is.

For potential usage of Assetic, we already have #352951: Make JS & CSS Preprocessing Pluggable and some other issues (search for Assetic). A couple of limitations (or lacking features) in Assetic have been pointed out already.

Furthermore, for image styles, there's also #1561832: Replace Image toolkit API with Imagine. Imagine not only covers image toolkits, but also has a built-in concept of generating image derivatives.

In turn, there's a lot to research and figure out in this field, before we can even start to think about harmonization. The idea of harmonizing all of this is a competing proposal, which should live and be discussed in its own issue, in order to not block and derail this issue.

Therefore, I'd really prefer to make pragmatic process here. That yields a concrete result and guaranteed improvement over D7.

This change will be obsolote the second the WSCCI patch will be commited. Plus the patch as-is as serious flaws. [EDIT: less than I thought due to a misreading]

@sun If the patch is a candidate for D7 backport, then just do the patch for D7 already and post it in a new issue set for 7.x, it indeed fixes real world problems, for a real world used version, but it doesn't make any sense for D8 because D8 is not a usable product yet, unfinished, in code as in design since it's in development process.

D8 is not out yet, and this can be solved in a much more elegant and efficent way, and we can do a major cleanup for a minimal cost by sharing the dynamic asset handling with image derivative generation --even if custom and simple, Assetic is yet another topic, more complex indeed--. This is what this issue has derivated to a year ago.

As long as D8 is not feature frozen, this must be designed first, coded later. Any code change makes going back in time difficult is a loss of time and energy for all people that actually wrote and reviewed it. Now is the time to think for D8, not the time to code if it's not thought first.

So, as you proposed, I could open a new issue, you said you will then you be happy to review it, I'm saying no: a lot of talking already happened in this one, there is no way to re-open them another place and start all over again from the beginning: you didn't reviewed the 60 first comments of the thread, or at least you didn't even commented them, I'm guessing you won't do it in another issue either.

I won't wait another year in a new issue that quotes the exact same posts that have been waiting here for a year.

EDIT: Sorry for being a bit nervous about this, but this is really something I did put a lot of effort into. I'd be happy not to see this year trashed away, especially when it tries (and succeeds as a stable module) to solve the exact same "real world problems" nicely, with a more global approach, with additional features (concurrent threads locking amongs other things) this patch actually doesn't solve.

yes, please lets just get this in. locking is already an issue, but this patch decreases the race window, and we can add locking later.

i'm happy to do js next in a similar style, then adapt with follow up patches once the kernel stuff is in.

Locking has it's own issue here #886488: Add stampede protection for css and js aggregation, I deliberately opened this as a separate issue because it's 1. a separate issue 2. the locking patch ended up very complex, 3. a lot of locks that were added to Drupal 7 (such as the read lock on variables) are likely making things worse rather than better for stampedes so they need to be treated carefully.

@pounard, if an issue is not fixed in 8.x, we fix it there first before backporting to 7.x, even if we know that the 8.x code might change completely later. This is because there is zero guarantee that 8.x refactoring will eventually deal with this issue, so we could end up knowingly releasing 8.x with regressions compared to 7.x. See http://drupal.org/node/767608 for lots of reasoning.

In this case, it'd be quite possible to add a generic listener to 7.x and still have inline generation of css/js aggregates. On the other hand if the almost-RTBC patch here gets committed and backported, then we can enforce that it gets converted to a generic listener when one is added, since we'd presumably need to refactor file_transfer().

Also if you look at the patch itself, about 90% of it has nothing to do with actually interpreting the file requests, most of it is refactoring the code that generates the filenames in the first place and moving a few things around, which will be needed regardless.

Side note, concating the filenames together seems good to me, it's also an approach that's close to being compatible with things like https://github.com/perusio/nginx-http-concat (which I've not tried but looks interesting), but pounard brought up a point very early on that this is vulnerable to DDOS attacks, which the current hash-based mechanism that was adapted in previous patches (which is also used in http://drupal.org/project/agrcache) isn't. If we only generate aggregates for files that actually exist (i.e. don't try to aggregate something with a non-existing modules/foo/foo.css) then possible this isn't an issue since there'd be a high, but finite, number of aggregates (and D6/7 have enough issues creating lots of unnecessary aggregate files despite the hash) but would be good to discuss.

I won't wait another year in a new issue that quotes the exact same posts that have been waiting here for a year.

I opened the stampede protection issue in August 2010, nearly two years ago. Comments like this aren't helpful and completely ignore the efforts other people have spent on dealing with this problem.

Agree about th potential DDoS problem, I wasn't even speaking of it. I don't know why switching to the concat method while the real world problem pointed out by the issue original post isn't due to the original hash/list mecanism?

URL have, per specification in the HTTP protocol, no length limit. But I'd like to highlight this: install a Drupal site with 100 modules, half of them brings you a CSS file, you will have URLs concating potentially 50 CSS, with a path length of 50 chars (I took this url as example: sites/all/modules/contrib/admin_menu/admin_menu.css which is not a long one), without taking into consideration the base url and other separators, this makes 50 * 50 = 2500 url length. This goes higher than the 2000 recommended max length for IE support. See http://stackoverflow.com/questions/417142/what-is-the-maximum-length-of-...

URL are also keys for caching and routing, in all the HTTP layer (client, proxy, routers, load balancer, proxy cache, CDN, HTTPd, PHP), and using extremely long keys may cause performance issues in any of those layers.

In some cases, the apache rewrite rule testing if a file exists can hit the file system limitations and fail. Even without rewrite, you can potentially hit file system limitations with many OS. This limit can be reached quite easily, see this kind of post for example http://serverfault.com/questions/140852/rewritten-urls-with-parameter-le...

EDIT: See also http://en.wikipedia.org/wiki/Comparison_of_file_systems

Did some research, and it seems those kind of problems can be quite recurring.

Whiie I can see the sexy side of URL concat, I think this is a no go, it doesn't scale, expose direct information about file directory structure of your PHP site, and may lead to weird behaviors and potential DDoS (both with arbitrary URL stressing the PHP side and extremely long URL stressing some browsers).

@catch I was a bit harsh with the "I won't wait another year", sorry again for that.

In reality this patch sounds better than that since it doesn't just concatenate file names. Still seems a weird way to go. I tried generating 30 random module names, between 5 and 12 char eachs, and use them as file name for CSS also, the current algorithm gives me an approximative 1300 chars length URL. [EDIT: which causes file system problems on my box, and is still really long]

For the DDoS stuff:
If you look at imagene module, this has permissions to protect it from being used as a free CDN:

To prevent others from using your server as a farm for gradient images, and totally wasting your filesystem with all possible permutations of colors, imagene comes with two permissions:

  • "see non-existing imagene images" can restrict people from using the server power to generate not-yet-existing images.
  • "save imagene images" will prevent generated images from being saved to disk, thus protecting your disk space from being filled with junk.

The idea is that only a person with access to the css files will ever want to generate new images, so this person can be given the necessary permissions. Or, you can grant the permissions to everyone, as long as the site is on a dev server, and restrict to admins when it goes to production.

If these permissions are restricted, a formatted version of an image needs to be visited by an admin/staff, before it can be viewed by a random visitor.

EDIT:
Ok, hashes are probably a more effective and desirable protection.
sites/mysite/files/styles/slider-image/public/slider/slide2.jpg?hash=öalkjüpohp9hoihjnöklajölknmövalj

#80: 1014086-80-css-imagecache.patch queued for re-testing.

Status:Needs review» Reviewed & tested by the community

As catch said, we don't have stampede protection in current code either so not relevant to this patch. The objection that 'WSCCI will soon change code' is a speculation (i.e. might not land soon or even never land). We don't block good patches for hopes and dreams, however sweet and likely those might be. Ultimately, catch will decide is he wants to delay this. So, back to RTBC, as tests are still passing.

Status:Reviewed & tested by the community» Needs work

We can't RTBC a patch that will attempt to write files with names longer than most FS supports. This patch is a no go. For example, ext3 has a limit of 255 chars for a single filename, using this patch, when browsing in the adminstration with standard profile (no modules) the file name is already 125 chars. We can't commit this, if we do, we'll doom a lot of sites.

pounard@blaster:/var/www/d8-git/www/sites/default/files/css
[Tue May 15, 00:12] $ ls -al
total 44K
drwxrwxr-x  2 www-data www-data  4096 May 15 00:12 .
drwxrwxrwx 11 pounard  pounard   4096 May 15 00:05 ..
-rw-rw-r--  1 www-data www-data  1682 May 15 00:08 core;modules;comment=comment.theme,core;modules;field;theme=field,core;modules;search=search.theme,core;modules;user=user.css
-rw-rw-r--  1 www-data www-data  3864 May 15 00:08 core;modules;system=system.admin.css
-rw-rw-r--  1 www-data www-data  7764 May 15 00:08 core;modules;system=system.base~system.theme.css
-rw-rw-r--  1 www-data www-data  3960 May 15 00:08 core;modules;toolbar=toolbar,core;modules;shortcut=shortcut.theme~shortcut.base.css
-rw-rw-r--  1 www-data www-data 16383 May 15 00:08 core;themes;seven=reset~style.css
pounard@blaster:/var/www/d8-git/www/sites/default/files/css
[Tue May 15, 00:12] $ php -a
Interactive shell
php > echo strlen('core;modules;comment=comment.theme,core;modules;field;theme=field,core;modules;search=search.theme,core;modules;user=user.css');
125
php > ^Dpounard@blaster:/var/www/d8-git/www/sites/default/files/css
[Tue May 15, 00:12] $ sudo touch "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacore;modules;comment=comment.theme,core;modules;field;theme=field,core;modules;search=search.theme,core;modules;user=user.css"
touch: cannot touch `aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaacore;modules;comment=comment.theme,core;modules;field;theme=field,core;modules;search=search.theme,core;modules;user=user.css': File name too long
pounard@blaster:/var/www/d8-git/www/sites/default/files/css
[Tue May 15, 00:12] $

yep, i agree with pounard now, this is not going to work - 256 characters is going to be too short.

Good to hear. You can do a proper patch without changing the hash algorithm thought. You can use a variable for storing the mapping, but personnally I'd go for a database table, even thought the change would be more invasive (I'm thinking about patch size).

EDIT: I'm still against patching right now but if anyone wants it, you can do it: but it need more minimal than this: using the core hash algorithm and storing into the variable as a first attempt will be only a patch with only a menu callback and no change in the rest of the code.

Note #47 is pretty close to this, although that patch is 15 months old now.

Note: There are 139 followers for the D7 patch of advagg #1171546: AdvAgg - D7 Port/Re-write and over 1,600 users of AdvAgg.

I still believe using a database table for CSS/JS aggregates is the correct way to do this. Getting a lot of what AdvAgg does in core is going to take some work if we want to go that route.

A better HTTP client via HTTPRL is the first step (advagg generates the aggregates in the background (multiprocess php)). I'm pretty happy with what HTTPRL currently does so creating a core patch to incorporate it is where I would start (HTTPRL is born out of AdvAgg code). Next step is to make it plugable #64866-37: Pluggable architecture for drupal_http_request() so different backends can be put in place if cURL is desired.

The current limitation that AdvAgg has is it doesn't take advantage of multi-sites. If we have the same CSS aggregate used across multiple sites on our server, taking advantage of that will make things a lot snappier (only need to generate one aggregate) and save disk space. This would require a global files folder, one that doesn't depend on the current site being accessed. Global storage of the old files that need to be deleted could be bypassed if we have a REST API for querying if the aggregate is old. This would require one site to be aware of all the other sites inside of the multi-site install so it knows what sites it should query via HTTP.

Once a better HTTP client is in core and (optional) a global files directory is available I might have the D7 version of AdvAgg done at which point we could create a patch for D8 core. This is my recommendation for core's CSS/JS aggregation. It's a big undertaking but a lot of what AdvAgg does is what we're trying to do. It uses the same codepath for both CSS and JS; S3/CDN friendly file-naming; caching for faster page generation times; etc... it does a lot, thus it would be a large patch. Gotta run...

Once a better HTTP client is in core ... I might have the D7 version of AdvAgg done

Do we need the better HTTP client in core, or is the hack in #1664784: drupal_http_request() has problems, so allow it to be overridden sufficient to unblock D7 AdvAgg? Please comment in that issue if more is needed there to satisfy AdvAgg.

at which point we could create a patch for D8 core

Are you recommending this as both the D7 and D8 solution, or D8 only? If D8 only, then is there any objection left to proceeding with an update of #47 for this issue, and doing #55 and/or AdvAgg in #352951: Make JS & CSS Preprocessing Pluggable or a spin-off of that?

D7 AdvAgg is not blocked; other than me needing time to do it. The D7 version of AdvAgg will rely on HTTPRL. Just stating that there is a lot going on.

This thread is for a D8 solution; back porting to 7x would be very hard to do. I've been working in #1447736: Adopt Guzzle library to replace drupal_http_request() and in Comparison of HTTP Client Libraries. I've concluded that Guzzle is the way forward; it offers what we're looking for in a HTTP Client. For the one thing it was lacking at the time, Non Blocking Requests, the author made it happen. That's huge. If we come across other issues when integrating it it's nice to know that mtdowling is willing to help.

At the core of AdvAgg is a whole lot of hooks to make it pluggable and the ability to generate a CSS/JS file based off the filename.

My objection to #47 is everything is stored in a variable instead of a database table. Having things in 2 tables like I have in AdvAgg allow for thing like the bundler to work correctly and for smart aggregate flushing. #55 from my understating deals more with the mechanics of how the hook_menu of this works; which is covered in WSCCI.

I would be ok with #352951: Make JS & CSS Preprocessing Pluggable, but I think shipping with something like AdvAgg (doesn't have to be AdvAgg) as the default would be good for the drupal community.

To get around the filename length, we can keep a table with an autoincrement (or try to do something a bit niftier with hex or just any allowable filename character to have more to work with). Just mapping to files that have been added with drupal_add_*().

When files are added, we replace the filename with the ID. In the request that generates the aggregate, it can map the id back to the filename. This will be a lot less data to ship around than the current hashes when there's lots of aggregates and it should be possible to add lots and lots of files before getting anywhere near the limit.

Also, for the DDOS issue, also spoke about this at some length with rupl, Mark Sonnabaum and greggles, and we've come up with a likely solution:

When the link to the aggregated CSS/JS is generated, we can hash the filenames + drupal_get_private_key(), and add that as a token to the request. The callback that assembles the aggregated file can then do the same and refuse to create the file if the token can't be validated.

Here is what I actually do in the core_library module, this includes a SQL table too and is simpler IMO.

The table stores both the filenames array and hash along with a creation timestamp and a incremented version number.

Generated files are named hash-version so that older files are kept, ensuring that old caches somewhere (browser, proxy, etc...) will always have a resulting file in result, corresponding to the older cached page state.

Each time a CSS or JS file is modified, the creation timestamp raises, the table is updated with the increment set to +1 and the new creation time saved, resulting in a new CSS/JS file named hash-version+1. This way, one line only per aggregated "set" exists, but multiple versions coexist, and no filename lenght problem exist.

The hash is the table index, lookups are quite fast since the table won't massively grow on a production site.

The filename storage in a blob field allow two things, first as it is a blob, is supposedly not loaded into memory in caches by the SQL backend (depending on size and backend of course, I'm thinking about toasted columns in PgSQL) and also to fully rebuild the file on-demand when the URL is hit (no pre-aggregation is needed).

I came around a lot of thinking, and this is the best way I could find to manage those files, and it is efficient: the only drawback is the extra SQL query when either checking for version or building the file on first hit.

EDIT: Moreover with a bit of imagination the table could be replaced by any other backend easily (MongoDB, Redis, etc..) since there is one and only index/primary key which is the array hash.

#101 solution becomes overly complicated IMHO.

@pounard the post in #101 is missing some background.

If we try to support real subrequests in core, then we need a way for the actual page to be able to take assets added by those subrequests, put them into the header of the page, potentially aggregate them etc.

With real subrequests that logic can't happen in PHP (since potentially no PHP at all will be executed), so we'd need to either have a js loader that reads some embedded json and handles adding the assets, or apparently twig has some kind of mechanism for this too (not sure if twig handles the loader bit as well as the embedded assets bit at all). Either way this means being able to create an aggregate URL without writing out either files or to the database (assuming we don't want the loader making backend requests which I'm sure we don't), which makes a solution where all the information required to produce an actual aggregate is in the aggregate URL itself much more important. That provoked a lot of discussion at the Munich code sprint and led to trying to develop the 'filenames in the URL' approach to a point where it was both performant and secure.

Now that hook_library_info() is mandatory for all JS and CSS, the whole system will become a lot more predictable, my opinion is that the aggregation mecanism should be tied to this hook data instead of trying to do anything dynamically. Using it may help us to do all of this without any database table or generated identifiers or hashes (althought I'm not fully convinced, but I think this path should be explored).

EDIT: This means that the system is a lot less dynamic now, and it may solve some of the subrequests problem.

Status:Needs work» Needs review
Issue tags:-Performance, -i/o, -needs backport to D7

#80: 1014086-80-css-imagecache.patch queued for re-testing.

Status:Needs review» Needs work
Issue tags:+Performance, +i/o, +needs backport to D7

The last submitted patch, 1014086-80-css-imagecache.patch, failed testing.

Is there any point in rerolling (http://drupal.org/patch/reroll) #80?

could use an issue summary update. tips to do: http://drupal.org/node/1427826

#886488-94: Add stampede protection for css and js aggregation says this issue might help that one.

(the slash in the i/o tag breaks the autocomplete from adding new tags)

Issue tags:+i-o

sorry fixing tag

Issue tags:-Needs reroll

Removing reroll tag

Heads up, the D7 version of AdvAgg is now in beta. It stores the files and order used in the creation of the aggregate, the a hash of all the files content (versioning), and any setting that affects the output of an aggregate (CSS/JS minification, CDN settings, embedded images, http/https, etc). Also has stampede protection, backports D8 JS code, and is coded to reduce disk I/O.

Title:Consider using imagecache style generation of css and javascript aggregatesConsider using imagecache style generation of CSS and JS aggregates

Now that #352951: Make JS & CSS Preprocessing Pluggable has landed, I read the entire issue. Let's not fix everything here. For that, we have this meta issue: #1048316: [meta] CSS and Javascript aggregation improvement.

Let's just fix the way CSS/JS aggregates are generated here, and implement the same strategy that we use for imagecache. (Taking into account the mistakes that were made for imagecache generation: https://drupal.org/SA-CORE-2013-002 — even though that is only superficially similar to the situation here.)

Thanks to #352951: Make JS & CSS Preprocessing Pluggable, it will be much simpler for something like AdvAgg to be ported to Drupal 8. It'll be easy in Drupal 8 to implement any kind of aggregation logic.

But, for Drupal 8 core, with us having entered the post-API freeze period, I believe something like #101 might be the simplest possible approach:

When the link to the aggregated CSS/JS is generated, we can hash the filenames + drupal_get_private_key(), and add that as a token to the request. The callback that assembles the aggregated file can then do the same and refuse to create the file if the token can't be validated.

Let's get this going again!

@Wim, #101 is only relevant if we also do #100.

There's two options really:

1. Store the aggregate filenames + list of files in state(), then the request that builds the file can reverse-lookup from the filename. This is what http://drupal.org/project/agrcache does and advcache is similar in that regard. The disadvantage with that approach is that this is a fairly large array to be fetching every request, and that whenever a new aggregate is requested, the main process has to write to state(). If the filename isn't found in state, nothing happens so there's no imagecache-esque DDOS vector.

2. Provide a representative of the filenames in the aggregate string itself (not the filename itself because that'd be too long). We map each individual file to a one or two character string ( in state()), then pass the list of tiny names in the filename (plus the hash), then the process generating the file can reverse lookup the filenames from the identifiers. This still requires a write to state in the parent process, but only when a new file is added (as opposed to new aggregate, so less frequent), and is a lot less data to be fetching from state() each request - just filename + two character string for each unique file, instead of aggregate filename + list of files for each unique aggregate. There's a bit of complexity added because you need to handle the shortest possible filenames (a-z/A-Z/_-/0-9) and ensure that you don't run out etc. but it ought to perform better.

If there's an option #3 someone please add it!

Regardless of #1 and #2, the route controller is going to look very similar, so unless there's a fatal flaw with one of the two approaches I'd be happy going with either for now, then we can swap it out if we decide the other was better later (or contrib can).

I think there might be an option #3… :)

It's inspired by https://docs.google.com/presentation/d/1Q44eWLI2qvZnmCF5oD2jCw-FFql9dYg3... — also see https://github.com/google/module-server for a sample implementation plus some more text.


So, the alternative proposal:

3. We have a dependency graph. We leverage that.

  • Step 0: must finish the conversion to libraries.
  • Step 1: revamp drupalSettings.ajaxPageState to list libraries, rather than specific CSS or JS files.
  • Step 2: define an aggregate by its filename, in which we list the "top-level libraries", which are libraries that are not a dependency of any of the other libraries on the page. So, we would never see an aggregate like jquery,drupal,debounce,announce.js (because neither of those do anything in and of themselves, they're all merely utilities), instead we would just see contextual.js, which implies that the "contextual" library plus all its dependencies are loaded.
  • Step 3: ensure that it works with multiple scopes, this is easy to do: if the header scope contains libraries foo and bar, and the footer contains libraries baz and qux, then the aggregate for the header would be foo,bar.js, the aggregate for the footer would be baz,qux-foo,bar.js: "libraries baz and qux plus their dependencies, minus libraries foo and bar and their dependencies". (This somewhat relates to #784626-59: Add explicit "default" scope for drupal_add_js() which defaults to "footer" so that pages render fast.)
  • Step 4: ensure that it is regenerated whenever the contents change: the hash that we use for the aggregate's file name today can be embedded in the path: if the aforementioned aggregates would be generated at some/path/aggregate/css/foo,bar.js, then we can do something like some/path/aggregate/css/<uniqueness token>/foo,bar.js. This would also make it rather trivial for a contrib module to override things to generate that uniqueness token — cfr. #678292: Make CSS and JS aggregation sensitive to file changes and be enabled by default.
  • Step 5: ensure we prevent DDoSsing: if the aforementioned aggregates would be generated at some/path/aggregate/css/<uniqueness token>/foo,bar.js, then we can do something like some/path/aggregate/css/<HMAC security token>/<uniqueness token>/foo,bar.js. This is exactly what the CDN module does today: $security_token = drupal_hmac_base64($uniqueness_token . $libraries_string, drupal_get_private_key() . drupal_get_hash_salt());.

#114 looks like a good plan and I can't immediately think of any showstoppers.

I see one small detail that could go wrong with #114: file names length will be proportional to the number of aggregated files, aside of that, it sounds OK.

I still think than using the state system for keeping track of aggregated files would be a rather good and failsafe solution instead of trying to dynamically lookup files names from the incoming URL.

#116:

  1. It is true that the file name length will be proportional. But it won't be proportional to the number of aggregated files, only to the number of top-level libraries. That's a huge difference!
  2. I'm not sure if using the state system is so much better — as you already say, it requires the state to be tracked. And as we've seen earlier in this issue, that's a lot of state to track. Whereas for #114, there would be almost the exact same state tracking as we have today (or am I missing something?). Only when/if we do something like #678292: Make CSS and JS aggregation sensitive to file changes and be enabled by default, we would need to track additional state.

If we're passing a list of libraries comma-separated in the filename, then consulting hook_library_info() in the aggregate generation callback, then it may not be necessary to have any state tracking at all.

It's definitely a consideration whether there might be too many libraries in a scope and we run into file system limitations again, but then that only takes us back to #100 (or splicing the aggregate into two past a certain length would be another option then).

Any chance we can get this into d8?