Drupal needs to abstract the URL generation process of files in a standardized, everywhere: from page.tpl.php to modules, to drupal_add_css(). That allows us to serve files from different servers.

The benefit is that Drupal can then rewrite file URLs to be served from static file servers or CDNs, FTP servers, or even bittorrent. However, in many cases there's a delay between a file being available on the Drupal site (i.e. the web server), and the synchronisation of that file to a file server. So we need to automatically fall back. E.g. try the CDN, if it's not available there, try the static file server. If none of the stand-alone file servers can serve the file, use Drupal's web server (which is *always* the file server currently, in Drupal 6 and earlier).

The first step is to allow pluggable file server types. This can be done through hook_file_server(). In short: each module that implements this hook should keep a local "source file - destination file pairs" cache, return the destination URL in case it's in the cache, and otherwise return FALSE.

The second step is to allow the user to configure the file server order (see the attached annotated screenshot): which file server should be preferred (i.e. tried first), which second, and so on. The web server Drupal is running on, will always be the last preferred server, i.e. the file server to which always can be fallen back.

Step three finally, is to abstract the URL generation and to implement the fall-back mechanism. This is file_url() function, which iterates over the available file server hook implementations, stopping when it receives an URL, trying the next file server when it receives a boolean FALSE. If none of the file server hook implementations returned a URL, Drupal's web server will be used.

This patch allows you to create scalability-enabling modules like the CDN integration module that I wrote.

Attached is a patch against Drupal 5 core. I want to get feedback before porting this patch to Drupal 6 and thus getting to the point where I have to maintain core patches for multiple Drupal versions.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

pwolanin’s picture

um - the patch is 5.x but the issue is 7.x?

Wim Leers’s picture

"Attached is a patch against Drupal 5 core. I want to get feedback before porting this patch to Drupal 6 and thus getting to the point where I have to maintain core patches for multiple Drupal versions."

The point is to get feedback on my version for Drupal 5 first, before I move on to Drupal 7. I hope that's okay?

pwolanin’s picture

@Wim - sure, the code doesn't look too heavy in any case.

Wim Leers’s picture

More about the advantages this brings, can be read in this section of an article I wrote about Drupal page loading performance.

moshe weitzman’s picture

In Drupal6, you can now rewrite the whole url, even base_url, using custom_url_rewrite_outbound. so some of these tricks are possible without any core hacking.

Wim Leers’s picture

That only works for files that are linked to through url() calls. Functions that *don't* use url():
- drupal_add_css()
- drupal_add_js()
- file_create_url() (only uses url() when using private files)

So, as far as I can tell, it can't be used at all...

drewish’s picture

subscribing, if you don't want multiple versions then it seems like HEAD would be the best choice to chase ;)

dopry’s picture

I am actually working on a variation of this with the private files work I've been doing in my spare time. The approach I was taking was to add a public and private files url. I thought it would be a simple solution.
The hook in this instance may be overkill, imho. How often do you expect to need fall through lists of servers for URLs?

What kind of system architecture do you see this working for? I envision a common use case of files.example.com being a domain name that can be pointed to a CDN, load balancer, or proxy/redirect server. I don't see how that would really benefit from this approach.

I don't really think much is gained from this flexibility for the average drupal user. Maybe I am being short sighted...

Wim Leers’s picture

This works above public/private files, conceptually at least. Although I must admit that it currently only supports public files, I haven't taken private files into consideration yet.

A whole list of servers may indeed be overkill. But it's the easiest way to work with in code and conceptually. IMO this use case should be pretty common:
- Primary file server: CDN, this is often an external domain (e.g. yoursite.yourcdn.com)
- Secondary file server: static file server, this is often on the same server, but served by a different process (e.g. lighttpd next to apache), or another server in the same server park. This one is used when it's not yet synced to the CDN (e.g. images generated by imagecache will not be synced yet on the first request).
- Tertiary (failsafe) file server: the web server that's serving Drupal.

So, in my opinion, the fallback mechanism should not be considered a complex tool. It should be seen as the mechanism underneath that makes it really easy to have multiple file servers.

Also, it might be necessary for other file servers to be between any of the above servers. You could e.g. create an S3 file server or a bittorrent server. For example to serve all .iso files from S3 instead of the CDN. Any combination becomes possible. The possibilities are endless.

This is not for the average Drupal user, that's true. But I'm trying to make it easier to do, so more users can take advantage of it.

dopry’s picture

I like the objective... but we're talking about adding a lot of moving parts to create a URL here... The part that worries me most is checking to see if a file is available on each host for the fall through... If we're talking about a site with a lot of file urls to be created, we're talking about a lot of latency. Also how do you specify that ISO's are on the CDN, etc. A lot of that work can be done with a layer 7 load balancer.

I don't think your use case/infrastructure is very common. Most CDN's function as a proxy and will pass requests back to your website if the image is not available on them. If you are using a CDN where you have to manually sync files, I suggest you re-assess your service provider.

If high availability is an issue you normally see fail over or load balanced fileservers with a shared filesystem, nfs, drbd, or rsync to keep them in sync. The tertiary file server argument doesn't really work in this case. Most sysadmins wouldn't even consider such an architecture.

The two most common use cases for an alternative file url I think I've come across are:
1) redirecting requests for files to a server optimized for static file hosting.
2) using 3rd party file storage services to store files (s3)

I tried addressing this problem in the fileapi.module I was working on, which I may get back to one day as drop in replacement for file.inc. The idea with it was to 'mount' different file stores for drupal. So basically paths beginning sites/default/files/myftp would be handled by an ftp specific api. URL generation would be passed through to the driver. So the driver could return and completely different URL from the system path or url.

I think we really need to understand the use cases and the larger drupal implementations that are out there. Maybe we can get feedback from some lullabot/sony/advomatic/firebright people about what kind of infrastructure they run on, and how they would work with these feature.

Wim Leers’s picture

I like the objective... but we're talking about adding a lot of moving parts to create a URL here... The part that worries me most is checking to see if a file is available on each host for the fall through... If we're talking about a site with a lot of file urls to be created, we're talking about a lot of latency. Also how do you specify that ISO's are on the CDN, etc. A lot of that work can be done with a layer 7 load balancer.

I failed to make this clear: the checking for availability should of course be cached to the database, i.e. store the current state after every cron synchronization.

I don't think your use case/infrastructure is very common. Most CDN's function as a proxy and will pass requests back to your website if the image is not available on them. If you are using a CDN where you have to manually sync files, I suggest you re-assess your service provider.

Well, I don't have experience with the "high-profile CDN's" such as Akamai. The only CDN I have hands-on experience with is CacheFly, and they only allow uploading files via scp/sftp/rscync. So "manual" synchronization is still necessary there. Depending on your set up, this may also be necessary for your static file server.

If high availability is an issue you normally see fail over or load balanced fileservers with a shared filesystem, nfs, drbd, or rsync to keep them in sync. The tertiary file server argument doesn't really work in this case. Most sysadmins wouldn't even consider such an architecture.

The main target in my use cases so far was high performance, not high availability.

The two most common use cases for an alternative file url I think I've come across are:
1) redirecting requests for files to a server optimized for static file hosting.
2) using 3rd party file storage services to store files (s3)

Agreed.

I tried addressing this problem in the fileapi.module I was working on, which I may get back to one day as drop in replacement for file.inc. The idea with it was to 'mount' different file stores for drupal. So basically paths beginning sites/default/files/myftp would be handled by an ftp specific api. URL generation would be passed through to the driver. So the driver could return and completely different URL from the system path or url.

That's more advanced than my approach, but is suitable. I assume you can have any number of instances of any mounts?

I think we really need to understand the use cases and the larger drupal implementations that are out there. Maybe we can get feedback from some lullabot/sony/advomatic/firebright people about what kind of infrastructure they run on, and how they would work with these feature.

Agreed.

My goals with this patch were:
1) all files that are served (including CSS etc.) should go through a central file URL generation function, in my patch this is file_url(). Very important: this implies that URLs in files (e.g. in CSS files) also have to be updated!
2) allow contrib modules to dynamically generate file URLs, and fall back if the file is not available on a specific server (yet), in my patch this is hook_file_server().

Use case: serving all the CSS/JS/image etc. files from a CDN. This requires some sort of mechanism to override the default server for these files. You may think that overriding the base URL would be enough, but it's not, for several reasons:
- for proper high performance static file serving, you need to set far future Expires headers (e.g. +10 years). This requires unique filenames, thus dynamic URLs, based on the uniqueness of the file itself (e.g. md5 hash or modification time).
- certain files, such as generated images (Image Cache ;)), can't be available on a CDN instantly, by design (they have to be generated dynamically). So a fallback mechanism is needed.

c960657’s picture

See also #207310: Allow rewriting of file URL's using custom_url_rewrite_outbound() for a less powerful but less complex way of rewriting URLs of uploaded files.

I like the idea of the suggested file_url() for creating URLs to JS, CSS etc. I suggest that this also passes the generated URLs through custom_url_rewrite_outbound().

Wim Leers’s picture

If that would be supported, it would be in a follow-up patch.

Robin Monks’s picture

Status: Needs review » Needs work

I'd love to see this rerolled for core. +1 for the idea!

Marking as code need work since, technically, feature requests can't go into 5.x, and 5.x code can't go into 7.x.

Robin

Wim Leers’s picture

This issue was on hold until hook_file() got committed. Since that's done now, we can continue working on this.

Unfortunately, university is a huge time sponge, so I don't have time to work on this right now. I believe Jakub Suchy (meba) might jump in here :)

meba’s picture

subscribing :)

meba’s picture

Assigned: Wim Leers » meba
Status: Needs work » Needs review
FileSize
523 bytes
7.66 KB

Actually, this was pretty easy. Attaching a first version of a patch. I decided not to include file server weight settings for now because I hope there may be a more elegant solution.

What does this do?

  1. It creates a function called file_url() which needs to be called everytime any file link should be passed to an user (theme_image, drupal_get_css, file_create_url, etc.)
  2. Creates a new hook - hook_file_server(). Whenever file_url() is called, it loops through every module implementing this hook and waits for a new URL. If no URL is returned, it falls back to Drupal's default

Also attaching
fstest.tar.gz - untar this to sites/all/modules and install File server testing module.

How to test

  1. Apply patch
  2. Enable fstest module, enable upload module
  3. Change fstest.module variable FSTEST_SERVER to a proper URL. For example, my local virtual server is available through `http://webdevel` but also using `http://192.168.122.158`. My default is webdevel so I have set FSTEST_SERVER to `http://192.168.122.158/` (include a trailing slash)
  4. Create a node and upload some files. See a link for uploaded files - it should point to `http://192.168.122.158`
  5. Display a source for the page. All CSS files, logo and images should point to `http://192.168.122.158`

More to come

  • Test results
  • Benchmarks

Please, at least some reviews :-)

meba’s picture

All tests passed except:

File validation, Upload user picture: these tests are failing with and without the patch, ignoring

Simpletest tests:
Found assertion {"SimpleTest pass.", "Other", "Pass", "simpletest.test", "SimpleTestTestCase->stubTest()"}., line 132
Found assertion {"Created permissions: access content", "Role", "Pass", "simpletest.test", "SimpleTestTestCase->stubTest()"}., line 135
Found assertion {"This is nothing.", "Other", "Pass", "simpletest.test", "SimpleTestTestCase->stubTest()"}., line 142

These tests are failing only with the patch, don't know why yet.

Wim Leers’s picture

Maybe those tests check for certain pieces of HTML which include URLs? We'll have to update the tests as well to check for the proper new URLs.

Anonymous’s picture

Status: Needs review » Needs work

The last submitted patch failed testing.

c960657’s picture

I think it is relevant to make a distinction between "dead" files and "alive" files (these descriptions may be badly chosen, but please bear with me on this).

Dead files are files that are part of Drupal, e.g. all files in misc/, and all images and CSS files in themes.
Alive files are files that are uploaded by users or dynamically created (aggregated CSS and JS).

Dead files only change when you upgrade core or modules. If you upload these files to the CDN at the same time you upload them to the primary web server, there is no need check at runtime whether they are available. Wouldn't it be reasonable to only check for alive files at runtime?

Dead files could be served in a simpler way, e.g. like this (please ignore the poor name of the function):

-        $settings['logo'] = base_path() . dirname($theme_object->filename) . '/logo.png';
+        $settings['logo'] = dead_url(dirname($theme_object->filename) . '/logo.png');

dead_url() could prepend a static hostname to the specified filepath (or just base_path() if there is no CDN involved), or perhaps distribute among a list of hostnames (static1.example.org, static2.example.org etc.) - but in a simple solution without hooks or anything.

Isn't this sufficiently flexible for dead files?

Wim Leers’s picture

c960657: While your intention is good, it would be way too error prone. What if a "dead file" was updated in a Drupal core update? What if you forgot to update this file on the CDN (which you'd have to do manually in the system you propose), which would consecutively break the JS of the entire web site?
The overhead of those few extra checks won't make that big a difference. I think it's smarter to optimize at a later stage, i.e. in page caching.

Also, the proposed hooks in this patch don't prevent you from creating the behavior you suggest. It just allows for more. In your implementation, you could check if the first part of a file path matches "misc" for example and then just pick one of the static domains you mention at random. There would just be one extra layer of indirection due to the proposed hooks.

Wim Leers’s picture

Assigned: meba » Wim Leers
Status: Needs work » Closed (fixed)