Support PHP stream wrappers

c960657 - February 26, 2008 - 23:25
Project:Drupal
Version:7.x-dev
Component:file system
Category:feature request
Priority:normal
Assigned:Unassigned
Status:patch (code needs work)
Description

PHP allows users to implement their own stream wrappers. This makes it possible to access virtual or remote filesystems as if they were on the local disk using fopen(), unlink(), file_get_contents() etc. The files may be stored e.g. as real files on a central server, as virtual files in a database, on remote a storage service like Amazon S3.

Handling uploaded files with Drupal in a webserver cluster is not trivial if the parallel servers do not have a shared directory on disk, e.g. using NFS. Using a stream wrapper administrators can implement their own custom logic that implements a global filesystem in their server cluster. Other uses include the ability to push additional copies of files to e.g. a CDN.

Stream wrappers has a well-defined interface that isn't specific to Drupal, so implementations of such wrappers are useful for other purposes as well.

Stream wrappers work almost like regular files, but only almost. Certain functions do not support stream wrappers, including realpath(), glob(), chmod(), tempnam() etc. So you cannot just insert "mywrapper://mysite.example.com/files" instead of "files" in Admin → Site configuration → File system.

However, with only a few modifications to the Drupal core it works. We currently run several sites in production using stream wrappers for file_directory_path and file_directory_temp. We have done a few modifications to include/file.inc:

  • we added a new function, file_realpath(), that implements realpath() for stream wrappers, and then replaced all occurrences of realpath() with file_realpath()
  • we added @ in front of chmod()
  • we replaced tempnam(realpath(file_directory_temp()), 'tmp_') with file_directory_temp() .'/tmp_'. md5(getmypid() . microtime(true))

We also had to do a few minor tweaks to some contrib modules, but generally things have just worked.

I suggest that Drupal is made stream wrapper friendly. As mentioned, this requires a few changes to file.inc and that developers refrain from using certain filesystem functions.

If people like this idea, I will post patches. Our current solution requires some adjustments in order to be applied to Drupal CVS (e.g. microtime(true) is PHP 5 only).

#1

AjK - February 27, 2008 - 23:33

Just so you know, Drupal 7 will be the first PHP5 only release.

#2

c960657 - April 16, 2008 - 22:22

This patch shows our changes to the Drupal 6 core. I'll whip up a patch against Drupal 7 as well.

Comments are welcome.

AttachmentSize
streamwrapper-D6-1.patch6.05 KB

#3

c960657 - April 24, 2008 - 19:52
Status:active» patch (code needs review)

Patch against HEAD.

This has been tested using the Amazon S3 package in PEAR , and file_directory_path set to "s3://mybucket/files" (running this example requires an Amazon S3 account).

require_once 'Services/Amazon/S3/Stream.php';
Services_Amazon_S3_Stream::register('s3',
  array('access_key_id'     => '12345678901234567890',
        'secret_access_key' => '1234567890123456789012345678901234567890' ));

This runs extremely slowly due (a caching layer is needed), but it serves as a proof of concept.

AttachmentSize
streamwrapper-D7-1.patch7.76 KB

#4

Wim Leers - April 24, 2008 - 21:50

Another interesting patch, c960657. Thanks! And subscribing. Will review when/if I find the time.

P.S.: I wish you had an easier to remember and type nickname though ;)

#5

Susurrus - April 25, 2008 - 00:40

I would think this would be a perfect situation where a unittest would be useful. It seems like all you need to do is create a new class and then access it. Creating a class that just accesses bunk data that would be static between all instances could do this and allow testing for everyone, including users who don't have an S3 account.

I'll see if I can look into that this weekend.

#6

Dries - April 25, 2008 - 18:20

You have my interest with this patch, c960657. Cool stuff.

file_realpath() could use an example and/or some more verbose explanation.

It would be great if you could extend the PHPdoc to capture some of your experience and knowledge. It's probably useful to include some information about stream wrappers to help people understand certain decisions.

I agree with Susurrus that some tests would be key to avoid that we re-introduce changes that might break stream wrappers.

#7

moshe weitzman - April 26, 2008 - 03:27

Subscribe - nice stuff.

Where would you see caching being added for an S3 stream?

#8

c960657 - April 27, 2008 - 01:16

I expanded the comments for file_realpath().

I also added support for resolving "/../" etc. This isn't necessary in a setup like ours, because you cannot escape the S3 bucket using "/../". But in general it is required in order to make file_check_location() work reliably.

The resolver in _file_remove_dot_segments() is based on the pseudo-code in RFC 3986 (about URIs). It is probably possible to do something similar to this using fewer lines of code (e.g. using regular expressions), but it isn't as trivial as it may sound, so for now I decided go with something that is known to work in a secure and reliable way.

WRT testing it would be nice if the whole testsuite for all modules using files could be run both with and without file_directory_path set to a path using a custom stream wrapper. I don't know how to achieve that.

>Where would you see caching being added for an S3 stream?
We have implemented caching in separate stream wrappers. We have file_directory_path set to filecache://dbstatcache://s3://static.example.com/files, where static.example.com is the name of the S3 bucket. Filecache caches files on the local disk of the webserver, but passes all stat() calls on to the dbstatcache. Dbstatcache caches stat information in a database that is accessible by all webservers in the cluster (this is necessary in order to detect if an already cached file is deleted or updated), but passes all fopen() calls on to the s3-wrapper. This may sound overly complicated, and perhaps it is, but the initial idea was to create simple general-purpose components and perhaps release them as open source (so far we have only released the S3 package through PEAR, but we don't mind sharing the other).

So far we only use S3 as a SAN with local caches, but using custom_url_rewrite_outbound() we could serve files directly off S3.

Our solution works, but the speed is not impressive. The end user usually don't feel a difference because of the local cache, but uploading big files is rather sluggish. E.g. ImageCache is not optimized for high-latency filesystems. Improving that is the next step.

AttachmentSize
streamwrapper-D7-2.patch9.48 KB

#9

chx - April 27, 2008 - 06:47

There are code style issues like } else { also _file_remove_dot_segments is extremely complex, are we sure we need all this?

#10

Susurrus - April 27, 2008 - 16:42

Does realpath() not do the job for removing dots?

#11

c960657 - April 27, 2008 - 17:00

>are we sure we need all this?
Perhaps not. As mentioned, we haven't implemented it in our setup, and so far without problems, so perhaps nothing in core relies on the /../ resolving feature of realpath(). The same can easily be archived using dirname().

file_realpath() could be altered to simply return false on all stream wrapper paths containing /../ rather than trying to resolve them using realpath/file_realpath(). This would still allow file_check_location() to detect potentially dangerous paths.

>Does realpath() not do the job for removing dots?
Yes, but only for regular files that exist on disk. It does not work for stream wrappers or arbitrary strings containing dots. This is the reason for introducing file_realpath().

I have attached a regexp-based approach for removing dot segments, just to show an alternative implementation. It is fewer lines of code, but as mentioned, resolving of dot segments may not be necessary at all.

AttachmentSize
streamwrapper-D7-3.patch7.64 KB

#12

Dries - April 28, 2008 - 09:55

I suggest that we leave out the more complex dot-handling for now. It's probably better to focus on testing, and to introduce the complex dot-handling when we have a real need for it, and when we can actually write tests for it.

#13

c960657 - April 28, 2008 - 12:48

I updated the patch to simply return FALSE on paths containing "/../". AFAICT this is enough to prevent security exploits (assuming that the wrapper is relatively sane).

I have made file_realpath() less complicated by making it illegal to specify file_directory_path and file_directory_temp with a file:// prefix. I don't see a need to allow two equivalent syntaxes for specifying filesystem paths - it will only introduce another attack vector.

#14

Dries - April 28, 2008 - 17:43

You didn't upload your latest patch.

#15

c960657 - April 28, 2008 - 20:08

Sorry, here it is :-)

AttachmentSize
streamwrapper-D7-4.patch8.19 KB

#16

Dries - April 30, 2008 - 06:58

Thanks c960657. It looks good to me.

One more question -- could you write some SimpleTests for this (without having to pull in an Amazon S3 library)? It would help ensure this feature gets maintained properly going forward.

#17

c960657 - May 6, 2008 - 21:50

I have added a dummy stream wrapper that may be used for testing.

I added two test classess: FileTestCase and FileStreamWrapperTestCase. The latter inherits from the former and is basically just a way to make sure that all tests are run with and without the dummy wrapper.

I have added tests for some of the functions in file.inc. I'd like to hear some comments on whether this is the right direction before I write further tests. I am new to writing unit tests.

Also, I have made a few more changes to file.inc:

  • file_check_directory() now uses the $mode argument for mkdir() in addition to specifying in afterwords using chmod(), because with stream wrappers chmod() doesn't work. The chmod() call is still necessary, because the $mode parameter of mkdir() is modified by the current umask.
  • file_check_location() now verifies that dirname($source) exists (i.e. $source does not have a stream wrapper prefix and a ".." segment)
AttachmentSize
streamwrapper-D7-5.patch20.87 KB

#18

c960657 - May 7, 2008 - 17:50

A question about file_directory_temp:

At Administer → Site configuration → File system, the description for the "Temporary directory" text field says: A file system path where uploaded files will be stored during previews. However, AFAICT uploaded files end up in file_directory_path while being previewed. Is the description wrong?

Would it be reasonable to say that file_directory_temp shouldn't be used for saving files between requests? In that case file_directory_temp could be restricted to always point to the local disk, i.e. not use stream wrappers. The purpose is to supply a place on the local disk where native PHP functions that does not work with stream wrappers can write. E.g. imagegif() does not work with stream wrappers, so it would be convenient to write to a temporary file before copying the file to file_directory_path.

#19

moshe weitzman - May 7, 2008 - 18:02

I believe that we keep file uploads in temp dir until a node is saved (for example). thus, the file neds to persist as the user is previewing and improving his node.

#20

c960657 - May 7, 2008 - 18:26

I thought so too. But the upload module appears to save uploaded files directly in file_directory_path (using file_save_upload() in upload_node_form_submit()). The fact that the node hasn't been saved yet is represented using the FILE_STATUS_TEMPORARY flag in the "files" table. When the node is saved, the flag is changed to FILE_STATUS_PERMANENT using file_set_status().

The remaining places in core that use file_save_upload() either discard the temporary file after having read it (includes/locale.inc) or copy it to another location (modules/system/system.admin.inc and modules/user/user.module).

But perhaps I missed something when grep'ing around ... ?

#21

dopry - May 14, 2008 - 02:09

no file_directory_temp is no longer used for file uploads or storing previews. The description on the file system settings page is incorrect.

I conceptually like stream wrappers and the idea of using them. However, I don't see their utility with this implementation.

  • url generation is not addressed
  • access control is not addressed
  • error handling is not addressed.
  • hook_file can cover every use case mentioned except S3 storage.
  • FUSE or JungleDisk can do the same without having to hack realpath.
    for FUSE: http://code.google.com/p/s3fs/wiki/FuseOverAmazon

I mean it would be totally cool to be able to write fileapi as a stream wrapper, but really it's just another layer of abstraction in an already difficult and poorly understood API. It doesn't allow me to do anything I can't already do with more mature tools.

#22

dopry - May 14, 2008 - 02:20

As a systems administrator I could never recommend this as a solution to cluster file distribution when there are file systems like OCFS and NFS freely avaliable, as well as various SANS options that are orders of magnitude more mature.

#23

c960657 - May 14, 2008 - 07:31

I am not sure what you miss WRT url generation, access control and error handling?

The URLs generated by file_create_url() work fine as long as you use private files. Errors generated by custom stream wrappers are handled the same way in PHP as errors from the local filesystem.

The hook_file stuff and s3fs look interesting, though.

#24

dopry - May 15, 2008 - 05:41

@c960657:

re: file_create_url with private files only... so in the S3 use case it has to be read from the remote service and served by drupal... so the file has to move across 2 networks before getting to the client, and the process serving the request has to sit there waiting reducing the number of clients your webserver can support. It would be so much nicer if files could be served directly from the remote service... then access control integration becomes an issue.

re: error handling... right now a lot of file handling happens with the assumption that underlying errors are permanent... ie) if a file can't be stat'ed it doesn't exist... however with stream handlers to a remote service this could be a temporary error.... The api layer above the stream handlers will need to be tolerant of these temporary errors that do not happen with a real filesystem. I'm not sure if it's really an issue, but something that should be considered maybe a new class of soft error for file operations.

I really like the concept of stream wrappers, but I think the implementation has to go deeper to be worth while. If we're going to embrace streams we might as well fully embrace streams for the filesystem... not just one particular stream for the entire FS and we also need to address URL generation and access control, even more so than the error handling in my opinion.

It seems like a much cleaner and more native way of implementing the drivers for my mountpoints / fileapi concept. I already have the mount point management and path parsing layers written.. I haven't revisited it since late Drupal 4.7 or early D5 development so who knows what condition it is in...

I'll think more about and keep playing devil's advocate.... I know I already have my ideas and long term vision, alternatives are welcome... I'm usually pretty sleep deprived when I hop on here.

#25

c960657 - May 15, 2008 - 07:56

>so in the S3 use case it has to be read from the remote service and served by drupal...
>so the file has to move across 2 networks before getting to the client,
Yes, the first time, until the webserver has a copy in its local cache.

>It would be so much nicer if files could be served directly from the remote service...
One way of doing this is using custom_url_rewrite_outbound(). I don't have experience with that yet, though (the possibility of rewriting the outgoing hostname was added in Drupal 6, and our current setup was made before that).

re: error handling:
I don't know much about NFS, but I assume any networked storage may have temporary errors.

>It seems like a much cleaner and more native way of implementing the drivers for my mountpoints
>/ fileapi concept.
I tend to agree. However, using a custom-written stream wrapper, you have greater flexibility WRT to what happens when a file is written, moved etc., and you can write your own local rules in PHP rather than writing your own FUSE wrapper - a thing that most PHP programmers probably aren't familiar with.

>I'll think more about and keep playing devil's advocate....
I appreciate your comments.

We initially chose to use stream wrappers because it seemed like a non-intrusive and non-system-administration-intensive way of supporting multiple servers in a cluster. I am not sure it was the best solution, but at least it works, though not as fast as we would like during uploads (perhaps partly because we also map file_directory_temp to our S3 drive because of the current description as mentioned in #18). Also, the work connected with keeping contrib modules stream wrapper-aware turned out to be bigger than initially expected (though, if stream wrapper support is made official in Drupal core, we can push our module changes upstream).

#26

dopry - May 15, 2008 - 10:36

re: pulling from s3...
it's a cost and I understand caching locally, but it's a big waste of bandwidth, and as you noted the latency issues can be killer. thats why I halted prior S3 work the latency was proving to be a bigger challenge than I anticipated.

re: generating URL's.
if file_create_url were stream aware or had a way of hooking into URL handlers for different streams we could probably solve this in core. just a thought...

re: NFS... not nearly as frequently as S3 or as a result of latency, and it's all kind of fubar on a site when NFS goes awol in a cluster... like ugly bad :) like no document root bad...

re: fileapi/stream wrappers... so I ripped the mountpoint system from fileapi and put it in my sandbox
http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox/dopry/mount... (D5, should be trivial to port to D7) with a completely untested modification to use stream wrappers instead of the original driver system I was developing for file api... It would be completely dependent on the stream wrappers in core thing, but I see a good match in my head. The contents of mountpoint.module would probably move into file.inc and mountpoint_ui would probably be headed to system.module. With tokenized paths and a smarter file_create_url it could be a fun mix.

re: you'd probably be happier with OCFS or NFS, the administrative overhead is actually minimal if you're already running a cluster, but if it gets the job done... I do still have reservations about file manipulations
and the like... (imagemagick, GD, etc)... It seems like going this route we will be copy files to local tmp storage pretty often until the rest of php's coolness becomes stream aware... I should probably look into that since It will affect a lot of the development work I do.

/me mutters about getting all excited about things he didn't want to get involved in and puts his grumpy hat back in the closet for a bit.

#27

mikl - May 16, 2008 - 15:28

Well, since my esteemed colleague, c960657, is in Riga on vacation, I'd just like to say thanks for the input, dopry. It has certainly given us something to think about.

As c960657 mentioned, getting all this to work with contrib-modules is quite painful – I spend the greater part of wednesday this week trying to upgrade image(api|cache|field) to latest stable (since we have been running -dev releases for a while), and that wasn't all that fun.

In any case, we're not dead set on S3. In fact, http://drupal.org/project/cdn looks promising, but that also involves patching the core :(

#28

c960657 - May 25, 2008 - 20:14

Though NFS (or equivalent services) may be one solution for handling files in a cluster, I think the stream wrapper approach is still worth considering. In particular, I think it is convenient to be able to write your own logic in PHP that suits your particular needs.

NFS also has its issues, so I think it would be useful to have a choice between several different approaches.

I have looked at hook_file, and I fail to see how it covers the same use cases as a stream wrapper. I can see that you are able to hook into file load and save operations, but AFAICT you cannot e.g. use file_scan_directory() to enumerate files not present on the local machine.

Updated patch:

  • Only file_directory_path has stream wrapper support, i.e. we can still use tempnam(), imagegif() etc. on files in file_directory_temp.
  • The outdated description for the file system settings page is updated.
  • New utility function: file_is_wrapper() - useful for modules using native functions without stream wrapper support
  • image_gd_close() now supports stream wrappers
AttachmentSize
streamwrapper-D7-6.patch24.68 KB

#29

dopry - May 28, 2008 - 01:50

with hook_file but you could guarantee changes across the cluster on hook_file events... no you couldn't use file scan directory in the same way via hook_file, but any FUSE mounted file system could be used the same way.

#30

c960657 - July 7, 2008 - 16:59

Chasing HEAD.

AttachmentSize
streamwrapper-D7-7.patch24.69 KB

#31

c960657 - July 16, 2008 - 16:28

Chasing HEAD.

AttachmentSize
streamwrapper-D7-8.patch24.72 KB

#32

lilou - August 23, 2008 - 20:01
Status:patch (code needs review)» patch (code needs work)

Need to be re-rolled :

file.test must be moved to modules/simpletest/test/

#33

c960657 - September 18, 2008 - 22:11

Reroll. Still work in progress.

In order to test the patch, I added the following to my settings.php in order to be able to run all existing tests with file_directory_path pointing to a stream wrapper. :

require_once $_SERVER['DOCUMENT_ROOT'] . '/modules/simpletest/tests/file.dummystreamwrapper.inc';
stream_wrapper_register('dummy-stream-wrapper', 'FileDummyStreamWrapper');
$conf['file_directory_path'] = 'dummy-stream-wrapper://sites/example.org/files';

This is probably not the way to go, but at this stage it is probably a good way to ensure thorough testing. I should work on making a regular testcase that test various modules that use files.

There is a failed test here and there, but generally it doesn't look too bad.

AttachmentSize
streamwrapper-D7-9.patch29 KB

#34

drewish - September 19, 2008 - 07:14

subscribe.

i'm interested in reviewing this further but there's some obvious things that need cleanup. the phpdoc doesn't follow the coding standards and the naming of things seems odd, file_realpath should probably be drupal_realpath, etc.

#35

c960657 - September 19, 2008 - 22:08

The patch is more or less complete, but lacks tests. No ready for review yet.

Instead of the addition to settings.php mentioned in comment #33, I now hacked DrupalWebTestCase::setUp() to set file_directory_path to a path with a wrapper prefix, and file_downloads to FILE_DOWNLOADS_PRIVATE. At the top of boostrap.inc I added code to always register the dummy stream wrapper with PHP. These two changes are included in the patch but are not intended to be committed.

With these two non-standard settings, all existing tests passes except two: 5722 passes, 2 fails, 0 exceptions

Without the two non-standard settings, all existing tests passes.

All tests have been run with the patches from #299290: User tests exceptions and #310113: SimpleTest: assertFieldByXPath() doesn't work in some environments, and with the change from #298309: DatabaseStatement::execute with PDO::FETCH_CLASS calls __construct() after setting object attributes reverted.

I had to change some of the tests to make the compatiable with stream wrappers and/or private files. This may not be necessary, because the tests are expected to run from a default configuration, i.e. without stream wrappers and with public files. But on the other hand it may be convenient to be able to run all tests with these non-standard settings like I have been doing.

AttachmentSize
streamwrapper-D7-10.patch42.67 KB
 
 

Drupal is a registered trademark of Dries Buytaert.