My company's site was experiencing extremely poor performance when uploading images (30+ seconds for a 90k file), which I tracked down to a problem in AmazonS3 (though it was exacerbated by a problem in another module). The current AmazonS3 makes an S3 API call for every call to file_exists(), is_dir(), and the like for files that aren't cached (i.e. those that don't exist, and those which do but haven't been cached yet).
This is a huge problem when a site has dozens of image styles, because of the way that Drupal handles them. The core writers probably assumed that the speed of a file_exists() call would never be significant, so such calls happen all over the place (e.g. moving an image checks for the existence of every image style derivative which might exist, in order to delete them). And when every single one of those calls hits S3, just to be told "nope, it doesn't exist", it takes ages to upload and move image files.
So, I re-wrote AmazonS3 from the ground up, making it rely far more heavily upon the file metadata cache. Now, the only time that the AmazonS3StreamWrapper calls out to S3 is to read the contents of a file, write a file, or delete a file (and a few other less common actions, like retrieving presigned URLs). Queries for file metadata (e.g. file_exists(), is_dir(), etc) exclusively hit the cache.
Of course, this means that the cache has to be significantly more robust than it was originally designed. Every change made to the filesystem is now reflected in the cache, ensuring that once a refresh has been done, the cache will stay in sync with the S3 bucket (provided that the bucket isn't altered by a third party). In addition, I set up a mechanism to refresh the cache by querying S3 for every file in the bucket, and filling out the cache with those files. You can trigger this refresh either from the config page (/admin/config/media/amazons3), or by using the new drush command drush amazons3-refresh-cache. You should only ever need to do this once, though, right after installing this version of AmazonS3. Though the update hook will also do it for you, if you already had an earlier version installed.
Another new feature is that folders are no longer stored in S3; they exist only in the cache. This is because S3 uses a flat filesystem: it doesn't actually have folders, just files with slashes in their names, so creating a file that pretends to be a folder isn't useful. So, because folders don't really exist until a file gets placed inside them, they should cease to exist when the last file in them is deleted. The unlink() function doesn't do this, so I set up a cron hook to refresh the cached folders every time cron runs. This prevents cache bloat if there are a lot of files being created and deleted in different deeply nested folders.
With these changes in place, image-related actions are dramatically more performant, and all other actions are also noticeably faster. However, I'm not personally familiar with the more advanced features of S3 (like torrent URLs and presigned URLs). I tried to make sure that my changes didn't alter those features, but I'm not sure I succeeded because my site doesn't use them. Please let me know if you have any problems with those features, or anything else, really.
This is such a fundamental change that a patch file would be more trouble than it's worth. Instead, I've attached the complete module to this post, and also uploaded it to a sandbox project of mine: AmazonS3 2.0.
To the maintainers of S3: if you like the changes I've made, I'd be more than happy to join this project as a co-maintainer. I'm now very familiar with the code, and will be happy to continue providing support. Since my site uses this module, my boss lets me spend work hours on improving it, much like the work I do on Date iCal.
| Comment | File | Size | Author |
|---|---|---|---|
| #11 | Added_option_to_serve_bucket_through_https_protocol_on_AmazonS3_v2.patch | 2.41 KB | aurimukas |
| #10 | AmazonS3_1.0dev-2.1-universal.patch | 71.8 KB | coredumperror |
| #10 | AmazonS3_2.0-2.1-universal.patch | 18.64 KB | coredumperror |
| #8 | amazons3 2.1.zip | 27.72 KB | coredumperror |
| #8 | AmazonS3_2.0-2.1.patch | 9.78 KB | coredumperror |
Comments
Comment #1
justafishHi,
Great! I don't have time to review this right now, but I just wanted to check that you're aware there's already a filesystem cache that does this in the latest dev release of this module? If so, how does it differ?
Thanks!
Comment #2
coredumperror commentedThe way that my re-write differs from the existing AmazonS3 module is that relies entirely on the filesystem cache for metadata purposes.
For example, let's say that Drupal core calls
file_exists('blahblah.jpg'), and blahblah.jpg doesn't exist. The current version of AmazonS3StreamWrapper checks the cache, sees that the file isn't cached, and then calls out to S3 to ask if the file exists there, and returns S3's response. My re-written AmazonS3StreamWrapper checks the cache, sees that the file isn't cached, and returns false.This, of course, means that the cache needs to be populated in a totally different way than the existing module does it, which is why I created the "Refresh file metadata cache" button (and drush command).
Performance is vastly improved because the metadata for every file is stored locally, so calls for metadata need only query the database, rather than querying S3. S3 is only called when an operation needs to read or write actual data inside a file, or needs to move or delete that file.
Comment #3
coredumperror commentedI made a huge mistake while designing the folder system, which I've fixed in the attached version. I had mistakenly assumed that it was important to ensure that the database doesn't get bloated with leftover empty folders after the files inside them get deleted. So I wrote a folder refresh function which cleaned out all the folders in the cache and rebuilt them based on which files exist. This meant that empty folders got deleted, which, in retrospect, was really dumb.
I've gotten rid of the folder refresh function, and re-written the file refresh function to leave folders intact. That function also no longer ignores the "folder" objects in S3 (files with a / on the end), since they may be the only representation of existing empty folders for users who are updating from the original version of AmazonS3
Comment #4
coredumperror commentedHAve you had a chance to take a look at this, yet? My team's been using it, and have given lots of praise for the performance increase. I'm sure other users of AmazonS3 would be interested.
Comment #5
torgospizzaCan you post a patch, rather than a zip? I'd like to compare them here. It seems that the "rebuild cache" mechanism might come in handy as a replacement for my function found in patches at #1277152: Integration with Filefield Sources.
Comment #6
kirikintha commentedHi all - I am really interested in this patch as well, I am working on a high traffic web site that we are wanting to leverage S3 with.
I don't want to step on toes, @coredumperror when you do get a patch up - I can surely test in a practical environment. @torgosPizza Please let me know if I can be of any help!
Comment #7
coredumperror commentedHey all, sorry for the late reply. torgos posted his patch request on the day that I left for a week-long hike in Yosemite.
Here's the patch for my re-write, which I made against the latest dev build of AmazonS3. I had originally posted a zip, rather than a patch, because every function is different (or removed, or a new function). This patch is nearly twice the size of the original codebase.
Comment #8
coredumperror commentedHi again! I've got some more updates for my re-write of AmazonS3. There is one significant change: the
amazons3_filetable's collation has been changed toutf8_bin.By default, MySQL doesn't support string keys with the same words but different capitalization (e.g.
s3://panda.jpgcannot be a key alongsides3://PANDA.jpg). However, nearly every filesystem out there (including S3) supports having two separate files with names like this, and AmazonS3's sync code was choking because of that. I had originally overcome this problem (partially) with a slow, clunky workaround. However, I'd thought it was an OSX limitation, rather than a MySQL issue. But when I finally installed my updated AmazonS3 module into a linux machine, the problem persisted, so I realized that I needed to find a different solution.Unfortunately, MySQL doesn't directly support case sensitive collation in UTF8. However, it is possible to make it collate by binary differences between characters, which is good enough for the
amazons3_filetable's needs. That's why the collation is now set toutf8_binrather than the default,utf8_general_ci. I've implemented an update hook to fix this, and changed thehook_schema()implementation.In addition, I made several comment changes, and added t() around all the remaining user-visible strings.
Attached is the zip file with AmaonzS3 v2.1, as well as a patch for v2.0 to v2.1, and v1.0-dev to v2.1.
Comment #9
dave reidIt would help if you made the patches in the 'universal' format using diff -u which is the standard for providing patches for Drupal projects.
Comment #10
coredumperror commentedOh, sorry about that. I usually just use git, but I couldn't readily do so in this situation, and I wasn't aware that diff could create git-like patches. I've attached new patches, made with
diff -u.Comment #11
aurimukas commentedMy suggestion to add possibility to get files using https protocol if CloudFront configured to serve files with https protocol.
Comment #12
coredumperror commentedPlease post this suggestion to this other issue queue, rather than appending this issue.
Also please be aware that changing the title of an existing issue is strongly discouraged unless you're the module's maintainer or the issue's original poster.
Comment #13
aurimukas commentedSorry for that...
BTW, maybe someone noticed the problem with image_cache generation in S3? files are being uploaded to S3 without any problems.. but when I used some files printouts with image_cache... always getting nothing... urls always directing to local address.. could be the problem that I'm using NGINX?
Comment #14
coredumperror commentedFrom reading the description of how ImageCache works, I believe your problem is an incompatibility between how AmazonS3 deals with image derivatives and how ImageCache does it.
I'll see if I can find a way to integrate with ImageCache to resolve this incompatibility, but I won't have a chance to look into this until next week at the very earliest.
Comment #15
aurimukas commentedThats great! thanks..!
P.S. Tried to use $conf['image_allow_insecure_derivatives'] = TRUE; setting, but nothing changed.. My NGINX vhost config also, I think, is valid.. Searched a lot a good conf., And finally made one from lots of suggestions..
Comment #16
aurimukas commentedGot solution in NGINX vhost config part:
location ~* /system/files/styles/ {
access_log off;
expires 30d;
try_files $uri @rewrite;
}
Comment #17
davidwbarratt commentedHello!
Could you put this into a Sandbox so I can use it properly?
Also, what version of AWS SDK should it be used with?
Thanks!
Comment #18
coredumperror commentedI've decided to separate my re-written code into my own module, which I've called S3 FIle System. It's currently in sandbox mode, until I get enough time to finish the cleanup, but should be in a fully working state. It uses the same AWS SDK as AmazonS3, which is v1.x.
Comment #19
davidwbarratt commentedawesome!
could you add your code to the repository on the sandbox?
Thanks!
Comment #20
davidwbarratt commentedAlso, you might see these issues:
#1675968: Image styles not generating for S3 files
#1555086: Serve files over SSL (https://) protocol if page is served via SSL.
#1985148: Allow for Custom Hostnames
which are all needed for AmazonS3 to be in an even slightly usable state.
If you can put your code into the Sandbox repository, I can re-roll the patches for the re-written module. :)
Thanks!
Comment #21
coredumperror commentedHah, whoops. I thought my initial commit had the unmodified code, but it looks like it didn't.
I went ahead and finished the initial cleanup and rename for S3 File System, but I haven't had a chance to test it yet. If you run into any issues, please post them to the S3 File System issue queue. justafish doesn't need any more spam from me. :)
Comment #21.0
coredumperror commentedfixed Date iCal link
Comment #22
justafishComment #23
coredumperror commentedComment #24
YairHadari commentedHi,
I"m using Presigned URLs and configured it like that: 60|yair.
I would like Drupal to issue a new expiration date each time some one refer the S3 object.
Currently the URL issue an expiration date once the object being upload to S3 and after the first expiration date no one can get to this object.
Thanks,
Yair