I've done a complete re-write of AmazonS3, with maximal performance in mind. [#2046243]

My company's site was experiencing extremely poor performance when uploading images (30+ seconds for a 90k file), which I tracked down to a problem in AmazonS3 (though it was exacerbated by a problem in another module). The current AmazonS3 makes an S3 API call for every call to file_exists(), is_dir(), and the like for files that aren't cached (i.e. those that don't exist, and those which do but haven't been cached yet).

This is a huge problem when a site has dozens of image styles, because of the way that Drupal handles them. The core writers probably assumed that the speed of a file_exists() call would never be significant, so such calls happen all over the place (e.g. moving an image checks for the existence of every image style derivative which might exist, in order to delete them). And when every single one of those calls hits S3, just to be told "nope, it doesn't exist", it takes ages to upload and move image files.

So, I re-wrote AmazonS3 from the ground up, making it rely far more heavily upon the file metadata cache. Now, the only time that the AmazonS3StreamWrapper calls out to S3 is to read the contents of a file, write a file, or delete a file (and a few other less common actions, like retrieving presigned URLs). Queries for file metadata (e.g. file_exists(), is_dir(), etc) exclusively hit the cache.

Of course, this means that the cache has to be significantly more robust than it was originally designed. Every change made to the filesystem is now reflected in the cache, ensuring that once a refresh has been done, the cache will stay in sync with the S3 bucket (provided that the bucket isn't altered by a third party). In addition, I set up a mechanism to refresh the cache by querying S3 for every file in the bucket, and filling out the cache with those files. You can trigger this refresh either from the config page (/admin/config/media/amazons3), or by using the new drush command drush amazons3-refresh-cache. You should only ever need to do this once, though, right after installing this version of AmazonS3. Though the update hook will also do it for you, if you already had an earlier version installed.

Another new feature is that folders are no longer stored in S3; they exist only in the cache. This is because S3 uses a flat filesystem: it doesn't actually have folders, just files with slashes in their names, so creating a file that pretends to be a folder isn't useful. So, because folders don't really exist until a file gets placed inside them, they should cease to exist when the last file in them is deleted. The unlink() function doesn't do this, so I set up a cron hook to refresh the cached folders every time cron runs. This prevents cache bloat if there are a lot of files being created and deleted in different deeply nested folders.

With these changes in place, image-related actions are dramatically more performant, and all other actions are also noticeably faster. However, I'm not personally familiar with the more advanced features of S3 (like torrent URLs and presigned URLs). I tried to make sure that my changes didn't alter those features, but I'm not sure I succeeded because my site doesn't use them. Please let me know if you have any problems with those features, or anything else, really.

This is such a fundamental change that a patch file would be more trouble than it's worth. Instead, I've attached the complete module to this post, and also uploaded it to a sandbox project of mine: AmazonS3 2.0.

To the maintainers of S3: if you like the changes I've made, I'd be more than happy to join this project as a co-maintainer. I'm now very familiar with the code, and will be happy to continue providing support. Since my site uses this module, my boss lets me spend work hours on improving it, much like the work I do on Date iCal.

Comment	File	Size	Author
#11	Added_option_to_serve_bucket_through_https_protocol_on_AmazonS3_v2.patch	2.41 KB	aurimukas
#10	AmazonS3_1.0dev-2.1-universal.patch	71.8 KB	coredumperror
#10	AmazonS3_2.0-2.1-universal.patch	18.64 KB	coredumperror
#8	amazons3 2.1.zip	27.72 KB	coredumperror
#8	AmazonS3_2.0-2.1.patch	9.78 KB	coredumperror
#8	AmazonS3_1.0dev-2.1.patch	60.16 KB	coredumperror
#7	amazons3_rewrite.patch	79.57 KB	coredumperror
#3	amazons3 2.0.zip	28.53 KB	coredumperror
	amazons3 2.0.zip	28.63 KB	coredumperror

Comments

Comment #1

justafish

she/her

English

London, UK

commented 19 July 2013 at 23:04

Hi,

Great! I don't have time to review this right now, but I just wanted to check that you're aware there's already a filesystem cache that does this in the latest dev release of this module? If so, how does it differ?

Thanks!

Comment #2

coredumperror commented 22 July 2013 at 17:59

The way that my re-write differs from the existing AmazonS3 module is that relies entirely on the filesystem cache for metadata purposes.

For example, let's say that Drupal core calls file_exists('blahblah.jpg'), and blahblah.jpg doesn't exist. The current version of AmazonS3StreamWrapper checks the cache, sees that the file isn't cached, and then calls out to S3 to ask if the file exists there, and returns S3's response. My re-written AmazonS3StreamWrapper checks the cache, sees that the file isn't cached, and returns false.

This, of course, means that the cache needs to be populated in a totally different way than the existing module does it, which is why I created the "Refresh file metadata cache" button (and drush command).

Performance is vastly improved because the metadata for every file is stored locally, so calls for metadata need only query the database, rather than querying S3. S3 is only called when an operation needs to read or write actual data inside a file, or needs to move or delete that file.

Comment #3

coredumperror commented 24 July 2013 at 19:03

Status	File	Size
new	amazons3 2.0.zip	28.53 KB

I made a huge mistake while designing the folder system, which I've fixed in the attached version. I had mistakenly assumed that it was important to ensure that the database doesn't get bloated with leftover empty folders after the files inside them get deleted. So I wrote a folder refresh function which cleaned out all the folders in the cache and rebuilt them based on which files exist. This meant that empty folders got deleted, which, in retrospect, was really dumb.

I've gotten rid of the folder refresh function, and re-written the file refresh function to leave folders intact. That function also no longer ignores the "folder" objects in S3 (files with a / on the end), since they may be the only representation of existing empty folders for users who are updating from the original version of AmazonS3

Comment #4

coredumperror commented 30 July 2013 at 17:02

HAve you had a chance to take a look at this, yet? My team's been using it, and have given lots of praise for the performance increase. I'm sure other users of AmazonS3 would be interested.

Comment #5

torgospizza

he/him

English

Portland, OR

commented 8 August 2013 at 19:01

Can you post a patch, rather than a zip? I'd like to compare them here. It seems that the "rebuild cache" mechanism might come in handy as a replacement for my function found in patches at #1277152: Integration with Filefield Sources.

Comment #6

kirikintha commented 9 August 2013 at 16:11

Hi all - I am really interested in this patch as well, I am working on a high traffic web site that we are wanting to leverage S3 with.

I don't want to step on toes, @coredumperror when you do get a patch up - I can surely test in a practical environment. @torgosPizza Please let me know if I can be of any help!

Comment #7

coredumperror commented 16 August 2013 at 18:04

Status	File	Size
new	amazons3_rewrite.patch	79.57 KB

Hey all, sorry for the late reply. torgos posted his patch request on the day that I left for a week-long hike in Yosemite.

Here's the patch for my re-write, which I made against the latest dev build of AmazonS3. I had originally posted a zip, rather than a patch, because every function is different (or removed, or a new function). This patch is nearly twice the size of the original codebase.

Comment #8

coredumperror commented 12 September 2013 at 00:08

Status	File	Size
new	AmazonS3_1.0dev-2.1.patch	60.16 KB
new	AmazonS3_2.0-2.1.patch	9.78 KB
new	amazons3 2.1.zip	27.72 KB

Hi again! I've got some more updates for my re-write of AmazonS3. There is one significant change: the amazons3_file table's collation has been changed to utf8_bin.

By default, MySQL doesn't support string keys with the same words but different capitalization (e.g. s3://panda.jpg cannot be a key alongside s3://PANDA.jpg). However, nearly every filesystem out there (including S3) supports having two separate files with names like this, and AmazonS3's sync code was choking because of that. I had originally overcome this problem (partially) with a slow, clunky workaround. However, I'd thought it was an OSX limitation, rather than a MySQL issue. But when I finally installed my updated AmazonS3 module into a linux machine, the problem persisted, so I realized that I needed to find a different solution.

Unfortunately, MySQL doesn't directly support case sensitive collation in UTF8. However, it is possible to make it collate by binary differences between characters, which is good enough for the amazons3_file table's needs. That's why the collation is now set to utf8_bin rather than the default, utf8_general_ci. I've implemented an update hook to fix this, and changed the hook_schema() implementation.

In addition, I made several comment changes, and added t() around all the remaining user-visible strings.

Attached is the zip file with AmaonzS3 v2.1, as well as a patch for v2.0 to v2.1, and v1.0-dev to v2.1.

Comment #9

dave reid

he/him

English

Nebraska USA

commented 12 September 2013 at 19:48

It would help if you made the patches in the 'universal' format using diff -u which is the standard for providing patches for Drupal projects.

Comment #10

coredumperror commented 13 September 2013 at 01:24

Status	File	Size
new	AmazonS3_2.0-2.1-universal.patch	18.64 KB
new	AmazonS3_1.0dev-2.1-universal.patch	71.8 KB

Oh, sorry about that. I usually just use git, but I couldn't readily do so in this situation, and I wasn't aware that diff could create git-like patches. I've attached new patches, made with diff -u.

Comment #11

aurimukas commented 18 September 2013 at 19:20

Title:	I've done a complete re-write of AmazonS3, with maximal performance in mind.	» Suggestion to use https protocol option to download files.
Category:	task	» feature
Priority:	Normal	» Minor

Status	File	Size
new	Added_option_to_serve_bucket_through_https_protocol_on_AmazonS3_v2.patch	2.41 KB

My suggestion to add possibility to get files using https protocol if CloudFront configured to serve files with https protocol.

Comment #12

coredumperror commented 18 September 2013 at 20:11

Title:

Suggestion to use https protocol option to download files.

» I've done a complete re-write of AmazonS3, with maximal performance in mind.

Please post this suggestion to this other issue queue, rather than appending this issue.

Also please be aware that changing the title of an existing issue is strongly discouraged unless you're the module's maintainer or the issue's original poster.

Comment #13

aurimukas commented 20 September 2013 at 19:00

Sorry for that...

BTW, maybe someone noticed the problem with image_cache generation in S3? files are being uploaded to S3 without any problems.. but when I used some files printouts with image_cache... always getting nothing... urls always directing to local address.. could be the problem that I'm using NGINX?

Comment #14

coredumperror commented 20 September 2013 at 19:07

From reading the description of how ImageCache works, I believe your problem is an incompatibility between how AmazonS3 deals with image derivatives and how ImageCache does it.

I'll see if I can find a way to integrate with ImageCache to resolve this incompatibility, but I won't have a chance to look into this until next week at the very earliest.

Comment #15

aurimukas commented 21 September 2013 at 04:22

Thats great! thanks..!
P.S. Tried to use $conf['image_allow_insecure_derivatives'] = TRUE; setting, but nothing changed.. My NGINX vhost config also, I think, is valid.. Searched a lot a good conf., And finally made one from lots of suggestions..

Comment #16

aurimukas commented 21 September 2013 at 06:09

Got solution in NGINX vhost config part:
location ~* /system/files/styles/ {
access_log off;
expires 30d;
try_files $uri @rewrite;
}

Comment #17

davidwbarratt commented 1 October 2013 at 19:51

Hello!

Could you put this into a Sandbox so I can use it properly?

Also, what version of AWS SDK should it be used with?

Thanks!

Comment #18

coredumperror commented 1 October 2013 at 20:36

I've decided to separate my re-written code into my own module, which I've called S3 FIle System. It's currently in sandbox mode, until I get enough time to finish the cleanup, but should be in a fully working state. It uses the same AWS SDK as AmazonS3, which is v1.x.

Comment #19

davidwbarratt commented 1 October 2013 at 21:03

awesome!

could you add your code to the repository on the sandbox?

Thanks!

Comment #20

davidwbarratt commented 1 October 2013 at 21:12

Also, you might see these issues:

#1675968: Image styles not generating for S3 files
#1555086: Serve files over SSL (https://) protocol if page is served via SSL.
#1985148: Allow for Custom Hostnames

which are all needed for AmazonS3 to be in an even slightly usable state.

If you can put your code into the Sandbox repository, I can re-roll the patches for the re-written module. :)

Thanks!

Comment #21

coredumperror commented 1 October 2013 at 21:35

Hah, whoops. I thought my initial commit had the unmodified code, but it looks like it didn't.

I went ahead and finished the initial cleanup and rename for S3 File System, but I haven't had a chance to test it yet. If you run into any issues, please post them to the S3 File System issue queue. justafish doesn't need any more spam from me. :)

Comment #21.0

coredumperror commented 1 October 2013 at 21:35

Issue summary:

View changes

fixed Date iCal link

Comment #22

justafish

she/her

English

London, UK

commented 12 February 2014 at 04:23

Project:

AmazonS3

» S3 File System

Comment #23

coredumperror commented 12 February 2014 at 18:19

Status:

Needs review

» Closed (fixed)

Comment #24

YairHadari commented 25 October 2015 at 05:55

Hi,

I"m using Presigned URLs and configured it like that: 60|yair.
I would like Drupal to issue a new expiration date each time some one refer the S3 object.
Currently the URL issue an expiration date once the object being upload to S3 and after the first expiration date no one can get to this object.

Thanks,
Yair

I've done a complete re-write of AmazonS3, with maximal performance in mind.

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #21.0

Comment #22

Comment #23

Comment #24

News items

Our community

Documentation

Drupal code base

Governance of community