I'm doing some work for a client who would like to use EC2 for video transcoding. At first glance media_mover seemed like a good fit for this: It already has S3 support and is clearly intended to do things like video transcoding.
My thought was that I'd write a media_mover sub-module that did something like this:
Harvest: grab files from nodes, upload them to S3, put messages into SQS (Amazon Simple Queue Service) to let EC2 instance know files need to be transcoded
Process: Kick off an EC2 instance to chug through the SQS messages and do the transcoding. As each file was transcoded the EC2 instance would put a success message on the SQS queue.
Storage: (Do nothing - we already put the files on S2 in the harvest step)
Complete: Read success messages from the SQS queue and mark the files as fully processed and available in the local Drupal db.
The problem that I've run into (the first of several, I'm sure) is that it's entirely possible Drupal's cron job could time out before all of the videos in a given batch are transcoded. If this happens the Complete stage won't mark the files as available. Because the success messages are in the queue the files will be marked the next time a Complete stage is run, but in order for that to happen we'll have to wait around until another user uploads something. At low-traffic times this could mean an hour or more passes before the file is marked available in the Drupal db.
One possible solution is to modify media_mover to simply run all of the stages on every cron iteration, regardless of whether or not any files were harvested. If no files were harvested we just call process, storage and complete with a $files array of 0 length. Obviously this means invoking more function calls, but presumably the majority of them would simply do nothing and return. (Except for my Complete stage, which would consume the EC2 queue success messages and mark stuff available locally.)
Another possible solution is for me to write a Harvest stage that grabs:
* files uploaded to nodes +
* files on S3 but still awaiting transcoding +
* files that have been transcoded but not marked complete in the Drupal db
...and passes that whole bundle of things down through the process/storage/complete chain. I then have to make my process stage smart enough to not kick off an EC2 instance if the only "files" harvested are those that have already been transcoded.
What do people think about modifying media_mover to call every stage regardless of whether or not harvest actually harvests any files? Would it break a bunch of existing code? Is there some massive performance penalty involved? Or would it be a simple change that would make it easier to use media_mover with asynchronous distributed services?
Comments
Comment #1
arthurf commentedLots of great ideas here. One quick thought is something I'm working on right now, which is a harvest function which harvests from media mover files, selecting from a specified configuration.
So your setup might be:
Configuration 1: Harvest files from drupal, no process, store files on s3
Configuration 2: Harvested files from config 1, process via EC2 and push the files back to s3, no need to store I think
I'm making the changes to MM over this weekend to make this possible, and will probably respond a bit more once I have some more time.
thanks so much for your interest, this could be a killer module for this.
Comment #2
noah10 commentedI saw some checkins related to the idea you described above (harvesting from other mm modules). Are they in a state where I should try to use this new feature or are you still in the middle of it?
Comment #3
arthurf commentedI think that you should be good to go now. We may need to make some API changes to make things work the way that you need them too- I'm completely behind getting EC2 functionality, so I'd be glad to help you in what ever ways that I can. Let me know if I can be of any assistance.
Comment #4
alippai commentedWhat's the status of this issue? Will it be available in Sept?
Comment #5
noah10 commentedRight now there's a bit of debate going on as to whether we want to do this piece in Drupal at all or external to it. I'm not sure when that will be answered, but I wouldn't count on this being available in September. (October is probable, though, assuming that the debate comes down in favor of doing it in Drupal.)
Comment #6
noah10 commentedWell, the debate was resolved in favor of doing it outside of Drupal. We have a separate desktop client app and some java-based stuff on the server to support that, so we went with the lifeguard package (http://code.google.com/p/lifeguard) to manage all of the EC2 stuff and tied it into Drupal by having a REST API that a custom media_mover module calls out to.