Pre generation kills the machine [#1138098]

Hello. I started using the module in the past few days and it does quite a bit - thanks for creating it. Caching is working well.

One issue I have is in regards to feeds. Most of my image creation happens in batch as a feeds import, so lots of nodes+photos downloaded and created at the same time. I had pre-generation turned on with the default of 5 files to the async worker and using the localhost to generate images (via imagemagick/convert). During a feeds import, I saw the number of "convert" processes spike to about 60-70 and the system load hit about 100. During the same import, I disabled the pregeneration and all of the converts went away and the system came back to life.

My thought is this - can we add a governor to this when used on the localhost? I'm thinking of perhaps a sys_getloadavg() call, and in the event the load is too high we either wait or bypass the image creation. I'm just trying to make use of the module without having my system die a horrible death by convert.

I don't have the current option of implementing a dedicated system for async creation of images.

Comment	File	Size	Author
#54	1138098-54.patch	35.63 KB	rjbrown99
#49	1138098-49.patch	32.02 KB	rjbrown99
#28	z.php_.txt	17.32 KB	mikeytown2
#24	imagemagick.png	21.09 KB	mikeytown2
#17	test.php_.txt	14.39 KB	mikeytown2

Comments

Comment #1

mikeytown2 commented 27 April 2011 at 20:33

Sounds like I need a multi process queue/batching module. It's been bouncing around in my head for a couple of months now.
http://groups.drupal.org/node/126624

I need to get another RC of advagg out the door before I would start work on this.

Comment #2

rjbrown99 commented 27 April 2011 at 22:16

How about using the Queue API? Backport of the D7 queue. Just queue up everything that needs to generate and de-queue them as the machine becomes available. I'm using this extensively already with the Feeds module in D6 and it has been rock solid.

Comment #3

rjbrown99 commented 18 May 2011 at 02:32

Thinking more about this. I use a lot of Amazon Web Services, and they have a good deal on their tiny instances - up to 2 ECUs on a micro instance for $0.02 per hour. That's decent CPU horsepower for not a lot of money. It's also cheaper for CPUs than even their high CPU on demand instances. You could run 8 tiny instances (for 8-16 ECUs) for the cost of a single high CPU medium instance (5 ECUs).

I am working out an orchestration framework with puppet->nginx/pressflow on a tiny instance so I can fire them up quickly and easily with no effort. That pretty much works, so now I'm looking at this module and where to go from here.

There are two issues as I see them with the current state of this module as it applies to asynchronous image generation.

1) Too many images can be sent at once and the box falls over (per this issue). That may be possibly fixed with queueing.

2) Async images are currently only set to go to a single machine, and it may be beneficial to dequeue and process them across a few machines. I realize we are getting into somewhat complicated distributed computing issues here, but I'd like to keep it somewhat simple. Even something like accepting a textarea list of hosts, storing them in an array, and using round-robin to dequeue probably works.

What do you think about this? I'm thinking of a simple process of sending all requests to drupal_queue on the server end, then an async dequeue to one or more workers.

In the meantime, it may be easier for now to just sys_getloadavg() on the worker, and if the load is too high process it locally. Opportunistic image acceleration in that case with minimal code.

Comment #4

mikeytown2 commented 18 May 2011 at 05:56

Something interesting:

My async worker code just got a major overhaul in the advagg module today. It now uses streams instead of sockets if running on PHP 5. In testing I could asynchronously send out a thousand requests in a little over 0.1 seconds. The old way took around a second per request. This new way doesn't care about reading the response, but I could set it up to read it if I wanted to go that route. My apache server tipped over and stopped processing the requests past 250 concurrent connections. Anyway what this means is I can ping as many servers as I want, at almost the same time. I believe this opens up some new doors.

For the curious advagg_async_connect_http_request() & advagg_async_send_http_request() are modeled after 3 bits of code: drupal_http_request() D7, guru-multiplexing, and one of my patches for memcache.

Anyway with this having some sort of process queue is ideal. Make both sides asynchronous; request gets sent and connection is closed. Once that has been processed, send a request back saying it has been processed, removing that item from the queue. If on the same database backend, use the DB to "send" the message back. Another way of sending the message back is for the image file to show up, or something else; time will tell whats the best way to do this.

There is a lot to think about, as being able to send the request to start up 1000 threads in a tenth of a second, and continue processing php code opens up some doors...

Comment #5

rjbrown99 commented 18 May 2011 at 19:34

Yea I'm happy to help with this, my site is entirely based around photos.

Part of my goal is to be able to send the async generation off to different servers that are unrelated to the primary Drupal site. I am now to the point where I am getting close to sending it to another Drupal server that NFS mounts the files directory from the primary. Taking that one step further, it would be cool to ship them off to a lightweight non-Drupal server. Something like node.js could be used for the server side of the async server.

I'd also like to use async generation for imagecache actions, IE - if I go into Views Bulk Operations and regenerate. Right now it seems limited to when nodes are saved.

What do you think would be the next step?

Comment #6

mikeytown2 commented 18 May 2011 at 21:23

The latest dev of imagecache should have VBO actions (I helped to write the patch), but that only happens on your local box; It also has drush support as well (another patch I helped to write). Being able to use both of these methods and send the generation off to another box sounds like a good idea. What we are talking about sounds a lot like a service; I might want to look into that.

Comment #7

rjbrown99 commented 18 May 2011 at 21:45

IMO there are really no good cloud based APIs or services that do this. The closest one I could find is http://www.picnik.com, which is owned by Google and has an API. But it's closer to their version of Flickr than it is to something like Twilio. What I really want is a fast, cheap API that can take basic ImageMagick-like options, do stuff to photos, and return them.

I was also looking in to the latest release of ImageMagick. There is a vague reference to "Heterogeneous Distributed Processing" in the architecture document here: http://www.imagemagick.org/script/architecture.php

It seems to use OpenCL which I am not terribly familiar with. Usually that's not a barrier but in this case I could not find a shred of documentation on how to set up a distributed ImageMagick setup. That would be ideal if you could just call convert normally and have the processing automagickally sent to distributed machines for processing (pun definitely intended.)

Thanks for the VBO heads up. I have been using that and it works well, the add on would be that hook to send the processing elsewhere.

Comment #8

rjbrown99 commented 18 May 2011 at 23:34

One other note, your VBO code is now in the release branch of the imagecache module. 6.x-2.0-beta12 came out two days ago with that code + the CDN patches and a bunch of other goodness.

Comment #9

mikeytown2 commented 19 May 2011 at 00:17

Reading about OpenCL and it appears that it doesn't support over the network parallel processing at the moment. I think what ImageMagick is referring to in the Heterogeneous Distributed Processing section has to do with processing cores on the same box.

Back to the point; I think we query all the worker boxes asking what their load is. Depending on the load we then give them a set of images to process; We can pass the original image via POST, or a URL reference. Using the secret key we can then have the workers ping the server with a set of URLs for the processed files, which the server will then download onto the server. This would get around the shared file system issue.

Comment #10

rjbrown99 commented 19 May 2011 at 01:21

When I see the word 'distributed' I always think multiple systems. It sounds like you are correct, distribute across CPUs+GPUs on one box. Probably not terribly useful in this case.

Very good idea about the URLs for processed files, I hadn't thought of that. The only assumption there is a public filesystem, and it seems like the majority of Drupal sites are public. So it could go something like this:

1) Drupal main site/server detects that an image needs to be created (via node save, batch action, programmatically, etc.) Perhaps there is a threshold where it just generates them itself in the event of low system load, or it drops them into a queue if it can't handle the work.

2) The main system polls worker servers and asks about load. Or perhaps the clients poll in periodically and say "I'm good". Not sure of the best approach here. Either way, an issue is response time - if the images are required immediately for a waiting client browser we may not have a lot of time to queue them, farm them out, get the results, and satisfy the request. Batched imports or creations would work well for the farmed-out images though. Thoughts on this one?

3) The least loaded system is sent a message via some type of transport/handoff. Perhaps an array that has the URL to the original image and preset information consisting of a set of instructions for ImageMagick. Or maybe we just auto-replicate existing imagecache presets from the master to the workers. (One security note, if the handoff is not encrypted we would want some kind of argument validation to make sure people don't inject stuff into the passed messages.) Either way, it's probably best to pass all of the required presets that need generation in a single message to a single worker. That will reduce bandwidth for downloading and writing the source image.

4) The system that received the direction grabs the original image, writes it out to disk, runs the presets, and returns something. Perhaps another array of URLs to the finished images.

5) The original caller system grabs the generated files and writes them out to disk, in whatever location they were supposed to be dropped.

6) Profit!

If we added a hook in step 2, it could also be used to spin up new cloud servers in the event of a high load and kill them when the load is done. That would work especially well in my case since my activity pattern consists of very low image creation volume and then a big surge in activity (via imported product feeds). My imports could then start slowly but pick up speed as new servers are spun up to handle the load.

Comment #11

fabianx commented 25 May 2011 at 04:19

Hey rjbrown99,

I got kind of solution for this. I used it once to do a proof of concept of drupal serving ajax really fast on several servers.

You want a job queue server and while drupals job queue is nice for this job I got the perfect candidate:

Gearman

Install gearman module and gearman php extension (http://drupal.org/project/gearman).

Then configure your EC2 worker instances to connect to your gearman server.
(either from drupal directly to a kind of remote procedure call [watch out for memleaks!] or to a simple bash worker running convert)

Well and then just queue them up to gearman.

For me that was the easiest solution to setup and should work well and stable.

Gearman also knows which workers are available, which are busy, etc.

More information here:

http://gearman.org/ - Developed originally for LiveJournal ...

Best Wishes,

Fabian

Comment #12

rjbrown99 commented 25 May 2011 at 06:01

Thanks for the note. What do you see as an advantage to that over Drupal Queue? My thought with Drupal Queue, at least for me, is twofold:

1) It's built in to Drupal 7, so work is forward-compatible by default.
2) I'm already using it with Feeds.

Gearman also seems to require additional third-party client->server software.

Not that any of this is bad, it just adds some additional moving pieces that mikeytown2 may or may not want to have in the module. I'd definitely like to hear why you like it over a generic queue.

Comment #13

fabianx commented 26 May 2011 at 03:33

Of course, if you can do all of this with drupal standard tools it is fine and definitely a long term solution.

My idea was more specifically suited to your special case. Gearman is really easy to setup and does a fantastic job as a job server.

You just install it, setup a little worker, throw the to be generated files at it and it'll tell you when its done.

Yes, it is a third party server, but that is both the advantage and the disadvantage here at the same time.

For a more generic solution, having a shared DB queue and some machines waiting for work (triggered by cron), receiving it and processing it, is absolutely the right solution to be integrated in the module. A little like the boost crawler does it.)

But for short term solutions I prefer using the tools best suited for the job to have a quicker implementation.

Best Wishes,

Fabian

Comment #14

rjbrown99 commented 26 May 2011 at 04:27

Thanks Fabianx, I totally appreciate the advice and I might give it a whirl just to test out Gearman. It may work short term, and writing some code to do queue+cron+workers or whatever may be the option that I spend some time working on. Mikeytown2 seems interested and is prolific at writing code quickly so I'm more waiting to see what he has to say before I do anything. It's a problem for me, but more of one that is approaching vs one that is killing me now. I like to stay ahead of my performance curve if possible.

Comment #15

mikeytown2 commented 27 May 2011 at 02:25

Looking at this thread http://groups.drupal.org/node/151244#comment-505374

I think I should provide the URL & IP URL to the file so if the worker doesn't have the file yet it can download it and process it.

Comment #16

rjbrown99 commented 3 June 2011 at 21:09

These two modules have some of the code written to do some of this:
http://drupal.org/project/imageeditor
http://drupal.org/project/picnik

I am talking about this from the standpoint of being able to send an image to a third-party server/service, have it do stuff, and send it back via a callback. They are doing it for editing and not caching, but there may be some ability to leverage the same process for this.

For example, the Picnik module sends the URL to the original image to the Picnik service, that service does stuff to the image and sends back a URL to the modified image to a callback, which in turn fetches it and saves it locally. Similar workflow to what we described.

Comment #17

mikeytown2 commented 7 June 2011 at 01:44

Status	File	Size
new	test.php_.txt	14.39 KB

Got some prototype code working in terms of reading back 10+ requests at once. For me asynchronous is almost 4x faster than synchronous. Not exactly sure what I'll be using this for, but it seems like a pretty good thing for sending lots of requests off in a "TCP" way (we do care about the response), where advagg's code does it in a "UDP" way (we do not care about the response). Not sure if you saw the UDP twitter jokes this weekend.

The great thing about TCP jokes is that you always get them.
The problem with a UDP joke is that you have no idea if people got it.

Anyway this could probably use a rate limiter, better redirection code (needs to be in the processing loop), a smarter usleep cycle, better timeout control, ability to set requests in "UDP" or "TCP" mode, fallback if stream_select() is disable and more code changes here and there.

Comment #18

rjbrown99 commented 7 June 2011 at 03:35

Not to be outdone...

I used to tell TCP jokes but could never get the parts in the right order and eventually gave up on the joke.
I tried to tell an IPV6 joke, but there was no one willing to listen.
There's like 20 other jokes behind my NAT joke.
The great thing about fragmentation jokes is

My wife just called us nerds. Yea! Proud of it!

But on a serious note, very cool about some test code. I'm currently working through some DB performance stuff and will get on to the imagecache soon after. For now I may check out dbtuner or just EXPLAIN my way to optimized queries, but within the next week or so I'll check this out. Thanks for rolling it up!

Comment #19

mikeytown2 commented 27 June 2011 at 22:16

something to look into
http://dlcware.blogspot.com/2010/12/imagemagick-openmp-and-really-bad.html

Comment #20

rjbrown99 commented 27 June 2011 at 22:56

Thanks, already caught that on your twitter feed, and I'm already in the process of re-rolling the ImageMagick SRPM :) I'll let you know if it helps.

Comment #21

mikeytown2 commented 30 June 2011 at 21:21

Did it help?

Comment #22

rjbrown99 commented 30 June 2011 at 21:24

Haven't tried it yet although I do have the package. I may not get to it until next week as I need to test it out first to make sure it doesn't break anything else.

Comment #23

mikeytown2 commented 30 June 2011 at 21:26

I will say for us our load went from peaks of 800 down to peaks of 2-3. It's that big of an improvement.

Comment #24

mikeytown2 commented 30 June 2011 at 21:33

Status	File	Size
new	imagemagick.png	21.09 KB

Comment #25

rjbrown99 commented 1 July 2011 at 00:12

Wow, that's significant. I might even be able to cut the machine size of a few instances if it's that big of a change. We see the same kind of load spikes and I know for a fact it's tied to imagemagick as we get the NewRelic graphs pointing directly at it. OK now I'm going to push this up and try to test sooner :)

Comment #26

mikeytown2 commented 11 July 2011 at 23:08

ping.

Comment #27

rjbrown99 commented 12 July 2011 at 04:08

The recompile definitely helps me, and may be worth a note somewhere on the imagecache project as a whole. I still want to do distributed image generation, it's just not quite as big of a deal now since the machine does not keel over when someone hits a page that requires images to be created. Very good find.

Comment #28

mikeytown2 commented 30 August 2011 at 21:41

Status	File	Size
new	z.php_.txt	17.32 KB

Been doing work on the prototype code. It will now natively handle redirects (301) and requests can be url pings or it can wait for the response code. Having a solid async http request library is key to making this fly.

Reason this code is needed is, right now the code still blocks when sending data to the worker. I would like to make this configurable so the http request can be blocking or non blocking. And while we are here, the code might as well handle 301's correctly.

Comment #29

mikeytown2 commented 6 September 2011 at 23:58

This module will soon rely on the HTTP Parallel Request Library. I got most of the kinks worked out of it from my point of view so I'm going to have this module use blocking or non blocking mode as options.

Comment #30

rjbrown99 commented 22 September 2011 at 22:57

This is starting to become an issue for me again. I suppose that's what happens when you have growth and more images and users :)

The new library looks good. I'm wondering if you think it's best to blast imagecache actions out via the library, or queue them (drupal_queue / queue API) first and then dequeue based on some concurrent process setting.

Comment #31

superfedya commented 4 October 2011 at 23:02

Version:

6.x-1.x-dev

» 6.x-2.0

Add or delete the images to ImageField is MUCH-MUCH slower (about 10 times or even more) with this module :(

Comment #32

mikeytown2 commented 5 October 2011 at 00:02

I had the exact opposite issue; 1.x is a lot slower than 2.x for me.
What does admin/settings/httprl have set?
Are non blocking requests working? see admin/reports/status
What does admin/settings/imageinfo-cache say under "Server/IP Address to send all image generation requests to:" & "HTTP Mode:"?

Comment #33

rjbrown99 commented 13 October 2011 at 01:02

So Mikeytown2 - what's your current thought about this? I need to start either distributing my image creation (which is going to cost me some more money for more instances) or queueing. My short-term thought for quick-and-dirty is to simply replace the 'convert' binary with a php script that:

1) Checks machine load (cat /proc/loadavg), if it's low enough just call the real convert binary with our passed args and create the file
2) If machine load is too high, then:
-- Dump a dummy 'image coming soon' image into place (at the original filepath so Drupal is happy.) This won't hold up page generation for the users or show broken images.
-- Queue the image in a FIFO drupal_queue and/or Amazon SQS queue. #11 also talks about Gearman which may be an interesting option.
-- Dequeue based on load and re-create the preset.

It's not a distributed answer, but it would fix my load spikes on batch image uploads. CDN is an issue on re-creation as the 'coming soon' image may be cached upstream. I'd look forward to your thoughts on this. In all cases I think a queue is needed somewhere or even the distributed creation process is going to tip over the remote nodes.

Comment #34

rjbrown99 commented 13 October 2011 at 07:13

As a follow-up, at the moment I gave up on the replace-the-binary idea and am playing around with the following:

1) Forked off imageapi_imagemagick.module (from ImageAPI) into a new module.
2) Added a form variable to capture a user-selectable threshold upon which to queue images. The threshold is the load average above which image generation would be queued. IE, if uptime shows a load of 2.42 and the threshold is 1, the convert action does not happen and instead it queues.
3) In my new _imageapi_imagemagick_convert() function, check the load before sending to _imageapi_imagemagick_convert_exec(). If the load is above the threshold, queue it with the same arguments it would have otherwise needed. In this case, I send it to a Drupal Queue API queue (via drupal_queue module in D6.)
4) De-queue currently on the cron call for drupal_queue. This could also happen via some type of persistent php script on the back-end or even via drush.

This all seems to work, but has a few issues that I have not solved yet.

1) I'd like a temp file to be placed at the location the user requested, so they don't get broken page loads. If this temp file has the same name (which it would have to have in this case), the upstream CDN is going to cache that instead of the proper image. Not sure yet what to do about this.
2) I'd like to have a bit of Javascript in the user's browser with an ajax callback. When the temp image is on a page, the JS would poll to see when the original image is ready and dynamically replace the temp image with the correct one.

For most of my CDN stuff, I use Filefield Paths with a custom token based on a hash of the file's MD5 checksum and the current timestamp. IE, when you save a file it becomes filename_fba994ebe58d1f8bf34a00953e475c89.jpg. This way whenever a user uploads a new file or replaces the filefield, it has a unique and new name so it's re-cached at the CDN. This approach may not work well here since we are sufficiently downstream from the actual filefield and can't really regenerate the master filename (or else it's also going to regen all of the other presets too.)

Comment #35

rjbrown99 commented 15 October 2011 at 06:27

Ok I have come full circle, and I'm just going to start by implementing a multi-machine imageinfo_cache setup. Here's what I am going to do...

Machine A: Primary Drupal site - web+db
Machine B: imageinfo_cache machine

In this case, machine B is basically a webhead. It has all of the same Drupal code as the primary site, but talks to the Machine A database. PHP files are rsync'd locally so they can be accessed directly by this server, and the sites/default/files directory is NFS mounted from Machine A.

There are no docs for this module so I'm assuming this is the correct approach. I'll continue to feel my way through it (and may write some docs as well - as you can see I am not short on words.)

Comment #36

rjbrown99 commented 17 October 2011 at 06:47

Here are my learnings so far. My setup is just as described in #35, primary machine and new webhead, with local php files rsync'd from the primary box and an NFS share of sites/default/files.

- Offloading any kind of image generation at all seems to require "Use imagecache pre-generation" to be enabled. Without that none of the convert processes are offloaded.

- Dynamic preset configuration per content type works, thank you for implementing that. IE, I can select one CCK field type and a few specific presets and they are the only ones generated.

- I use blocking mode, assuming this is a better approach to not overloading the image generation machine. I wouldn't want to blast it with 50 requests at once. My assumption is that blocking mode will slow it down, do N number at a time, finish, and then send more. Slower image creation, but no overload of the client. Am I correct here?

- I set the max number of files to send to each worker as 3. I only have one worker for now. What happens if I have more than 3 files to generate? Will it run them locally via a convert operation or wait?

- The offloading of the image creation seems to ONLY happen when new images are uploaded or changed. IE, the trigger for offload is new/replacement imagefield updates. It does not offload creation in the event that an existing imagefield is already populated but the imagecache preset has not been generated. For example, if I flush a single ImageCache preset it's not going to offload the convert process when a user happens to hit a page that shows images requiring that preset. Am I correct in this conclusion?

I'd appreciate a bit of feedback since I'm back around to just this module, it will help me prioritize any new changes that I may need. Thanks mikeytown2.

Comment #37

rjbrown99 commented 18 October 2011 at 05:07

Answering myself (partially)...

1) Yes, any offloading requires "Use imagecache pre-generation."

2) Still would like input on blocking mode and worker count.

3) Yes, image creation offload only happens on initial upload. It does not happen via dynamic request for missing imagecache presets.

OK, so that being said, my approach for implementing dynamic image offload from imagecache is as follows. I figure we can just hijack the callback path into our own function. Right now it looks identical to the imagecache function in this example.

/*
 * Implementation of hook_menu_alter()
 *   - Takeover the imagecache callback that dynamically generates an image upon request
 */
function imageinfo_cache_menu_alter(&$items) {
  $items[file_directory_path() .'/imagecache']['page callback'] = 'imageinfo_cache_imagecache_cache';
}

/*
 * Hijacked callback for imagecache_cache() function, from imagecache.module. This is the callback for
 * handling public files imagecache requests. This adds imageinfo_cache support and queues requests.
 */
function imageinfo_cache_imagecache_cache() {
  $args = func_get_args();
  $preset = check_plain(array_shift($args));
  $path = implode('/', $args);
  _imagecache_cache($preset, $path);
}

If we hijack at that point, I can add in queue support to the function prior to calling _imagecache_cache(). I am envisioning queue support using TWO drupal_queue (Queue API) queues:

Queue #1: VIP queue. This is a special queue, used for all dynamic requests for imagecache files that do not exist, such as those that are created via the imagecache_cache() function. Anything in here always has priority and is dequeued first.

Queue #2: Normal queue. Everything else goes here, including the generation requests currently served in this module via imagecache_generate_image(). These are all 'batch jobs' (in the non-Drupal meaning) to generate files from when a user uploaded the file.

I also thought of what might happen in the case when an item gets queued in the normal queue, then also gets put into the VIP queue when a user tries to hit that page. In this case, the image would be generated from the VIP queue. When it eventually comes back around to the Normal queue, the imagecache_generate_image() function will see that the file already exists and will just return TRUE. That would work out nicely.

I would also like to implement a new hook_menu path for a new callback. This callback would simply output sys_getloadavg() load average. This would be called from a queue runner on the image generation node prior to sending over httprl requests. The idea here is to keep the load low so any priority queue requests could be filled quickly. The user would set a threshold above which the images just stay in the queue.

All that being said, I still am not sure about the best way to actually implement imageinfo_cache generation for realtime requests. I think we'd also have to hijack _imagecache_cache() which is a larger function and has more going on. We could then appropriately redirect things.

Hope that all makes sense. I'm to the point where I am starting to actually write code so this is a good time for a you-are-spot-on or you-are-in-the-weeds comments.

Comment #38

mikeytown2 commented 18 October 2011 at 06:09

Been busy with the pnwds this weekend; you can checkout my slides here: http://pnwdrupalsummit.org/sessions/front-end-performance

Anyway to answer the question about blocking mode: That will slow down the client side, because it waits to verify that the server got the request (TCP). Non Blocking mode sends it and does not verify (UDP).

If you wish to have that work on dynamic requests you would need to proxy all */imagecache/* traffic to your image server IF the file doesn't exist.

The load average callback seems like it would belong in the httprl module.

Let me know if I missed anything.

Comment #39

rjbrown99 commented 18 October 2011 at 16:36

Thanks for the reply. Do you have thoughts about the queue implementation? I'm actively going to be working on that as it's the only sane way I can see for scaling my image generation.

Comment #40

mikeytown2 commented 18 October 2011 at 17:25

Heard this might be something to look into: http://drupal.org/project/redis_queue and if your already using redis might as well put watchdog on it: http://drupal.org/project/redis_watchdog

Comment #41

mikeytown2 commented 19 October 2011 at 08:15

Just found this module
http://drupal.org/project/background_process

Comment #42

rjbrown99 commented 19 October 2011 at 16:07

Thanks Mikeytown2. My current thought is to stick with the Drupal Queue API and not venture off into redis, gearman, or other somewhat more proprietary options. They may be more flexible but they also introduce dependencies on external software. I think implementing the queue at the right points in imaeginfo_cache would enable effective offload of image creation without tanking the offloaded machine. Do you see a downside to that approach?

Comment #43

gielfeldt commented 21 October 2011 at 06:48

Ultimate Cron + Ultimate Cron Queue Scaler can process a single queue through multiple threads spread across different hosts. It's transparent to the application implementing the queue.

http://drupal.org/project/ultimate_cron
http://drupal.org/sandbox/gielfeldt/1313840

Comment #44

rjbrown99 commented 21 October 2011 at 19:57

Thanks for all of your suggestions.

I'll be done with the first take at queue support this weekend. I'm using Drupal Queue API for queue storage, beanstalkd for offload of the queue from the database, and a similar approach to waiting queue for persistent connections to the queue and immediate dispatch of image creation.

This is turning out to be almost the complete reverse design of the current imageinfo_cache: instead of a centralized process that uses httprl to quickly dispatch work to webheads via http, this uses a centralized queue with webheads that maintain a persistent connection to the queue to do work. So instead of a server->imagegen push, it's imagegenclient->server where it sits on the queue waiting for work. This has enabled me to implement load thresholds on the imagegen client machines, where they will not claim an item for processing unless their system load is below a certain value. This effectively allows for horiziontal scaling, where you could for example spin up one or more Amazon tiny instances (with 2ECUs per instance) at spot pricing for under $10 a month each to generate images.

It's mostly working now, hopefully I'll have it done soon. Then we need to figure out if other people like this, and if so should it be part of imageinfo_cache with the option to use either httprl or queue as an approach, be a 'sub-module' since it introduces a dependency on at least drupal_queue, or be a different project completely.

Comment #45

mikeytown2 commented 21 October 2011 at 21:39

Sub Module sounds like the right way to do this. Let me know if you want commit access to this project.

Comment #46

rjbrown99 commented 21 October 2011 at 21:42

Thanks, let me get through making it work first :) Then you and everyone else here can have a look, let me know if it's workable or has problems with design or implementation, and then I'll roll it up into a submodule or whatever other format works. Right now it's just inline within the main imageinfo_cache module but the functionality is relatively independent so not difficult to break out.

Comment #47

crea commented 24 October 2011 at 18:42

Limitation of requests can be done by webserver setup. I don't know about apache, but with Nginx + php-fpm you can create separate limit for Imagecache URLs. With that setup, additional requests will simply be queued by nginx itself, waiting for a php-fpm "slot".

Comment #48

rjbrown99 commented 24 October 2011 at 20:52

Yes, but I want to limit requests only by machine load, not the number of requests processed at a time. I'd like the machine to do as many as it can within the bounds of the system, and as fast as it can. The goal is to horizontally scale, where I can spin up N image generation hosts and maintain a consistent load across those systems. I also may want to spin up hosts of differing sizes, some smaller/lower CPU and some with larger CPU.

It's 90% done at this point and is working, soon to be complete when I finish working around #1319972: require_once fails when using register_shutdown_function.

Comment #49

rjbrown99 commented 25 October 2011 at 06:58

Status	File	Size
new	1138098-49.patch	32.02 KB

Ok, here is my madness in the form of a patch. I did not perform extensive testing, but so far this is working for me. I added a bunch of documentation on what I was trying to accomplish with the various code changes. Patched against 6.x-2.0.

Per my new README, I am using this with drupal_queue, the core patch, and beanstalkd where the two queues in question were moved to beanstalkd. Here's my settings.php (which I should add to the README.) Note that I did not hijack all of the queues to go to beanstalk, only the two I used for this module.

$conf['beanstalk_default_queue'] = array(
  'host' => 'localhost', // Name of the host where beanstalkd is installed.
  'port' => '11300', // Port which beanstalkd is listening to.
  'fork' => FALSE, // Used in runqueue.sh to know if it should run the job in another process.
  'reserve_timeout' => 0, // How long you should wait when reserving a job.
  'ttr' => 60, // Seconds a job can be reserved for
  'release_delay' => 0, // Seconds to delay a job
  'priority' => 1024, // Sets the priority of the job
);
$conf['queue_class_imageinfo_cache_queue_vip'] = 'BeanstalkdQueue';
$conf['beanstalk_queue_imageinfo_cache_queue_vip'] = array(
  'host' => 'localhost',
  'port' => '11300',
  'fork' => FALSE,
  'reserve_timeout' => 0,
  'ttr' => 180,
  'release_delay' => 0,
  'priority' => 512,
);
$conf['queue_class_imageinfo_cache_queue_normal'] = 'BeanstalkdQueue';
$conf['beanstalk_queue_imageinfo_cache_queue_normal'] = array(
  'host' => 'localhost',
  'port' => '11300',
  'fork' => FALSE,
  'reserve_timeout' => 0,
  'ttr' => 180,
  'release_delay' => 0,
  'priority' => 1024,
);

One other note... right now you need to run 'drush imageinfo-cache-queue' to dequeue any of the jobs. This would need to be run under a supervisor daemon or something, which again I'll roll up after more testing.

Comment #50

mikeytown2 commented 25 October 2011 at 07:16

Thats a big patch! I'm assuming that the re-working of the imagecache url is optional... if not it really should be.

Semi-off topic:
One thing that happened at the PNWDS is I was given the commit permission to imagecache & imageapi; so not all is lost in terms of getting code committed to those 2 modules, something to keep in mind. Short list of things I would like to get fixed.
#1250348: Warning: Division by zero in imageapi_image_scale_and_crop()
#1201914: Add in the ability to "nice" the convert process
#1189884: Allow for custom code to be inserted into the convert command before it is ran
#1243258: use lock.inc instead of a file lock
#587086: Drush command to flush presets and build cache
#813804: Rules integration

Comment #51

crea commented 26 October 2011 at 03:57

Yes, but I want to limit requests only by machine load, not the number of requests processed at a time. I'd like the machine to do as many as it can within the bounds of the system, and as fast as it can.

Limiting number of parallel imagecache processes doesn't limit performance of the machine, if you have atleast N processes for N cpu cores. Adding more processes above the limit will most likely only increase IO load, not the total generation performance. Thus, you CAN push the machine to it's limits even while having limited imagecache processes. Also, you still need to limit max number of processes, because large image processing requires big amounts of RAM, and you don't want to get oom.

The goal is to horizontally scale, where I can spin up N image generation hosts and maintain a consistent load across those systems. I also may want to spin up hosts of differing sizes, some smaller/lower CPU and some with larger CPU.

It's true that for scaling between the machines some kind of queue or load-balancer is needed.

Comment #52

rjbrown99 commented 26 October 2011 at 05:22

With the queueing patch you don't even need to run an http server on image generation nodes :) Just the Drupal php files, settings.php pointed to the running database+beanstalk, an NFS mount of the files dir, and the running drush process. FWIW, my image processing hosts are Amazon micro instances, with limited CPU (up to 2ECU) and memory (613MB RAM). With queueing they are doing quite well so far. This is day 1 of running with the patch so far...

Comment #53

rjbrown99 commented 27 October 2011 at 00:05

One improvement is needed in the queueing patch - it would be ideal for the worker to also use a register shutdown function. Right now it doesn't, which means that it's basically single-threaded. I will fix this, and there's another small error in one of the functions (continue vs. return) that I will fix. I'll post up a new patch after more testing.

Comment #54

rjbrown99 commented 27 October 2011 at 04:55

Status	File	Size
new	1138098-54.patch	35.63 KB

Here is a new version of the patch. Changes:

1) Fixed a bug, continue should have been return. Caused breakage/errors in PHP.

2) Attempted to make my queue callback/worker function asynchronous in the same way the original function did, via a register shutdown function (in imageinfo_cache_generate_image).

3) Enhanced the README a bit, still more work to do on it.

I have this running on my prod site at the moment. I'll report back with any additional bugs or findings over the next few days.

Comment #55

mikeytown2 commented 27 October 2011 at 05:03

Sounds good. The register shutdown function doesn't make PHP asynchronous if you where wondering... well not since PHP 4.1. If you look at the httprl_background_processing() function and how that is used, that will make the PHP process work in the background closing the connection to the requesting client.

Comment #56

rjbrown99 commented 27 October 2011 at 22:40

Hm thanks, I think I need to revisit that part - it does appear to still be single-threaded. I can run multiple queue runners but I'd like to have one only if I could. I'll keep working on it.

Edit: I also need to fix locking, right now it is a single lock across both queues, and it should be one lock per queue. Otherwise if something hits the 'normal' queue and is far down in the image generation queue, it can't also be placed in the VIP/priority queue for immediate generation because the lock is in place.

Comment #57

mikeytown2 commented 27 October 2011 at 22:45

@rjbrown99
It might be easier to follow if you create a sandbox; once you think its ready, we can merge. I've released 2.1 today so I'm expecting 2.2 to include your code.

Comment #58

rjbrown99 commented 28 October 2011 at 00:28

That would require me learning how to create a sandbox :) No problem, I'll start doing that.

Comment #59

rjbrown99 commented 28 October 2011 at 19:10

Sandbox here: http://drupal.org/sandbox/rjbrown99/1324340

At this very moment, it's just a copy of the 6.x-2.1 module. I'll merge in my patch with that tree so you can track any future commits as changes against 6.x-2.1.

I reviewed the background processing in httprl and I think I get how you are doing it. The main server handling the dispatch of the image generation makes a connection to the imageinfo_cache client node via http. The client then runs the httprl_background_processing() function, which sends back a header to close the http connection to the server. It also does a ignore_user_abort() function so the php script on the client keeps on executing until it is finished. So the server thinks everything is fine and execution is done, while the client keeps on running a 'headless' php function and goes through the image processing stuff. You are using the http protocol as the way to break/fork execution between the two systems.

That being said, with the queue worker I'm not sure that approach will work. The queue generation function is currently called from drush and is running on the image generation client node already, so we are starting with a 'self-contained' php interpreter running a function and looping through the queue. We would need a way to split execution within the same php process to make it work asynchronously. I don't think any of the shell_exec stuff will work well since we are quite a bit above where the actual convert process is called. I was thinking that perhaps pcntl-fork() might work but I'm not sure how that will play with mysql and descriptors.

I guess another option would be to convert the actual worker process into a drush script, and then exec the drush script with some options. But that seems hackish. Still thinking about this.

Comment #60

mikeytown2 commented 28 October 2011 at 19:37

Just a heads up that pcntl doesn't work in windows. So making it optional is a good idea if you do use it.

Comment #61

rjbrown99 commented 28 October 2011 at 19:42

Yea saw that. I think that second drush hook may be the way to go. Not ideal, but I don't see another way to efficiently fork off another process to do the work.

Comment #62

rjbrown99 commented 9 November 2011 at 01:09

Quick update - as you can see from the commits in my sandbox this has been progressing. I am using queueing now in production and it's working quite well.

I had to add a manual exception for the imagecrop module as it has some strange reliance on creating images in realtime. I spent many hours on resolving that issue but it just didn't work well in batch mode (even when passed the identical arguments.)

Right now I have this working as follows:

1) Primary system, significant system resources. I am running the drush job to just flush the VIP queue on this host, and only up to the defined system threshold. This helps provide that 'realtime' feel when users trip across an uncreated image preset.

2) Secondary system, minimal system resources. This runs two drush jobs (still not threaded) that create images. They loop back and forth between the VIP and normal queue. This works nicely - in the event there are VIP images, it pitches in with the master worker and helps generate them and spread out the load. In the event there are no VIP images, it works off the normal queue. This takes hours to do all of the presets but it does finish them.

Scaling up simply means adding more worker machines. I'm currently working with Puppet's new cloud provisioner to fully orchestrate the startup and shutdown of image generation nodes.

This is at a reasonably good place if someone wants to give it a whirl. I have NOT tested http mode (since I don't use it.) There is one crossover issue I am not sure how to resolve: the menu callback for imagecache_cache is hijacked in all cases. That should probably only be the case for queue mode. Not sure of the best way to do that - perhaps set a variable and clear the menu cache when the mode is changed?