Early Bird Registration for DrupalCon Portland 2024 is open! Register by 23:59 PST on 31 March 2024, to get $100 off your ticket.
You have VCS write permissions as well as other permissions to this project:
http://drupal.org/project/httprl
You have VCS write permissions as well as other permissions to this project:
http://drupal.org/project/httprl
Comments
Comment #1
gielfeldt CreditAttribution: gielfeldt commentedThanks for the access to the project. I also got your mail about this.
I look into this when I get the time, however other pressing matters have emerged, so it probably won't be in the near future :-(
I'll get back to you on this. I might create a new branch in HTTPRL, as I suspect my expections/intentions for the API in HTTPRL might cause a major revision bump.
I've also received concerns about a "dependency hell". Background Process was made by abstracting reusable compononents from Ultimate Cron, so for Ultimate Cron the dependency list currently looks like:
Ultimate Cron -> Background Process
Background Process -> Progress
Which would then turn into:
Ultimate Cron -> Background Process
Background Process -> HTTPRL
Background Process -> Progress
Still well within the spirit of reusing modular components IMO, but I have to consider pragmatic requests.
Nevertheless, I'll look at putting my ideas into HTTPRL to make it more flexible and to gain advantage of the focus on the http requesting itself, which that project has.
Comment #2
mikeytown2 CreditAttribution: mikeytown2 commentedAs for some ideas with HTTPRL, I was thinking of a callback parameter (for you) at the bottom and an alter hook at the top (for this #1325662: Usage and/or caching of gethostbyname).
I've done most of my planed dev for HTTPRL now; just got this bug to fix (#1354282: HTTPS connections going missing) and I will be releasing a 1.3 version of HTTPRL. Also note that 6.x and 7.x are the same.
Comment #3
mikeytown2 CreditAttribution: mikeytown2 commentedMore dev work has been done and HTTPRL looks a lot closer to this project now. It supports callbacks and background callbacks after retrieving a HTTP request; background callbacks can be blocking or non-blocking. I also support calling any function in a blocking or non-blocking manner; if using blocking, pass-by-reference works.
We should look into merging the 2 projects together as we have the same goals.
Comment #4
gielfeldt CreditAttribution: gielfeldt commentedI haven't looked at HTTPRL lately, but if they are nearly identical, I agree, we should join them.
Sounds nice with pass-by-reference when using blocking. This could allow callers to parallelize function calls in a very transparent manner.
Comment #5
mikeytown2 CreditAttribution: mikeytown2 commentedYes it does :) http://groups.drupal.org/node/226054
Comment #6
mikeytown2 CreditAttribution: mikeytown2 commentedI would like to put all of httprl into your module. My ultimate goal is to have the http client code get into Drupal 8. httprl can operate at the database bootstrap level; supporting this in background process would be a good thing.
Some of the things httprl is currently missing is Pool management & Load balancing. I think both of these use cases can be solved in a 3rd party code by using hook_httprl_request_alter(); which allows you to alter the request right before the connection is made.
Let me know how you wish to proceed. I've skimmed your code; if you can do the same that would help both of us out.
Comment #7
gielfeldt CreditAttribution: gielfeldt commentedI'll look through the HTTPRL code. I use mainly Background Process for Ultimate Cron, which were the reason Background Process was made. If we are to separate the pool management and load balancing from the main module, I might need to coordinate a bit first. hook_httprl_request_alter() sounds interesting though. I think I actually tried this approach too to begin with, but then ran into problems, though I can't remember what? Perhaps I was just attacking it from the wrong angle.
I think it would be nice to the see support for non-blocking requests in core. I would have begun looking joining our modules sooner, but haven't really had the time. HTTPRL does all what I need for the http request part and seems more maintained there. I would though like for the API to be a bit more like Background Process.
The HTTPRL code gave me some issues when nesting requests due to the static variable storage of the requests making it a bit global'ish.
vs
Another thing is the locking. I see that HTTPRL sets a lock, but I can't seem to figure out what it's used for? I'll look more at the code later. I'm not sure yet, but perhaps the locking should be taken out of Background Process?
Anyways, I'll skim HTTPRL some more and get back to you.
Comment #8
gielfeldt CreditAttribution: gielfeldt commentedBTW, I just added you as a maintainer.
Comment #9
mikeytown2 CreditAttribution: mikeytown2 commentedI set a lock when making a background process request. I then check to make sure the lock is taken in the new process. Being extra paranoid as this allows any function to run; I release the lock in
/httprl_async_function_callback
see httprl.async.incSwitching over to object code is something that I plan on doing; that way I can get rid of the statics.
Comment #10
gielfeldt CreditAttribution: gielfeldt commentedI can't remember if I mentioned this before, but there might be a problem using the lock api across requests, even when using memcache. The reason is the removal of the lock in the shutdown handler.
Consider the following scenario:
1.) request "A" sets a lock and issues a non-blocking http request (request "B").
2.) after issuing request "B", request "A" finishes.
3.) request "A" has finished before request "B" could check the lock. The lock is gone, due to shutdown handler removal in request "A".
I've also recently been made aware of problems, if the lock is db-based and set inside a transaction in request "A". This will make the lock invisible to request "B" on MVCC systems such as InnoDB with isolation level "repeatable read" (which is more or less the default way to go now in D7).
Comment #11
mikeytown2 CreditAttribution: mikeytown2 commentedI take care of the shutdown function
I wait ~2.5 seconds in menu_callback for locks to show up. If this is an issue with "repeatable read" then locking in Drupal is broken... this table might want to default to myisam in that case.
Comment #12
mikeytown2 CreditAttribution: mikeytown2 commentedComment #13
gielfeldt CreditAttribution: gielfeldt commentedI agree the locking API could be deemed broken in transactional envs. But seeing as it is only lock_may_be_available() that doesn't work, and that the naming of the function implies some sort of unreliability, it might be considered "working as designed".
I ended up using my own table as locking mechanism, due to missing features in the locking API. However, perhaps we should focus on using the locking API, and fix the things there? There are currently some discussion on improving the locking API in D8 #1225404: Modern event based lock framework proposal.
Btw, why do you wait for up to 2.5 seconds for the lock? The lock is set before the http request is issued. Which kinds of delay/latency have you experienced (or are expecting)?
Comment #14
mikeytown2 CreditAttribution: mikeytown2 commentedbummer... the way I'm using locking with memcache works. Using this with core and it fails. Created an issue to fix this #1555154: Fix locking with core's lock.inc.
Comment #15
mikeytown2 CreditAttribution: mikeytown2 commentedCore lock issue fixed. Did the 2.5 second delay as a just in case; don't know if it is truly needed.
Comment #16
mikeytown2 CreditAttribution: mikeytown2 commentedHTTPRL can now use hook_boot to run functions or methods.
D7: hook_boot bootstrap level takes about 25ms. Full level takes close to 300ms.
Passing the session works in D7; going to see how hard it is in D6. Checkout the last 2 examples in httprl's readme to see what it can do http://drupalcode.org/project/httprl.git/blob/refs/heads/7.x-1.x:/README...
Comment #17
gielfeldt CreditAttribution: gielfeldt commentedVery nice. Calling background processes with only bootstrap level should indeed be available where possible.
Another idea a had, was perhaps being able to use closures instead of array programming.
http://www.htmlist.com/development/extending-php-5-3-closures-with-seria...
For example in Background Process it would perhaps look like:
Regarding the core lock fix, doesn't the system module clean up the expired entries in the semaphore table?
Comment #18
gielfeldt CreditAttribution: gielfeldt commentedI also have plans for making Background Process postgres compatible in the 2.x (which also involves making the progress module postgres compatible). And perhaps even embedding the Progress module into Background Process module, as I think it might be a bit over engineered.
We could then at the same time merge HTTPRL into Background Process 2.x? What do you think?
Regarding the merge, I'm not sure where to begin. You? Perhaps we should outline the joined feature set, and then redesign based on the two modules?
Comment #19
mikeytown2 CreditAttribution: mikeytown2 commentedJust got back from vacation. Looking at the PHP closure example; this looks like it would need to use eval() which I'm trying to stay away from at all costs. Having an eval example in the readme might be a good idea so people know it is possible if they still wish to use it.
For core locks I update the entry in the database; only if its a background one and only if using core. I have a cron job that cleans up locks if the process happened to die at an unfortunate spot.
I think HTTPRL is already postgres compatible. What issues are you having? I don't see any bug reports in your issue queue.
Listing the features both have is a good start. Hopefully I'll do that later on today.
Comment #20
gielfeldt CreditAttribution: gielfeldt commentedRegarding postgres compatibility, it's not the Background Process module as such, that's the problem. It's my Progress module where I had a mild case of brainfart and thus created columns named "start" and "end". It propagates a little into Background Process module, but it should actually be possible to just fix this in the Progress abstraction layer.
I try to stay away from eval() too. However in this case, the code used in eval() has already been parsed and accepted by the php compiler, so it "should" be safe :-). In any case, it's just syntax sugarcoating really, as it is just as easy creating a new regular function and just handing that as callback.
I'll try to work out a list as well.
Comment #21
mikeytown2 CreditAttribution: mikeytown2 commentedHTTPRL Feature List
Parallel HTTP Requests:
Parallel Function/Method Calls:
Other Features:
httprl_background_processing()
httprl_strlen()
httprl_glue_url()
httprl_get_server_schema()
httprl_pr()
httprl_fast403()
httprl_set_user()
Some notes:
Due to the way the code grew organically, the callback code is split up into 3 different ways to do it. I would like to standardize on one way, only using httprl_run_array() as this gives the most options and is not that complex.
Adding in more context helping functions is also a goal; so that background functions will have the correct context passed to them.
Comment #22
gielfeldt CreditAttribution: gielfeldt commentedHi Mike
Sorry I haven't replied yet, I've been busy with other things. I'll try to make room for this.
Another thing I've been pondering about, is actually to decouple the HTTP stuff from background (exact opposite of merging our two projects). The reason for this is to abstract it in such a way that a background process can be started in several ways depending on your setup.
* Through HTTP request to a Drupal menu callback (like things are today)
* Through Drush
* Through Fast CGI server
* Other potential backends ...
If this is a way to go, then we would probably need to merge background processes http stuff into HTTPRL and extend HTTPRL to also implement a "backend" for Background Process.
What do you think of this idea?
Comment #23
mikeytown2 CreditAttribution: mikeytown2 commentedI've thought about this. stream_select() and proc_open() play together quite nicely; this gives us drush/shell parallel calls as well as HTTP parallelism (from stream_socket_client()).
Fast CGI server seems like a HTTP request, or did I miss something? As for other back ends, shell commands would make it not super difficult to do.
Given this, the core of what we're looking for revolves around stream_select(). I would opt to bring in support for proc_open() to HTTPRL's core. Not exactly sure how to implement it but here is how drush is doing it _drush_backend_proc_open().
Comment #24
gielfeldt CreditAttribution: gielfeldt commentedAn overview of Background Process features:
* Execute arbitrary functions in background either through direct call (http request) or using queue.
* Near drop-in-replacement of drupal_http_request() for making parallel http requests supporting sliding window technique as well.
* Load balancing and pool management for background processes.
* Class API for starting more advanced background processes (such as setting user, session, etc.)
* Automatic cleanup of stale/dead processes using timeout and Apache Server Status.
* Drop-in-replacement for Drupal Batch API, making batch jobs run through background processes.
* Keepalive/restart support for processes, allowing creation of daemons.
Regarding Fast CGI, I must admit I don't know if it's an HTTP request as such. I was thinking more of the separation Fast CGI vs Drupal menu callback.
Comment #25
gielfeldt CreditAttribution: gielfeldt commentedJust thinking out loud:
Comment #26
gielfeldt CreditAttribution: gielfeldt commentedJust thinkin out load some more (this bit is not as "magical" as the previous post, but supports queueing and class/object calls):
Todo: How to fit in pass-by-reference vars in this scenario?
Comment #27
mikeytown2 CreditAttribution: mikeytown2 commentedHTTPRL does pass-by-reference for all variables. I have a couple bits of code that helps with this process:
httprl_recursive_array_reference_extract() & Pass by reference trick for call_user_func_array(). In short, don't worry about pass by reference, that issue has been solved (it wasn't easy though).
If we wanted something magical we would have to use something like DOH; although we might be able to steal some of it's magic and not require it. In the mean time what is the difference between call and queue and why would the second lock('test') call may or may not run?
Can you chain DB ops? That's the advantage I see with how httprl_run_array() is used.
Comment #28
gielfeldt CreditAttribution: gielfeldt commentedCool what you've done with the reference stuff.
The example in post #25, I imagined using the __call method for intercepting. But that would limit us from calling class functions or object methods, and the only advantage of the __call() trick was syntax sugar. So I would still suggest something like post #26 for this.
Regarding the chain stuff ... well ... to be honest, it's a weird feature I think :-)
I mean why not just:
I think this is more readable/maintainable, and probably also faster because it can be compiled and opcode-cached.
Or (if we we're to do the closure stuff, which is probably unnecessary):
Comment #29
mikeytown2 CreditAttribution: mikeytown2 commentedMain issue is module code is not available at the hook_boot bootstrap level; that only happens once a full bootstrap has happened. We could use eval (closure stuff) but I try to avoid eval at all costs. The HTTPRL code could be refactored so it looks more like this.
With your code, what are the differences between call and queue and why would the second lock('test') call may or may not run? Do we need call() & lock()? Anyway to make this as simple as possible? We should work on the ideal syntax :)
Also need to think about error handling and capturing printed output.
Comment #30
gielfeldt CreditAttribution: gielfeldt commentedcall() sends an http requests for starting the background process. queue() creates an item in a queue which is then processed by cron (very useful, if not necessary, when having lots of background processes you want to start, and need to keep the server from congesting). Actually a bit like the job_queue module.
For each background process, I create a unique handle, so the process can be identified and maintained for cleanup, etc. The parameter in lock() makes you able to specify the handle, thereby disallowed further execution of the same background process if it is already running (thundering herd protection, basically).
We could make the lock() optional for the call(), I see no problem with that as such. call() could just do a lock() internally to make the inner workings more consistent.
About bootstrap. Yes, you're right. Unless the module implements hook_boot(). Then the code should be available I guess.
About the chaining. I really think the code has poor readability. I'm sad that serializing of closures isn't part of php, as this seems like something that should be default for such a feature. Regarding avoiding eval() at all cost, well then I partially agree. Eval() shouldn't be used lightly. However in this case where the code has been parsed and validated by the compiler before it's stored, it should be safe. And seeing now that it does serve a purpose, rather than just syntax sugar, I might lean more toward this.
I'll think some more about it. Perhaps I'll turn around and consider the chaining method after I've given it some more thought.
Comment #31
gielfeldt CreditAttribution: gielfeldt commentedWhat about being able to specify which file the code resides in? (like $menu['file'] and $menu['file path'] or similar). I think the oop chaining technique is easiest if we have a factory method. I'll use the lock() again as a factory method in this example:
Thoughts?
Comment #32
mikeytown2 CreditAttribution: mikeytown2 commentedAt this point why not just allow multiple calls? We could still have a phase and include method, but internally it would use the call method.
Comment #33
gielfeldt CreditAttribution: gielfeldt commentedHow does the pass-through of the results work? (Considering your previous example regarding the db_select()+chaining) ... And we beginning to reinvent the Batch API? (&$context stuff).
Comment #34
mikeytown2 CreditAttribution: mikeytown2 commentedIn HTTPRL I use pass by reference to make it work. Hybrid example below.
So the call method takes 3 parameters in this case. First one is the function name; second is the function arguments; and third are return values, errors thrown and printed output.
Comment #35
Mithrandir CreditAttribution: Mithrandir commentedI don't understand this code. In which use case would you rather define a procedure to perform in a background process like this rather than with a wrapper function?
Advantages: Clear division of labour between the background process and the actual procedure performed.
I would hate to troubleshoot code that is coming from a list of function calls in the background process initiation. The whole idea of the background process is to take a simple input (what to call with what parameters) and offload it to a background process. The logic of "what to call" should (in my opinion - this is purely opinion-based arguing, I know) be a function or classname/method name and the arguments passed alongside it.
That way, it is easy to log, what is run in a background process, with which arguments, and easy to test run without a background process - just call the function with the same arguments.
I don't see any use cases for the BackgroundProcess::call('a')->method() calls above. In my opinion, they only add to confusion.
Comment #36
mikeytown2 CreditAttribution: mikeytown2 commentedSounds good following the KISS principle is almost never a bad idea. So we have 3 methods.
phase() // Bootstrap level.
include() // File to include.
call() // Function to run.
With this we should be able to accomplish everything that is needed. I'm assuming that we are in agreement in that all parameters passed to the function will be handled like a pass by reference variable. Going back to #26 how do we want to handle output (returned value, errors, printed output) and multiple calls?
Comment #37
gielfeldt CreditAttribution: gielfeldt commentedI suggest we that the BackgroundProcess methods return an instance of an object.
I also suggest that the methods actually do what they imply, e.g. call() WILL perform an http-request. To this end we might need:
Initiate background process right away: $process->call();
Queue background process: $process->queue();
Initiate within in current request, but postpone for later to make use of e.g. sliding windows: $process->postpone();
For the phase() stuff, in order to support an arbitrary bootstrap level, we could perhaps create a script within the background_process module, that does:
Comment #38
mikeytown2 CreditAttribution: mikeytown2 commentedWhat's the difference between queue and postpone? They seem the same to me for the most part. Is there ever a reason why we need call? I think queue is really all we need. Queue up some requests and run them. If some need to be ran before others, we could add weights to it and have an options array that gets passed in if needed (for other things). Can you give me one good example where queue with weights won't accomplish the same goal?
The example that uses
&$nodesref
(Load 5 nodes in background using pass-by-reference, but only two at a time) is not a good one. We will have multiple things trying to write to the same variable when all 5 background threads return. We need to document that in order for pass-by-reference to work every reference variable needs to be unique.Being able to execute code at the hook_boot() level (6 of 8) is a big step in the right direction. The lowest level we can boot into will be _drupal_bootstrap_variables() (level 4 of 8) because we need to make sure this function call is authorized. Given this we should implement hook boot. See httprl_boot for an example; what I do is run the same codepath as if it was a menu callback, except it is running at a lower bootstrap level now.
I like where we're going with this thread :)
Comment #39
gielfeldt CreditAttribution: gielfeldt commentedqueue() is meant to push the job into a drupal queue, and then let cron handle the processing. The use case would be: I want to defer the call to this function, but it doesn't have to be right away.
call() : I want to call this function right away, but I don't necessarily wan't to wait for it.
postpone() : I want to call this function "right away", but I wan't to pile up a bunch first.
The call() and postpone() somewhat maps 1-1 to the logic I currently have in background_process_http_request(), which is the drupal_http_request() replacement, i.e., just for calling an arbitrary url (could be fetching 10 images from a remote server 2 at a time). A use case for postpone() in background process context, could be that you wanted to call X number of functions and use the results, but still have some congestion prevention by limiting to Y number of functions running in parallel at any given time.
Yes, the sliding window (5 nodes, 2 at a time) example, not a good one ... it could produce strange results when using pass-by-ref, if the caller doesn't take precautions like your new example does.
I still think we could "authorize" the call for levels < 4, since we're using a server-side generated token.
Regarding if we need call() and postpone(): Perhaps not internally. I've also been thinking in the lines of perhaps reusing the Batch API. The Batch API takes a number of functions and adds them to a queue which is picked up by the Batch subsystem. We could do the same? The subsystem that picks these functions up could then just be something else, triggered by cron or a menu callback?
Comment #40
mikeytown2 CreditAttribution: mikeytown2 commentedI know postpone() is useful as I use that same model in HTTPRL; what is the advantage of using call()? Is call() what I call non-blocking, as in I run it and I don't care about the return value?
I do see the usefulness of queue() if the queue can be processed soon (like start up a background queue worker in a shutdown function).
In terms of sub 4 level bootstraps, how would you accomplish it?
configuration level 1
page cache level 2
database level 3
variables level 4
The best I can see is doing database level 3 with the lock system from level 4. One could come up with more interesting ways of doing this (allow verified level 0 booting) but that will require a lot of custom code (use filesystem for verification) IMHO. Using lock.inc makes this fairly simple to implement from my point of view.
Comment #41
gielfeldt CreditAttribution: gielfeldt commentedYes, that call() could be considered analogous to the non-blocking concept in HTTPRL. (Of course, technically the postpone() will also perform "non-blocking" requests in order to execute them asynchronously).
It might not be possible to execute a background process in bootstrap levels < 3, if we store the info for a process in the db (don't know where else to store it). Regarding use of the lock subsystem, I'm not sure this is mature/good enough at this time (I know of the trick HTTPRL does regarding unsetting the global lock to avoid cleanup). Besides, if we are to store the process info in table anyway, this could work as a lock.
I also currently have an issue when starting a background process within a transaction due to the tokens not being available in the subrequests (see this thread for more info).
To keep it simple, I was just thinking of putting it into a cron queue, and then let a cron queue worker handle it during next cron. If speeder cron queue processing is needed, I would think using something like Ultimate Cron or similar.
Comment #42
mikeytown2 CreditAttribution: mikeytown2 commentednon-blocking in HTTPRL means it doesn't wait for the response ever; it can send 1000 non-blocking requests in about 100ms and 100ms is how long the script takes to run; if I were to do this in blocking mode it would send requests off in parallel but it would wait (block) for all of them to return. Thus using non-blocking in HTTPRL you can send off a job that will take 20 seconds and the current script will finish fairly quickly as it doesn't care about what was returned. Does call() and postpone() both do this (non-blocking/no-waiting) or only one? It's still not 100% clear to me. Parallel is different from Blocking/Non-Blocking. HTTPRL operates in parallel 100% of the time. Making sure we are speaking the same lingo here :)
I like the idea of using locking because it can then utilize memcache, but I'm not stuck on the idea. I do think using lock.inc is the right way forward though as it's core drupal and maps 100% to what we are trying to do (make sure only authorized requests are ran). I would be in support of filing a core bug and modifying the semaphore table in the mean time if locking doesn't work how we need it to.
Looking forward, this is how things are looking in D8 #1599622: Run poormanscron via a non-blocking HTTP request. Having a small sub module that can fire off a background queue worker is something to look into as not everyone will want to install Ultimate Cron.
Comment #43
gielfeldt CreditAttribution: gielfeldt commentedLingo/clarification:
non-blocking for me means what it does for stream_set_blocking(), namely that e.g. fread() doesn't wait for data.
In my example, background_process_wait($processes) waits for the $processes to finished and returns the results. It does this by polling. You can see the logic in background_process_http_request_process(). This is used for http requests, for example if you want to download 8 images in parallel (or asynchronously I guess is the proper term here). Is also possible to provide background_process_http_request() with a callback function, e.g.:
stream-wise, it is possible to manually do a blocking http request using the background_process_http_request(), as this is a replacement for drupal_http_request(). However the background processing of specific php functions are always non-blocking.
The good things about memcache as locking, is that it is not volatile to transactions or requests. It is however volatile in ways that are out of our control (restart, garbage collection, etc.) which potentially could become a problem. Also, if we are to store an entry in the db with our process information, we essentially do a double locking?
Btw, this might also be of interest: #962958: Run cron in a separate process to avoid a response delay
Comment #44
mikeytown2 CreditAttribution: mikeytown2 commentedIn HTTPRL I set all streams to stream_set_blocking($stream, FALSE); so it can operate in parallel. In HTTPRL what I call non-blocking is that after fwrite is done it does a fclose on the stream. This prevents that request from blocking the the script as a whole.
stream_set_blocking($stream, FALSE)
is a requirement if you want to do parallel streams from my understanding, and HTTPRL assumes you want parallel 100% of the time.Side note is I think that stream_set_blocking should be called stream_set_sync as that is a better way of describing what is happening IMHO; but then there might be some confusion with the STREAM_CLIENT_ASYNC_CONNECT flag.
Why would one want to do a non-parallel http request? Just issue a single request, or set the window size/limit to be one if needed. IMHO having a method just for this is wasteful, we can simplify here :)
Comment #45
gielfeldt CreditAttribution: gielfeldt commentedAt the time it seemed more logical (simpler?) in my copy/paste of drupal_http_request() to implement support for non-blocking rather than to replace the current functionality.
One use case, though, could be the use of the callback option, i.e. a regular drupal_http_request() with a callback option.
However, this blocking/non-blocking part is completely decoupled from the high level API in Background Process which handles background processing of functions, so we don't necessarily have to take this in to consideration. I'm actually thinking more of using the code from httprl_request() instead of background_process_http_request(), as you have given/gotten more attention to the actual IO part, so we don't need to implement the blocking part as such. We need though to eliminate static vars in order to encapsulate it properly.
We also need to make a clear distinction between the API for actually issuing an http request, and the API for handling the execution of functions in the background via the http requests, e.g.:
1.) background_process_http_request(): Issue an http request.
2.) background_process_start(): Run an arbitrary function in the background (which "incidentally" uses a non-blocking http request to do so).
Don't know (yet) what the async-flag does, but I think the stream_set_blocking() function is named properly. It does what it does. Async operations is what you can use it for. Then again, we could probably talk terminology all day :-)
Do you watch the European Soccer Championship? We (Denmark) just lost a match :-(
Comment #46
mikeytown2 CreditAttribution: mikeytown2 commentedI do not watch it, but one of my coworkers is rooting for Italy. Sorry to hear that Denmark lost, good luck against Germany this weekend and hopefully Portugal will lose :)
In terms of workflow, I like the idea of a "local" job queue that gets populated and then a run function that will start to issue HTTP requests.
No HTTP requests are issued until
run_job()
is ran. This removes the possibility ofcreate_http_request()
issuing a HTTP request by its self; it needsrun_job()
in order for it to work. Also noted is create_thread() is a special wrapper around create_http_request(). The returned value from create_http_request and create_thread should be of the same type. If one wishes to use a callback for a single http request simply do this:In terms of supporting core's queue we can either pass in an argument to create_thread() and create_http_request() or we can create a new function to do this. If we are creating a new function/method this is how I see it working.
Here I have 2 new functions queue_job() and run_drupal_queue(). queue_job will put this $job array into the drupal queue to be processed at a later date. run_drupal_queue() will issue a create_thread() that is special because it uses the HTTPRL non-blocking version of execution (fwrite, fclose with no fread). run_drupal_queue() will then process the queue. If we pass in the $queue_id to run_drupal_queue() then it will only process that queued item and ignore all others. This allows for certain things that have a high priority to be ran first and lower priority items to be ran on cron. An example of a higher priority queued item is building imagecache presets on file upload so they are ready by the time the node is saved. A lower priority item would be to ping google because your news sitemap has been updated.
I use the STREAM_CLIENT_ASYNC_CONNECT flag in httprl. It means that once DNS has been looked up, it will establish the TCP connection in the background and not block the script from running. If you connect with an IP address and set the host header, one can issue a lot of HTTP requests in a short amount of time due to the usage of this flag and no DNS lookup :)
Comment #47
gielfeldt CreditAttribution: gielfeldt commentedI mostly agree, except for the naming of the functions, which makes we wonder if we're talking past each other here.
My idea on the abstraction level. The caller starting a job, shouldn't necessarily concern himself with if it's an http request or not. As I see it the caller wants to start a job, either immediately or when possible in the nearest future. In our case, we'll use an http request for starting the job immediately (immediately as when run_job() is called), and a drupal queue for the queueing part.
But it could easily be another mechanism that does the work (a drush call through proc_open() or whatever). The main thing is that we decouple high level api of starting a background process from the actual work needed to do so. Ensuring that the right code is responsible for the right task.
Comment #48
gielfeldt CreditAttribution: gielfeldt commentedShould we create a new branch where we can begin some prototyping? Regarding naming, I think I prefer Background Process over HTTPRL, but I may be biased :-).
Comment #49
mikeytown2 CreditAttribution: mikeytown2 commentedBackground Process is a better name than HTTPRL for this merged project.
But in terms of short names here are some I've come up with
BP - Background Process
BPRL - Background Process & Parallel Request Library
BPMRL - Background Process & Multi Request Library
PRBPL - Parallel Request & Background Process Library
TPRL - Threading & Parallel Request Library
TMRL - Threading & Multi Request Library
PRL - Parallel Process & Request Library
MPRL - Multi Process & Parallel Request Library
etc...
Also wondering if we should make httprl client code be a sub module or auto included. I would lean towards sub modules so we can make both stand on their own. And start to roll in some other useful things that modules that depend on Background Process do like cron, etc... similar to AdvAgg; have it all in one module.
What other modules use Progress?
Comment #50
gielfeldt CreditAttribution: gielfeldt commentedI have no other modules that use progress. The abstraction/separation is a bit over-engineered, so I definitly think it should be merged into Background Process. However, I do use Background Process for a couple og modules (ultimate cron, cache graceful, http accelerator). I dont believe in putting ultimate cron into background process, thats a bit too monolithic an approach for me. Keeping the http part in a sub-module, we could do that. We could just start by merging the http stuff into httprl and the background process stuff into Background Process?
Regarding shortnames, i dont see the need for this, i think it will only leed to confusion.
Comment #51
mikeytown2 CreditAttribution: mikeytown2 commentedSounds good. We bring in progress and httprl as stand alone sub-modules; Keep the long names; and keep ultimate cron, cache graceful, http accelerator separate projects.
I'll see if I can get a new httprl version rolled out in the next day or two.
Comment #52
gielfeldt CreditAttribution: gielfeldt commentedCool. I'll create a new branch for Background Process and merge in the Progress module. I meant to embed the Progress code into the Background Process module, as the Progress module doesn't really deserve a module of it's own. Also, There's some duplication of data between the Progress and Background Process module (start time), which speaks in favor of a normalization.
So to roughly sum up:
1.) Background Process + Progress + HTTPRL background process code => Background Process
2.) Background Process http code + HTTPRL => HTTPRL
Is this what we we mean?
Comment #53
mikeytown2 CreditAttribution: mikeytown2 commentedI was thinking:
Background Process + Progress + HTTPRL threading library & http client code => Background Process.
Thus the HTTP client is a stand alone inside of Background Process. Or do you want to rely on httprl? I was going to merge 100%. Reason being if your doing parallel HTTP requests, odds are you want to do a lot of "work" and can thus benefit from something like background process.
Comment #54
gielfeldt CreditAttribution: gielfeldt commentedSure we, can put add the HTTPRL as a submodule to Background Process. But a separate Progress module still serves no purpose IMO. And it's just a matter of where the code is placed, so I think it's safe to "truly" merge Progress into, just to get rid of the unnecessary duplicate data in the tables. It would certainly make my life a bit simpler I think, and I know of no-one who uses the Progress module by itself.
I already have a couple of sub modules in Background Process. If we are to put HTTPRL in as a submodule, perhaps it should be renamed to conform to proper standards? (Is it proper standards to prefix submodules with the mother-module's name? I'm not so sure. I'm thinking panels, panels_mini, panels_page, etc.)
Anyways, currently I have:
background_process
background_process/background_process_ass
background_process/background_batch
I think we can ignore the two existing submodules for now and port them later or move them out.
Whether or not we place HTTPRL inside Background Process makes no difference to me. We can always move it around (though we might have to rename the module), at least until we create the first release.
Comment #55
mikeytown2 CreditAttribution: mikeytown2 commentedbackground_process_http_client work for you?
Comment #56
gielfeldt CreditAttribution: gielfeldt commentedYes, sounds fine.
Comment #57
gielfeldt CreditAttribution: gielfeldt commentedShould we branch background_process 7.x-1.x into a new branch, e.g. httprl or someting, and work from there?
Comment #58
mikeytown2 CreditAttribution: mikeytown2 commentedbackground_process 7.x-2.x should be fine. Just hide the dev release till we get most of the code working together; once that's done make 7.x-2.x-dev available. Luckily all the projects we wish to merge have a lot of the bugs worked out of them so this shouldn't be insanely hard to do.
Comment #59
gielfeldt CreditAttribution: gielfeldt commentedOk. Done. 7.x-2.x is ready.
Comment #60
gielfeldt CreditAttribution: gielfeldt commentedShouldwe try to divide the workload somehow? I was thinking of beginning merging in the progress module.
Comment #61
mikeytown2 CreditAttribution: mikeytown2 commentedI'll add in httprl http client code. You can merge in progress module. Once that is done we can work on replacing old http code with new http code, pass-by-reference, API interfaces, and things of that nature.
Comment #62
mikeytown2 CreditAttribution: mikeytown2 commentedas a fallback if http requests to self do not work, we could implement something like this http://drupal.org/project/request_queue
Comment #63
gielfeldt CreditAttribution: gielfeldt commentedThat would be equivelant of doing a:
On that note, I've been asked if it would make sense to make the API in such a way that the caller doesn't have to take into account whether or not the jobs should be executed using an http request or through a queue, but rather that this is a setting.
e.g.:
and then it's the setting that determines whether it runs through a queue or an http request is made right away.
I'm not sure what I think of this. On one hand it makes sense, but on the other hand you could image that the caller actually wants to control it. Perhaps we could support both paradigms?
What do you think?
BTW, I haven't had much time to look into making new code, as I'm on vacation at the moment and will be for the next week or so.
Comment #64
mikeytown2 CreditAttribution: mikeytown2 commentedTough question...
We have these different modes of operation from my understanding:
- queue the request (I want this done in the near future, but it doesn't have to be now)
- queue the request and start up background process to run queue (useful for node save)
- queue then execute the request in same process if doing a non-blocking request (useful for node save)
- queue then execute the request in same process (useful for cron)
I think we should add the request to the queue so that if it fails for some reason it will be ran in the near future. If there are too many worker processes preventing a queued item from being ran this will allow us to fail gracefully.
In terms of how the different modes of operation is controlled via a setting or how it's called I would opt for how it's called (different methods).
Comment #65
gielfeldt CreditAttribution: gielfeldt commentedThink it sounds like a good idea having a queue as fallback. We might need a separate queue for this, and we need to make sure that the job isn't run twice.
I'm not sure about these 3... Can you elaborate a bit perhaps with some examples?
Comment #66
mikeytown2 CreditAttribution: mikeytown2 commented"queue the request and start up background process to run queue (useful for node save)" Example: Notify users that a node has been updated. Needs to be done soon but if delayed, not a big deal.
"queue then execute the request in same process if doing a non-blocking request (useful for node save)" Example: Generate image cache presets for newly created node by requesting them in a non blocking manner. If this fails, no big deal as the browser will be requesting these very soon.
"queue then execute the request in same process (useful for cron)" Example: Think of something like the feeds module.
Also there's been some developments in the direction of http clients in drupal and we should adopt guzzle instead of httprl #1447736: Adopt Guzzle library to replace drupal_http_request(). I've been working on stuff over here if you've been wondering :)
Comment #67
gielfeldt CreditAttribution: gielfeldt commentedHi Michael
I've committed a prototype of a new API to 7.x-2.x. Here's what I've done.
* Removed lot's of old code and moved db-stuff to the Background Process class.
* Moved all http stuff to background_process.http.inc
* Refactored BackgroundProcess class and put it into background_process.inc (I'm not sure of the file convention here …)
* Added support for various dispatchers (see example).
* Added support for run in arbitrary bootstrap phase (though below DATABASE phase might not make sense).
* Pass-by-reference results
* Partially moved progress handling to background process
I've tried to isolate the inner workings of Background Process, so that actually running the process is taken care of by some "dispatcher". Currently implemented are:
* call() - via non-blocking http request
* queue() - add's item to queue (cron worker not implemented yet)
* runNow() - proof-of-concept/stub, just runs the function and returns the result in the current request.
For the arbitrary bootstrap phase, I've add background_process.bootstrap.php … it may be a bit insecure at this moment :-)
As you can see from the API above, it's a bit explicit with the setCallback() and all … however, I think it's good to have this clean and visible distinction, and it also makes it easier to separate runNow(), call() and queue(). Later, perhaps we could shortcut the API, e.g. passing $callback and $arguments directly to call() or so.
EDIT: Changed codestyle for better viewing pleasure ...
Comment #68
mikeytown2 CreditAttribution: mikeytown2 commentedI like what I see :)
Below are some comments.
Do we want to support this? I think the
resultStorage
method is all we need.In regards to waitForFinish. The current default is to wait 10 seconds, correct? Just making sure I'm reading the code correctly.
So if one cares about the results of call(), waitForFinish() is the answer. If one doesn't care, don't call waitForFinish() and we're good to go. Correct?
Comment #69
gielfeldt CreditAttribution: gielfeldt commentedHi
1.) Perhaps not, it might be confusing with two ways of getting the variables, and runNow() not return the object is inconsistent. runNow() or any other dispatcher should probably return the object to support further chaining. And then we could provide a getResult() method if needed:
2.) Correct, default wait is ten seconds. I'm not sure what a good default value is here nor with the polling interval.
3.) Correct again.
Comment #70
gielfeldt CreditAttribution: gielfeldt commentedHi Mike
I've been working some more on this. Here's what I've been focusing on:
Known todo (from the top of my head):
Here's a "short" list of the Background Process API:
To see the effect of the NodeJS stuff, you can try git checkout of Ultimate Cron 7.x-2.x which utilizes this. You of course need to have NodeJS setup. Here's what I did (on Ubuntu, from http://drupal.org/node/1713530 and https://github.com/joyent/node/wiki/Installing-Node.js-via-package-manager):
Then I:
Then I started the nodejs server:
After this you should be ready to install Ultimate Cron and goto /admin/config/system/cron
What do you think? And how do you suggest we proceed?
Comment #71
gielfeldt CreditAttribution: gielfeldt commentedDid you have a chance to look at this yet?
Comment #72
gielfeldt CreditAttribution: gielfeldt commentedHi Mike
I really want to take this further, and you're probably busy with other stuff. But do you have any immediate concerns or comments? Otherwise, my plan is to continue with the points I listed, perhaps waiting with the merge of HTTPRL till the end, as that part of the functionality is somewhat isolated in the http-part.
Comment #73
mikeytown2 CreditAttribution: mikeytown2 commentedThings look good from what I can see. Sorry for taking so long to get back to you.