Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
For high performance link checking we need to add cURL support with non-blocking parallel link checks. It's planned to be able to check up to 1000 links in parallel if the hardware and external network connection can handle this :-).
Comments
Comment #1
hass CreditAttribution: hass commentedWhen I opened this case I hoped to have this implemented within a week with the approach from http://www.onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blo..., but the status code and error handling and auto-updating of 301 links seems to be a quite complex task and made me working on other tasks.
If someone likes to jump in here - go on... we definitively need this feature.
Comment #2
hass CreditAttribution: hass commentedhttp://www.jaisenmathai.com/blog/2008/05/29/asynchronous-parallel-http-r...
Comment #3
sinasalek CreditAttribution: sinasalek commentedsubscribed
Comment #4
janusman CreditAttribution: janusman commentedThis patch adds CURL support into the link checker.
Some non-scientific benchmarks for 195 links tested:
1 simultaneous connection
120 sec (timed out when only 134 links were checked)
1.12 links/sec
10 Connections
55 sec (195 links checked)
3.54 links/sec
50 Connections
36 sec (195 links checked)
5.41 links/sec
Comment #6
janusman CreditAttribution: janusman commentedLet's try that again.
Comment #8
janusman CreditAttribution: janusman commentedGah, had some non-UNIX line endings
Comment #10
janusman CreditAttribution: janusman commentedA note:: it *seems* the tests fail regardless of the patch, so this needs human review =)
Comment #11
hass CreditAttribution: hass commentedIf the tests pass without this patch and with not - there may be bugs :-)))
Comment #12
sinasalek CreditAttribution: sinasalek commentedThis module might be useful http://drupal.org/project/curl
Comment #13
hass CreditAttribution: hass commentedBy default we use HEAD, not GET. POST is not yet possible, we need to do some more stuff to get this working.
I'd like to move this into an extra function if not already available to reduce lines of code.
By this drupal_http_request() we have a blocking request as I know... :-(
Powered by Dreditor.
Comment #14
hass CreditAttribution: hass commentedThis curl modul from #12 should get a review if it provides us the same API/interface for installations with and without cURL.
Comment #15
janusman CreditAttribution: janusman commentedThe CURL module mentioned in #12 offers a fallback PHP-only version of CURL; however, it does not implement curl_multi commands which is the whole point of this patch =)
@hass: re #13:
1) could you clarify your first comment "By default we use HEAD, not GET. POST is not yet possible, we need to do some more stuff to get this working.".
2) Also, I will try to address your second comment... "I'd like to move this into an extra function if not already available to reduce lines of code."
3) I guess the third one is not a blocking issue (and doesn't really need addressing?)? "By this drupal_http_request() we have a blocking request as I know... :-("
Thanks
Comment #16
hass CreditAttribution: hass commented#1. Your else statement defaults to GET. This is logic wise not correct and need to default to HEAD.
#3. I'm not sure... but if it blocks we don't have "non blocking requests". We could address this later if there is no way to solve...
Aside, should we try to depend on http://drupal.org/project/curl? I don't know the code, but it may easify some things if people don't have curl!?
Comment #17
janusman CreditAttribution: janusman commentedAgain, http://drupal.org/project/curl does not support the parallel fetches from the "real" CURL library.
Will look into #1... maybe #3 could be attacked somehow... ideas welcome =)
Comment #18
hass CreditAttribution: hass commentedD7 first
Comment #19
hass CreditAttribution: hass commented#8: linkchecker_CURL-380052-4.patch queued for re-testing.
Comment #21
janusman CreditAttribution: janusman commentedHeh, yep, definitely will need a reroll since patch is from prehistoric CVS days =)
Comment #22
hass CreditAttribution: hass commentedhttp://drupal.org/project/httprl may also an alternative to curl. It's written as API compatible with core drupal_http_request(). In http://drupal.org/node/1272542 they plan to fallback to curl if stream wrappers are not available.
The good thing is we get all possibilities in one module and have a consistent API and we don't need to care about anything linkchecker internally. I love the ideas, but have no experience with httprl yet.
Comment #23
hass CreditAttribution: hass commented@janusman: I guess you spend a lot of time on the curl patch... I'm sorry for this, as the HTTPRL module was a chance find for me, too. But it looks really promising!
Just made a test with 1000 URLs (~3 minutes) and it's really cool and API conform to core.
Per #732626: Supporting checking link via Ajax in node view we should first move the checking logic out of hook_cron() and than we can make the HTTPRL module plug-able. hopefully #1268096: Implement a rate limiter get's solved as I do not like to commit a patch before. We can really overload any server with this feature.
Comment #24
hass CreditAttribution: hass commentedPatch attached is a prove of concept. We need an UI so users can decide to stay with Core or use the HTTPRL module (Experimental). Changing topic to be more general.
Comment #25
janusman CreditAttribution: janusman commentedHey! No problem. I'm all for using HTTPRL. =)
Comment #26
hass CreditAttribution: hass commentedPatch requires HTTPRL latest DEV or future v1.5. This is waiting for #1427958: Run callback after stream has been fetched as current logic is not that fast as expected because of some request timeouts that are blocking speed.
Comment #27
hass CreditAttribution: hass commentedNew patch
Comment #28
hass CreditAttribution: hass commentedMissed to remove one workaround leftover from previous broken HTTPRL versions.
Comment #29
kim.pepperAny update on this ticket? I has been 10 months without a review.
Comment #30
hass CreditAttribution: hass commentedComment #31
hass CreditAttribution: hass commentedFixed the link to httprl module.
Comment #32
mikeytown2 CreditAttribution: mikeytown2 commentedShould be -2 not 2
http://drupalcode.org/project/httprl.git/blob/22dba7cfa825cb24a1fe8b5326...
I haven't tested either patch yet. This should be the only change in this patch in comparison to the patch in #30.
Comment #33
mikeytown2 CreditAttribution: mikeytown2 commentedAlso I will hopefully be rolling out 1.8 of httprl before the new year. If you could test the latest dev to make sure it works as expected that would be appreciated. I've been running httprl 7.x-1.x-dev on prod with no issues but I'm one of many use cases.
Comment #34
hass CreditAttribution: hass commentedThanks a lot for the -2 hint. I missed this change. I'm currently testing the callback logic we discussed in #1589122: background_callback array with two parameters?. New patch will follow soon.
Comment #35
hass CreditAttribution: hass commentedLatest patch attached with callback logic. This is a memory killer. No idea why.
I pushed only 560 links in the queue and when cron exits after some time only 60 links have been checked with a thread limit of 4. CPU is 100% and 1.5GB of memory has been used by Apache (httpd) process what is not acceptable.
@task: add global httprl timeout. This need to be calculated.
Comment #36
mikeytown2 CreditAttribution: mikeytown2 commentedchange
background_callback
tocallback
as this will use less threads and thus less memory.background_callback
is really only useful if _linkchecker_status_handling eats up a lot of time (several seconds), if it is quick,callback
is all you need.@hass
Can you give me instructions on how to repo the test on a clean D7 install?
Comment #37
hass CreditAttribution: hass commentedI can give "callback" a try tomorrow. I thought the background one is the one with less memory usage as it free's up after it's completed. How exacly can I make a decission for one or the other callback/background?
It's very simple... Fill up the site with many links... I have 1500 test urls. Than run cron with web url from status page. That's really all.
I plan to create a final linkchecker release very soon. Would be great to get this done before.
Comment #38
mikeytown2 CreditAttribution: mikeytown2 commentedFresh install of drupal so I wasn't be able to repo the memory usage. But I was able to repo the CPU usage. Switching from background_callback to callback made all the difference. Attached patch is based off of #35 and it should fix both of the reported issues :) I also moved drupal_set_time_limit around as httprl only needs the extra time when httprl_send_request() is called.
background_callback can be very handy but in this case it is the wrong tool here. Internally background_callback is used with httprl_queue_background_callback(), which we could use here if you wish for linkchecked_cron() to have it's own php process; right now linkchecked_cron() is apart of the whole cron process. The main advantage of putting linkchecked_cron() in it's own php process is so it can fully utilize the 240 seconds available to cron as it is the only thing running.
Comment #39
mikeytown2 CreditAttribution: mikeytown2 commentedIn this patch we have linkchecked_cron() run in the background. It's a quick hack, creating a new function instead of reusing linkchecked_cron() would be ideal. What I'm doing is creating a new process and executing
linkchecker_cron(TRUE)
outside of cron.Both patches are running with the latest dev of httprl.
Comment #40
hass CreditAttribution: hass commentedI'm sorry, but I do not really understand what you are doing here or how this works. I thought we are already running the link checks with httprl_send_request() after we filled the queue. Can you explain this a bit, please?
I'm fine with this drupal_set_time_limit() change. But as I know core already set's 240 seconds and it's not set again to 240 seconds as I know. This is only there for the case that any other module may not set it. e.g. core set 240s and any other module require 60 seconds before linkchecker comes in the sequence. Than we only have 180s left up to 240s. With timer_read('page') we can calculate the remaining seconds and make sure that httprl is able to execute the _linkchecker_status_handling() before php times out. To be a bit safe, we should not use more than 220s seconds over all or only 180s max. We need to add the global timeout for httprl to e.g. "220 - (timer_read('page') / 1000)" in $options array to make httprl exit before php kills the process the hard way.
Comment #41
hass CreditAttribution: hass commentedAside, do you know why the CPU load is so heavy (not tested latest two patches)? Normally core drupal_http_request() is low on CPU with 0% load. I also think httprl need to be low on CPU, too. Most of the time we are waiting for the remote servers to answer... My best guess is that there are some very aggressive loops without a sleep that will fire the CPU to 100%. That's all guessing, but what httprl is doing while it runs - shouldn't be that CPU intensive and the size of the url array should also not eat sooo much memory. 32 socket shouldn't be a problem as a server should theoretically be able to open up to ~65536 ports, too. Ok, there are timeouts blocking and we'd still like to be able to be accessible from outside, but all is relative here... 32*240s = 7680 checks (I do not expect these many) and this is not that much :-)
Comment #42
mikeytown2 CreditAttribution: mikeytown2 commentedLong story short, this runs linkchecker_cron in a new thread.
Long story below:
This is not required, if you look at the the patch in #38 it does not use this bit of code. But, what this does in #39 is run linkchecker_cron() in a new thread/process that is not tied to the cron thread/process in any way; this allows for us to use as much time as we need and we don't have to worry about other cron hooks running out of time because linkchecker_cron() is now running in a new & independent thread/process.
At a lower level it sends off a non blocking http POST request to the path httprl_async_function_callback; inside the POST is the authentication information, the function to run, and the arguments to pass to that function. Most cron hooks don't return anything, and this is true for linkchecker_cron(); we can run it and ignore any return value. By doing this (no return value) HTTPRL will use a non-blocking request when making the POST to the path "httprl_async_function_callback"; we make it non-blocking by not defining a return key in the same array that the function key is defined.
Don't set
'return'
here and it will be a non blocking POST request to a special URL (httprl_async_function_callback) that allows for us to run any function, if you have the master key and have the correct lock.The best option would be to move drupal_set_time_limit() outside of the loop. I wasn't sure what you where doing here thus I tried to keep the normal code flow exactly the same while only modifying the HTTPRL codepath.
This has to do with the use of "background_callback" instead of "callback". background_callback issues a POST to the path "httprl_async_function_callback" and in our case runs the _linkchecker_status_handling function in a new thread. Doing this for every URL that is checked will eat up a lot of RAM and CPU because each new thread is a full drupal bootstrap. When callback is used (as in #38 & #39) the majority of the time is spent in usleep(); verified via cachegrind.
Comment #43
hass CreditAttribution: hass commentedThanks for all these details. I tried both of your patches. This is really heavy with 8 threads (<100% CPU, ~400MB httpd). Can we reduce CPU load?
I see httprl has a default global_timeout of 120s. Is this intentional? I would set it to 180s or so.
Moving the complete code outside of hook_cron() should be done for sure. There is some need to reuse the logic in other ways like checking one link with an ajax request. So we can do this for sure.
#39: ~1223 links with 8 treads in only one cron run... Dreams come true :-)))
Comment #44
hass CreditAttribution: hass commentedThis patch adds missing
global_connections
andglobal_timeout
. You may remember I asked already to allow setting these global values withhttprl_send_request()
. Here we see how this could help. Now we add 8000 links to the queue and also 8000 times the global settings. This requires a lot more memory than needed and I'm also not able to set theglobal_timeout
correctly. Is there a chance that this may change soon?Comment #45
hass CreditAttribution: hass commentedI've committed a cleanup patch to reduce the size of these patch and the stuff that is not really part of this new feature here. New patch attached that also moves the code outside of hook_cron().
Comment #47
hass CreditAttribution: hass commentedMissed to update the hook_cron() call in the drush file.
Comment #48
hass CreditAttribution: hass commentedSuxxx windows line feeds
Comment #50
hass CreditAttribution: hass commented#48: linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback5.patch queued for re-testing.
Comment #51
hass CreditAttribution: hass commentedLooking for feedback on how we can move this into
httprl_send_request($global_options);
. Maybe HTTPRL 1.8 can support this?Comment #52
mikeytown2 CreditAttribution: mikeytown2 commentedSadly that would require a 2.0 release. I looked into it and I have too many assumptions baked into the current version. The rational for 120 seconds is it is 1/2 of 240.
Comment #53
mikeytown2 CreditAttribution: mikeytown2 commentedUpdate to the patch in #48
Changes made:
- Moved drupal_set_time_limit out of the foreach loop.
- The $response variable in _linkchecker_status_handling is passed by reference.
- The $response variable is destroyed at the end of _linkchecker_status_handling in order to free memory. This should take care of the memory issues your reporting.
Things to do:
Would like to move link checking logic logic out of the cron hook so the new thread/process will call something other than a linkchecker_cron (having linkchecker_cron call linkchecker_cron is a quick hack). Make this new function have as little side effects as possible (don't call drupal_set_time_limit) so it can be used elsewhere without any time limits enforced (drush).
Edit: Looks like we do have a new function so I'll see what I can do in order to reduce the side effects.
Comment #54
hass CreditAttribution: hass commentedIt's already changed in the last patch or have I misunderstood this? The function is named
_linkchecker_check_links()
.What side effects? The high CPU load? :-)
Comment #55
hass CreditAttribution: hass commentedI fear to commit this patch :-(. The latest one brings httpd up to 1.2GB of memory, 100% CPU with permanent load (machine is non-responsive) and runs *only* with 8 concurrent HEAD connections. Without the reference it looked smarter on the memory and CPU side.
HTTPRL need to be made more performant and a lot less memory and CPU intensive first.
Aside, shouldn't we have a lock on the function call? Otherwise cron may be executed more than once and this will start up N background processes what will kill nearly every machine and will check the same urls N times. Is this something that HTTPRL need to handle or linkchecker? Any idea how we can add a lock?
Comment #56
hass CreditAttribution: hass commentedWe can use below:
http://api.drupal.org/api/drupal/includes%21lock.inc/function/lock_acqui...
http://api.drupal.org/api/drupal/includes%21lock.inc/function/lock_relea...
Comment #57
mikeytown2 CreditAttribution: mikeytown2 commentedI'll run this against a test server that I have with 30k links. The batch operation after clicking save on
admin/config/content/linkchecker
could take forever at this rate...Comment #58
mikeytown2 CreditAttribution: mikeytown2 commentedI'll roll a patch with locking. This would be something that linkchecker needs to do, make sure only one instance of _linkchecker_check_links() is running.
Comment #59
hass CreditAttribution: hass commentedI'm open minded to optimizations that work reliable... :-)
Comment #60
mikeytown2 CreditAttribution: mikeytown2 commentedHere is the patch that I have. I've tested it against a small set of links (50) and I can't find any memory leaks and/or excessive memory usage.
Still waiting for the batch scan to be done.
Comment #61
mikeytown2 CreditAttribution: mikeytown2 commentedUpdated patch so it will ignore function timeout errors.
This one reports memory usage as well.
Comment #62
mikeytown2 CreditAttribution: mikeytown2 commentedAdds in support for head_only
Comment #63
hass CreditAttribution: hass commentedWhat is head_only? The option for HEAD or GET is "method" only.
Comment #64
hass CreditAttribution: hass commentedWhy is there a need for two time limits? We know from $max_execution_time that this is the maximum time we can run a script, isn't it?
I'm going to add a watchdog in here later to warn about the locked process. Maybe useful information to know. :-)
This should be enough, isn't it?
Comment #65
hass CreditAttribution: hass commentedNew patch
Comment #67
hass CreditAttribution: hass commentedBased on #61.
$time_limit
Comment #68
hass CreditAttribution: hass commentedI do not know why, but the memory usage is not always logged on my machine.
Comment #69
hass CreditAttribution: hass commentedComment #70
mikeytown2 CreditAttribution: mikeytown2 commentedReason I made the $time_limit variable is for drush. In #62 when drush calls _linkchecker_check_links() it will force php to timeout after 240 seconds even if we have that limit set at a higher value. The next patch will be based off of #69 though :)
Answer for head_only is in this comment #1869002-11: Extreme memory and cpu usage
Comment #71
mikeytown2 CreditAttribution: mikeytown2 commentedAdded head_only back in. If you fully understand what this does and still do not want it in I will not add it back in future patches.
Added a check to see if the link->url is internal. If it is, limit HTTPRL to only 1 concurrent connection for that domain. I'm hoping this will take care of the memory and CPU usage issues you've seen.
Comment #72
hass CreditAttribution: hass commentedOk, we found my local problem. :-( I have disabled all links to
http://localhost
and now all run smooth. Memory usage is good (~80MB), CPU mostly low, short spikes with max 25%. Sorry for stressing on this side. But it clearly shows me that we can kill a remote server with linkchecker very quickly. I guess I will set this domain limit down to 2 threads how the RFC suggests originally and may allow overriding this via settings.php.Need to move this outside the foreach and set all to the same time. $max_execution_time - 30s or 60s if we know we are running in an extra process we have 100% of the time and there is no need to read the page timer.
Removal, we are using the
Range
header.I'm not a fan of committing this... this case just shows us a box may be quickly overwhelmed by 8 threads at the same time and on the end of the day the "per domain" limit may be too high with a thread limit of 8. :-)
I do not see both entries in the watchdog logs. Maybe httprl_send_request() does not come back and is also not able to remove the named lock. This needs further investigation.
Good catch. I planed to look into httprl to see if there may be more I have missed in past moths...
Comment #73
mikeytown2 CreditAttribution: mikeytown2 commentedI do not see both entries in the watchdog logs. Maybe httprl_send_request() does not come back and is also not able to remove the named lock. This needs further investigation.
My guess is something went fatal and killed PHP or we ran out of time. On windows boxes PHP has a lot less time to run due to how http://php.net/set-time-limit works. The other guess is safe mode. We can also try adding a call to drupal_set_time_limit() in the _linkchecker_status_handling() function. Putting lock_release and watchdog in a shutdown function might work as well.
Comment #74
hass CreditAttribution: hass commentedLast patch, this one will be committed if all lights are green.
Comment #75
hass CreditAttribution: hass commentedRemoved 1 from Number of simultaneous connections.
Comment #76
hass CreditAttribution: hass commentedBetter formatting of memory usage
Comment #77
hass CreditAttribution: hass commentedHTTP Parallel Request Library is now named HTTP Parallel Request & Threading Library
Comment #78
hass CreditAttribution: hass commented@mikeytown2: It was a pleasure to work with You on this feature. Thanks a lot for all your help!!!
D7: http://drupalcode.org/project/linkchecker.git/commit/b56dda1
Comment #79
mikeytown2 CreditAttribution: mikeytown2 commentedI like how thorough you are. I enjoyed working with you as well :)
Comment #80
hass CreditAttribution: hass commentedComment #81
hass CreditAttribution: hass commentedhttp://drupalcode.org/project/linkchecker.git/commit/52d44fe