Add support with non-blocking parallel link checking [#380052]

Comment	File	Size	Author
#77	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback12.patch	12.46 KB	hass
#76	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback11.patch	12.43 KB	hass
#75	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback10.patch	12.42 KB	hass
#74	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback9.patch	12.42 KB	hass
#71	linkchecker-380052-71-add-httprl.patch	11.82 KB	mikeytown2
#69	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback8.patch	11.32 KB	hass
#68	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback8.patch	11.33 KB	hass
#67	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback7.patch	11.25 KB	hass
#65	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback6.patch	11.25 KB	hass
#62	linkchecker-380052-62-add-httprl.patch	11.21 KB	mikeytown2
#61	linkchecker-380052-61-add-httprl.patch	11.18 KB	mikeytown2
#60	linkchecker-380052-60-add-httprl.patch	10.98 KB	mikeytown2
#53	linkchecker-380052-53-add-httprl.patch	10.66 KB	mikeytown2
#48	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback5.patch	10.16 KB	hass
#47	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback5.patch	10.16 KB	hass
#45	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback4.patch	9.87 KB	hass
#44	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback3.patch	12.38 KB	hass
#39	linkchecker-380052-39-add-httprl-background.patch	12.31 KB	mikeytown2
#38	linkchecker-380052-38-add-httprl.patch	11.53 KB	mikeytown2
#35	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback2.patch	11.31 KB	hass
#32	linkchecker-380052-31+Add+support+with+non-blocking+parallel+link+checking.patch	10.03 KB	mikeytown2
#31	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking.patch	10 KB	hass
#30	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking.patch	10 KB	hass
#28	380052_Add_support_with_non-blocking_parallel_link_checking_2012021203.patch	10.38 KB	hass
#27	380052_Add_support_with_non-blocking_parallel_link_checking_2012021202.patch	10.46 KB	hass
#26	380052_Add_support_with_non-blocking_parallel_link_checking_2012021201.patch	9.93 KB	hass
#24	380052_Add_support_with_non-blocking_parallel_link_checking_2012012801.patch	2.66 KB	hass
#8	linkchecker_CURL-380052-4.patch	10.88 KB	janusman
#6	linkchecker_CURL-380052-4.patch	11.12 KB	janusman
#4	linkchecker_CURL-380052-4.patch	10.66 KB	janusman

Comment #1

hass commented 28 June 2009 at 13:21

Assigned:

» Unassigned

When I opened this case I hoped to have this implemented within a week with the approach from http://www.onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blo..., but the status code and error handling and auto-updating of 301 links seems to be a quite complex task and made me working on other tasks.

If someone likes to jump in here - go on... we definitively need this feature.

Log in or register to post comments

Comment #2

hass commented 13 August 2009 at 17:22

http://www.jaisenmathai.com/blog/2008/05/29/asynchronous-parallel-http-r...

Log in or register to post comments

Comment #3

sinasalek commented 12 February 2010 at 11:42

subscribed

Log in or register to post comments

Comment #4

janusman commented 11 March 2010 at 23:37

Status:

Active

» Needs review

Status	File	Size
new	linkchecker_CURL-380052-4.patch	10.66 KB

This patch adds CURL support into the link checker.

Some non-scientific benchmarks for 195 links tested:

1 simultaneous connection
120 sec (timed out when only 134 links were checked)
1.12 links/sec

10 Connections
55 sec (195 links checked)
3.54 links/sec

50 Connections
36 sec (195 links checked)
5.41 links/sec

Log in or register to post comments

Comment #5

12 March 2010 at 00:00

Status:

Needs review

» Needs work

The last submitted patch, linkchecker_CURL-380052-4.patch, failed testing.

Log in or register to post comments

Comment #6

janusman commented 12 March 2010 at 17:36

Status:

Needs work

» Needs review

Status	File	Size
new	linkchecker_CURL-380052-4.patch	11.12 KB

Let's try that again.

Log in or register to post comments

Comment #7

12 March 2010 at 17:50

Status:

Needs review

» Needs work

The last submitted patch, linkchecker_CURL-380052-4.patch, failed testing.

Log in or register to post comments

Comment #8

janusman commented 12 March 2010 at 17:54

Status:

Needs work

» Needs review

Status	File	Size
new	linkchecker_CURL-380052-4.patch	10.88 KB

Gah, had some non-UNIX line endings

Log in or register to post comments

Comment #9

12 March 2010 at 18:10

Status:

Needs review

» Needs work

The last submitted patch, linkchecker_CURL-380052-4.patch, failed testing.

Log in or register to post comments

Comment #10

janusman commented 12 March 2010 at 19:16

Status:

Needs work

» Needs review

A note:: it *seems* the tests fail regardless of the patch, so this needs human review =)

Log in or register to post comments

Comment #11

hass commented 12 March 2010 at 20:56

If the tests pass without this patch and with not - there may be bugs :-)))

Log in or register to post comments

Comment #12

sinasalek commented 14 March 2010 at 07:33

This module might be useful http://drupal.org/project/curl

Log in or register to post comments

Comment #13

hass commented 14 March 2010 at 12:20

+++ linkchecker.module Locally Modified (Based On 1.7.2.138)
@@ -137,20 +139,187 @@
+    // GET or HEAD?
+    if ($request->method == 'HEAD') {
+      curl_setopt($curl_handles[$index], CURLOPT_NOBODY, TRUE);
+    }
+    else {
+      curl_setopt($curl_handles[$index], CURLOPT_HTTPGET, TRUE);
+    }

By default we use HEAD, not GET. POST is not yet possible, we need to do some more stuff to get this working.

+++ linkchecker.module Locally Modified (Based On 1.7.2.138)
@@ -137,20 +139,187 @@
+      $responses = array(
+        100 => 'Continue', 101 => 'Switching Protocols',
+        200 => 'OK', 201 => 'Created', 202 => 'Accepted', 203 => 'Non-Authoritative Information', 204 => 'No Content', 205 => 'Reset Content', 206 => 'Partial Content',
+        300 => 'Multiple Choices', 301 => 'Moved Permanently', 302 => 'Found', 303 => 'See Other', 304 => 'Not Modified', 305 => 'Use Proxy', 307 => 'Temporary Redirect',
+        400 => 'Bad Request', 401 => 'Unauthorized', 402 => 'Payment Required', 403 => 'Forbidden', 404 => 'Not Found', 405 => 'Method Not Allowed', 406 => 'Not Acceptable', 407 => 'Proxy Authentication Required', 408 => 'Request Time-out', 409 => 'Conflict', 410 => 'Gone', 411 => 'Length Required', 412 => 'Precondition Failed', 413 => 'Request Entity Too Large', 414 => 'Request-URI Too Large', 415 => 'Unsupported Media Type', 416 => 'Requested range not satisfiable', 417 => 'Expectation Failed',
+        500 => 'Internal Server Error', 501 => 'Not Implemented', 502 => 'Bad Gateway', 503 => 'Service Unavailable', 504 => 'Gateway Time-out', 505 => 'HTTP Version not supported'
+      );

I'd like to move this into an extra function if not already available to reduce lines of code.

+++ linkchecker.module Locally Modified (Based On 1.7.2.138)
@@ -137,20 +139,187 @@
+          // If retry is set, fall back to drupal_http_request()
+          if ($retry) {
+            $result = drupal_http_request($result->headers['Location'], $headers, $method, $data, --$retry);
+            $result->redirect_code = $result->code;
+          }
+          $result->redirect_url = $location;

By this drupal_http_request() we have a blocking request as I know... :-(

Powered by Dreditor.

Log in or register to post comments

Comment #14

hass commented 14 March 2010 at 12:22

This curl modul from #12 should get a review if it provides us the same API/interface for installations with and without cURL.

Log in or register to post comments

Comment #15

janusman commented 22 March 2010 at 21:34

Status:

Needs review

» Needs work

The CURL module mentioned in #12 offers a fallback PHP-only version of CURL; however, it does not implement curl_multi commands which is the whole point of this patch =)

@hass: re #13:
1) could you clarify your first comment "By default we use HEAD, not GET. POST is not yet possible, we need to do some more stuff to get this working.".

2) Also, I will try to address your second comment... "I'd like to move this into an extra function if not already available to reduce lines of code."

3) I guess the third one is not a blocking issue (and doesn't really need addressing?)? "By this drupal_http_request() we have a blocking request as I know... :-("

Thanks

Log in or register to post comments

Comment #16

hass commented 24 May 2010 at 14:29

#1. Your else statement defaults to GET. This is logic wise not correct and need to default to HEAD.

#3. I'm not sure... but if it blocks we don't have "non blocking requests". We could address this later if there is no way to solve...

Aside, should we try to depend on http://drupal.org/project/curl? I don't know the code, but it may easify some things if people don't have curl!?

Log in or register to post comments

Comment #17

janusman commented 24 May 2010 at 16:04

Again, http://drupal.org/project/curl does not support the parallel fetches from the "real" CURL library.

Will look into #1... maybe #3 could be attacked somehow... ideas welcome =)

Log in or register to post comments

Comment #18

hass commented 27 December 2011 at 20:44

Version:	6.x-2.x-dev	» 7.x-1.x-dev
Status:	Needs work	» Needs review

D7 first

Log in or register to post comments

Comment #19

hass commented 19 January 2012 at 09:41

#8: linkchecker_CURL-380052-4.patch queued for re-testing.

Log in or register to post comments

Comment #20

19 January 2012 at 09:43

Status:

Needs review

» Needs work

The last submitted patch, linkchecker_CURL-380052-4.patch, failed testing.

Log in or register to post comments

Comment #21

janusman commented 24 January 2012 at 22:41

Heh, yep, definitely will need a reroll since patch is from prehistoric CVS days =)

Log in or register to post comments

Comment #22

hass commented 28 January 2012 at 13:15

http://drupal.org/project/httprl may also an alternative to curl. It's written as API compatible with core drupal_http_request(). In http://drupal.org/node/1272542 they plan to fallback to curl if stream wrappers are not available.

The good thing is we get all possibilities in one module and have a consistent API and we don't need to care about anything linkchecker internally. I love the ideas, but have no experience with httprl yet.

Log in or register to post comments

Comment #23

hass commented 28 January 2012 at 14:10

@janusman: I guess you spend a lot of time on the curl patch... I'm sorry for this, as the HTTPRL module was a chance find for me, too. But it looks really promising!

Just made a test with 1000 URLs (~3 minutes) and it's really cool and API conform to core.

Per #732626: Supporting checking link via Ajax in node view we should first move the checking logic out of hook_cron() and than we can make the HTTPRL module plug-able. hopefully #1268096: Implement a rate limiter get's solved as I do not like to commit a patch before. We can really overload any server with this feature.

Log in or register to post comments

Comment #24

hass commented 28 January 2012 at 16:33

Title:

Add cURL support with non-blocking parallel link checking

» Add support with non-blocking parallel link checking

Status	File	Size
new	380052_Add_support_with_non-blocking_parallel_link_checking_2012012801.patch	2.66 KB

Patch attached is a prove of concept. We need an UI so users can decide to stay with Core or use the HTTPRL module (Experimental). Changing topic to be more general.

Log in or register to post comments

Comment #25

janusman commented 8 February 2012 at 22:40

Hey! No problem. I'm all for using HTTPRL. =)

Log in or register to post comments

Comment #26

hass commented 12 February 2012 at 18:01

Category:	task	» feature
Status:	Needs work	» Needs review

Status	File	Size
new	380052_Add_support_with_non-blocking_parallel_link_checking_2012021201.patch	9.93 KB

Patch requires HTTPRL latest DEV or future v1.5. This is waiting for #1427958: Run callback after stream has been fetched as current logic is not that fast as expected because of some request timeouts that are blocking speed.

Log in or register to post comments

Comment #27

hass commented 12 February 2012 at 19:17

Status	File	Size
new	380052_Add_support_with_non-blocking_parallel_link_checking_2012021202.patch	10.46 KB

New patch

Log in or register to post comments

Comment #28

hass commented 12 February 2012 at 19:24

Status	File	Size
new	380052_Add_support_with_non-blocking_parallel_link_checking_2012021203.patch	10.38 KB

Missed to remove one workaround leftover from previous broken HTTPRL versions.

Log in or register to post comments

Comment #29

kim.pepper

English

🏄‍♂️🇦🇺Sydney, Australia

commented 1 November 2012 at 03:59

Any update on this ticket? I has been 10 months without a review.

Log in or register to post comments

Comment #30

hass commented 17 December 2012 at 23:45

Status	File	Size
new	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking.patch	10 KB

Log in or register to post comments

Comment #31

hass commented 17 December 2012 at 23:48

Status	File	Size
new	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking.patch	10 KB

Fixed the link to httprl module.

Log in or register to post comments

Comment #32

mikeytown2 commented 17 December 2012 at 23:55

Status	File	Size
new	linkchecker-380052-31+Add+support+with+non-blocking+parallel+link+checking.patch	10.03 KB

+++ b/linkchecker.module
@@ -462,6 +503,7 @@
+    case   2: // HTTPRL: maximum allowed redirects exhausted.

Should be -2 not 2
http://drupalcode.org/project/httprl.git/blob/22dba7cfa825cb24a1fe8b5326...

I haven't tested either patch yet. This should be the only change in this patch in comparison to the patch in #30.

Log in or register to post comments

Comment #33

mikeytown2 commented 18 December 2012 at 00:12

Also I will hopefully be rolling out 1.8 of httprl before the new year. If you could test the latest dev to make sure it works as expected that would be appreciated. I've been running httprl 7.x-1.x-dev on prod with no issues but I'm one of many use cases.

Log in or register to post comments

Comment #34

hass commented 18 December 2012 at 00:32

Thanks a lot for the -2 hint. I missed this change. I'm currently testing the callback logic we discussed in #1589122: background_callback array with two parameters?. New patch will follow soon.

Log in or register to post comments

Comment #35

hass commented 18 December 2012 at 01:56

Status	File	Size
new	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback2.patch	11.31 KB

Latest patch attached with callback logic. This is a memory killer. No idea why.

I pushed only 560 links in the queue and when cron exits after some time only 60 links have been checked with a thread limit of 4. CPU is 100% and 1.5GB of memory has been used by Apache (httpd) process what is not acceptable.

@task: add global httprl timeout. This need to be calculated.

Log in or register to post comments

Comment #36

mikeytown2 commented 18 December 2012 at 01:59

+++ b/linkchecker.module
@@ -404,31 +404,62 @@
+        'background_callback' => array(
...
+            'function' => '_linkchecker_status_handling',

change background_callback to callback as this will use less threads and thus less memory. background_callback is really only useful if _linkchecker_status_handling eats up a lot of time (several seconds), if it is quick, callback is all you need.

@hass
Can you give me instructions on how to repo the test on a clean D7 install?

Log in or register to post comments

Comment #37

hass commented 18 December 2012 at 02:15

I can give "callback" a try tomorrow. I thought the background one is the one with less memory usage as it free's up after it's completed. How exacly can I make a decission for one or the other callback/background?

It's very simple... Fill up the site with many links... I have 1500 test urls. Than run cron with web url from status page. That's really all.

I plan to create a final linkchecker release very soon. Would be great to get this done before.

Log in or register to post comments

Comment #38

mikeytown2 commented 18 December 2012 at 04:01

Status	File	Size
new	linkchecker-380052-38-add-httprl.patch	11.53 KB

Fresh install of drupal so I wasn't be able to repo the memory usage. But I was able to repo the CPU usage. Switching from background_callback to callback made all the difference. Attached patch is based off of #35 and it should fix both of the reported issues :) I also moved drupal_set_time_limit around as httprl only needs the extra time when httprl_send_request() is called.

background_callback can be very handy but in this case it is the wrong tool here. Internally background_callback is used with httprl_queue_background_callback(), which we could use here if you wish for linkchecked_cron() to have it's own php process; right now linkchecked_cron() is apart of the whole cron process. The main advantage of putting linkchecked_cron() in it's own php process is so it can fully utilize the 240 seconds available to cron as it is the only thing running.

Log in or register to post comments

Comment #39

mikeytown2 commented 18 December 2012 at 04:14

Status	File	Size
new	linkchecker-380052-39-add-httprl-background.patch	12.31 KB

In this patch we have linkchecked_cron() run in the background. It's a quick hack, creating a new function instead of reusing linkchecked_cron() would be ideal. What I'm doing is creating a new process and executing linkchecker_cron(TRUE) outside of cron.

Both patches are running with the latest dev of httprl.

Log in or register to post comments

Comment #40

hass commented 18 December 2012 at 10:49

+++ b/linkchecker.module
@@ -385,7 +385,25 @@ function _linkchecker_link_block_ids($link) {
+  if ($has_httprl && !$background) {
+    // Setup callback options array; call linkchecker_cron(TRUE) in the background.
+    $callback_options = array(
+      array(
+        'function' => 'linkchecker_cron',
+      ),
+      TRUE
+    );
+    // Queue up the request.
+    httprl_queue_background_callback($callback_options);
+
+    // Execute request.
+    httprl_send_request();
+    return;
+  }
+

I'm sorry, but I do not really understand what you are doing here or how this works. I thought we are already running the link checks with httprl_send_request() after we filled the queue. Can you explain this a bit, please?

+++ b/linkchecker.module
@@ -395,40 +413,70 @@ function linkchecker_cron() {
+      if ($links_remaining == 0) {
+        // Make sure we have enough time to validate all of the links.
+        drupal_set_time_limit(240);
+
+        httprl_send_request();
+      }
...
+      _linkchecker_status_handling($response, $link);
...
+      if ((timer_read('page') / 1000) > ($max_execution_time / 2)) {

I'm fine with this drupal_set_time_limit() change. But as I know core already set's 240 seconds and it's not set again to 240 seconds as I know. This is only there for the case that any other module may not set it. e.g. core set 240s and any other module require 60 seconds before linkchecker comes in the sequence. Than we only have 180s left up to 240s. With timer_read('page') we can calculate the remaining seconds and make sure that httprl is able to execute the _linkchecker_status_handling() before php times out. To be a bit safe, we should not use more than 220s seconds over all or only 180s max. We need to add the global timeout for httprl to e.g. "220 - (timer_read('page') / 1000)" in $options array to make httprl exit before php kills the process the hard way.

Log in or register to post comments

Comment #41

hass commented 18 December 2012 at 12:13

Aside, do you know why the CPU load is so heavy (not tested latest two patches)? Normally core drupal_http_request() is low on CPU with 0% load. I also think httprl need to be low on CPU, too. Most of the time we are waiting for the remote servers to answer... My best guess is that there are some very aggressive loops without a sleep that will fire the CPU to 100%. That's all guessing, but what httprl is doing while it runs - shouldn't be that CPU intensive and the size of the url array should also not eat sooo much memory. 32 socket shouldn't be a problem as a server should theoretically be able to open up to ~65536 ports, too. Ok, there are timeouts blocking and we'd still like to be able to be accessible from outside, but all is relative here... 32*240s = 7680 checks (I do not expect these many) and this is not that much :-)

Log in or register to post comments

Comment #42

mikeytown2 commented 18 December 2012 at 19:26

Can you explain this a bit, please?

Long story short, this runs linkchecker_cron in a new thread.

Long story below:
This is not required, if you look at the the patch in #38 it does not use this bit of code. But, what this does in #39 is run linkchecker_cron() in a new thread/process that is not tied to the cron thread/process in any way; this allows for us to use as much time as we need and we don't have to worry about other cron hooks running out of time because linkchecker_cron() is now running in a new & independent thread/process.

At a lower level it sends off a non blocking http POST request to the path httprl_async_function_callback; inside the POST is the authentication information, the function to run, and the arguments to pass to that function. Most cron hooks don't return anything, and this is true for linkchecker_cron(); we can run it and ignore any return value. By doing this (no return value) HTTPRL will use a non-blocking request when making the POST to the path "httprl_async_function_callback"; we make it non-blocking by not defining a return key in the same array that the function key is defined.

+++ b/linkchecker.module
@@ -385,7 +385,25 @@ function _linkchecker_link_block_ids($link) {
+        'function' => 'linkchecker_cron',

Don't set 'return' here and it will be a non blocking POST request to a special URL (httprl_async_function_callback) that allows for us to run any function, if you have the master key and have the correct lock.

I'm fine with this drupal_set_time_limit() change.

The best option would be to move drupal_set_time_limit() outside of the loop. I wasn't sure what you where doing here thus I tried to keep the normal code flow exactly the same while only modifying the HTTPRL codepath.

do you know why the CPU load is so heavy

This has to do with the use of "background_callback" instead of "callback". background_callback issues a POST to the path "httprl_async_function_callback" and in our case runs the _linkchecker_status_handling function in a new thread. Doing this for every URL that is checked will eat up a lot of RAM and CPU because each new thread is a full drupal bootstrap. When callback is used (as in #38 & #39) the majority of the time is spent in usleep(); verified via cachegrind.

Log in or register to post comments

Comment #43

hass commented 18 December 2012 at 22:48

Thanks for all these details. I tried both of your patches. This is really heavy with 8 threads (<100% CPU, ~400MB httpd). Can we reduce CPU load?

I see httprl has a default global_timeout of 120s. Is this intentional? I would set it to 180s or so.

Moving the complete code outside of hook_cron() should be done for sure. There is some need to reuse the logic in other ways like checking one link with an ajax request. So we can do this for sure.

#39: ~1223 links with 8 treads in only one cron run... Dreams come true :-)))

Log in or register to post comments

Comment #44

hass commented 18 December 2012 at 23:01

Status	File	Size
new	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback3.patch	12.38 KB

This patch adds missing global_connections and global_timeout. You may remember I asked already to allow setting these global values with httprl_send_request(). Here we see how this could help. Now we add 8000 links to the queue and also 8000 times the global settings. This requires a lot more memory than needed and I'm also not able to set the global_timeout correctly. Is there a chance that this may change soon?

Log in or register to post comments

Comment #45

hass commented 18 December 2012 at 23:53

Status	File	Size
new	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback4.patch	9.87 KB

I've committed a cleanup patch to reduce the size of these patch and the stuff that is not really part of this new feature here. New patch attached that also moves the code outside of hook_cron().

Log in or register to post comments

Comment #46

18 December 2012 at 23:54

Status:

Needs review

» Needs work

The last submitted patch, linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback4.patch, failed testing.

Log in or register to post comments

Comment #47

hass commented 18 December 2012 at 23:58

Status:

Needs work

» Needs review

Status	File	Size
new	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback5.patch	10.16 KB

Missed to update the hook_cron() call in the drush file.

Log in or register to post comments

Comment #48

hass commented 19 December 2012 at 00:00

Status	File	Size
new	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback5.patch	10.16 KB

Suxxx windows line feeds

Log in or register to post comments

Comment #49

19 December 2012 at 00:02

Status:

Needs review

» Needs work

The last submitted patch, linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback5.patch, failed testing.

Log in or register to post comments

Comment #50

hass commented 19 December 2012 at 00:03

Status:

Needs work

» Needs review

#48: linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback5.patch queued for re-testing.

Log in or register to post comments

Comment #51

hass commented 19 December 2012 at 00:10

+++ b/linkchecker.moduleundefined
@@ -418,12 +448,48 @@
+        'global_connections' => $linkchecker_check_connections_max,
+        'global_timeout' => 180 - (timer_read('page') / 1000),
...
+        httprl_send_request();

Looking for feedback on how we can move this into httprl_send_request($global_options);. Maybe HTTPRL 1.8 can support this?

Log in or register to post comments

Comment #52

mikeytown2 commented 19 December 2012 at 00:26

Sadly that would require a 2.0 release. I looked into it and I have too many assumptions baked into the current version. The rational for 120 seconds is it is 1/2 of 240.

Log in or register to post comments

Comment #53

mikeytown2 commented 19 December 2012 at 21:49

Status	File	Size
new	linkchecker-380052-53-add-httprl.patch	10.66 KB

Update to the patch in #48

Changes made:
- Moved drupal_set_time_limit out of the foreach loop.
- The $response variable in _linkchecker_status_handling is passed by reference.
- The $response variable is destroyed at the end of _linkchecker_status_handling in order to free memory. This should take care of the memory issues your reporting.

Things to do:
Would like to move link checking logic logic out of the cron hook so the new thread/process will call something other than a linkchecker_cron (having linkchecker_cron call linkchecker_cron is a quick hack). Make this new function have as little side effects as possible (don't call drupal_set_time_limit) so it can be used elsewhere without any time limits enforced (drush).

Edit: Looks like we do have a new function so I'll see what I can do in order to reduce the side effects.

Log in or register to post comments

Comment #54

hass commented 19 December 2012 at 22:10

Would like to move link checking logic logic out of the cron hook so the new thread/process will call something other than a linkchecker_cron (having linkchecker_cron call linkchecker_cron is a quick hack). Make this new function have as little side effects as possible (don't call drupal_set_time_limit) so it can be used elsewhere without any time limits enforced (drush).

It's already changed in the last patch or have I misunderstood this? The function is named _linkchecker_check_links().

Looks like we do have a new function so I'll see what I can do in order to reduce the side effects

What side effects? The high CPU load? :-)

Log in or register to post comments

Comment #55

hass commented 19 December 2012 at 23:04

I fear to commit this patch :-(. The latest one brings httpd up to 1.2GB of memory, 100% CPU with permanent load (machine is non-responsive) and runs *only* with 8 concurrent HEAD connections. Without the reference it looked smarter on the memory and CPU side.

HTTPRL need to be made more performant and a lot less memory and CPU intensive first.

Aside, shouldn't we have a lock on the function call? Otherwise cron may be executed more than once and this will start up N background processes what will kill nearly every machine and will check the same urls N times. Is this something that HTTPRL need to handle or linkchecker? Any idea how we can add a lock?

Log in or register to post comments

Comment #56

hass commented 19 December 2012 at 23:09

We can use below:

http://api.drupal.org/api/drupal/includes%21lock.inc/function/lock_acqui...
http://api.drupal.org/api/drupal/includes%21lock.inc/function/lock_relea...

Log in or register to post comments

Comment #57

mikeytown2 commented 19 December 2012 at 23:21

I'll run this against a test server that I have with 30k links. The batch operation after clicking save on admin/config/content/linkchecker could take forever at this rate...

Log in or register to post comments

Comment #58

mikeytown2 commented 19 December 2012 at 23:23

I'll roll a patch with locking. This would be something that linkchecker needs to do, make sure only one instance of _linkchecker_check_links() is running.

Log in or register to post comments

Comment #59

hass commented 19 December 2012 at 23:23

I'm open minded to optimizations that work reliable... :-)

Log in or register to post comments

Comment #60

mikeytown2 commented 20 December 2012 at 00:05

Status	File	Size
new	linkchecker-380052-60-add-httprl.patch	10.98 KB

Here is the patch that I have. I've tested it against a small set of links (50) and I can't find any memory leaks and/or excessive memory usage.

Still waiting for the batch scan to be done.

Log in or register to post comments

Comment #61

mikeytown2 commented 21 December 2012 at 00:36

Status	File	Size
new	linkchecker-380052-61-add-httprl.patch	11.18 KB

Updated patch so it will ignore function timeout errors.
This one reports memory usage as well.

Log in or register to post comments

Comment #62

mikeytown2 commented 21 December 2012 at 01:48

Status	File	Size
new	linkchecker-380052-62-add-httprl.patch	11.21 KB

Adds in support for head_only

Log in or register to post comments

Comment #63

hass commented 21 December 2012 at 08:16

Status:

Needs review

» Needs work

What is head_only? The option for HEAD or GET is "method" only.

Log in or register to post comments

Comment #64

hass commented 21 December 2012 at 09:17

+++ b/linkchecker.moduleundefined
@@ -386,31 +386,65 @@ function _linkchecker_link_block_ids($link) {
+  $max_execution_time = ini_get('max_execution_time') == 0 ? 240 : ini_get('max_execution_time');
+  if ($time_limit > 0) {
+    // Make sure we have enough time to validate all of the links.
+    drupal_set_time_limit($time_limit);
+  }
+  // Make sure this is the only process trying to run this function.
+  if (!lock_acquire(__FUNCTION__, max($max_execution_time, $time_limit))) {
+    return FALSE;
+  }

Why is there a need for two time limits? We know from $max_execution_time that this is the maximum time we can run a script, isn't it?

+++ b/linkchecker.moduleundefined
@@ -386,31 +386,65 @@ function _linkchecker_link_block_ids($link) {
+  if (!lock_acquire(__FUNCTION__, max($max_execution_time, $time_limit))) {
+    return FALSE;
+  }

I'm going to add a watchdog in here later to warn about the locked process. Maybe useful information to know. :-)

This should be enough, isn't it?

/**
 * Run link checks.
 */
function _linkchecker_check_links() {
  // Get max_execution_time from configuration, override 0 with 240 seconds.
  $max_execution_time = ini_get('max_execution_time') == 0 ? 240 : ini_get('max_execution_time');
  // Make sure we have enough time to validate all of the links.
  drupal_set_time_limit($max_execution_time);

  // Make sure this is the only process trying to run this function.
  if (!lock_acquire(__FUNCTION__, $max_execution_time)) {
    return FALSE;
  }

Log in or register to post comments

Comment #65

hass commented 21 December 2012 at 09:31

Status:

Needs work

» Needs review

Status	File	Size
new	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback6.patch	11.25 KB

New patch

Log in or register to post comments

Comment #66

21 December 2012 at 09:32

Status:

Needs review

» Needs work

The last submitted patch, linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback6.patch, failed testing.

Log in or register to post comments

Comment #67

hass commented 21 December 2012 at 09:37

Status:

Needs work

» Needs review

Status	File	Size
new	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback7.patch	11.25 KB

Based on #61.

Removed $time_limit
Added some watchdogs

Log in or register to post comments

Comment #68

hass commented 21 December 2012 at 09:47

Status	File	Size
new	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback8.patch	11.33 KB

I do not know why, but the memory usage is not always logged on my machine.

Log in or register to post comments

Comment #69

hass commented 21 December 2012 at 09:49

Status	File	Size
new	linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback8.patch	11.32 KB

Log in or register to post comments

Comment #70

mikeytown2 commented 21 December 2012 at 20:10

Reason I made the $time_limit variable is for drush. In #62 when drush calls _linkchecker_check_links() it will force php to timeout after 240 seconds even if we have that limit set at a higher value. The next patch will be based off of #69 though :)

Answer for head_only is in this comment #1869002-11: Extreme memory and cpu usage

Log in or register to post comments

Comment #71

mikeytown2 commented 21 December 2012 at 22:06

Status	File	Size
new	linkchecker-380052-71-add-httprl.patch	11.82 KB

Added head_only back in. If you fully understand what this does and still do not want it in I will not add it back in future patches.
Added a check to see if the link->url is internal. If it is, limit HTTPRL to only 1 concurrent connection for that domain. I'm hoping this will take care of the memory and CPU usage issues you've seen.

Log in or register to post comments

Comment #72

hass commented 22 December 2012 at 00:52

Ok, we found my local problem. :-( I have disabled all links to http://localhost and now all run smooth. Memory usage is good (~80MB), CPU mostly low, short spikes with max 25%. Sorry for stressing on this side. But it clearly shows me that we can kill a remote server with linkchecker very quickly. I guess I will set this domain limit down to 2 threads how the RFC suggests originally and may allow overriding this via settings.php.

+++ b/linkchecker.moduleundefined
@@ -386,31 +386,66 @@ function _linkchecker_link_block_ids($link) {
+  $max_execution_time = ini_get('max_execution_time') == 0 ? 240 : ini_get('max_execution_time');

@@ -418,26 +453,69 @@ function linkchecker_cron() {
+        'global_timeout' => 180 - (timer_read('page') / 1000),

Need to move this outside the foreach and set all to the same time. $max_execution_time - 30s or 60s if we know we are running in an extra process we have 100% of the time and there is no need to read the page timer.

+++ b/linkchecker.moduleundefined
@@ -418,26 +453,69 @@ function linkchecker_cron() {
+        'head_only' => TRUE,

Removal, we are using the Range header.

+++ b/linkchecker.moduleundefined
@@ -418,26 +453,69 @@ function linkchecker_cron() {
+      // If connecting to this server, limit connections to 1.
+      if (strpos($link->url, $base_http) === 0 || strpos($link->url, $base_https) === 0) {
+        $options += array(
+          'domain_connections' => 1,
+        );
+      }

I'm not a fan of committing this... this case just shows us a box may be quickly overwhelmed by 8 threads at the same time and on the end of the day the "per domain" limit may be too high with a thread limit of 8. :-)

+++ b/linkchecker.moduleundefined
@@ -418,26 +453,69 @@ function linkchecker_cron() {
+  watchdog('linkchecker', 'Link check run completed.', array(), WATCHDOG_NOTICE);
+  watchdog('linkchecker', 'Memory usage: @memory_get_usage byte, Peak memory usage: @memory_get_peak_usage byte.', array('@memory_get_peak_usage' => memory_get_peak_usage(), '@memory_get_usage' => memory_get_usage()), WATCHDOG_NOTICE);

I do not see both entries in the watchdog logs. Maybe httprl_send_request() does not come back and is also not able to remove the named lock. This needs further investigation.

+++ b/linkchecker.moduleundefined
@@ -456,6 +534,11 @@ function _linkchecker_status_handling($link, $response) {
+    case -4: // HTTPRL: httprl_send_request timed out.
+      // Skip these and try them again next cron run.
+      break;

Good catch. I planed to look into httprl to see if there may be more I have missed in past moths...

Log in or register to post comments

Comment #73

mikeytown2 commented 22 December 2012 at 02:15

I do not see both entries in the watchdog logs. Maybe httprl_send_request() does not come back and is also not able to remove the named lock. This needs further investigation.

My guess is something went fatal and killed PHP or we ran out of time. On windows boxes PHP has a lot less time to run due to how http://php.net/set-time-limit works. The other guess is safe mode. We can also try adding a call to drupal_set_time_limit() in the _linkchecker_status_handling() function. Putting lock_release and watchdog in a shutdown function might work as well.