For high performance link checking we need to add cURL support with non-blocking parallel link checks. It's planned to be able to check up to 1000 links in parallel if the hardware and external network connection can handle this :-).

CommentFileSizeAuthor
#77 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback12.patch12.46 KBhass
#76 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback11.patch12.43 KBhass
#75 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback10.patch12.42 KBhass
#74 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback9.patch12.42 KBhass
#71 linkchecker-380052-71-add-httprl.patch11.82 KBmikeytown2
#69 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback8.patch11.32 KBhass
#68 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback8.patch11.33 KBhass
#67 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback7.patch11.25 KBhass
#65 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback6.patch11.25 KBhass
#62 linkchecker-380052-62-add-httprl.patch11.21 KBmikeytown2
#61 linkchecker-380052-61-add-httprl.patch11.18 KBmikeytown2
#60 linkchecker-380052-60-add-httprl.patch10.98 KBmikeytown2
#53 linkchecker-380052-53-add-httprl.patch10.66 KBmikeytown2
#48 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback5.patch10.16 KBhass
#47 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback5.patch10.16 KBhass
#45 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback4.patch9.87 KBhass
#44 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback3.patch12.38 KBhass
#39 linkchecker-380052-39-add-httprl-background.patch12.31 KBmikeytown2
#38 linkchecker-380052-38-add-httprl.patch11.53 KBmikeytown2
#35 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking_callback2.patch11.31 KBhass
#32 linkchecker-380052-31+Add+support+with+non-blocking+parallel+link+checking.patch10.03 KBmikeytown2
#31 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking.patch10 KBhass
#30 linkchecker_380052+Add+support+with+non-blocking+parallel+link+checking.patch10 KBhass
#28 380052_Add_support_with_non-blocking_parallel_link_checking_2012021203.patch10.38 KBhass
#27 380052_Add_support_with_non-blocking_parallel_link_checking_2012021202.patch10.46 KBhass
#26 380052_Add_support_with_non-blocking_parallel_link_checking_2012021201.patch9.93 KBhass
#24 380052_Add_support_with_non-blocking_parallel_link_checking_2012012801.patch2.66 KBhass
#8 linkchecker_CURL-380052-4.patch10.88 KBjanusman
#6 linkchecker_CURL-380052-4.patch11.12 KBjanusman
#4 linkchecker_CURL-380052-4.patch10.66 KBjanusman
Support from Acquia helps fund testing for Drupal Acquia logo

Comments

hass’s picture

Assigned: » Unassigned

When I opened this case I hoped to have this implemented within a week with the approach from http://www.onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blo..., but the status code and error handling and auto-updating of 301 links seems to be a quite complex task and made me working on other tasks.

If someone likes to jump in here - go on... we definitively need this feature.

hass’s picture

sinasalek’s picture

subscribed

janusman’s picture

Status: Active » Needs review
FileSize
10.66 KB

This patch adds CURL support into the link checker.

Some non-scientific benchmarks for 195 links tested:

1 simultaneous connection
120 sec (timed out when only 134 links were checked)
1.12 links/sec

10 Connections
55 sec (195 links checked)
3.54 links/sec

50 Connections
36 sec (195 links checked)
5.41 links/sec

Status: Needs review » Needs work

The last submitted patch, linkchecker_CURL-380052-4.patch, failed testing.

janusman’s picture

Status: Needs work » Needs review
FileSize
11.12 KB

Let's try that again.

Status: Needs review » Needs work

The last submitted patch, linkchecker_CURL-380052-4.patch, failed testing.

janusman’s picture

Status: Needs work » Needs review
FileSize
10.88 KB

Gah, had some non-UNIX line endings

Status: Needs review » Needs work

The last submitted patch, linkchecker_CURL-380052-4.patch, failed testing.

janusman’s picture

Status: Needs work » Needs review

A note:: it *seems* the tests fail regardless of the patch, so this needs human review =)

hass’s picture

If the tests pass without this patch and with not - there may be bugs :-)))

sinasalek’s picture

This module might be useful http://drupal.org/project/curl

hass’s picture

+++ linkchecker.module Locally Modified (Based On 1.7.2.138)
@@ -137,20 +139,187 @@
+    // GET or HEAD?
+    if ($request->method == 'HEAD') {
+      curl_setopt($curl_handles[$index], CURLOPT_NOBODY, TRUE);
+    }
+    else {
+      curl_setopt($curl_handles[$index], CURLOPT_HTTPGET, TRUE);
+    }

By default we use HEAD, not GET. POST is not yet possible, we need to do some more stuff to get this working.

+++ linkchecker.module Locally Modified (Based On 1.7.2.138)
@@ -137,20 +139,187 @@
+      $responses = array(
+        100 => 'Continue', 101 => 'Switching Protocols',
+        200 => 'OK', 201 => 'Created', 202 => 'Accepted', 203 => 'Non-Authoritative Information', 204 => 'No Content', 205 => 'Reset Content', 206 => 'Partial Content',
+        300 => 'Multiple Choices', 301 => 'Moved Permanently', 302 => 'Found', 303 => 'See Other', 304 => 'Not Modified', 305 => 'Use Proxy', 307 => 'Temporary Redirect',
+        400 => 'Bad Request', 401 => 'Unauthorized', 402 => 'Payment Required', 403 => 'Forbidden', 404 => 'Not Found', 405 => 'Method Not Allowed', 406 => 'Not Acceptable', 407 => 'Proxy Authentication Required', 408 => 'Request Time-out', 409 => 'Conflict', 410 => 'Gone', 411 => 'Length Required', 412 => 'Precondition Failed', 413 => 'Request Entity Too Large', 414 => 'Request-URI Too Large', 415 => 'Unsupported Media Type', 416 => 'Requested range not satisfiable', 417 => 'Expectation Failed',
+        500 => 'Internal Server Error', 501 => 'Not Implemented', 502 => 'Bad Gateway', 503 => 'Service Unavailable', 504 => 'Gateway Time-out', 505 => 'HTTP Version not supported'
+      );

I'd like to move this into an extra function if not already available to reduce lines of code.

+++ linkchecker.module Locally Modified (Based On 1.7.2.138)
@@ -137,20 +139,187 @@
+          // If retry is set, fall back to drupal_http_request()
+          if ($retry) {
+            $result = drupal_http_request($result->headers['Location'], $headers, $method, $data, --$retry);
+            $result->redirect_code = $result->code;
+          }
+          $result->redirect_url = $location;

By this drupal_http_request() we have a blocking request as I know... :-(

Powered by Dreditor.

hass’s picture

This curl modul from #12 should get a review if it provides us the same API/interface for installations with and without cURL.

janusman’s picture

Status: Needs review » Needs work

The CURL module mentioned in #12 offers a fallback PHP-only version of CURL; however, it does not implement curl_multi commands which is the whole point of this patch =)

@hass: re #13:
1) could you clarify your first comment "By default we use HEAD, not GET. POST is not yet possible, we need to do some more stuff to get this working.".

2) Also, I will try to address your second comment... "I'd like to move this into an extra function if not already available to reduce lines of code."

3) I guess the third one is not a blocking issue (and doesn't really need addressing?)? "By this drupal_http_request() we have a blocking request as I know... :-("

Thanks

hass’s picture

#1. Your else statement defaults to GET. This is logic wise not correct and need to default to HEAD.

#3. I'm not sure... but if it blocks we don't have "non blocking requests". We could address this later if there is no way to solve...

Aside, should we try to depend on http://drupal.org/project/curl? I don't know the code, but it may easify some things if people don't have curl!?

janusman’s picture

Again, http://drupal.org/project/curl does not support the parallel fetches from the "real" CURL library.

Will look into #1... maybe #3 could be attacked somehow... ideas welcome =)

hass’s picture

Version: 6.x-2.x-dev » 7.x-1.x-dev
Status: Needs work » Needs review

D7 first

hass’s picture

#8: linkchecker_CURL-380052-4.patch queued for re-testing.

Status: Needs review » Needs work

The last submitted patch, linkchecker_CURL-380052-4.patch, failed testing.

janusman’s picture

Heh, yep, definitely will need a reroll since patch is from prehistoric CVS days =)

hass’s picture

http://drupal.org/project/httprl may also an alternative to curl. It's written as API compatible with core drupal_http_request(). In http://drupal.org/node/1272542 they plan to fallback to curl if stream wrappers are not available.

The good thing is we get all possibilities in one module and have a consistent API and we don't need to care about anything linkchecker internally. I love the ideas, but have no experience with httprl yet.

hass’s picture

@janusman: I guess you spend a lot of time on the curl patch... I'm sorry for this, as the HTTPRL module was a chance find for me, too. But it looks really promising!

Just made a test with 1000 URLs (~3 minutes) and it's really cool and API conform to core.

Per #732626: Supporting checking link via Ajax in node view we should first move the checking logic out of hook_cron() and than we can make the HTTPRL module plug-able. hopefully #1268096: Implement a rate limiter get's solved as I do not like to commit a patch before. We can really overload any server with this feature.

hass’s picture

Title: Add cURL support with non-blocking parallel link checking » Add support with non-blocking parallel link checking
FileSize
2.66 KB

Patch attached is a prove of concept. We need an UI so users can decide to stay with Core or use the HTTPRL module (Experimental). Changing topic to be more general.

janusman’s picture

Hey! No problem. I'm all for using HTTPRL. =)

hass’s picture

Category: task » feature
Status: Needs work » Needs review
FileSize
9.93 KB

Patch requires HTTPRL latest DEV or future v1.5. This is waiting for #1427958: Run callback after stream has been fetched as current logic is not that fast as expected because of some request timeouts that are blocking speed.

hass’s picture

hass’s picture

Missed to remove one workaround leftover from previous broken HTTPRL versions.

kim.pepper’s picture

Any update on this ticket? I has been 10 months without a review.

hass’s picture

hass’s picture

mikeytown2’s picture

+++ b/linkchecker.module
@@ -462,6 +503,7 @@
+    case   2: // HTTPRL: maximum allowed redirects exhausted.

Should be -2 not 2
http://drupalcode.org/project/httprl.git/blob/22dba7cfa825cb24a1fe8b5326...

I haven't tested either patch yet. This should be the only change in this patch in comparison to the patch in #30.

mikeytown2’s picture

Also I will hopefully be rolling out 1.8 of httprl before the new year. If you could test the latest dev to make sure it works as expected that would be appreciated. I've been running httprl 7.x-1.x-dev on prod with no issues but I'm one of many use cases.

hass’s picture

Thanks a lot for the -2 hint. I missed this change. I'm currently testing the callback logic we discussed in #1589122: background_callback array with two parameters?. New patch will follow soon.

hass’s picture

Latest patch attached with callback logic. This is a memory killer. No idea why.

I pushed only 560 links in the queue and when cron exits after some time only 60 links have been checked with a thread limit of 4. CPU is 100% and 1.5GB of memory has been used by Apache (httpd) process what is not acceptable.

@task: add global httprl timeout. This need to be calculated.

mikeytown2’s picture

+++ b/linkchecker.module
@@ -404,31 +404,62 @@
+        'background_callback' => array(
...
+            'function' => '_linkchecker_status_handling',

change background_callback to callback as this will use less threads and thus less memory. background_callback is really only useful if _linkchecker_status_handling eats up a lot of time (several seconds), if it is quick, callback is all you need.

@hass
Can you give me instructions on how to repo the test on a clean D7 install?

hass’s picture

I can give "callback" a try tomorrow. I thought the background one is the one with less memory usage as it free's up after it's completed. How exacly can I make a decission for one or the other callback/background?

It's very simple... Fill up the site with many links... I have 1500 test urls. Than run cron with web url from status page. That's really all.

I plan to create a final linkchecker release very soon. Would be great to get this done before.

mikeytown2’s picture

Fresh install of drupal so I wasn't be able to repo the memory usage. But I was able to repo the CPU usage. Switching from background_callback to callback made all the difference. Attached patch is based off of #35 and it should fix both of the reported issues :) I also moved drupal_set_time_limit around as httprl only needs the extra time when httprl_send_request() is called.

background_callback can be very handy but in this case it is the wrong tool here. Internally background_callback is used with httprl_queue_background_callback(), which we could use here if you wish for linkchecked_cron() to have it's own php process; right now linkchecked_cron() is apart of the whole cron process. The main advantage of putting linkchecked_cron() in it's own php process is so it can fully utilize the 240 seconds available to cron as it is the only thing running.

mikeytown2’s picture

In this patch we have linkchecked_cron() run in the background. It's a quick hack, creating a new function instead of reusing linkchecked_cron() would be ideal. What I'm doing is creating a new process and executing linkchecker_cron(TRUE) outside of cron.

Both patches are running with the latest dev of httprl.

hass’s picture

+++ b/linkchecker.module
@@ -385,7 +385,25 @@ function _linkchecker_link_block_ids($link) {
+  if ($has_httprl && !$background) {
+    // Setup callback options array; call linkchecker_cron(TRUE) in the background.
+    $callback_options = array(
+      array(
+        'function' => 'linkchecker_cron',
+      ),
+      TRUE
+    );
+    // Queue up the request.
+    httprl_queue_background_callback($callback_options);
+
+    // Execute request.
+    httprl_send_request();
+    return;
+  }
+

I'm sorry, but I do not really understand what you are doing here or how this works. I thought we are already running the link checks with httprl_send_request() after we filled the queue. Can you explain this a bit, please?

+++ b/linkchecker.module
@@ -395,40 +413,70 @@ function linkchecker_cron() {
+      if ($links_remaining == 0) {
+        // Make sure we have enough time to validate all of the links.
+        drupal_set_time_limit(240);
+
+        httprl_send_request();
+      }
...
+      _linkchecker_status_handling($response, $link);
...
+      if ((timer_read('page') / 1000) > ($max_execution_time / 2)) {

I'm fine with this drupal_set_time_limit() change. But as I know core already set's 240 seconds and it's not set again to 240 seconds as I know. This is only there for the case that any other module may not set it. e.g. core set 240s and any other module require 60 seconds before linkchecker comes in the sequence. Than we only have 180s left up to 240s. With timer_read('page') we can calculate the remaining seconds and make sure that httprl is able to execute the _linkchecker_status_handling() before php times out. To be a bit safe, we should not use more than 220s seconds over all or only 180s max. We need to add the global timeout for httprl to e.g. "220 - (timer_read('page') / 1000)" in $options array to make httprl exit before php kills the process the hard way.

hass’s picture

Aside, do you know why the CPU load is so heavy (not tested latest two patches)? Normally core drupal_http_request() is low on CPU with 0% load. I also think httprl need to be low on CPU, too. Most of the time we are waiting for the remote servers to answer... My best guess is that there are some very aggressive loops without a sleep that will fire the CPU to 100%. That's all guessing, but what httprl is doing while it runs - shouldn't be that CPU intensive and the size of the url array should also not eat sooo much memory. 32 socket shouldn't be a problem as a server should theoretically be able to open up to ~65536 ports, too. Ok, there are timeouts blocking and we'd still like to be able to be accessible from outside, but all is relative here... 32*240s = 7680 checks (I do not expect these many) and this is not that much :-)

mikeytown2’s picture

Can you explain this a bit, please?

Long story short, this runs linkchecker_cron in a new thread.

Long story below:
This is not required, if you look at the the patch in #38 it does not use this bit of code. But, what this does in #39 is run linkchecker_cron() in a new thread/process that is not tied to the cron thread/process in any way; this allows for us to use as much time as we need and we don't have to worry about other cron hooks running out of time because linkchecker_cron() is now running in a new & independent thread/process.

At a lower level it sends off a non blocking http POST request to the path httprl_async_function_callback; inside the POST is the authentication information, the function to run, and the arguments to pass to that function. Most cron hooks don't return anything, and this is true for linkchecker_cron(); we can run it and ignore any return value. By doing this (no return value) HTTPRL will use a non-blocking request when making the POST to the path "httprl_async_function_callback"; we make it non-blocking by not defining a return key in the same array that the function key is defined.

+++ b/linkchecker.module
@@ -385,7 +385,25 @@ function _linkchecker_link_block_ids($link) {
+        'function' => 'linkchecker_cron',

Don't set 'return' here and it will be a non blocking POST request to a special URL (httprl_async_function_callback) that allows for us to run any function, if you have the master key and have the correct lock.

I'm fine with this drupal_set_time_limit() change.

The best option would be to move drupal_set_time_limit() outside of the loop. I wasn't sure what you where doing here thus I tried to keep the normal code flow exactly the same while only modifying the HTTPRL codepath.

do you know why the CPU load is so heavy

This has to do with the use of "background_callback" instead of "callback". background_callback issues a POST to the path "httprl_async_function_callback" and in our case runs the _linkchecker_status_handling function in a new thread. Doing this for every URL that is checked will eat up a lot of RAM and CPU because each new thread is a full drupal bootstrap. When callback is used (as in #38 & #39) the majority of the time is spent in usleep(); verified via cachegrind.

hass’s picture

Thanks for all these details. I tried both of your patches. This is really heavy with 8 threads (<100% CPU, ~400MB httpd). Can we reduce CPU load?

I see httprl has a default global_timeout of 120s. Is this intentional? I would set it to 180s or so.

Moving the complete code outside of hook_cron() should be done for sure. There is some need to reuse the logic in other ways like checking one link with an ajax request. So we can do this for sure.

#39: ~1223 links with 8 treads in only one cron run... Dreams come true :-)))

hass’s picture

This patch adds missing global_connections and global_timeout. You may remember I asked already to allow setting these global values with httprl_send_request(). Here we see how this could help. Now we add 8000 links to the queue and also 8000 times the global settings. This requires a lot more memory than needed and I'm also not able to set the global_timeout correctly. Is there a chance that this may change soon?

hass’s picture

I've committed a cleanup patch to reduce the size of these patch and the stuff that is not really part of this new feature here. New patch attached that also moves the code outside of hook_cron().

Status: Needs review » Needs work
hass’s picture

Missed to update the hook_cron() call in the drush file.

hass’s picture

Status: Needs review » Needs work
hass’s picture

Status: Needs work » Needs review
hass’s picture

+++ b/linkchecker.moduleundefined
@@ -418,12 +448,48 @@
+        'global_connections' => $linkchecker_check_connections_max,
+        'global_timeout' => 180 - (timer_read('page') / 1000),
...
+        httprl_send_request();

Looking for feedback on how we can move this into httprl_send_request($global_options);. Maybe HTTPRL 1.8 can support this?

mikeytown2’s picture

Sadly that would require a 2.0 release. I looked into it and I have too many assumptions baked into the current version. The rational for 120 seconds is it is 1/2 of 240.

mikeytown2’s picture

Update to the patch in #48

Changes made:
- Moved drupal_set_time_limit out of the foreach loop.
- The $response variable in _linkchecker_status_handling is passed by reference.
- The $response variable is destroyed at the end of _linkchecker_status_handling in order to free memory. This should take care of the memory issues your reporting.

Things to do:
Would like to move link checking logic logic out of the cron hook so the new thread/process will call something other than a linkchecker_cron (having linkchecker_cron call linkchecker_cron is a quick hack). Make this new function have as little side effects as possible (don't call drupal_set_time_limit) so it can be used elsewhere without any time limits enforced (drush).

Edit: Looks like we do have a new function so I'll see what I can do in order to reduce the side effects.

hass’s picture

Would like to move link checking logic logic out of the cron hook so the new thread/process will call something other than a linkchecker_cron (having linkchecker_cron call linkchecker_cron is a quick hack). Make this new function have as little side effects as possible (don't call drupal_set_time_limit) so it can be used elsewhere without any time limits enforced (drush).

It's already changed in the last patch or have I misunderstood this? The function is named _linkchecker_check_links().

Looks like we do have a new function so I'll see what I can do in order to reduce the side effects

What side effects? The high CPU load? :-)

hass’s picture

I fear to commit this patch :-(. The latest one brings httpd up to 1.2GB of memory, 100% CPU with permanent load (machine is non-responsive) and runs *only* with 8 concurrent HEAD connections. Without the reference it looked smarter on the memory and CPU side.

HTTPRL need to be made more performant and a lot less memory and CPU intensive first.

Aside, shouldn't we have a lock on the function call? Otherwise cron may be executed more than once and this will start up N background processes what will kill nearly every machine and will check the same urls N times. Is this something that HTTPRL need to handle or linkchecker? Any idea how we can add a lock?

mikeytown2’s picture

I'll run this against a test server that I have with 30k links. The batch operation after clicking save on admin/config/content/linkchecker could take forever at this rate...

mikeytown2’s picture

I'll roll a patch with locking. This would be something that linkchecker needs to do, make sure only one instance of _linkchecker_check_links() is running.

hass’s picture

I'm open minded to optimizations that work reliable... :-)

mikeytown2’s picture

Here is the patch that I have. I've tested it against a small set of links (50) and I can't find any memory leaks and/or excessive memory usage.

Still waiting for the batch scan to be done.

mikeytown2’s picture

Updated patch so it will ignore function timeout errors.
This one reports memory usage as well.

mikeytown2’s picture

Adds in support for head_only

hass’s picture

Status: Needs review » Needs work

What is head_only? The option for HEAD or GET is "method" only.

hass’s picture

+++ b/linkchecker.moduleundefined
@@ -386,31 +386,65 @@ function _linkchecker_link_block_ids($link) {
+  $max_execution_time = ini_get('max_execution_time') == 0 ? 240 : ini_get('max_execution_time');
+  if ($time_limit > 0) {
+    // Make sure we have enough time to validate all of the links.
+    drupal_set_time_limit($time_limit);
+  }
+  // Make sure this is the only process trying to run this function.
+  if (!lock_acquire(__FUNCTION__, max($max_execution_time, $time_limit))) {
+    return FALSE;
+  }

Why is there a need for two time limits? We know from $max_execution_time that this is the maximum time we can run a script, isn't it?

+++ b/linkchecker.moduleundefined
@@ -386,31 +386,65 @@ function _linkchecker_link_block_ids($link) {
+  if (!lock_acquire(__FUNCTION__, max($max_execution_time, $time_limit))) {
+    return FALSE;
+  }

I'm going to add a watchdog in here later to warn about the locked process. Maybe useful information to know. :-)

This should be enough, isn't it?

/**
 * Run link checks.
 */
function _linkchecker_check_links() {
  // Get max_execution_time from configuration, override 0 with 240 seconds.
  $max_execution_time = ini_get('max_execution_time') == 0 ? 240 : ini_get('max_execution_time');
  // Make sure we have enough time to validate all of the links.
  drupal_set_time_limit($max_execution_time);

  // Make sure this is the only process trying to run this function.
  if (!lock_acquire(__FUNCTION__, $max_execution_time)) {
    return FALSE;
  }

hass’s picture

New patch

Status: Needs review » Needs work
hass’s picture

Based on #61.

  • Removed $time_limit
  • Added some watchdogs
hass’s picture

I do not know why, but the memory usage is not always logged on my machine.

hass’s picture

mikeytown2’s picture

Reason I made the $time_limit variable is for drush. In #62 when drush calls _linkchecker_check_links() it will force php to timeout after 240 seconds even if we have that limit set at a higher value. The next patch will be based off of #69 though :)

Answer for head_only is in this comment #1869002-11: Extreme memory and cpu usage

mikeytown2’s picture

Added head_only back in. If you fully understand what this does and still do not want it in I will not add it back in future patches.
Added a check to see if the link->url is internal. If it is, limit HTTPRL to only 1 concurrent connection for that domain. I'm hoping this will take care of the memory and CPU usage issues you've seen.

hass’s picture

Ok, we found my local problem. :-( I have disabled all links to http://localhost and now all run smooth. Memory usage is good (~80MB), CPU mostly low, short spikes with max 25%. Sorry for stressing on this side. But it clearly shows me that we can kill a remote server with linkchecker very quickly. I guess I will set this domain limit down to 2 threads how the RFC suggests originally and may allow overriding this via settings.php.

+++ b/linkchecker.moduleundefined
@@ -386,31 +386,66 @@ function _linkchecker_link_block_ids($link) {
+  $max_execution_time = ini_get('max_execution_time') == 0 ? 240 : ini_get('max_execution_time');

@@ -418,26 +453,69 @@ function linkchecker_cron() {
+        'global_timeout' => 180 - (timer_read('page') / 1000),

Need to move this outside the foreach and set all to the same time. $max_execution_time - 30s or 60s if we know we are running in an extra process we have 100% of the time and there is no need to read the page timer.

+++ b/linkchecker.moduleundefined
@@ -418,26 +453,69 @@ function linkchecker_cron() {
+        'head_only' => TRUE,

Removal, we are using the Range header.

+++ b/linkchecker.moduleundefined
@@ -418,26 +453,69 @@ function linkchecker_cron() {
+      // If connecting to this server, limit connections to 1.
+      if (strpos($link->url, $base_http) === 0 || strpos($link->url, $base_https) === 0) {
+        $options += array(
+          'domain_connections' => 1,
+        );
+      }

I'm not a fan of committing this... this case just shows us a box may be quickly overwhelmed by 8 threads at the same time and on the end of the day the "per domain" limit may be too high with a thread limit of 8. :-)

+++ b/linkchecker.moduleundefined
@@ -418,26 +453,69 @@ function linkchecker_cron() {
+  watchdog('linkchecker', 'Link check run completed.', array(), WATCHDOG_NOTICE);
+  watchdog('linkchecker', 'Memory usage: @memory_get_usage byte, Peak memory usage: @memory_get_peak_usage byte.', array('@memory_get_peak_usage' => memory_get_peak_usage(), '@memory_get_usage' => memory_get_usage()), WATCHDOG_NOTICE);

I do not see both entries in the watchdog logs. Maybe httprl_send_request() does not come back and is also not able to remove the named lock. This needs further investigation.

+++ b/linkchecker.moduleundefined
@@ -456,6 +534,11 @@ function _linkchecker_status_handling($link, $response) {
+    case -4: // HTTPRL: httprl_send_request timed out.
+      // Skip these and try them again next cron run.
+      break;

Good catch. I planed to look into httprl to see if there may be more I have missed in past moths...

mikeytown2’s picture

I do not see both entries in the watchdog logs. Maybe httprl_send_request() does not come back and is also not able to remove the named lock. This needs further investigation.

My guess is something went fatal and killed PHP or we ran out of time. On windows boxes PHP has a lot less time to run due to how http://php.net/set-time-limit works. The other guess is safe mode. We can also try adding a call to drupal_set_time_limit() in the _linkchecker_status_handling() function. Putting lock_release and watchdog in a shutdown function might work as well.

hass’s picture

Last patch, this one will be committed if all lights are green.

hass’s picture

Removed 1 from Number of simultaneous connections.

hass’s picture

hass’s picture

HTTP Parallel Request Library is now named HTTP Parallel Request & Threading Library

hass’s picture

Status: Needs review » Fixed

@mikeytown2: It was a pleasure to work with You on this feature. Thanks a lot for all your help!!!

D7: http://drupalcode.org/project/linkchecker.git/commit/b56dda1

mikeytown2’s picture

I like how thorough you are. I enjoyed working with you as well :)

hass’s picture

Version: 7.x-1.x-dev » 6.x-2.x-dev
Assigned: Unassigned »
Status: Fixed » Patch (to be ported)
hass’s picture

Status: Patch (to be ported) » Fixed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.