I'm pushing 530 urls in the queue and start running concurrent URL checks with background process calls. Thread limit is ONLY 4. CPU spikes up to 100%, machines becomes unresponsive and memory spikes to... 400+MB and grows very fast up to 1.5GB... after some time cron finishes and only 60 links are checked.

This is not what I expected and it's not usable :-(.

I'm posting latest linkchecker patch in #380052: Add support with non-blocking parallel link checking.

Comments

HTTPRL uses usleep() and in $tv_usec in stream_select() in its event loop in order to not eat up CPU time. Have you tried this with the latest dev? I can not reproduce what you are reporting after processing 5k URLs (using the latest dev of HTTPRL).

Title:Possible memory leakExtreme memory and cpu usage
Version:7.x-1.7» 7.x-1.x-dev

I've tested with DEV. After send_request has been executed the load spikes. Just creating/queue the array of links is fine, but when it starts checking for some time cpu load is 100%. Than it goes down to ~25% until it finishes. Memory load is extreme. I'm not sure when I find time to debug the source of this in httprl. What's your memory load and CPU?

CPU usage bounces around between 0% and 5% (via top).
Memory usage is flat.
I was able to check 1,240 URLs in a single cron run when "Number of simultaneous connections" is set to 128. Function times out after that (hits 3 minute limit).

I put this at the bottom of _linkchecker_check_links().

<?php
  watchdog
('mem-used', memory_get_peak_usage() . " <br>\n" . memory_get_usage());
?>

Output from that:
37525176 = 36MB
10451392 = 10MB

What version of PHP and what OS are you on?

Status:Active» Needs review
StatusFileSize
new1.75 KB

You can try this patch out, it increases the waiting periods inside of the loop.

My only guess for the memory usage is a GET request that returns a lot of data. I could make an option that kills the connection once we have the headers if this is what is causing the memory usage.

Status:Needs review» Active
StatusFileSize
new1.55 KB

This patch has been committed to 6.x & 7.x. Adds a new option called head_only.

Allows us to only get the HEAD when using things like GET. I'm guessing that one of the links in your database points to a url with a lot of bytes to download and it only supports GET. This addition should make your use case more efficient.

Please rollback. The option is "method" that set's HEAD or GET. If I request with GET I have also set "Range" to request the first 1024 bytes only. You may remember this from past discussions.

See linkchecker:

<?php
   
// Range: Only request the first 1024 bytes from remote server. This is
    // required to prevent timeouts on URLs that are large downloads.
   
if ($link->method == 'GET') { $headers['Range'] = 'bytes=0-1024'; }
?>

removed

Please note that 100% of my tests have been HEAD requests as I know.

EDIT: There have been only 3 GET requests in 1500 urls.

Status:Active» Needs work

Windows 7 SP1 (x64)
PHP 5.3.1
MySQL 5.0.67

StatusFileSize
new9.13 KB

100% CPU on 4 cores:
2012-12-21_101029.png

I'll explain how head_only works, I don't think I need to roll this back. PS did you look at the patch in #5?

head_only = FALSE:
A GET request goes out with a range request of 0-1024. The server decides to ignore the range request and send everything back. And yes this is valid: http://stackoverflow.com/questions/720419/how-can-i-find-out-whether-a-s...

head_only = TRUE:
A GET request goes out with a range request of 0-1024. The server decides to ignore the range request and send everything back. HTTPRL closes the connection after we have gotten the HTTP headers.

The other idea on why your servers CPU usage is going up is that linkchecker is checking internal links (hitting your own server). You have a 4 core box, 4 concurrent requests to your box can in theory make it hit 100% CPU usage across all cores. If this is the case we can set the domain connection limit for the localhost to be 1 instead of the default of 8. This would also explain the memory usage.

head_only = TRUE:
A GET request goes out with a range request of 0-1024. The server decides to ignore the range request and send everything back. HTTPRL closes the connection after we have gotten the HTTP headers.

This sounds not standard conform to me. Please see #1426854: Enforce Range headers on the client side. If I add a Range header I set the byte range in linkchecker and HTTPRL module is not allowed to do anything I have not specified. HTTPRL will never receive any byte more than 1024 bytes from the remote server if I set $headers['Range'] = 'bytes=0-1024'. There is no need to add any extra options like head_only to the module. If you'd like to add any extra logic you need to look for the Range header and depend on this one, but you should not create your own non-standard one. Note that I could also specify a byte range of $headers['Range'] = 'bytes=2000-6543' in linkchecker if I need exactly these bytes to be downloaded. I hope this makes clear that we are not talking about the first 1024 bytes only. Range is a generic HTTP feature mostly used by download accelerators to download packets in parallel and resume broken downloads just as a few examples. I'm only "abusing" it here to prevent a full file download. I could have been written $headers['Range'] = 'bytes=0-1', too.

I would strongly suggest a rollback. I also see no byte range in patch #5. It may be better to document the Range header somewhere.

The other idea on why your servers CPU usage is going up is that linkchecker is checking internal links (hitting your own server).

You are correct. HTTPRL have hit my own server. I'm sorry for this. I have disabled all localhost links and now all is good on my box. HTTPRL runs with ~80MB memory and quite normal CPU load. However I do not see the watchdogs. I'm not sure if something exists in httprl and the process never comes back to linkchecker to save the watchdog entries and to remove the named lock.

HTTPRL will never receive any byte more than 1024 bytes from the remote server if I set $headers['Range'] = 'bytes=0-1024'

Sadly this is not correct. Some servers do not implement Range, or have intentionally disabled it for past security vulnerabilities.
http://serverfault.com/questions/304859/apache-disable-range-requests-di...

Parsing the range header and killing the connection if the data downloaded exceeds that limit if the Response is not a 206 does sound like a better option, just requires more work. I'll work on this :) Reading up on this to see what I need to do in order to parse the range header http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35

Sounds more than perfect :-) looking forward to this...

Changing the default domain_connections from 8 to 2 might be a good idea. I got 8 by taking the highest level (IE 10) from all modern browsers. All other modern browsers are at 6. This setting could be related to this issue #1837776: Background requests going to the correct IP but wrong Host..

Range Header:
Looks like I need to be able to parse these and get the byte count.

bytes=0-1024
// 1024
bytes=28-175,382-399,510-541,644-744,977-980
// 147 + 17 + 31 + 100 + 3 = 325
bytes=500-
// All but first 500 bytes (can not short circuit this case).
bytes=-500
// Only the last 500 bytes (can not short circuit this case).
bytes=0-0,-1
// First and last bytes of the request (can not short circuit this case).

Response back from a request with lots of commas looks like this:
URL: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Request modified via https://addons.mozilla.org/en-us/firefox/addon/modify-headers/
Range: bytes=28-175,382-399,510-541,644-744,977-980

Headers Back:

HTTP/1.1 206 Partial Content
Date: Sat, 22 Dec 2012 01:28:56 GMT
Server: Apache/2
Last-Modified: Wed, 01 Sep 2004 13:24:52 GMT
Etag: "1edec-3e3073913b100"
Accept-Ranges: bytes
Content-Length: 850
Cache-Control: max-age=21600
Expires: Sat, 22 Dec 2012 07:28:56 GMT
p3p: policyref="http://www.w3.org/2001/05/P3P/p3p.xml"
Connection: close
Content-Type: multipart/byteranges; boundary=4d166e345bc6752

Content:

--4d166c4bc2e655a
Content-type: text/html; charset=iso-8859-1
Content-range: bytes 28-175/126444
"-//W3C//DTD XHTML 1.0 Strict//EN"
     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns='http://www.w3.org/1999/xhtml'>
<head><titl
--4d166c4bc2e655a
Content-type: text/html; charset=iso-8859-1
Content-range: bytes 382-399/126444
d='sec14'>14</a> H
--4d166c4bc2e655a
Content-type: text/html; charset=iso-8859-1
Content-range: bytes 510-541/126444
header fields. For entity-heade
--4d166c4bc2e655a
Content-type: text/html; charset=iso-8859-1
Content-range: bytes 644-744/126444
sends and who receives the entity.
</p>
<h3><a id='sec14.1'>14.1</a> Accept</h3>
<p>
   The Accept
--4d166c4bc2e655a
Content-type: text/html; charset=iso-8859-1
Content-range: bytes 977-980/126444
n th
--4d166c4bc2e655a--

Being able to glue this back together will be a feature in the future (multipart decoding). Luckily for us all of this doesn't matter as I'm trying to figure out what the the last byte I need and don't download after that when I do a range request and instead of a 206 I get a 200.

The thing with enforcing the byte range is I will be taking a 200 and converting it to a 206. Something like this should require a setting in order to turn it on as it might be un-expected behavior.

Content-Range and Accept-Ranges may allow to identify if the server allow ranges. I never seen status code 416 (Requested range not satisfiable), but it seems to exists.

Second to last comment in http://stackoverflow.com/questions/2209204/parsing-http-range-header-in-php . Fully untested myself :-)

http://www.ietf.org/rfc/rfc2616.txt

Clients that use persistent connections SHOULD limit the number of
simultaneous connections that they maintain to a given server. A
single-user client SHOULD NOT maintain more than 2 connections with
any server or proxy. A proxy SHOULD use up to 2*N connections to
another server or proxy, where N is the number of simultaneously
active users. These guidelines are intended to improve HTTP response
times and avoid congestion.

Status:Needs work» Active
StatusFileSize
new2.43 KB

Going to re-open #1426854: Enforce Range headers on the client side and use that for the location of the range header.

This patch has been committed. It reverts the patch in #5 and fixes documentation for one of the error constants.

Status:Active» Fixed
StatusFileSize
new437 bytes

Committing this patch so I can close this issue. Patch changes domain_connections from 8 to 2.

Status:Fixed» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.