If there isn't already, it would be great to supply an array of allowed content types, or a way to limit to only text/html. So that if the url is a pdf, or zip file etc, then it doesn't try to download it.

Also, to specify content length too would be good, so it doesn't try to download a 500mb file for example.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

mikeytown2’s picture

Being able to enforce the max content length is difficult to do due to chunked transfer encoding. The connection will timeout so I'm not super worried about this one.

Having an allowed_content_types array is possible. Feel free to add it to the options under httprl_request(). Be sure to not set a default in this case. It would get enforced in httprl_send_request().

hass’s picture

In linkchecker I've added 'Range' support some weeks ago. I'm using HEAD mostly (range not required), but users are able to use GET and than I force downloading the first 0-1024bytes only... This works great. This is just a hint how you may solve it for every content type and server that supports ranges (should be normal today).

Just keep in mind you will not get "200 OK", it't "206 Partial Content".

mikeytown2’s picture

see httprl_send_request(). I already get 1024 byte chunks until I get all the headers. If I see a redirect, I kill the connection right there. This is a fairly easy problem to solve for HTTPRL.

hass’s picture

I have not understood the details behind you chunks logic. When i implemented it I was not aware of httprl and made it for core and to prevent the 500 or 5GB downloads with GET mehod :-). But without range limit httprl must download all, what is all correct.

mikeytown2’s picture

In terms of limiting the number of bytes downloaded; that can be done. Now that I think about the requirements for a link checker, this would be a nice feature. I can limit the total bytes transfered, just not the message size due to chunked transfer encoding.

hass’s picture

Nothing required by you... 'Range' header is the way all linkcheck modules should go if they have a need to limit transfered bytes. It's the standard way how webservers work. Why should we add any other stuff? :-)

mikeytown2’s picture

Ah nice, just set the Range header http://stackoverflow.com/questions/716680/difference-between-content-ran....

The other one is the Accept header but most web servers seem to ignore it thus the need for the array of content types we wish to download.

I should add to the documentation of httprl_request() some of the more useful headers.

hass’s picture

     // Range: Only request the first 1024 bytes from remote server. This is
     // required to prevent timeouts on URLs that are large downloads.
     if ($link->method == 'GET') { $headers['Range'] = 'bytes=0-1024'; }
mikeytown2’s picture

Status: Active » Postponed (maintainer needs more info)
FileSize
1.16 KB

This has been committed. If servers do not respect the Accept header let me know; I'll implement strict enforcement if needed. http://www.gethifi.com/blog/browser-rest-http-accept-headers

mikeytown2’s picture

Status: Postponed (maintainer needs more info) » Closed (works as designed)

Closing this issue. Open to patches though.

mikeytown2’s picture

Status: Closed (works as designed) » Active

I will be creating a patch for servers that do not accept the Range header. This is useful as anything downloaded in httprl gets loaded into memory. Requesting a URL that returns a lot of data could cause PHP to run out of memory. In this case, if the Range header is sent out and server does not reply with a 206 but a 200, httprl will download up to the last byte of data needed in order to fulfill the range request and then close the connection; turning the 200 back into a 206.

Functions I have so far for parsing the Range header

/**
* Parse a range header into start and end byte ranges.
*
* @param $input
*   String in the form of bytes=0-1024 or bytes=0-1024,2048-4096
* @return array
*   Keyed arrays containing start and end values for the byte ranges.
*   Empty array if the string can not be parsed.
*/
function httprl_get_ranges($input) {
  $ranges = array();
  // Make sure the input string matches the correct format.
  $string = preg_match('/^bytes=((\d*-\d*,? ?)+)$/', $input, $matches) ? $matches[1] : FALSE;
  if (!empty($string)) {
    // Handle mutiple ranges
    foreach (explode(',', $string) as $range) {
      // Get the start and end byte values for this range.
      $values = explode('-', $range);
      if (count($values) != 2) {
        return FALSE;
      }
      $ranges[] = array('start' => $values[0], 'end' => $values[1]);
    }
  }
  return $ranges;
}

/**
* Given an array of ranges, get the last byte we need to download.
*
* @param $ranges
*   Multi dimentional array
* @return int or NULL
*   NULL: Get all values; int: last byte to download.
*/
function httprl_get_last_byte_from_range($ranges) {
  $max = 0;
  if (empty($ranges)) {
    return NULL;
  }
  foreach ($ranges as $range) {
    if (!is_numeric($range['start']) || !is_numeric($range['end'])) {
      return NULL;
    }
    $max = max($range['end'], $max);
  }
  return $max;
}
mikeytown2’s picture

This will go in the 1.9 release

mikeytown2’s picture

Title: Specify allowed content types » Enforce Range headers on the client side
Status: Active » Fixed
FileSize
5.3 KB

Range headers are now strict. If a 200 is returned when a 206 was expected, httprl will turn the 200 into a 206 if that will allow us to cut the connection to the server sooner.

This patch has been committed.

mikeytown2’s picture

Forgot to make sure this only runs on a GET request.

This patch has been committed to 6.x & 7.x

mikeytown2’s picture

Status: Fixed » Needs work

Tested and this breaks with chunked transfer encoding.

$urls = array(
  'http://www.jobrobot.de/',
);

httprl_request($urls, array('version' => 1.1, 'method' => 'GET', 'max_redirects' => 0, 'async_connect' => TRUE, 'headers' => array('Range' => 'bytes=0-1,2-3', 'Transfer-Encoding' => 'chunked')));
$request = httprl_send_request();
echo httprl_pr($request);

I need to decode the chunks first and then split up the data based off of the byte range.

mikeytown2’s picture

Thinking about this and cutting the download off if Transfer-Encoding or Content-Encoding is used would be a bad idea. Will move forward with this in mind.

mikeytown2’s picture

Status: Needs work » Fixed
FileSize
2.28 KB

The following patch has been committed to 6.x & 7.x.

mikeytown2’s picture

Support for
bytes=1024-
bytes=-1024
has been added.

This patch has been committed to 6.x & 7.x.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.