When you fetch feeds, you don't handle the 30X HTTP redirect codes, but redirect codes are totally legal.
You don't even seems to handle any HTTP return code at all by the way (20X && 30X are not error codes).

It seems that you rewrote your own drupal_http_request() equivalent, maybe for performances (I don't know, you surely did have a good reason to do it), but you should have looked at their code, they handle gracefully all the HTTP redirects code (with a system of counted retries which works perfectly, I used a copy of this function to make a PHP HTTP proxy for GIS clients).

Here is a sample workaround (realy ugly sample, this is not a patch):

/**
 * Get the content from the given URL.
 *
 * @param $url
 *  A valid URL (not only web URLs).
 * @param $username
 *  If the URL use authentication, here you can supply the username for this.
 * @param $password
 *  If the URL use authentication, here you can supply the password for this.
 * @return
 *  A stdClass object that describes the data downloaded from $url. The object's
 *  data property contains the actual document at the URL.
 */
function http_request_get($url, $username = NULL, $password = NULL, $accept_invalid_cert = FALSE, $retry = 3) {

  // ... Cuted some code here

  $result->code = isset($result->code) ? $result->code : 200;

  if ($retry > 0 && in_array($result->code, array(300, 301, 302, 303))) {
    $retry--;

    return http_request_get($url, $username, $password, $accept_invalid_cert, $retry);
  }

  // In case of 304 Not Modified try to return cached data.
  else if ($result->code == 304) {

    if (isset($last_result)) {
      $last_result->from_cache = TRUE;
      return $last_result;
    }
    else {
      // It's a tragedy, this file must exist and contain good data.
      // In this case, clear cache and repeat.
      cache_clear_all('feeds_http_download_'. md5($url), 'cache');
      return http_request_get($url, $username, $password);
    }
  }

  // Cuted some code here..

  return $result;
}

Did not tested it yet, I'm going to in the next minutes (note that my sample code is quite ugly and really a dirty quick fix).

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

pounard’s picture

PS: I'm sorry for not taking the time to make a real nice patch, I'll may do it when the time will allow me to do so, don't count on it until a few days.

pounard’s picture

Category: bug » feature

Changed this to feature request:

  • Handle HTTP 30X redirection codes
  • A failed to fetch feed should disable itself, and be reported into some kind of admin form

FYI I did put 450 feeds into a Drupal instance (using OPML importer, which lead to 15 000 nodes import in less than one day) I had numerous errors, I did report almost all of them. With this quite big number of feeds, the annoying thing is when a feed failed, next cron executions will try to fetch it back, and this cause the cron execution to be really slower.

pounard’s picture

Title: failed with code 302 » Handle 302 request and shutdown feeds that fail with other codes.

Retitled issue.

alex_b’s picture

Version: 6.x-1.0-alpha10 » 6.x-1.x-dev

Reassigning this to 6.x - we should fix this.

Dane Powell’s picture

Subscribing - I've got lots of feeds on my site that users have entered erroneously and thus are filling up the logs with 404 errors from the Feeds module. I'd love to see these feeds get disabled automatically when these types of errors occur.

nkschaefer’s picture

+1. The site I'm working on aggregates several thousand RSS feeds of health science journals, and I've slapped together a rough/sloppy admin report form by pulling messages out of the watchdog log and keeping track of which ones I've marked "fixed" through a (gigantic) array in the variable table.

I totally agree with the suggestions in #2; the vast majority of problems my feeds are encountering are redirects, and if these feeds could update themselves when they hit a 301, it would save me tons of time. Also, I just wrote a module that provides Views integration for the broken links report from the Linkchecker module (hopefully this will become incorporated into that module), and I think that would be a great way to go with this, too. If we could store errors in a new DB table and expose that table to Views, then site admins like me could compile any and all similar errors from different modules into a single, user-friendly report via Views.

nkschaefer’s picture

FileSize
13.18 KB

A solution? Check out the module I've attached to this.

I have a pretty immediate need for this (a lot of errors and redirects have been popping up daily), and my project uses Drupal 6. To solve my problem right away without hacking the feeds module, I wrote a custom module that does the following:

A) creates a new database to store log entries

B) supplies a new Feeds fetcher (called FeedsLogHTTPFetcher) - this is almost identical to the normal FeedsHTTPFetcher, but it overrides a couple of key things to implement changes. To use it, go to your Feeds Admin UI and override the fetcher for your preferred feed importer -- choose "Feeds Log HTTP Fetcher." You should be good to go.

C) When feeds are checked, if a 301 response is encountered, the module notes this, gets the final redirected URL, and updates the feed node to reflect the new URL. It logs the fact that this happened in Drupal's watchdog log. For now, since 302s are "temporary" redirects, they're being treated like other "bad" status codes (see below). I made it very easy to change the code to update 302 URLs as well, but for now I thought it would be safer not to do this.

D) If a bad status code is returned (negative numbers corresponding to problems a la drupal_http_request, as well as actual HTTP status codes like 40X, 50X), it will be logged in a database table. The table keeps track of the feed nid and URL associated with the error, the number of times it's been attempted, the status code of the error, and the last-checked timestamp. In addition to logging the error here, the offending node will be unpublished. If the user updates the node by changing the URL, the node will be automatically re-published.

E) Views integration is provided for the error DB table, and a default view will provide a pre-created "feeds error log" table. This can be overridden/modified/whatever. All fields in the error log table will appear under the group "Feeds Error Log" in the Views UI. I've also added a field called "Status Code Meaning," which gives a rough translation of the status codes that a human can be better able to interpret. I included translations for weird codes like -1003 from drupal_http_request, because seeing stuff like "feed failed with code -1003" means nothing to most people.

Now, some boring stuff:

Before I did anything, I noticed that feeds were behaving a little differently with and without CURL enabled/available. With CURL, if a feed redirected, the redirection would be followed and content imported, but the module would report an HTTP status code of 200 (this code was coming from the final URL, not the one stored in the database that caused the redirect). Without CURL enabled, redirects would just throw errors and importing would fail.

This module still follows and retrieves content from the redirect if CURL is enabled, but it reports the status code from the original response (this was being just thrown away) rather than the 200 status code from the final response. Without CURL enabled, it finds the redirected URL using some code I found online that seems to work. To view the site the code came from, look in the comments on the methods in FeedsLogHTTPFetcher.inc. I had to override feeds_http_request.inc as well, but I've clearly marked in my file (http_request_override.inc) everywhere I made changes (just two little places). There may be a good reason why the original HTTP status code was being thrown away, but from my perspective, it makes more sense to take the response code from where my code takes it.

To summarize: if anyone else wants what I wanted (auto-update feeds that fail with HTTP status 301, log other bad status codes, and provide Views integration and a nice way of looking at errors, so your site admin can see what he/she needs to see easily), and you're running Drupal 6, just download the attached file, drop the feeds_log folder into your module directory, and install it. Then go to your Feeds Admin UI and change your fetcher to "Feeds Log HTTP Fetcher." Voila. The report is available at yoursite/feeds_log.

nkschaefer’s picture

Sorry for writing even more...

I was going to write a patch to the feeds module as well, but this is my first shot at patching (or using CVS) and I am totally lost. My site runs Drupal 6, so I wanted to write a patch against the 6.x branch of feeds, but it looks like maybe you only accept patches for 7.x?

Anyway, I set up CVS on my computer and checked out the latest version of feeds but forgot to include the version tag, so I wound up with the 7.x version. I stupidly deleted the folder (not via cvs or command line) and tried to re-check out with the 6.x version of feeds (to the same directory), but it didn't appear there. I assume what's going on is that CVS is still keeping track of the version of the directory that's now in "Trash," but I can't figure out how to move it back, or replace it, or tell it where I want it to be -- and I'm not sure how to tell the difference between commands that I want to apply to my local copy of the module files rather than the version on Drupal's CVS repository. I'm also not sure if I can overwrite the files with a new checkout now, or even if my patch will be looked at if it's for the 6.x version of feeds.

If I'm overlooking something simple and someone can get me on my feet quickly, I can try to write a patch that does what the above module does (of course, we want might to keep some of it -- ie the part that installs, manages the log table -- in a separate module in case some users don't have many feeds or care about fancy logging). Otherwise, the changes needed to get this stuff to work should be really evident from that module I uploaded, if any experienced patchers have a few minutes to copy/paste some stuff.

Thanks - and if I can be of more help with this, I'd like to be. I'm eager to get these changes incorporated into Feeds.

nkschaefer’s picture

FileSize
13.17 KB

Agh - one more (non-important) thing

The module .info file wasn't change to make it show up in the "Feeds" package. This copy should show up in the right place and work fine. Download this one if you want to try it out.

nkschaefer’s picture

And sorry to be posting yet again on this thread, but I let my module run over the weekend and realized that I overlooked something: Feeds stores "source" configurations for individual fetchers. This means that, if you use the module I posted above, the fact that the fetcher is a different class than FeedsHTTPFetcher will cause some problems. If you go to edit a feed node, Feeds won't pull up the URL because it's mapped to the class name FeedsHTTPFetcher, not FeedsLogHTTPFetcher. Similarly, if you save a feed node with FeedsLogHTTPFetcher selected as the fetcher, the URL will store differently and will no longer be picked up if you switch back to FeedsHTTPFetcher.

I think what I did in the module above still should be good to get the ball rolling on implementing these changes, and if these changes were incorporated into the Feeds module and the FeedsHTTPFetcher class, then these problems stemming from using a different class wouldn't exist. But in the meantime, I wouldn't enable this module on a production site if you've got a lot of feeds to keep track of.

alex_b’s picture

I am confused - all that's needed here is to follow redirects, not a new module. Curl is actually set up to do that (see curl_setopt($download, CURLOPT_FOLLOWLOCATION, TRUE); in http_request_get()) so the challenge here is to find out

A whether the cases reported here actually do find redirects not to be working or sth else is going on
B whether the cases reported here use curl (there is a fallback to drupal_http_request())
C If A && B - why does curl not follow redirects?

pounard’s picture

Plus this is a quite old issue, might be resolved by the time. I don't have the same environment to test again on, but if I remember well, I had the same issues on many environments.

nkschaefer’s picture

Feeds with CURL available does follow redirects. One possible problem I see, though, is that it reports the HTTP status code from the final location (a 200) and retrieves content, but I think it would be better to note the status code from the initial location (i.e. 301) and update the feed node's URL, in addition to pulling in content from the final location. Since 301s represent permanent redirects, it would be better practice to go straight for the new address, rather than always checking the old URL (which may eventually disappear) first.

The site I'm working on has, for whatever reason, been set up without CURL available (and I'm not in a position to fix that). Without CURL enabled, feeds reports 301 status codes as errors and fails to fetch content. This is a problem for my site.

If CURL is available, the code I wrote above still follows the location, but it also notes 301 codes and updates the source URL to the final location and saves it.

If CURL is not available, it introduces a new (non-CURL) method to follow the location, and does the same thing (updates the source URL and saves the FeedsSource object).

Additionally, the code I wrote introduces a new table to log bad status codes (404, etc.) and exposes this table to Views so that admins can see a clean report of all recent failures, rather than hunting through the watchdog table. This is a big plus for my site, which aggregates about 3,000 feeds and there are lots of errors happening daily that need attention from the site admins.

I agree that a new module isn't needed, but I wanted to demonstrate what I was talking about via some working code, rather than just bothering people and asking for stuff. The problems I noted in the last post make it impractical to use, but it's still a good demonstration/proof-of-concept of what I was talking about. These changes may seem insignificant in a lot of cases, but they would be a big plus for my site. If there's anything else I can do to help or clarify, please let me know. Thanks.

Dane Powell’s picture

Status: Active » Needs review
FileSize
5.3 KB

I definitely think that this should be handled in Feeds, not in a separate module.

To that end, I wrote this patch that incorporates the basic functionality of nkschaefer's module- updating the URL for permanent redirects and disabling the feed for other errors. There is no extensive logging of data - I think the odds of this being accepted are better if the changes are minimal. This patch also cleans up the code considerably to adhere to Drupal's coding standards (although more work could still be done, especially on the three functions that were copied from elsewhere).

I have confirmed that this will not affect valid feeds, and will disable feeds returning a 404. I haven't tested other conditions.

Edit: I apologize that you'll probably need to edit the filename on the second line of the patch to get it to apply correctly

nkschaefer’s picture

Awesome - thanks for getting the ball rolling on this. I am ignorant about how patching/testing patches works, but I'll read up and help out with this if I can.

Also, if anyone wants some redirected feed URLs to use as test cases: here are a few I've recently encountered through my site:

301s:
http://www3.interscience.wiley.com/rss/journal/117968610
http://www3.interscience.wiley.com/rss/journal/37476
http://www3.interscience.wiley.com/rss/journal/118520573

302s:
http://www.extenza-eps.com/rss/etoc/suli
http://www.extenza-eps.com/rss/etoc/jscp
http://content.karger.com/produktedb/rss.asp?j=DNE
http://content.karger.com/produktedb/rss.asp?j=ORL

Thanks again. I'll look into helping more with this once I can.

-Nathan

Dane Powell’s picture

FileSize
5.52 KB

Here's an updated patch rolled with git

carlos8f’s picture

Status: Needs review » Needs work

Thanks for the patch, but as far as I can tell the implementation is rather broken :-/

First thing, when using cURL, $result->code is going to be the last HTTP code encountered after following redirects. So even though $headers['Location'] will have the last redirect URL, it will never hit the if() to save the new location, since if the server will likely return a 200 after redirects. To fix this we'll have to patch http_request_get() to parse the raw headers and store the original redirect code, so we can save the new feed URL if we got a 301 originally.

Second, cURL follows redirects on 301, which means the logic ($result->code == 302 && $curl) is misguided and should also continue if there is a 301.

Third, the non-cURL helper methods included in the patch are apparently redundant, because drupal_http_request() processes up to 3 redirects by default. We can use $result->redirect_url as the new feed URL. We might also want to bump up the retry count a little bit.

I also don't agree that the feed should be unpublished if a non-OK status is returned. Servers commonly crap out with a temporary 500 here and there and I don't want to be constantly re-publishing my feeds. So let's just unpublish on 400's and not on 500's.

Minor: t() should be removed from the watchdog() and "feeds_log" as the watchdog key should probably be replaced with "FeedsHTTPFetcher" or "feeds" to be consistent.

I have a site with thousands of feeds which have stopped importing because meetup recently changed their URL structure. I'll be working on a patch.

carlos8f’s picture

Category: feature » bug
Priority: Normal » Major
Status: Needs work » Needs review
FileSize
14.99 KB

This fixes issues I mentioned in #17, adds lots of tests, and other fixes:

- Fix problem with patch #16 that uses hard-coded importer ID "feed", which results in the new config not being saved correctly.
- Fixes cURL requests that Feeds triggers inside tests so they maintain the simpletest User-Agent, as drupal_http_request() does (required for included tests to work).
- Makes $result->code consistent across cURL and drupal_http_request(), i.e. code is always the first HTTP code encountered. On redirects, both "drivers" now populate $result->redirect_code and $result->redirect_url with the final code and URL.
- Bumped up max redirects for drupal_http_request() to 10 from 3.
- Handle 307 redirects.
- Standardized $headers to be associative.

Also, at the moment, you also need patch from #1203578: Wrong path in simpletest files. [test -> tests] to run any tests on the 6.x branch.

Reclassifying as bug. This affects non-cURL users. In drupal_http_request() $result->code is the first code encountered (301 or 302). cURL returns the last code, which allows it to pass the check on a redirect. So if you don't use cURL (open_basedir is one reason), Feeds won't parse redirected feeds at all. That's definitely an oversight/bug.

Cheers :D

carlos8f’s picture

Slight change to also test the importer with fallback drupal_http_request() if cURL is available.

c0ldfury’s picture

Is this dead? Feeds still doesn't handle redirects properly.

c0ldfury’s picture

Version: 6.x-1.x-dev » 7.x-2.x-dev
FileSize
10.03 KB

Well, I modified the above patch to work with 7.x-2.x-dev as most of the code it affects it the same, but it isn't working for me atm. I'll post it here in case someone else can do something with it.

c0ldfury’s picture

Status: Needs review » Needs work
FileSize
9.11 KB

I've got the redirect handling working for my use case (a site that redirects to a url with a session key), but I think the test case still needs some work. I don't know enough about drupal test cases to know what is wrong with it.

Redarion’s picture

This patch doesn't work with me: always the same 302 error.
does anyone keep on getting the same problem?