I just spent the past few hours trying to debug why my cron job was failing to run to completion on one of my customer's sites. It was narrowed down to Link Checker.

After further investigation, I found that Link Checker is having a terrible time trying to check any myspace.com pages. I ended up removing all the myspace.com links from the linkchecker_link table and configured Link Checker to not check the link status of URIs containing myspace.com

I don't have time right now to debug this any further, so it's really just a heads up to you guys. At least things are working for me again.

Note that this is just a recent phenomenon. I've had the same version of Link Checker running for some time now and all those myspace.com links have been there since day one.

Cheers
Adrian.

Comments

hass’s picture

Category: bug » support
Status: Active » Fixed

If myspace blocks your server it's not a bug of linkchecker... Maybe you are affected by one of the known durpal http request bugs. Give HTTPRL a try, please.

Let us know your debug results, please.

age3141592’s picture

Yes, it does appear to be related to drupal_http_request as it never returns a response.

Using HHTPRL does work better for sure and it does complete. However a sample url is still not retrieved correctly:

https://myspace.com/theautumnwoodsman -10053 Connection timed out. Time to First Byte Timeout.

I've also noted in the logs this message associated with the link check:

LOCATION http://www.axeandyoushallreceive.com/httprl_async_function_callback?count=1
REFERRER
MESSAGE Warning: session_start(): Cannot send session cookie - headers already sent by (output started at sites/all/modules/httprl/httprl.module:2284) in drupal_session_start() (line 287

The strange thing is that myspace is not blocking my server as I can do a wget on that page.

A thought, which I have not tried, is that myspace is not responding to the default drupal user agent used in the request.

hass’s picture

Try GET please... Maybe they block HEAD requests...

age3141592’s picture

Yes, they are blocking HEAD requests. GET works as expected (mostly).

I say mostly because myspace is not responding with a 404 to missing pages when requested with GET. Instead they don't answer at all and we get:
-10060 Connection timed out. Read.

I'm perfectly happy to black list myspace.com from link checker on my customer's site in question as most of those links are in archived messages. However, the original issue can bite others in the future.

Alexander, thanks for your time in figuring things out here. I'm sure you'll have better ideas as to whether or not a link checker fix is warranted and how that would take shape. I do have a couple ideas if you wish to discuss it further.

hass’s picture

I try to debug this on my own again later... I believe they may really block you or the Drupal user agent... If this is really true we will find out and ways to solve... :-)

We have #503040: Implement a trigger mechanism to call actions based on link checker test results on the roadmap for buggy sites like rapidshare and maybe myspace not returning 404 status code. To analyze the returned content with a string compare... But looks like nobody is really interrested or it would have already been done. I triggered rapidshare 5 times about this but it looks like the developers over there or at least their support haver extremly limited knowledge about how computers need to work.

Please feel free to share your ideas :-)

hass’s picture

I've got a status code "404" on random URLs with GET and HEAD is unclean blocked. These servers need to send "405 method not allowed".

Need to try other user agents.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.