One day after installing and having cron run hourly I still have

There are 2320 unchecked links of about 2320 links in the database. Please be patient until all links have been checked via cron.

When does something else happen?

Thanks

Comments

That's very strange... Everytime when cron runs these number should decrease... Do you have any errors in logs? What's your max execution time? What check library have you selected?

Status:Active» Postponed (maintainer needs more info)

Tried to reproduced this with core and httprl on latest DEV and all works here.

Please also report your Drupal 7 version and PHP version and memory limit.

Thanks for the quick reply

Link checker 7.x-1.0
Drupal 7.18
PHP version 5.2.17
PHP memory limit 256
Check Library HTTP
HTTP Parallel Request Library 7.x-1.8
No errors on status page
Lots of page not found errors in logs
Cron is running successfully
What can I filter on to see messages specific to link checker?
Max execution time is set in php.ini to 300

Puhhh... This sounds all very good. Can you just try "Drupal core" check lib, please? Maybe something is wrong with HTTPRL.

If this does not help - are you a developer? Can you try to debug the issue?

Changed to Drupal core and it's working now.
Thanks.

That's ok for now and you have not so many links. This is really fine and should be completed after 24 cron runs / 24h.

Moving queue. Would be great if you could help the httprl maintainer figuring out thr root cause. I've tested myself with 5.3.1 only in past weeks. Maybe mikey has some ideas. I'm sure it is PHP 5.2.x or stream related.

Project:Link checker» HTTP Parallel Request & Threading Library
Version:7.x-1.0» 7.x-1.8
Component:Miscellaneous» Code
Category:support» bug

Happy to help!
How?

Let's wait for mikey's feedback. He may have some ideas. You could give HTTPRL DEV a try for now.

Go to the status report admin/reports/status. Under HTTPRL are there any errors?

HTTPRL All the required functions are enabled and non blocking requests are working.

@hass
Linkchecker used 2 features of HTTPRL, threads and parallel downloading. Do you think we could make these 2 independent features?
1. Run link checker in a new process, independent of cron
2. if ($has_httprl) {

That would help to narrow down what the possible issues are. The status report says that threading should be working so splitting this up should help to pinpoint where this is failing.

Isn't this not just only the part in hook_cron() that needs to be disabled? We could give him a patch or just explain how to disable the background process.

<?php
  
// Run link checker in a new process, independent of cron.
  
if (module_exists('httprl') && variable_get('linkchecker_check_library', 'core') == 'httprl') {
?>

By renaming 'httprl' in module_exists() this stops running as background process e.g. Rename to 'httprl_disabled'.

@Jwaxman: Can you try this, please?

Can/should I do this while it's currently working (quite well by the way) or should I wait a day to let it finish?

No, before it finishes. If it has finished, just press the clear all data button under maintenance and all starts from scratch. Switch back to httprl lib.

Didn't get to it in time.
Real job got in the way.
Will do this test this weekend and report back.

No worries... Just press the clear all button in linkchecker and all starts from scratch.

Changed the linkchecker.module file as requested.
Changed configuration on the linkchecker admin page back to using HTTPRL.
Cleared and reanalyzed links.
Waited for the batch to finish.
Ran cron.
First bunch of links showed up on the broken links report.

Does that help?

You mean the number of unchecked links reported on the broken links page reduce with every cron run? This means the background process is not working properly or cannot called correctly. You could theoretically have firewall (block requests from localhost, or forced SSL redirection, etc.) or loopback interface issues if this is not a bug of the HTTPRL module itself.

Can you check your Apache logs if there is such a hit (with original linkchecker code, not the altered one). This is the background process call. If there is anything else than status code 200 or the hit is missing something went wrong in code or on server/firewall side.

"POST /httprl_async_function_callback?count=1 HTTP/1.0" 200 38 "-" "Drupal (+http://drupal.org/)"

Code wise I would say the background tasks are not yet verified if they may work properly, but I could be wrong. 'admin/httprl-test' looks like such a test url. Whatever this will not solve the bug... it only brings us a step closer to the root cause.

@mikeytown2: Your #11 comments sounds a bit like you do not trust this feature to work reliable... why? I think we can make the features independed, but should this be an UI setting or more a hidden variable to disable the background feature only?

>>You mean the number of unchecked links reported on the broken links page reduce with every cron run?
Yes

>>Probably not. I'm in a shared hosting environment (Hostmonster).
The Main Error Log (I assume that's where I'd look for this) is shared my multiple sites and gets overwritten very quickly.

Version:7.x-1.8» 7.x-1.x-dev
Status:Postponed (maintainer needs more info)» Active

Looks like httprl_build_url_self() isn't working correctly under the command line (cron). This function gets call here if you're wondering http://drupalcode.org/project/httprl.git/blob/e02065fb2a2e0dc8cc573e0a0d...

@jwaxman
If you go to admin/config/development/httprl under "IP Address to send all self server requests to" can you fill out that info with your servers IP Address? My guess is 127.0.0.1 isn't getting routed back to drupal for some reason. When testing this, run it with an unmodified version of linkchecker.

Status:Needs work» Active

The other issue might be with core or how cron.php is getting called depending on how you look at things.

The $base_path variable is set via drupal_settings_initialize() D7 or conf_init() D6. The code is the same in D6 & D7.

<?php
   
// $_SERVER['SCRIPT_NAME'] can, in contrast to $_SERVER['PHP_SELF'], not
    // be modified by a visitor.
   
if ($dir = rtrim(dirname($_SERVER['SCRIPT_NAME']), '\/')) {
     
$base_path = $dir;
     
$base_url .= $base_path;
     
$base_path .= '/';
    }
    else {
     
$base_path = '/';
    }
?>

When in the command line version of PHP, $_SERVER['SCRIPT_NAME'] is populated via the script name that is passed into php.
php /var/www/cron.php will result in $_SERVER['SCRIPT_NAME'] = "/var/www/cron.php"
php cron.php will result in $_SERVER['SCRIPT_NAME'] = "cron.php"
So instead of hitting http://127.0.0.1/, httprl could be hitting http://127.0.0.1/var/www/. This is an issue with core because the $base_path variable is incorrect. This could also be an issue with how you are calling cron.php as making it relative instead of absolute will fix it as well.
EDIT: looks like making it relative makes $base_path = "/./"

@jwaxman
How is cron being called on your server?

StatusFileSize
new676 bytes

The following patch has been applied to httprl 6.x & 7.x. It makes sure that httprl_build_url_self() outputs the correct info even if core gives us a bad $base_path dir due to drupal getting called from cron via command line.

Leaving open till jwaxman reports back what worked in his case.

What is $_SERVER['PWD']?

pwd is part of the $_SERVER global when php is ran from the command line. php -r "print_r(get_defined_vars());"

Is this platform independent? Never used it... Is getcwd() less safer?

Status:Active» Needs work

hmmm looks like windows on PHP doesn't contain that variable. Issue is getcwd() resolves symlinks thus my trick here needs more work...

StatusFileSize
new1002 bytes

Following patch has been committed to 6.x & 7.x.
So if you're using symlinks on windows and when running cron.php you invoke it via the symlink dir this might not work. The work around in this case would be to set $base_url in settings.php

Why are you trying to use PWD if getcwd works on all platforms the same way?

Lets say that /var/www/ is a symlink that points to /code
Running php /var/www/cron.php
getcwd() = "/code"
$_SERVER['SCRIPT_NAME'] = "/var/www/cron.php"
Thus $base_path will be set to /var/www but getcwd is set to /code.

Status:Active» Fixed

Going to mark this as fixed. Please re-open if this is not working when Drupal is ran from cron.

Should I still add a disable feature for HTTPRL background process to linkchecker?

I would say yes. Getting a background process to work on all the different server configurations isn't easy. Sometimes the best option is to be able to not use a background process. Being able to check links is parallel is still a huge advantage :)

Ok, but we should expect that users are not able to identify such type of issues on their own and I'm not sure how we can explain them in an easy way that the background tasks are broken on their server and they should better fix their server for highest possible performance. I believe it would be a lot easier if HTTPRL could provide a variable_get() with TRUE/FALSE that tells other modules; if background tasks are working or not. Than we have only one "instance" every module can rely on.

This way we do not need to duplicate these runtime validation checks or UI settings in every module. Maybe add a checkbox to the HTTPRL settings page.

I could test the callback in httprl_cron only if cron is ran from the command line (argc/argv). My status check test passed when ran from a browser (#10); thus I need a different test when httprl is ran from the command line.

Why not calling a callback function in a runtime (.install) check that outputs a drupal_set_message() that the check was successful and sets the variable to TRUE, otherwise default to FALSE (background not possible). Or don't do a test and just add the disable checkbox to the HTTPRL page.

We could run into the same issues like #965078: HTTP request checking is unreliable and should be removed in favor of watchdog() calls. Therefore I vote for a checkbox in HTTPRL page only, defaults to TRUE. So we can tell non-developers to disable and try again.

So if the variable is disabled have httprl_queue_background_callback() do nothing and return FALSE. would that work for you?

This may complicate code and make it unreadable... Thought more about variable_get('httprl_background_callback', TRUE).

Or httprl_is_backround_callback_capable().

Working on a patch to add in the httprl_is_background_callback_capable() function.

Status:Fixed» Needs work
StatusFileSize
new3.21 KB

Got started on this patch.

Status:Needs work» Fixed
StatusFileSize
new4.67 KB

The following patch has been committed.

Status:Fixed» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Status:Closed (fixed)» Active

Reopening this since I'm having trouble getting links to be checked.

  • Drupal 7.18
  • Linkchecker 7.x-1.1
  • HTTPRL 7.x-1.11
  • Status page shows everything green
  • Unchecked background callbacks from the HTTPRL settings

Links do get checked properly when using Drupal core as the library. I don't see any errors being logged in watchdog, and manual cron runs complete successfully (run from the UI). What else can I check? Thanks in advance.

I'm trying to think of what could be causing you issues and the best idea I have at the moment is to have you run some of the examples in the httprl readme and see which ones don't work as advertised (it's my mini test suite). I would skip the non-blocking examples a couple of them will issue 1,000 requests in a short amount of time.

I stuck each example into a node body with PHP format and they all worked correctly. I did each simple example and the first non blocking one (10 URLs) and got output returned back. It seems to just not work when cron runs. I have a feeds task that runs during cron, and node indexing works during this time as well.

Any other ideas for things I can try? Thanks.

How are you starting cron; what is the command in crontab?

It doesn't work when starting cron from admin menu ('Run cron') or from crontab ('drush cron'). This is on a university install of Drupal running PHP 5.3.3 and it shows all functions available in the site status report. It DOES check links correctly from my local MAMP environment (PHP 5.3.14) when running cron manually from the UI. I'm wondering if something else is at issue here, like firewall, but the examples I ran did return output to the node body.

Same issue here. And we work together.

We recently switched over to a drush cron run using elysia cron so I should be able to debug this now that I have a good repo case.

Issue summary:View changes
Status:Active» Fixed
StatusFileSize
new19.38 KB

Following patch has been committed. This has some ugly code due to the fact that running from the command line doesn't give me the correct context in order to hit the webserver with the right request. When dealing with sub directory drupal install I need to auto discover the correct subdir. I believe the code also handles symlinks better due to using this code #1792310: chdir to DRUPAL_ROOT not working when core directory is symbolic link when going up the directory tree.

StatusFileSize
new5.36 KB

Last patch was the wrong file. This is the correct one

One more change. Have it use no subdirectories if the subdir test failed.

Status:Fixed» Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.