Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
One day after installing and having cron run hourly I still have
There are 2320 unchecked links of about 2320 links in the database. Please be patient until all links have been checked via cron.
When does something else happen?
Thanks
Comment | File | Size | Author |
---|---|---|---|
#54 | httprl-1878454-54-use-no-subdirs-on-fail.patch | 530 bytes | mikeytown2 |
#53 | httprl-1878454-51-cron-fix.patch | 5.36 KB | mikeytown2 |
#42 | httprl-1878454-42-background-setting.patch | 4.67 KB | mikeytown2 |
#41 | httprl-1878454-38-background-setting.patch | 3.21 KB | mikeytown2 |
#27 | httprl-1878454-27-fix-cmd-base-path.patch | 1002 bytes | mikeytown2 |
Comments
Comment #1
hass CreditAttribution: hass commentedThat's very strange... Everytime when cron runs these number should decrease... Do you have any errors in logs? What's your max execution time? What check library have you selected?
Comment #2
hass CreditAttribution: hass commentedTried to reproduced this with core and httprl on latest DEV and all works here.
Please also report your Drupal 7 version and PHP version and memory limit.
Comment #3
jwaxman CreditAttribution: jwaxman commentedThanks for the quick reply
Link checker 7.x-1.0
Drupal 7.18
PHP version 5.2.17
PHP memory limit 256
Check Library HTTP
HTTP Parallel Request Library 7.x-1.8
No errors on status page
Lots of page not found errors in logs
Cron is running successfully
What can I filter on to see messages specific to link checker?
Max execution time is set in php.ini to 300
Comment #4
hass CreditAttribution: hass commentedPuhhh... This sounds all very good. Can you just try "Drupal core" check lib, please? Maybe something is wrong with HTTPRL.
If this does not help - are you a developer? Can you try to debug the issue?
Comment #5
jwaxman CreditAttribution: jwaxman commentedChanged to Drupal core and it's working now.
Thanks.
Comment #6
hass CreditAttribution: hass commentedThat's ok for now and you have not so many links. This is really fine and should be completed after 24 cron runs / 24h.
Moving queue. Would be great if you could help the httprl maintainer figuring out thr root cause. I've tested myself with 5.3.1 only in past weeks. Maybe mikey has some ideas. I'm sure it is PHP 5.2.x or stream related.
Comment #7
jwaxman CreditAttribution: jwaxman commentedHappy to help!
How?
Comment #8
hass CreditAttribution: hass commentedLet's wait for mikey's feedback. He may have some ideas. You could give HTTPRL DEV a try for now.
Comment #9
mikeytown2 CreditAttribution: mikeytown2 commentedGo to the status report
admin/reports/status
. Under HTTPRL are there any errors?Comment #10
jwaxman CreditAttribution: jwaxman commentedHTTPRL All the required functions are enabled and non blocking requests are working.
Comment #11
mikeytown2 CreditAttribution: mikeytown2 commented@hass
Linkchecker used 2 features of HTTPRL, threads and parallel downloading. Do you think we could make these 2 independent features?
1. Run link checker in a new process, independent of cron
2. if ($has_httprl) {
That would help to narrow down what the possible issues are. The status report says that threading should be working so splitting this up should help to pinpoint where this is failing.
Comment #12
hass CreditAttribution: hass commentedIsn't this not just only the part in hook_cron() that needs to be disabled? We could give him a patch or just explain how to disable the background process.
By renaming 'httprl' in module_exists() this stops running as background process e.g. Rename to 'httprl_disabled'.
@Jwaxman: Can you try this, please?
Comment #13
jwaxman CreditAttribution: jwaxman commentedCan/should I do this while it's currently working (quite well by the way) or should I wait a day to let it finish?
Comment #14
hass CreditAttribution: hass commentedNo, before it finishes. If it has finished, just press the clear all data button under maintenance and all starts from scratch. Switch back to httprl lib.
Comment #15
jwaxman CreditAttribution: jwaxman commentedDidn't get to it in time.
Real job got in the way.
Will do this test this weekend and report back.
Comment #16
hass CreditAttribution: hass commentedNo worries... Just press the clear all button in linkchecker and all starts from scratch.
Comment #17
jwaxman CreditAttribution: jwaxman commentedChanged the linkchecker.module file as requested.
Changed configuration on the linkchecker admin page back to using HTTPRL.
Cleared and reanalyzed links.
Waited for the batch to finish.
Ran cron.
First bunch of links showed up on the broken links report.
Does that help?
Comment #18
hass CreditAttribution: hass commentedYou mean the number of unchecked links reported on the broken links page reduce with every cron run? This means the background process is not working properly or cannot called correctly. You could theoretically have firewall (block requests from localhost, or forced SSL redirection, etc.) or loopback interface issues if this is not a bug of the HTTPRL module itself.
Can you check your Apache logs if there is such a hit (with original linkchecker code, not the altered one). This is the background process call. If there is anything else than status code 200 or the hit is missing something went wrong in code or on server/firewall side.
"POST /httprl_async_function_callback?count=1 HTTP/1.0" 200 38 "-" "Drupal (+http://drupal.org/)"
Code wise I would say the background tasks are not yet verified if they may work properly, but I could be wrong. 'admin/httprl-test' looks like such a test url. Whatever this will not solve the bug... it only brings us a step closer to the root cause.
@mikeytown2: Your #11 comments sounds a bit like you do not trust this feature to work reliable... why? I think we can make the features independed, but should this be an UI setting or more a hidden variable to disable the background feature only?
Comment #19
jwaxman CreditAttribution: jwaxman commented>>You mean the number of unchecked links reported on the broken links page reduce with every cron run?
Yes
>>Probably not. I'm in a shared hosting environment (Hostmonster).
The Main Error Log (I assume that's where I'd look for this) is shared my multiple sites and gets overwritten very quickly.
Comment #20
mikeytown2 CreditAttribution: mikeytown2 commentedLooks like httprl_build_url_self() isn't working correctly under the command line (cron). This function gets call here if you're wondering http://drupalcode.org/project/httprl.git/blob/e02065fb2a2e0dc8cc573e0a0d...
@jwaxman
If you go to
admin/config/development/httprl
under "IP Address to send all self server requests to" can you fill out that info with your servers IP Address? My guess is 127.0.0.1 isn't getting routed back to drupal for some reason. When testing this, run it with an unmodified version of linkchecker.Comment #21
mikeytown2 CreditAttribution: mikeytown2 commentedThe other issue might be with core or how cron.php is getting called depending on how you look at things.
The
$base_path
variable is set via drupal_settings_initialize() D7 or conf_init() D6. The code is the same in D6 & D7.When in the command line version of PHP, $_SERVER['SCRIPT_NAME'] is populated via the script name that is passed into php.
php /var/www/cron.php
will result in$_SERVER['SCRIPT_NAME'] = "/var/www/cron.php"
php cron.php
will result in$_SERVER['SCRIPT_NAME'] = "cron.php"
So instead of hitting
http://127.0.0.1/
, httprl could be hittinghttp://127.0.0.1/var/www/
. This is an issue with core because the$base_path
variable is incorrect. This could also be an issue with how you are calling cron.php as making it relative instead of absolute will fix it as well.EDIT: looks like making it relative makes
$base_path = "/./"
@jwaxman
How is cron being called on your server?
Comment #22
mikeytown2 CreditAttribution: mikeytown2 commentedThe following patch has been applied to httprl 6.x & 7.x. It makes sure that httprl_build_url_self() outputs the correct info even if core gives us a bad $base_path dir due to drupal getting called from cron via command line.
Leaving open till jwaxman reports back what worked in his case.
Comment #23
hass CreditAttribution: hass commentedWhat is $_SERVER['PWD']?
Comment #24
mikeytown2 CreditAttribution: mikeytown2 commentedpwd is part of the $_SERVER global when php is ran from the command line.
php -r "print_r(get_defined_vars());"
Comment #25
hass CreditAttribution: hass commentedIs this platform independent? Never used it... Is getcwd() less safer?
Comment #26
mikeytown2 CreditAttribution: mikeytown2 commentedhmmm looks like windows on PHP doesn't contain that variable. Issue is getcwd() resolves symlinks thus my trick here needs more work...
Comment #27
mikeytown2 CreditAttribution: mikeytown2 commentedFollowing patch has been committed to 6.x & 7.x.
So if you're using symlinks on windows and when running cron.php you invoke it via the symlink dir this might not work. The work around in this case would be to set $base_url in settings.php
Comment #28
hass CreditAttribution: hass commentedWhy are you trying to use PWD if getcwd works on all platforms the same way?
Comment #29
mikeytown2 CreditAttribution: mikeytown2 commentedLets say that /var/www/ is a symlink that points to /code
Running
php /var/www/cron.php
getcwd() =
"/code"
$_SERVER['SCRIPT_NAME'] =
"/var/www/cron.php"
Thus $base_path will be set to /var/www but getcwd is set to /code.
Comment #30
mikeytown2 CreditAttribution: mikeytown2 commentedGoing to mark this as fixed. Please re-open if this is not working when Drupal is ran from cron.
Comment #31
hass CreditAttribution: hass commentedShould I still add a disable feature for HTTPRL background process to linkchecker?
Comment #32
mikeytown2 CreditAttribution: mikeytown2 commentedI would say yes. Getting a background process to work on all the different server configurations isn't easy. Sometimes the best option is to be able to not use a background process. Being able to check links is parallel is still a huge advantage :)
Comment #33
hass CreditAttribution: hass commentedOk, but we should expect that users are not able to identify such type of issues on their own and I'm not sure how we can explain them in an easy way that the background tasks are broken on their server and they should better fix their server for highest possible performance. I believe it would be a lot easier if HTTPRL could provide a variable_get() with TRUE/FALSE that tells other modules; if background tasks are working or not. Than we have only one "instance" every module can rely on.
This way we do not need to duplicate these runtime validation checks or UI settings in every module. Maybe add a checkbox to the HTTPRL settings page.
Comment #34
mikeytown2 CreditAttribution: mikeytown2 commentedI could test the callback in httprl_cron only if cron is ran from the command line (argc/argv). My status check test passed when ran from a browser (#10); thus I need a different test when httprl is ran from the command line.
Comment #35
hass CreditAttribution: hass commentedWhy not calling a callback function in a runtime (.install) check that outputs a drupal_set_message() that the check was successful and sets the variable to TRUE, otherwise default to FALSE (background not possible). Or don't do a test and just add the disable checkbox to the HTTPRL page.
Comment #36
hass CreditAttribution: hass commentedWe could run into the same issues like #965078: HTTP request checking is unreliable and should be removed in favor of watchdog() calls. Therefore I vote for a checkbox in HTTPRL page only, defaults to TRUE. So we can tell non-developers to disable and try again.
Comment #37
mikeytown2 CreditAttribution: mikeytown2 commentedSo if the variable is disabled have httprl_queue_background_callback() do nothing and return FALSE. would that work for you?
Comment #38
hass CreditAttribution: hass commentedThis may complicate code and make it unreadable... Thought more about
variable_get('httprl_background_callback', TRUE)
.Comment #39
hass CreditAttribution: hass commentedOr
httprl_is_backround_callback_capable()
.Comment #40
mikeytown2 CreditAttribution: mikeytown2 commentedWorking on a patch to add in the
httprl_is_background_callback_capable()
function.Comment #41
mikeytown2 CreditAttribution: mikeytown2 commentedGot started on this patch.
Comment #42
mikeytown2 CreditAttribution: mikeytown2 commentedThe following patch has been committed.
Comment #44
vinmassaro CreditAttribution: vinmassaro commentedReopening this since I'm having trouble getting links to be checked.
Links do get checked properly when using Drupal core as the library. I don't see any errors being logged in watchdog, and manual cron runs complete successfully (run from the UI). What else can I check? Thanks in advance.
Comment #45
mikeytown2 CreditAttribution: mikeytown2 commentedI'm trying to think of what could be causing you issues and the best idea I have at the moment is to have you run some of the examples in the httprl readme and see which ones don't work as advertised (it's my mini test suite). I would skip the non-blocking examples a couple of them will issue 1,000 requests in a short amount of time.
Comment #46
vinmassaro CreditAttribution: vinmassaro commentedI stuck each example into a node body with PHP format and they all worked correctly. I did each simple example and the first non blocking one (10 URLs) and got output returned back. It seems to just not work when cron runs. I have a feeds task that runs during cron, and node indexing works during this time as well.
Comment #47
vinmassaro CreditAttribution: vinmassaro commentedAny other ideas for things I can try? Thanks.
Comment #48
mikeytown2 CreditAttribution: mikeytown2 commentedHow are you starting cron; what is the command in crontab?
Comment #49
vinmassaro CreditAttribution: vinmassaro commentedIt doesn't work when starting cron from admin menu ('Run cron') or from crontab ('drush cron'). This is on a university install of Drupal running PHP 5.3.3 and it shows all functions available in the site status report. It DOES check links correctly from my local MAMP environment (PHP 5.3.14) when running cron manually from the UI. I'm wondering if something else is at issue here, like firewall, but the examples I ran did return output to the node body.
Comment #50
noslokire CreditAttribution: noslokire commentedSame issue here. And we work together.
Comment #51
mikeytown2 CreditAttribution: mikeytown2 commentedWe recently switched over to a drush cron run using elysia cron so I should be able to debug this now that I have a good repo case.
Comment #52
mikeytown2 CreditAttribution: mikeytown2 commentedFollowing patch has been committed. This has some ugly code due to the fact that running from the command line doesn't give me the correct context in order to hit the webserver with the right request. When dealing with sub directory drupal install I need to auto discover the correct subdir. I believe the code also handles symlinks better due to using this code #1792310: Wrong DRUPAL_ROOT with non-standard code structure when going up the directory tree.
Comment #53
mikeytown2 CreditAttribution: mikeytown2 commentedLast patch was the wrong file. This is the correct one
Comment #54
mikeytown2 CreditAttribution: mikeytown2 commentedOne more change. Have it use no subdirectories if the subdir test failed.