i was noticing that my link checker was only running 30 links (or less) per hour, since that's what the max execution time was set to.
during peak traffic hours that is probably fine, but during off hours, why not let the thing run and try to churn through as many of the 10k+ links we have as it can?
so i came up with the following customization to allow you to set when your peak hours are (in 24 hour time) and how many you want to check max along with how much time you want the link checker to churn before shutting it down during the cron run.
replace lines 134-152 of linkchecker.module with the following:
// --- set your vars here (in 24hour time) //
$startPeaktime = 6;
$endPeakTime = 22;
$check_links_max_per_cron_run_offPeak = 500;
$timer_divisor_offPeak = 5000;
$timer_divisor_default = 1000;
// --- //
// get current time
$currentTimeHour = date("H");
// check if peak time
$peaktime = (($startPeaktime<=$currentTimeHour)&&($currentTimeHour<=$endPeakTime)?true:false);
$check_links_max_per_cron_run = ($peaktime?$max_execution_time:$check_links_max_per_cron_run_offPeak);
$timer_divisor = ($peaktime?$timer_divisor_default:$timer_divisor_offPeak);
//$check_links_max_per_cron_run = variable_get('linkchecker_check_links_max', 10);
$check_links_interval = variable_get('linkchecker_check_links_interval', 2419200);
$useragent = variable_get('linkchecker_check_useragent', 'Drupal (+http://drupal.org/)');
watchdog('linkchecker', '$max_execution_time:'.$max_execution_time, array(), WATCHDOG_NOTICE);
watchdog('linkchecker', '$check_links_max_per_cron_run:'.$check_links_max_per_cron_run, array(), WATCHDOG_NOTICE);
// Get URLs for checking.
$result = db_query_range("SELECT * FROM {linkchecker_links} WHERE last_checked < %d AND status = %d ORDER BY last_checked, lid ASC", time() - check_links_interval, 1, 0, $check_links_max_per_cron_run);
watchdog('linkchecker', 'num pulled from db to check:'.mysql_num_rows($result), array(), WATCHDOG_NOTICE);
watchdog('linkchecker', 'peaktime?:'.($peaktime?'yes':'no'), array(), WATCHDOG_NOTICE);
$linkcheckercountertally=0;
while ($link = db_fetch_object($result)) {
// Fetch URL.
watchdog('linkchecker', 'checking: '.$link->url, array(), WATCHDOG_NOTICE);
$response = drupal_http_request($link->url, array('User-Agent' => 'User-Agent: ' . $useragent), $link->method, NULL, 1);
//watchdog('linkchecker', 'done checking: '.$link->url, array(), WATCHDOG_NOTICE);
_linkchecker_status_handling($link, $response);
$linkcheckercountertally++;
if ((timer_read('page') / $timer_divisor) > ($max_execution_time / 2)) {
watchdog('linkchecker', 'stop linkchecking, we hit the max: '.$linkcheckercountertally, array(), WATCHDOG_NOTICE);
break; // Stop once we have used over half of the maximum execution time.
}
}
watchdog('linkchecker', 'ran out of links to check', array(), WATCHDOG_NOTICE);
}
just set your vars in the top there and let it go.
feel free to comment/uncomment any of the watchdog lines to be able to see/hide what exactly it is doing in your watchdog recent log table page.
if anyone would want to take this and add it to the config page for linkchecker, that would be slick, but if you don't mind playing in the php, this should work fine for you, as is.
[standard legal disclaimer about how you should back up your db and files before tinkering with your site in any way, and how i bear no responsibility if inserting this code crashes your site, WSOD's your site, crashes your browser, deletes your database, makes your girlfriend leave you, sets your dog free, drinks all your beer, or anything else that happens after you use this code]
enjoy!
Comments
Comment #1
hass commentedThis is a bit like #380052: Add support with non-blocking parallel link checking... otherwise why are you not calling cron every 5 minutes? :-)
To get all links checked quickly I'm also calling cron with wget in a loop on command line to get all checked as quick as possible. As it's more a one time action I was fine with this solution... later you should not see so many new links... normally - if we are not on a high stress site... but for high stress we need the cURL... and the maintainer of high stress sites often ran cron every 5 minutes.
What I'm missing in your code is a time-out override. linkchecker tries it's very best to check as much links as possible per run. It can be 3 links, but could also be ~120 links per run. This *hardly* depends on remote servers speed. As slower the remote servers answer as fewer links are checked, but half cron time is always used. You cannot check 500 links within 120seconds... This *must* end in cron failures/crashes. Your patch makes cron failing, the original code doesn't.
Comment #2
chadd commentedi thought of running cron every 5 minutes but i think that would put a bit too much stress on our already groaning servers. LOL
as far as not seeing many new links, well, that's not the case with us.
long story short, we get a lot of new links added to pages on a consistent basis.
basically i was just trying to throttle the link checker during peak time and then use some off peak time to churn through the links instead of just sitting idle.
we're paying for the CPU, might as well use it, eh?
and the numbers in the vars are just starters for tweaking. i put in 500, not thinking that it would check 500, but just to make it max out the timer and give me a place to start with tweaks. i still have a time-out override, i just modify it a bit during off-peak hours. so far it's looking like it checks about 200 per cron with those settings. but my goal is to tweak those until i get it as high as possible without crashing cron. obviously, everyone would have to tweak it to fit their site/server/situation.
so far i haven't had cron fail once with my new code inserted. (except for the issues brought up in the time out thread here: http://drupal.org/node/739524 but that was happening before i inserted this code, so i don't think the new code is the cause of that)
Comment #3
hass commentedThe original code make sure that cron cannot fail... this is why it stops checking after half max execution time. Verify that you have an SSL lib activated....
Comment #4
chadd commentedwell, it was failing with the original code.
and unless i'm reading my own code wrong, it still has the check in it to make sure it stops checking after half max execution time, i'm just changing the divisor from your default 1000 to something else which shortens the reported time it took to check the link, in effect giving it more time to check links.
now that i look at i should change it to instead of changing the 1000 divisor to changing the 2, so you would change half max execution time to 1/4 or 3/4 max execution time or whatever you deem best for your circumstance
Comment #5
chadd commentedmodified to change the cron timeout fail stopper on line 147 (of the original .module file. line 171 of the .module after you insert the modified code)
Comment #6
hass commentedI can only guess that timer_read('page') or get_ini('max_execution_time') may return some wrong numbers. Please try to debug all values and try to figure out why cron fails. It can only be a problem on your server... I have very intensive tested this functionality and it worked well in all cases. Slow servers, quick servers - sometimes a few links sometimes very many (up to ~120 per cron run), but never ever any cron fails except openssl was disabled in PHP and I checked HTTPS links. I'm not sure if I'm able to catch the disabled SSL somehow...
Comment #7
chadd commented[i'm guessing that #6 is meant as a reply to my other thread about specific links failing cron http://drupal.org/node/739524 ? ]
FYI: this code addition worked great over the weekend. cron never failed.
checking 180+ links per cron during off-peak traffic times and tuning down to check the linkchecker default during peak times.
really, all it's doing is checking what hour of the day it is and if it is "off-peak" it raises the max links that can be checked and raises the amount of time it will take to check links before breaking and exiting cron.
Comment #8
hass commentedUps, sorry - yes... confusing... :-)
Isn't it not easier if you configure cron to run every ~5 or 15 minutes in off times and every hour during the day... it would not require any to change the module code... what load is acceptable during peak time? Never seen any other module that do such things, but I have also thought more than once to create an external script that do the checks with curl... and allow to disable cron hook in such a case... something that may come in future :-)
Comment #9
chadd commentedour servers are highly loaded enough during the day that i don't want to chance running every 5-15 min during peak traffic times.
i guess i could just run cron every 15 min during non-peak hours, or i could do it this way... same end result i suppose.
Comment #10
hass commentedWhat are you doing with other modules that use cron? :-)
Comment #11
chadd commentedi don't catch your meaning...
Comment #12
hass commentedI believe you should have other modules installed. Other modules may use hook_cron, too. How do you handle peak times with other modules? :-)
Comment #13
chadd commentedsorry to resurrect such an old thread, but i realize i never really answered your final question. and tbh, i really can't. we have many other modules that are triggered by cron and run fine during peak times, but after much extensive testing, using supercron (before it was abandoned) to specify cron module trigger order, etc, linkchecker was the only one that seemed to cause cron to crash.
using supercron, we put linkchecker as the last module that cron triggered, cron still crashed, but at least everything else ran successfully before the crash.
using the above peak time hack solved those crashes.
i've been using it for almost 2 years now in 6.x-2.4, and porting it now to 2.5 without any issues. now with supercron gone, we still have no crashes of cron.
just wanted to post this as an FYI for anyone else having issues or wanting to have the ability to limit linkchecker during specific times.
Comment #14
hass commentedMaybe you should better run linkchecker first an than your other stuff. I have no idea why linkchecker cron should "crash" anything. It finish's cleanly after 120s. That's all it does. You should define what you mean with the word "crash".
Comment #15
chadd commentedcrash == cron hangs and never finishes, it's been a while, so i can't exactly remember exactly what happened
either way, i like having the ability to set limits during specific hours of the day.
Comment #16
hass commentedAre you blocking get_ini() or timer_read('page') calls? You need to debug it. It should be very easy for a developer. It sounds like the timer_read condition does not stop the process. Something with your system is abnormal.
Comment #17
chadd commentedlooking back, i think the crashing was more related to this #156582: drupal_http_request() should support timeout setting and i had a different thread to try and debug the cron hanging that i was experiencing.
but, as i said, i like having the ability to set limits based on hour of the day. that's why i started this thread in the first place as a possible feature request and providing the code that could be the start of that feature addition, or allow people to do it themselves if desired.
i apologize for re-opening this thread, i just wanted to let people know that the peak time code that i provided earlier still works with the 6.x-2.5 version of linkchecker