Since implementing the nagios module on several sites, we have been getting warning pages off and on with the line:
DRUPAL WARNING, ADMIN:WARNING=Drupal core update status, CRON:OK
This has been happening despite the fact that the sites involved are completely up to date. Looking at update_status_requirements() in update.module, (line 200) it appears that this error is issued when drupal fails to get info from drupal.org. This is somewhat undesirable behavior, as we're getting warning pages for sites that aren't actually be broken. Would it be possible to optionally ignore this status if (for instance) it has been less than 48 hours since the update status was successfully checked? I'd be happy to generate a patch if you don't have time, but would like guidance on what would be deemed an acceptable solution before spending too much time on it.
| Comment | File | Size | Author |
|---|---|---|---|
| #16 | 615128-16-remove_additional_UNKNOWN_check.patch | 1.58 KB | greg.harvey |
| #7 | 615128-7-update_check_pending-D7.patch | 1.13 KB | greg.harvey |
| #3 | nagios-615128-3-update_cache_with_grace_time.patch | 962 bytes | helmo |
| #2 | update_cache_with_grace_time.patch | 935 bytes | morenstrat |
Comments
Comment #1
helmo commentedI've had this problem in the past, I had the idea that it had something todo with the execution of cron on drupal.
Maybe you can use of my solution.
This issue describes it: #531890: Run cron just before nagios check?
Comment #2
morenstratI think this happens when the update cache has expired and Nagios performs a check before Drupal cron has had a chance to refresh the update cache. Please the the attached patch for a possible solution. It adds a grace time (nagios_cron_duration) to the update cache expiration.
Comment #3
helmo commentedNice patch. I would suggest using two additional defines for the int values you use. I've updated the patch file.
Comment #4
morenstratThanks. Using the constants is definitely an improvement.
Comment #5
greg.harveyGoing through R&TBC patches, this one looks good. Committed.
Comment #7
greg.harveyRe-opening to get this in to the D7 branch. Marked #1270188: Requirements error when update check is pending as duplicate. Attached patch from there here.
Comment #8
greg.harveyCommitted to the D7 branch. Will be in the next dev snapshot, later on. Thanks dunix@gmx.de! =)
Comment #10
morenstratWith version 1.0 - I think - these lines were added to nagios_check_requirements (lines 413-416):
It seems like these lines prevent the patch from #7 from doing what it was supposed to do. Since version 1.0 I keep getting an "ADMIN UNKNOWN" warning once a day when the update cache has expired and an update check is pending.
I think these new lines and the patch need to be merged somehow. Or am I completely wrong?
Comment #11
greg.harveyWe always had the problem of UNKNOWN warnings when update cache expired - it never went away for us, so we're not seeing a regression. Possibly because it was unfixed almost as quickly as it was fixed?
Open to suggestion here. Need to remember when and why those lines were added though. =/
Comment #12
greg.harveySeems I introduced it here: #832288: return unknown rather than critical on wget failure
Stupidly, I didn't post the revised patch, but it was almost unchanged, bar some white space changes, from the original. The issue tries to address what happens if there's a timeout on wget, but also catch the case where "No information is available" and flag an UNKNOWN warning then too.
I guess that guy desired this behaviour and you do not. Neither do I, come to think of it.
So perhaps we can just remove the changes here to nagios.module and leave the exit states in check_drupal alone, as they're fine?
Comment #13
helmo commentedThe unknown status in check_drupal is indeed not the issue here.
I'm not sure why we need to ignore the case where e.g. the release history server is unavailable.
I'm comfortable with an UNKNOWN status, I would expect nagios to retry the check before bothering me.
The grace period we had before seemed reasonable as an addition to this.
As an example, what would happen if some firewall prevented Drupal from reaching the release history server?
I would like to know about that when it's a persistent problem.
Comment #14
greg.harvey@helmo Are you saying you like the behaviour "as is" then?
Comment #15
helmo commentednot completely as is ... the current check indeed looks redundant.
Whether my firewall usecase fits depends on how UPDATE_FETCH_PENDING is implemented.
If we indefinitely get this code, and core thus doesn't recognise when it's a permanent problem, then I would like nagios to let me know.
Comment #16
greg.harveyOK, so here's a proposed patch.
Comment #17
greg.harveyFYI, got this change being tested on a Drupal 6 multisite I've been using as a guinea pig. Any other testers of this patch, particularly for Drupal 7, your input would be great. =)
Comment #18
helmo commentedI've installed the patch from #17 on a D7.10 test site.
Lets see how it behaves.
EDIT: should have been #16
Comment #19
greg.harveySo far the Drupal 6 change appears to have stopped it flip-flopping between UNKNOWN and OK - I'll give it a few more days and also need to fake a failure to make sure it still works at all! ;-)
Comment #20
greg.harveyI'm going to apply this to D6 - it seems to be working.
@helmo, how's the D7 version going on? And can you confirm you meant #16, not #17 or #7? ;-)
Comment #21
greg.harveyI actually pushed this to the development snapshot on the basis it probably works fine - when @helmo confirms I can roll a small release.
So fixed in both branches for now.
Comment #22
helmo commented@greg.harvey: Yes, I meant #16
The D7 site seems fine.
Comment #23
greg.harvey\o/
Great. I'll try to roll some releases today then. Failing that, it'll be first week in January.
Happy holidays everyone! =)
Comment #25
greg.harveyBang on time - releases were rolled yesterday, dev snapshots are now in synch with latest point releases.