Greetings.
I just made a nagios change that causes it to send the very first alert
for something to _just_ irc.
If you are active and looking at a problem at this point, please go and
ack it on the web interface. This will stop escalations.
It will then wait 10minutes and the next alert (if the problem hasn't
recovered or been acked) will go to irc, email and pagers.
It will then send every hour after that to irc, email, and pager until
the problem is acked or solved.
Rationale:
* Much of the time now we have someone on irc who can look at and fix
issues (since we have sysadmin main folks in europe). Paging everyone
is causing pager fatigue especially when someone else is already
fixing it.
* We get a lot of alerts that are short network caused things that
recover in a few minutes. There's usually 0 we can do about them, our
users never notice them, and it's causing pager fatigue to page on
them and then immediately page ok after bothering people. Ideally we
would adjust these checks, and we should, but it's going to take a
while to get them all right.
* We often get a lot of alerts from 1 proxy or the like being rebooted
or restarting apache. These usually only happen for a minute or two
and there's no need to page on them.
* We sometimes get alerts directly related to changes we are currently
making in something and then go fix them. There's no need to page
someone for this, just be aware of irc when making playbook or host
changes and clean up anything you cause to alert.
I'd like to get back to the idea that if you get a page it's an
important thing you need to go look at, not "oh, nagios again".
This is all subject to adjustment, but hopefully it will make life a
bit easier for us sysadmin types and not cause any problems for anyone
else. ;)
kevin
Show replies by date