Hi,
Hi,
For a few days false notification of nagios reduced. But it has increased again.
You sure?
Looking at the /configs/system/nagios/services/template.cfg reveals that it is configured as max_check_attempt = 4 and retry_check_interval 1 for hosts and max_check_attempts = 3 and retry_check_interval 1.
So if a service or host is unreachable for 3 or 4 mins, we get a notification. (However most of the cases it is false positive, due to congestion or others).
Looking through my email, from what I can recall there are no false positives. xen6 had to be power-cycled which caused all the other collateral notifications.
Just to put it into perspective... 1st notification: 0212UTC - Accounts down on .120-phx ... 5th notification: 0216UTC - UNKNOWN status on xen6 (NRPE: Unable to read output) ... 11/12th notifications: 0228UTC - Host Down - xen6/db2 & Starting 0233UTC - Host/service UP/Okay notifications
According to my IRC logs xen6 went a bit haywire and had to be rebooted, so TBH I don't see what is false here.
Yes congestion can cause some problems, but isn't that also a sign that stuff may need to be balanced better or given more processing/networking capacity.
It's long enough to not detect every single VPN bloop, but it's also long enough to give an idea of problems.
How about finding out a working delay which we can afford, if a service or host is really down. How about 10 mins ? (5 attempt x 2 mins?).
IMO this is too long, also, it doesn't take that long for someone to SSH in and have a quick look, I don't speak for everyone, but I don't mind if I spend 2-5 minutes to check.
Also we may list services/host which are critical and which are not. That will help to define different notification period for the different hots/services.
I thought I shall do it after the freeze, but its becoming too annoying.
Personally, I don't think anything should be done at the moment.
- Nigel
So if a service or host is unreachable for 3 or 4 mins, we get a notification. (However most of the cases it is false positive, due to congestion or others).
Looking through my email, from what I can recall there are no false positives. xen6 had to be power-cycled which caused all the other collateral notifications.
How long was it down? Why should a normal reboot will send 23 mails? Reboot is not any exceptional thing. Is it? An alert should be when its absolutely necessary... it should report only when xen6 comes up but a service does not come up.. What do you think? Thanks.
On Sun, 27 Apr 2008, susmit shannigrahi wrote:
So if a service or host is unreachable for 3 or 4 mins, we get a notification. (However most of the cases it is false positive, due to congestion or others).
Looking through my email, from what I can recall there are no false positives. xen6 had to be power-cycled which caused all the other collateral notifications.
How long was it down? Why should a normal reboot will send 23 mails? Reboot is not any exceptional thing. Is it? An alert should be when its absolutely necessary... it should report only when xen6 comes up but a service does not come up.. What do you think? Thanks.
A normal reboot shouldn't, but when its in a hung state, it takes a while before people can get to it.
-Mike
Nigel Jones wrote:
Looking through my email, from what I can recall there are no false positives. xen6 had to be power-cycled which caused all the other collateral notifications.
Collateral notifications can be caught using service dependencies and parent hosts. Do we currently use any?
Kind regards,
Jeroen van Meeuwen -kanarip
On Sun, April 27, 2008 11:01 pm, Jeroen van Meeuwen wrote:
Nigel Jones wrote:
Looking through my email, from what I can recall there are no false positives. xen6 had to be power-cycled which caused all the other collateral notifications.
Collateral notifications can be caught using service dependencies and parent hosts. Do we currently use any?
I believe we do, but it wouldn't have helped in this case (I've done a bit more digging)
Half the notifications came from the external nagios instance on noc2, while the xen6/db alerts came from the internal nagios instance. Another reason why I like the current setup and don't think we should change a thing :)
Also, the UNKNOWN alerts weren't that bad, they were a precursor to the box having to restarted, only in this case was the up/down alerts a little useless. However, I'd sooner keep them as it because otherwise we run the risk of not noticing a box down immediately and get everyone under the moon asking "why can't I access fedoraproject.org... it's down your OS can't be that good".
- Nigel
Kind regards,
Jeroen van Meeuwen -kanarip
Fedora-infrastructure-list mailing list Fedora-infrastructure-list@redhat.com https://www.redhat.com/mailman/listinfo/fedora-infrastructure-list
On Mon, 28 Apr 2008, Nigel Jones wrote:
On Sun, April 27, 2008 11:01 pm, Jeroen van Meeuwen wrote:
Nigel Jones wrote:
Looking through my email, from what I can recall there are no false positives. xen6 had to be power-cycled which caused all the other collateral notifications.
Collateral notifications can be caught using service dependencies and parent hosts. Do we currently use any?
I believe we do, but it wouldn't have helped in this case (I've done a bit more digging)
Half the notifications came from the external nagios instance on noc2, while the xen6/db alerts came from the internal nagios instance. Another reason why I like the current setup and don't think we should change a thing :)
Also, the UNKNOWN alerts weren't that bad, they were a precursor to the box having to restarted, only in this case was the up/down alerts a little useless. However, I'd sooner keep them as it because otherwise we run the risk of not noticing a box down immediately and get everyone under the moon asking "why can't I access fedoraproject.org... it's down your OS can't be that good".
One thing I would like implemented is event handlers. Some things (probably not this thing) could be handled automatically for us.
-Mike
On Sun, 27 Apr 2008, Jeroen van Meeuwen wrote:
Nigel Jones wrote:
Looking through my email, from what I can recall there are no false positives. xen6 had to be power-cycled which caused all the other collateral notifications.
Collateral notifications can be caught using service dependencies and parent hosts. Do we currently use any?
There's a ticket open but no progress.
-Mike
infrastructure@lists.fedoraproject.org