Re: How much downtime do we afford for nagios?

Sunday, 27 April 2008

On Mon, 28 Apr 2008, Nigel Jones wrote:

...
 On Sun, April 27, 2008 11:01 pm, Jeroen van Meeuwen wrote:
 > Nigel Jones wrote:
 >> Looking through my email, from what I can recall there are no false
 >> positives.  xen6 had to be power-cycled which caused all the other
 >> collateral notifications.
 >>
 >
 > Collateral notifications can be caught using service dependencies and
 > parent hosts. Do we currently use any?
 I believe we do, but it wouldn't have helped in this case (I've done a bit
 more digging)

 Half the notifications came from the external nagios instance on noc2,
 while the xen6/db alerts came from the internal nagios instance. Another
 reason why I like the current setup and don't think we should change a
 thing :)

 Also, the UNKNOWN alerts weren't that bad, they were a precursor to the
 box having to restarted, only in this case was the up/down alerts a little
 useless.  However, I'd sooner keep them as it because otherwise we run the
 risk of not noticing a box down immediately and get everyone under the
 moon asking "why can't I access fedoraproject.org... it's down your OS
 can't be that good". 
One thing I would like implemented is event handlers.  Some things
(probably not this thing) could be handled automatically for us.

	-Mike

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: How much downtime do we afford for nagios?