On Fri, 2011-03-18 at 11:04 -0400, seth vidal wrote:
Hi folks,
some thoughts have been slowly coalescing in my head about how we're
managing our boxes/services and I have some suggestions I've passed by
various folks but I wanted to check them out with everyone:
1. puppetd sucks..... memory. Right now we have puppetd running on every
box and it wakes up every half hour and runs itself. This is fine but in
the time where it is not doing anything it just eats memory for no good
reason. I'd like to suggest we move to a cron-driven model instead of
puppetd. I'd write a simple cron job that runs every half hour to run
puppetd, if a lock file is not found. Pretty straightforward, of
course.
this is done.
2. monitoring if puppetd has run properly:
two things we want to know about puppet runs:
a. when they last happened per-box
b. if they fell over in a horrible way.
(a) can be known by looking at the $nodename.yaml file which lives
on the puppetmaster. I've written a script to check if that file is
older than 1 hour and report the nodename if it is.
(b) can be done via the cron job - ie: taking error output from the
puppet run and mailing to people until we fix it! :)
I've written this and it can now submit issues via nsca (via func). One
problem it appears our puppet node names do not match our nagios host
names, A LOT. So we'll need to get some aliases in place so they work.
-sv