On Fri, Mar 18, 2011 at 11:04:32AM -0400, seth vidal wrote:
Hi folks,
some thoughts have been slowly coalescing in my head about how we're
managing our boxes/services and I have some suggestions I've passed by
various folks but I wanted to check them out with everyone:
1. puppetd sucks..... memory. Right now we have puppetd running on every
box and it wakes up every half hour and runs itself. This is fine but in
the time where it is not doing anything it just eats memory for no good
reason. I'd like to suggest we move to a cron-driven model instead of
puppetd. I'd write a simple cron job that runs every half hour to run
puppetd, if a lock file is not found. Pretty straightforward, of
course.
+1
Might need to update kickstarts and/or the SOP pages:
http://fedoraproject.org/wiki/Kickstart_Infrastructure_SOP
http://fedoraproject.org/wiki/Puppet_Infrastructure_SOP
2. monitoring if puppetd has run properly:
two things we want to know about puppet runs:
a. when they last happened per-box
b. if they fell over in a horrible way.
(a) can be known by looking at the $nodename.yaml file which lives
on the puppetmaster. I've written a script to check if that file is
older than 1 hour and report the nodename if it is.
(b) can be done via the cron job - ie: taking error output from the
puppet run and mailing to people until we fix it! :)
+1
3. sign** boxes. problems here:
a. These boxes are falling out of date, repeatedly, b/c they aren't
in our normal updating path.
b. these boxes don't email out to the same locations as the other
boxes
c. these boxes don't get faspassword updates properly
d. these boxes don't get config changes normally via puppet
(a) I'd like to suggest that they be put into a normal updating path
and/or we setup a nag mail to tell us about them
(b) obviously, fix their mail configs
(c) fasclient is failing b/c of a missing token b/c, most likely, of
(d)
I'm open to suggestions on those but it is a bit annoying b/c while I
understand their 'sensitivity' I think our way of treating them is
making the problem WORSE not better.
I agree with your assessment. I guess we need to tell releng our concerns
and figure out what needs to be done For a: perhaps have releng okay us/a
specific subset of sysadmins to run updates along with all the other
updates.
-Toshio