Re: changing a few things in our host mgmt tools

Friday, 18 March 2011

On Fri, Mar 18, 2011 at 11:04:32AM -0400, seth vidal wrote:
...
 Hi folks,
  some thoughts have been slowly coalescing in my head about how we're
 managing our boxes/services and I have some suggestions I've passed by
 various folks but I wanted to check them out with everyone:

 1. puppetd sucks..... memory. Right now we have puppetd running on every
 box and it wakes up every half hour and runs itself. This is fine but in
 the time where it is not doing anything it just eats memory for no good
 reason. I'd like to suggest we move to a cron-driven model instead of
 puppetd. I'd write a simple cron job that runs every half hour to run
 puppetd, if a lock file is not found. Pretty straightforward, of
 course. 
  +1

Might need to update kickstarts and/or the SOP pages:

http://fedoraproject.org/wiki/Kickstart_Infrastructure_SOP
http://fedoraproject.org/wiki/Puppet_Infrastructure_SOP

...
 2. monitoring if puppetd has run properly:
    two things we want to know about puppet runs:
    a. when they last happened per-box
    b. if they fell over in a horrible way.

     (a) can be known by looking at the $nodename.yaml file which lives
 on the puppetmaster. I've written a script to check if that file is
 older than 1 hour and report the nodename if it is.
     (b) can be done via the cron job - ie: taking error output from the
 puppet run and mailing to people until we fix it! :)
  +1

...
 3. sign** boxes. problems here:
    a. These boxes are falling out of date, repeatedly, b/c they aren't
 in our normal updating path.
    b. these boxes don't email out to the same locations as the other
 boxes
    c. these boxes don't get faspassword updates properly
    d. these boxes don't get config changes normally via puppet

    (a) I'd like to suggest that they be put into a normal updating path
 and/or we setup a nag mail to tell us about them
    (b) obviously, fix their mail configs
    (c) fasclient is failing b/c of a missing token b/c, most likely, of
 (d)

   I'm open to suggestions on those but it is a bit annoying b/c while I
 understand their 'sensitivity' I think our way of treating them is
 making the problem WORSE not better.
  I agree with your assessment.  I guess we need to tell releng our concerns
and figure out what needs to be done  For a: perhaps have releng okay us/a
specific subset of sysadmins to run updates along with all the other
updates.

-Toshio

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

Re: changing a few things in our host mgmt tools