Re: [SSSD] [PATCH] monitor: Service restart fixes

Tuesday, 6 January 2015

On 12/10/2014 09:16 PM, Stephen Gallagher wrote:
...

 On Wed, 2014-12-10 at 14:59 -0500, Stephen Gallagher wrote:
> There are actually two bugs here:
>
> 1) When either the kill(SIGTERM) or kill(SIGKILL) commands returned
> failure (for any reason), we would talloc_free(svc) which removed it
> from being eligible for restart, resulting in the service never
> starting again without an SSSD service restart.
>
> 2) There is a fairly wide race condition where it's possible for a
> SIGKILL timer to "catch up" to the child exit handler between us
> noticing the termination and actually restarting it. The race
> happens because we re-enter the mainloop and add a restart
> timeout to avoid a quick failure if we keep restarting due to a
> transitory issue (the mt_svc object, and therefore the SIGKILL
> timer, were never freed until we got to the actual service
> restart).
>
> We can minimize this race by recording  the timer_event for the
> SIGKILL timeout in the mt_svc object. This way, if the process
> exits via SIGTERM, we will immediately remove the timer for the
> SIGKILL.
>
> This patch also removes the incorrect talloc_free(svc) calls on the
> kill() failures and replaces them with an attempt to just start up
> the service again and hope for the best.
>
> Resolves:
> https://fedorahosted.org/sssd/ticket/2525

 Just after sending this, I noticed another enhancement I could make to
 basically eliminate the potential race condition. Updated patch
 attached. 
Ack. Thank you Stephen.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [SSSD] [PATCH] monitor: Service restart fixes