On Wed, 2014-12-10 at 14:59 -0500, Stephen Gallagher wrote:
> There are actually two bugs here:
>
> 1) When either the kill(SIGTERM) or kill(SIGKILL) commands returned
> failure (for any reason), we would talloc_free(svc) which removed it
> from being eligible for restart, resulting in the service never
> starting again without an SSSD service restart.
>
> 2) There is a fairly wide race condition where it's possible for a
> SIGKILL timer to "catch up" to the child exit handler between us
> noticing the termination and actually restarting it. The race
> happens because we re-enter the mainloop and add a restart
> timeout to avoid a quick failure if we keep restarting due to a
> transitory issue (the mt_svc object, and therefore the SIGKILL
> timer, were never freed until we got to the actual service
> restart).
>
> We can minimize this race by recording the timer_event for the
> SIGKILL timeout in the mt_svc object. This way, if the process
> exits via SIGTERM, we will immediately remove the timer for the
> SIGKILL.
>
> This patch also removes the incorrect talloc_free(svc) calls on the
> kill() failures and replaces them with an attempt to just start up
> the service again and hope for the best.
>
> Resolves:
>
https://fedorahosted.org/sssd/ticket/2525
Just after sending this, I noticed another enhancement I could make to
basically eliminate the potential race condition. Updated patch
attached.
Ack. Thank you Stephen.