New subject: "Child not responding" on loaded servers

Tuesday, 3 May 2016

On 1 May 2016 at 17:04, Jakub Hrozek <jhrozek(a)redhat.com&gt; wrote:
...
> On 30 Apr 2016, at 10:28, Patrick Coleman
<patrick.coleman(a)meraki.com&gt; wrote:
> On 29 Apr 2016 9:10 pm, "Lukas Slebodnik" <lslebodn(a)redhat.com&gt;
wrote:
> >
> > Do you meand IO related load or CPU related load?
>
> Lots of both, but we're typically IO bound more of the time.
>
> > If there is issue with CPU then you can mount sssd cache to tmpfs
> > to avoid such issues. (there are plans to improve it in 1.14)
>
> Cool, I'll give that a go.

 Alternatively, increase the 'timeout' option in sssd's sections.. 
I appreciate the advice, thankyou. I've put /var/lib/sss on to a tmpfs
filesystem on a couple of loaded machines and seen what I believe to
be improvements - it's a little too early to say, but I'll report back
once I have a wider deployment.

I did want to feed back a little of our research into this issue. If
we strace the sssd_be subprocess on a loaded machine, we see it
sitting in msync() and fdatasync() for periods of up to 7.3 seconds in
one test. This is perhaps expected, given the machine is under heavy
IO load, but sssd makes a *lot* of these calls.

In a 7m 49.985s test (this is as long as the sssd_be process lasts
before it is killed by the parent for not replying to ping) on a
machine with moderate disk load and no new interactive logins, sssd
made 232 *sync calls. The median syscall takes only 67ms, but the
maximum is more than seven seconds - in the eight minute test sssd
spent 1m 00.044s in *sync system calls.

My (naive) analysis here is that the backend process is spending 13%
of its time unavailable to service account queries, because it's doing
cache maintenance. This seems to rather defeat the point of having a
cache... are my assumptions correct here? I'm happy to send the strace
log (and any other data) to interested parties off-list, just let me
know.

In an attempt to improve this behaviour, in addition to a tmpfs for
/var/lib/sss I've also just added the following to the nss and pam
stanzas in the config:

memcache_timeout = 1800
entry_cache_timeout = 1800

...the idea being they will respond from their own cache without
contacting the backend, which may be busy per the above. Is this
reasonable?

Cheers,

Patrick

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

Re: "Child not responding" on loaded servers