On Tue, May 03, 2016 at 03:52:03PM +0100, Patrick Coleman wrote:
On 1 May 2016 at 17:04, Jakub Hrozek <jhrozek(a)redhat.com>
>> On 30 Apr 2016, at 10:28, Patrick Coleman <patrick.coleman(a)meraki.com>
>> On 29 Apr 2016 9:10 pm, "Lukas Slebodnik" <lslebodn(a)redhat.com>
>> > Do you meand IO related load or CPU related load?
>> Lots of both, but we're typically IO bound more of the time.
>> > If there is issue with CPU then you can mount sssd cache to tmpfs
>> > to avoid such issues. (there are plans to improve it in 1.14)
>> Cool, I'll give that a go.
> Alternatively, increase the 'timeout' option in sssd's sections..
I appreciate the advice, thankyou. I've put /var/lib/sss on to a tmpfs
filesystem on a couple of loaded machines and seen what I believe to
be improvements - it's a little too early to say, but I'll report back
once I have a wider deployment.
I did want to feed back a little of our research into this issue. If
we strace the sssd_be subprocess on a loaded machine, we see it
sitting in msync() and fdatasync() for periods of up to 7.3 seconds in
one test. This is perhaps expected, given the machine is under heavy
IO load, but sssd makes a *lot* of these calls.
Yes, every cache update does 4 of these. This is a know issue I'm
working on right now:
In a 7m 49.985s test (this is as long as the sssd_be process lasts
before it is killed by the parent for not replying to ping) on a
machine with moderate disk load and no new interactive logins, sssd
made 232 *sync calls. The median syscall takes only 67ms, but the
maximum is more than seven seconds - in the eight minute test sssd
spent 1m 00.044s in *sync system calls.
My (naive) analysis here is that the backend process is spending 13%
of its time unavailable to service account queries, because it's doing
Very nice analysis.
Just a detail, it's not cache maintenance, but updates. The thing we are
doing wrong at the moment is that we do a full write of the whole object
even if nothing changes.
This seems to rather defeat the point of having a
cache... are my assumptions correct here? I'm happy to send the strace
log (and any other data) to interested parties off-list, just let me
In an attempt to improve this behaviour, in addition to a tmpfs for
/var/lib/sss I've also just added the following to the nss and pam
stanzas in the config:
memcache_timeout = 1800
entry_cache_timeout = 1800
...the idea being they will respond from their own cache without
contacting the backend, which may be busy per the above. Is this
Yes it is, in the sense that the cache writes would be performed less
frequently. But your value of entry_cache_timeout is too low, the default
is 5400. By the way, an authentication request ignores the cache validity,
since during authentication we always try to get fresh group membership.
When #2602 is implemented, the cost of cache updates should (mostly) go