We have been lately having big problems with sssd caching. On our ssh servers, (each with ~100-200 users) login may take several minutes as the sssd_be -process uses 100% cpu time and sssd_be -process may be in this state for days. Clearing the cache and restarting sssd during the day usually helps and then everything works for few days, sometimes only hours. It is not clear what triggers this behaviour, maybe some some combination of lots of users and cache update at the same time.

The culprit seems to have been addition of few big groups lately to ldap for our access policy worsening the situation and sssd-performance.

On test server simple id command and empty cache with same setttings as in production takes:
[root@testsk tmp]# time id testusr
uid=1143(testusr) gid=100(users) groups=100(users),3318(roam),3102(nixe),1000(staff1),3785(wl-staff1),3119(system),3402(fileaccess),3377(vpn1),120(grp2),3123(devel),1001(devel3),3378(vpn2),3266(usr),3386(access3)

real    0m28.689s
user    0m0.006s
sys    0m0.007s

We have currently several groups with around 17 000 and 3000 users so this id query creates over 100k ghost users to cache:

[root@testsk tmp]# ldbsearch -H /var/lib/sss/db/cache_TESTAUTH.ldb |grep ghost |wc -l
asq: Unable to register control with rootdse!

Indeed, with full debug (time of id-command is then over 1 minute) all I see in the logs ldap backend mostly adding ghost users to cache as it adds information from _all_ groups related to that uid. As backend is not respondind to monitor pings fast enough, monitor tries to kill it and restart. Same happens also in production servers. I have already extended timeout to 60 but it seems not to be enough.

This latter case seems to be relevant especially when we started to receive complaints from some people that httpd authentication was not working. Apache error log shows:
[Tue Oct 29 12:21:36 2013] [error] [client xxx.xx.xx.xx] GROUP: testuser not in required group(s).
when in fact user is in the required group but it seems that sssd just fails to respond fast enough. This is (PAM, AuthType Basic, Require group testgroup) kind of authentication.

This is on RHEL6.4, sssd-1.9.2-82.10.el6_4.x86_64.  Configured services nss, ldap:
sanitized config:
config_file_version = 2
debug_level = 1
reconnection_retries = 3
timeout = 60
services = nss
domains = TESTAUTH
filter_groups = root
filter_users = root
reconnection_retries = 3
debug_level = 1
debug_level = 1
ldap_purge_cache_timeout = 3600
id_provider = ldap
auth_provider = ldap
ldap_uri = ldap://authserv.test
ldap_search_base = dc=test
ldap_user_search_base = ou=People,dc=test
ldap_group_search_base = ou=Group,dc=test

So in the end, any ideas or suggestions how to improve the situation? Of course I'm willing to debug/test this more if needed as the current situation is almost disastrous.

 - Sami

ps. Quick test on a Fedora 19 and sssd-1.11.1-4.fc19 made the same queries in 7 seconds or less so apparently some progress in performance has been done. Any idea when would RHEL6 sssd be rebased? I tried to compile latest git-version on RHEL6 but I couldn't find all required components (for ex. configure: error: you must have the cifsidmap header installed to build the idmap plugin).