Hi,
We've got a number of machines using sssd to connect to LDAP for auth.
In the past we've had problems with sssd crashing regularly[1], but
after posting here we built some custom packages to disable netlink
notifications from the kernel, and it's generally improved.
We're still seeing auth failures across random machines - perhaps 1-2%
when we run a process which connects to all hosts. The machines are
generally heavily loaded when this happens, and sssd.log looks like:
(Fri Apr 29 09:31:19 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
(Fri Apr 29 09:31:29 2016) [sssd] [tasks_check_handler] (0x0020):
Child (meraki) not responding! (yet)
(Fri Apr 29 09:31:39 2016) [sssd] [tasks_check_handler] (0x0020):
Child (meraki) not responding! (yet)
(Fri Apr 29 09:31:39 2016) [sssd] [ping_check] (0x0020): A service
PING timed out on [nss]. Attempt [0]
While sssd is in this state, it seems to deny auth randomly for LDAP
users - they receive "connection closed by remote host". It will
eventually restart its children, but that doesn't seem to fix the
problem.
Logs for the meraki domain and for nss indicate the subprocesses are running:
/var/log/sssd/sssd_meraki.log
(Fri Apr 29 09:30:53 2016) [sssd[be[meraki]]] [sdap_save_user]
(0x0400): Storing info for user blinken
(Fri Apr 29 09:31:22 2016) [sssd[be[meraki]]]
[sdap_initgr_rfc2307_next_base] (0x0400): Searching for groups with
base [dc=meraki,dc=com]
(Fri Apr 29 09:31:22 2016) [sssd[be[meraki]]]
[sdap_get_generic_ext_step] (0x0400): calling ldap_search_ext with
[(&(memberuid=blinken)(objectClass=posixGroup)(cn=*)(&(gidNumber=*)(!(
gidNumber=0))))][dc=meraki,dc=com].
(Fri Apr 29 09:31:22 2016) [sssd[be[meraki]]]
[sdap_get_generic_op_finished] (0x0400): Search result: Success(0), no
errmsg set
/var/log/sssd/sssd_nss.log
(Fri Apr 29 09:31:22 2016) [sssd[nss]] [nss_cmd_getgrgid_search]
(0x0080): No matching domain found for [1155]
(Fri Apr 29 09:31:22 2016) [sssd[nss]] [nss_cmd_getbynam] (0x0100):
Requesting info for [blinken] from [<ALL>]
(Fri Apr 29 09:31:22 2016) [sssd[nss]] [nss_cmd_initgroups_search]
(0x0100): Requesting info for [blinken@meraki]
(Fri Apr 29 09:31:26 2016) [sssd[nss]] [calc_flat_name] (0x0080): Flat
name requested but domain has noflat name set, falling back to domain
name
(Fri Apr 29 09:31:26 2016) [sssd[nss]] [nss_cmd_getbynam] (0x0100):
Requesting info for [meraki] from [<ALL>]
(Fri Apr 29 09:31:26 2016) [sssd[nss]] [nss_cmd_initgroups_search]
(0x0080): No matching domain found for [meraki], fail!
We first saw the behaviour on sssd 1.11.7 and have upgraded to sssd
version 1.13.4, with more or less the same symptoms. We've turned
enumerate on and off with no apparent change in behaviour.
Does anyone have any suggestions here? Let me know if I can provide
more detailed debugging information (perhaps off-list).
Cheers,
Patrick
1.
https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahoste...