Hi Jakub,
Apologies for the long delay in response as I was dragged away for other projects! So my
previous (false)theory was a result of sssd_nss not being able to see entries that the
sssd_be places into the memcache. It would not be able to find it's group so it would
query any other domains it could see. I've now been able to isolate the issue down to
an even smaller plausible cause.
After some digging through logs of an affected machine I've discovered a very
interesting set of logs. This is quite verbose so bear with me. The command I ran was:
$ getent group $GID
Here we see NSS requesting a GID number to be resolved/fetched:
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [nss_getby_id] (0x0400): Input ID: 1000001111
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_send] (0x0400): CR #187: New request
'Group by ID'
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_select_domains] (0x0400): CR #187:
Performing a multi-domain search
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_domains] (0x0400): CR #187:
Search will check the cache and check the data provider
First it checks the NSS cache to see if an entry is present in it's cache:
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_set_domain] (0x0400): CR #187: Using
domain [
mydomain.com]
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_send] (0x0400): CR #187: Looking
up GID:1000001111@mydomain.com
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_ncache] (0x0400): CR #187:
Checking negative cache for [GID:1000001111@mydomain.com]
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_ncache] (0x0400): CR #187:
[GID:1000001111@mydomain.com] is not present in negative cache
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Looking
up [GID:1000001111@mydomain.com] in cache
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Object
[GID:1000001111@mydomain.com] was not found in cache
It's not present in the cache according to NSS so it requests the backend to search
the domain provider for this entry:
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_dp] (0x0400): CR #187: Looking up
[GID:1000001111@mydomain.com] in data provider
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_issue_request] (0x0400): Issuing request
for [0x55ee5f5d1b10:2:1000001111@mydomain.com]
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_get_account_msg] (0x0400): Creating request
for [mydomain.com][0x2][BE_REQ_GROUP][idnumber=1000001111:-]
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_internal_get_send] (0x0400): Entering
request [0x55ee5f5d1b10:2:1000001111@mydomain.com]
The domain provider finds the entry on the domain and fills the cache:
(Thu Oct 17 17:01:42 2019) [sssd[be[mydomain.com]]] [sdap_save_group] (0x0400): Storing
info for group TheGroup(a)mydomain.com
(Thu Oct 17 17:01:42 2019) [sssd[be[mydomain.com]]] [sysdb_store_group] (0x1000): The
group record of TheGroup(a)mydomain.com did not change, only updated the timestamp cache
What's interesting is that the sssd_be tells us that the entry was already present,
yet nss was unaware of any entries in the cache, nss didn't even say (not exact quote)
"entry found, needs updating", which after some testing on a working machine, is
what occurs when nss encounters an entry that is out of date.
Here we see sssd_nss responding to the backend's return of the group entry.
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [sss_dp_get_reply] (0x1000): Got reply from Data
Provider - DP error code: 0 errno: 0 error message: Success
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Looking
up [GID:1000001111@mydomain.com] in cache
(Thu Oct 17 17:02:07 2019) [sssd[nss]] [cache_req_search_cache] (0x0400): CR #187: Object
[GID:1000001111@mydomain.com] was not found in cache
Either the backend is entering data wrongly into the cache or nss is unable to read
certain entries that are placed in the cache, I would like to think it is the former
because the issue is so intermittent and for the most part, it works correctly. The only
workaround we have found so far is stopping sssd.service, removing the ldbs from
/var/lib/sss/db/ and restarting the service. This allows sssd to work correctly and return
the groups correctly.
I hope this makes sense,
Thanks,
Jamal