I was suspecting a race condition, because as well as the rest of
the cleanup task is asynchronous. I was suspecting the following might
- initgroups starts:
- users are written to the cache
- groups are written to the cache but not linked yet to the user
- cleanup tasks starts
- cleanup task removes the group objects because they are
"empty". It shouldn't happen because the cleanup task should
only remove expired entries, but IIRC Lukas saw a similar
"groups are written to the cache but not linked yet to the user objects"
Is it possible for the responder to answer a client about groups information
before the groups are written to the cache AND linked to it ? That's what the
getgroups syscall (from the client) returning the wrong number of group would
suggest when the problem occurs. Could that be related to ghost or fake entries ?
> Maybe you could show us where to look exactly for :
> - where the backend is writing the groups data to the sysdb cache
So the operation that evaluates what groups the user is a member of is
called initgroups. IIRC you're using the rfc2307 (non-bis) schema, so
the initgroups request that you run starts at
src/providers/ldap/sdap_async_initgroups.c:385 in function
sdap_initgr_rfc2307_send() and ends at sdap_initgr_rfc2307_recv()
> - where the backend is signaling to the responder that the cache has been updated
The schema-specific request is the one I listed above, then
returns to the generic LDAP code in ldap_common.c. The function that
signals over sbus (dbus protocol used over unix socket) is at
sdap_handler_done(), in particular be_req_terminate()
> - where the responder is aware that he can now check the cache to get the answer
This is done in src/responder/common/responder_dp.c. The request is
sent with sss_dp_get_account_send().
This code is a bit complex, because concurrent requests are just added to
queue in sss_dp_issue_request() if the corresponding request is already
found in rctx->dp_request_table hash table. But the first request that
finishes would receive an sbus message from the provider in
sss_dp_internal_get_done(). Then it would iterate over the queue of
requests and mark them as done or failed.o
The callback that should be invoked by this generic NSS code is
> - where the responder is actually getting the data from the sysdb cache
src/responder/nss/nsssrv_cmd.c, in particular
nss_cmd_initgroups_search() and the function check_cache().
Thank you for this extensive answer. We were quite close to this understanding.
We'll try to dig more.