On Fri, Apr 03, 2015 at 05:59:10PM +0200, Thomas HUMMEL wrote:
I'm using sssd-ldap-1.11.6 (from the official CentOS repo) on CentOS release
6.6 (Final) on a cluster of compute nodes running the slurm scheduler
) in 14.11 version.
Sssd is configured without enumerate, with cache_credential and default various
cache timeout values.
It works fine except in the following case where there seem to be a caching
[ the following is 100% reproducible ]
a) I clear the cache with the following commands :
. /etc/init.d/sssd stop
. rm -rf /var/lib/sss/mc/* /var/lib/sss/db/*
. /etc/init.d/sssd start
b) I launch a "job array" consisting of 100 or so simple task. Basically this
will execute in batch many instances (each one called a task) of the same
program in parallel on the compute node.
Such a job write its output in a .out text file owned by <user>:<gid>.
-> so many processes end up querying sssd in parallel to retrieve the user groups
What happens is that :
. the first task completes without error
. tasks 2 and 3 (or something like that) fail with a "permission denied"
. tasks > 3 complete without error
. also if we ask slurm to launch each task one after the other instead of in
a parallel fashion, the pb does not occur
- the job array is very fast since each task is very simple. Many tasks can be
completed under a second of time.
- if I don't clear sssd cache or if I just issue sss_cache -E or -g, the
problem occurs randomly and may be hard to reproduce.
At full debug level, sssd shows ldap answer correcty and sssd, only for entries
not already in cache, is adding so called "fake groups" :
ex : 'Adding fake group gensoft to sysdb'
A simple patch to slurm in order to print (with getgroups(2)) the number of
group of user shows that, for failed tasks, the number of groups retrieved for
is incomplete, which explains the "permission denied" message.
In fact, the missing groups seem to be the "fake" groups which seem to be
put in sssd cache by the first task.
So my guess is that :
. task 1 fetches groups missing from cache and first flag them as "fake"
. before task1 finishes "resolving" fake groups entry, tasks 2 and 3 discard
those incomplete entries
. task 1 finishes replacing fake by real groups
. following tasks behave as expected regarding groups
That sound like a good analysis except it would also be a bug..:-)
In case the back end is contacted at all and fetches data from the
server, the other requests should be suspended until the first one
You said earlier this is 100% reproducable and you were able to gather
the debug logs, right? Could we see them? Since there seems to be some
kind of a race condition, it might be nice to also enable debug_microseconds.
Any ideas ?
Here is my sssd.conf file :
config_file_version = 2
services = nss, pam
domains = pasteur_ldap_home
filter_users = root,ldap,named,avahi,haldaemon,dbus,radiusd,news,nscd
ldap_tls_reqcert = allow
auth_provider = ldap
ldap_schema = rfc2307
ldap_search_base = xxxx
ldap_group_search_base = xxxx
id_provider = ldap
ldap_id_use_start_tls = True
# We do not authorize password change
chpass_provider = none
ldap_uri = ldap://xxxx/
cache_credentials = True
ldap_tls_cacertdir = /etc/openldap/certs
ldap_network_timeout = 3
# getent passwd will only list /etc/passwd, but id or getent passwd login will query
#enumerate = True
ldap_page_size = 500
#debug_level = 0x02F0
debug_level = 0x77F0
Thomas Hummel | Institut Pasteur
<hummel(a)pasteur.fr> | Groupe Exploitation et Infrastructure
sssd-users mailing list