Hello,
We have been lately having big problems with sssd caching. On our ssh servers, (each with ~100-200 users) login may take several minutes as the sssd_be -process uses 100% cpu time and sssd_be -process may be in this state for days. Clearing the cache and restarting sssd during the day usually helps and then everything works for few days, sometimes only hours. It is not clear what triggers this behaviour, maybe some some combination of lots of users and cache update at the same time.
The culprit seems to have been addition of few big groups lately to ldap for our access policy worsening the situation and sssd-performance.
On test server simple id command and empty cache with same setttings as in production takes: [root@testsk tmp]# time id testusr uid=1143(testusr) gid=100(users) groups=100(users),3318(roam),3102(nixe),1000(staff1),3785(wl-staff1),3119(system),3402(fileaccess),3377(vpn1),120(grp2),3123(devel),1001(devel3),3378(vpn2),3266(usr),3386(access3)
real 0m28.689s user 0m0.006s sys 0m0.007s
We have currently several groups with around 17 000 and 3000 users so this id query creates over 100k ghost users to cache:
[root@testsk tmp]# ldbsearch -H /var/lib/sss/db/cache_TESTAUTH.ldb |grep ghost |wc -l asq: Unable to register control with rootdse! 105196
Indeed, with full debug (time of id-command is then over 1 minute) all I see in the logs ldap backend mostly adding ghost users to cache as it adds information from _all_ groups related to that uid. As backend is not respondind to monitor pings fast enough, monitor tries to kill it and restart. Same happens also in production servers. I have already extended timeout to 60 but it seems not to be enough.
This latter case seems to be relevant especially when we started to receive complaints from some people that httpd authentication was not working. Apache error log shows: [Tue Oct 29 12:21:36 2013] [error] [client xxx.xx.xx.xx] GROUP: testuser not in required group(s). when in fact user is in the required group but it seems that sssd just fails to respond fast enough. This is (PAM, AuthType Basic, Require group testgroup) kind of authentication.
This is on RHEL6.4, sssd-1.9.2-82.10.el6_4.x86_64. Configured services nss, ldap: sanitized config: ------------------------ [sssd] config_file_version = 2 debug_level = 1 reconnection_retries = 3 timeout = 60 services = nss domains = TESTAUTH [nss] filter_groups = root filter_users = root reconnection_retries = 3 debug_level = 1 [domain/TESTAUTH] debug_level = 1 ldap_purge_cache_timeout = 3600 id_provider = ldap auth_provider = ldap ldap_uri = ldap://authserv.test ldap_search_base = dc=test ldap_user_search_base = ou=People,dc=test ldap_group_search_base = ou=Group,dc=test
So in the end, any ideas or suggestions how to improve the situation? Of course I'm willing to debug/test this more if needed as the current situation is almost disastrous.
Cheers, - Sami
ps. Quick test on a Fedora 19 and sssd-1.11.1-4.fc19 made the same queries in 7 seconds or less so apparently some progress in performance has been done. Any idea when would RHEL6 sssd be rebased? I tried to compile latest git-version on RHEL6 but I couldn't find all required components (for ex. configure: error: you must have the cifsidmap header installed to build the idmap plugin).
On Wed, Oct 30, 2013 at 12:18:44PM +0200, Sami K wrote:
Hello,
We have been lately having big problems with sssd caching. On our ssh servers, (each with ~100-200 users) login may take several minutes as the sssd_be -process uses 100% cpu time and sssd_be -process may be in this state for days. Clearing the cache and restarting sssd during the day usually helps and then everything works for few days, sometimes only hours. It is not clear what triggers this behaviour, maybe some some combination of lots of users and cache update at the same time.
The culprit seems to have been addition of few big groups lately to ldap for our access policy worsening the situation and sssd-performance.
On test server simple id command and empty cache with same setttings as in production takes: [root@testsk tmp]# time id testusr uid=1143(testusr) gid=100(users) groups=100(users),3318(roam),3102(nixe),1000(staff1),3785(wl-staff1),3119(system),3402(fileaccess),3377(vpn1),120(grp2),3123(devel),1001(devel3),3378(vpn2),3266(usr),3386(access3)
real 0m28.689s user 0m0.006s sys 0m0.007s
We have currently several groups with around 17 000 and 3000 users so this id query creates over 100k ghost users to cache:
[root@testsk tmp]# ldbsearch -H /var/lib/sss/db/cache_TESTAUTH.ldb |grep ghost |wc -l asq: Unable to register control with rootdse! 105196
Indeed, with full debug (time of id-command is then over 1 minute) all I see in the logs ldap backend mostly adding ghost users to cache as it adds information from _all_ groups related to that uid. As backend is not respondind to monitor pings fast enough, monitor tries to kill it and restart. Same happens also in production servers. I have already extended timeout to 60 but it seems not to be enough.
This latter case seems to be relevant especially when we started to receive complaints from some people that httpd authentication was not working. Apache error log shows: [Tue Oct 29 12:21:36 2013] [error] [client xxx.xx.xx.xx] GROUP: testuser not in required group(s). when in fact user is in the required group but it seems that sssd just fails to respond fast enough. This is (PAM, AuthType Basic, Require group testgroup) kind of authentication.
This is on RHEL6.4, sssd-1.9.2-82.10.el6_4.x86_64. Configured services nss, ldap: sanitized config:
[sssd] config_file_version = 2 debug_level = 1 reconnection_retries = 3 timeout = 60 services = nss domains = TESTAUTH [nss] filter_groups = root filter_users = root reconnection_retries = 3 debug_level = 1 [domain/TESTAUTH] debug_level = 1 ldap_purge_cache_timeout = 3600 id_provider = ldap auth_provider = ldap ldap_uri = ldap://authserv.test ldap_search_base = dc=test ldap_user_search_base = ou=People,dc=test ldap_group_search_base = ou=Group,dc=test
So in the end, any ideas or suggestions how to improve the situation? Of course I'm willing to debug/test this more if needed as the current situation is almost disastrous.
Cheers,
- Sami
Hi Sami,
I'm sorry you are having problems with SSSD.
In 6.5, we added a new "ignore_group_members" option that makes all groups appear as empty. Setting this option to "true" would make a huge performance gain at the expense of not seeing the group members. But if your environment relies on group membership mostly for access control, that should be fine.
ps. Quick test on a Fedora 19 and sssd-1.11.1-4.fc19 made the same queries in 7 seconds or less so apparently some progress in performance has been done. Any idea when would RHEL6 sssd be rebased?
Not in RHEL-6.5 :-) Currently it's not clear if RHEL6 will rebase. (And details about future RHEL updates are not usually disclosed on public mailing list).
I tried to compile latest git-version on RHEL6 but I couldn't find all required components (for ex. configure: error: you must have the cifsidmap header installed to build the idmap plugin).
Passing --disable-cifs-idmap-plugin to configure should get rid of this requirement.
On Wed, Oct 30, 2013 at 12:18:44PM +0200, Sami K wrote:
Hello,
ps. Quick test on a Fedora 19 and sssd-1.11.1-4.fc19 made the same queries in 7 seconds or less so apparently some progress in performance has been done. Any idea when would RHEL6 sssd be rebased? I tried to compile latest git-version on RHEL6 but I couldn't find all required components (for ex. configure: error: you must have the cifsidmap header installed to build the idmap plugin).
sorry for that, please use the configure option --disable-cifs-idmap-plugin to get around this. It is already tracked to make the cifsidmap support optional (https://fedorahosted.org/sssd/ticket/2125).
HTH
bye, Sumit
sssd-users mailing list sssd-users@lists.fedorahosted.org https://lists.fedorahosted.org/mailman/listinfo/sssd-users
On (30/10/13 11:40), Sumit Bose wrote:
On Wed, Oct 30, 2013 at 12:18:44PM +0200, Sami K wrote:
Hello,
ps. Quick test on a Fedora 19 and sssd-1.11.1-4.fc19 made the same queries in 7 seconds or less so apparently some progress in performance has been done. Any idea when would RHEL6 sssd be rebased? I tried to compile latest git-version on RHEL6 but I couldn't find all required components (for ex. configure: error: you must have the cifsidmap header installed to build the idmap plugin).
sorry for that, please use the configure option --disable-cifs-idmap-plugin to get around this. It is already tracked to make the cifsidmap support optional (https://fedorahosted.org/sssd/ticket/2125).
It is possible to creare src.rpm directly from tarball (sssd>1.10) or git repository with script "make_srpm.sh". This script is located in subdirectory "contrib/fedora/"
Then you can rebuild src.rpm with rpmbuild or mock without any problem, because cifs-plugin is automatically disabled for older distributions in spec file.
Last note, if you wanted to use newer version on RHEL6 I would suggest to build sssd from 1.11 branch. Master branch needn't be very stable for production release and 1.11 branch does not have cifs plugin.
LS
Thank you for all the comments and suggestions,
2013/10/30 Jakub Hrozek jhrozek@redhat.com
On Wed, Oct 30, 2013 at 12:18:44PM +0200, Sami K wrote:
Any idea when would RHEL6 sssd be rebased?
Not in RHEL-6.5 :-) Currently it's not clear if RHEL6 will rebase. (And details about future RHEL updates are not usually disclosed on public mailing list).
I guessed that much, just trying to make incentive to rebase if it solves problems :)
2013/10/31 Lukas Slebodnik lslebodn@redhat.com
On (30/10/13 11:40), Sumit Bose wrote:
On Wed, Oct 30, 2013 at 12:18:44PM +0200, Sami K wrote:
sorry for that, please use the configure option --disable-cifs-idmap-plugin to get around this. It is already tracked to make the cifsidmap support optional (https://fedorahosted.org/sssd/ticket/2125).
Thanks, this worked.
It is possible to create src.rpm directly from tarball (sssd>1.10) or git repository with script "make_srpm.sh". This script is located in subdirectory "contrib/fedora/"
And this was even better - thanks for the tip. Script worked great after a small change: ----------------------- [root@testm1 contrib]# diff -u fedora/make_srpm.sh rhel/make_srpm.sh --- fedora/make_srpm.sh 2013-11-01 11:53:53.128687041 +0200 +++ rhel/make_srpm.sh 2013-11-01 12:00:00.587957406 +0200 @@ -108,10 +108,10 @@ > "$RPMBUILD/SPECS/$PACKAGE_NAME.spec"
NAME="$PACKAGE_NAME-$PACKAGE_VERSION" -git archive --format=tar.gz --prefix="$NAME"/ \ - --output "$RPMBUILD/SOURCES/$NAME.tar.gz" \ +git archive --format=tar --prefix="$NAME"/ \ --remote="file://$SRC_DIR" \ - HEAD + HEAD \ + | gzip > "$RPMBUILD/SOURCES/$NAME.tar.gz"
cp "$SRC_DIR"/contrib/*.patch "$RPMBUILD/SOURCES" ----------------------- git archive in RHEL6 does not have support for tar.gz format: [testm1 contrib]# git --version git version 1.7.1
Then you can rebuild src.rpm with rpmbuild or mock without any problem,
because cifs-plugin is automatically disabled for older distributions in spec file.
Last note, if you wanted to use newer version on RHEL6 I would suggest to build sssd from 1.11 branch. Master branch needn't be very stable for production release and 1.11 branch does not have cifs plugin.
In other news, I seem to be unable to produce the same performance level as in F19 on RHEL6, tried with sssd-1.11.3. So either configuration error on my part or something else, have to investigate further. Also the suggested option "ignore_group_members" really makes the difference except it is not suitable for us in all environments. Apache pam module for 'require group' asks specifically for group members (tried it out) so no luck there. We really should not use that method anymore but we have bunch of legacy sites and modifying all of them would be a mess.
Thanks,
- Sami
On Fri, Nov 01, 2013 at 08:03:47PM +0200, Sami K wrote:
Thank you for all the comments and suggestions,
2013/10/30 Jakub Hrozek jhrozek@redhat.com
On Wed, Oct 30, 2013 at 12:18:44PM +0200, Sami K wrote:
Any idea when would RHEL6 sssd be rebased?
Not in RHEL-6.5 :-) Currently it's not clear if RHEL6 will rebase. (And details about future RHEL updates are not usually disclosed on public mailing list).
I guessed that much, just trying to make incentive to rebase if it solves problems :)
2013/10/31 Lukas Slebodnik lslebodn@redhat.com
On (30/10/13 11:40), Sumit Bose wrote:
On Wed, Oct 30, 2013 at 12:18:44PM +0200, Sami K wrote:
sorry for that, please use the configure option --disable-cifs-idmap-plugin to get around this. It is already tracked to make the cifsidmap support optional (https://fedorahosted.org/sssd/ticket/2125).
Thanks, this worked.
It is possible to create src.rpm directly from tarball (sssd>1.10) or git repository with script "make_srpm.sh". This script is located in subdirectory "contrib/fedora/"
And this was even better - thanks for the tip. Script worked great after a small change:
[root@testm1 contrib]# diff -u fedora/make_srpm.sh rhel/make_srpm.sh --- fedora/make_srpm.sh 2013-11-01 11:53:53.128687041 +0200 +++ rhel/make_srpm.sh 2013-11-01 12:00:00.587957406 +0200 @@ -108,10 +108,10 @@ > "$RPMBUILD/SPECS/$PACKAGE_NAME.spec"
NAME="$PACKAGE_NAME-$PACKAGE_VERSION" -git archive --format=tar.gz --prefix="$NAME"/ \
--output "$RPMBUILD/SOURCES/$NAME.tar.gz" \
+git archive --format=tar --prefix="$NAME"/ \ --remote="file://$SRC_DIR" \
HEAD
HEAD \
| gzip > "$RPMBUILD/SOURCES/$NAME.tar.gz"
cp "$SRC_DIR"/contrib/*.patch "$RPMBUILD/SOURCES"
git archive in RHEL6 does not have support for tar.gz format: [testm1 contrib]# git --version git version 1.7.1
Then you can rebuild src.rpm with rpmbuild or mock without any problem,
because cifs-plugin is automatically disabled for older distributions in spec file.
Last note, if you wanted to use newer version on RHEL6 I would suggest to build sssd from 1.11 branch. Master branch needn't be very stable for production release and 1.11 branch does not have cifs plugin.
In other news, I seem to be unable to produce the same performance level as in F19 on RHEL6, tried with sssd-1.11.3. So either configuration error on my part or something else, have to investigate further. Also the suggested option "ignore_group_members" really makes the difference except it is not suitable for us in all environments. Apache pam module for 'require group' asks specifically for group members (tried it out) so no luck there. We really should not use that method anymore but we have bunch of legacy sites and modifying all of them would be a mess.
Thanks,
- Sami
I wonder if for your specific environment, enabling enumeration would actually be beneficial? If the SSSD clients are mostly up and stick to the same server, then the incremental enumerations would only download anything in case the LDAP data actually changed.
We usually don't recommend enumeration, but in case you have large groups AND rely on seeing the group members, then downloading all the content in one go after startup and then only checking for updates might be actually faster..
sssd-users@lists.fedorahosted.org