Hello,
Sometimes SSSD is not recovered after being killed by own watchdog. This is a DEV environment, somewhat poorly monitored, OOM kills are common (but SSSD is not killed). RHEL8.8, sssd-2.8.2 with Active Directory.
1. Under the memory pressure, the system becomes unresponsive, and SSSD's own watchdog terminates sssd_be and restarts it. 2. Since the system is still operating very slowly, it takes quite a while to start sssd_be. From the output of “ps” command I can see that sssd_be start time is 09:32:13, but sssd_be’s initial message in its logs “Starting with debug level = 0x0070” is dated 9:34:02. So it took almost 2 minutes to start sssd_be. 3. Meanwhile, nss/pam responders tried to connect to the backend 3 times and gave up with a message “Unable to reconnect: maximum retries exceeded”. 4. OOM killer finally kills some process (not SSSD ones) and the system performance returns back to normal.
So we end up with SSSD up and running, but not functioning, because nss/pam responders will never try connecting to the backend again. And it was caused by its own watchdog. It looks like if watchdog hadn't killed sssd_be, it would have recovered after OOM killer killed the memory hog process.
I still cannot believe that SSSD cannot recover on its own from such a simple situation. To my mind it should try reconnecting to the backend every 60s or something like that.
Is it expected behaviour or am I missing something?
Kind regards, Grigory Trenin
Am Mon, Jul 22, 2024 at 08:20:02PM +0300 schrieb Grigory Trenin:
Hello,
Sometimes SSSD is not recovered after being killed by own watchdog. This is a DEV environment, somewhat poorly monitored, OOM kills are common (but SSSD is not killed). RHEL8.8, sssd-2.8.2 with Active Directory.
- Under the memory pressure, the system becomes unresponsive, and
SSSD's own watchdog terminates sssd_be and restarts it. 2. Since the system is still operating very slowly, it takes quite a while to start sssd_be. From the output of “ps” command I can see that sssd_be start time is 09:32:13, but sssd_be’s initial message in its logs “Starting with debug level = 0x0070” is dated 9:34:02. So it took almost 2 minutes to start sssd_be. 3. Meanwhile, nss/pam responders tried to connect to the backend 3 times and gave up with a message “Unable to reconnect: maximum retries exceeded”. 4. OOM killer finally kills some process (not SSSD ones) and the system performance returns back to normal.
So we end up with SSSD up and running, but not functioning, because nss/pam responders will never try connecting to the backend again. And it was caused by its own watchdog. It looks like if watchdog hadn't killed sssd_be, it would have recovered after OOM killer killed the memory hog process.
I still cannot believe that SSSD cannot recover on its own from such a simple situation. To my mind it should try reconnecting to the backend every 60s or something like that.
Is it expected behaviour or am I missing something?
Hi,
this sounds like https://github.com/SSSD/sssd/issues/6803. This should be fixed for RHEL-8.8 with package version sssd-2.8.2-3.el8_8 form errata https://access.redhat.com/errata/RHBA-2023:4525. Are you using this version or an older one?
bye, Sumit
Kind regards, Grigory Trenin -- _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
Hi Sumit,
Yes, I'm running this version. "rpm -q --changelog" also shows that the fix is there:
$ rpm -q --changelog sssd | head -2 * Mon Jul 10 2023 Alexey Tikhonov atikhono@redhat.com - 2.8.2-3 - Resolves: rhbz#2219351 - [sssd] SSSD enters failed state after heavy load in the system [rhel-8.8.0.z]
Yes, this bug looks similar... but it might be a different issue. In my logs I don't see any handshake_timeouts.
I can see that PAM and NSS responders tried to connect 3 times (because reconnection_retries=3 by default) to backend and then gave up:
(2024-07-21 9:32:13): [pam] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_company.com [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_company.com: Connection refused * ... skipping repetitive backtrace ... (2024-07-21 9:32:13): [pam] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus * ... skipping repetitive backtrace ... (2024-07-21 9:32:16): [pam] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_company.com [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_company.com: Connection refused * ... skipping repetitive backtrace ... (2024-07-21 9:32:16): [pam] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus * ... skipping repetitive backtrace ... (2024-07-21 9:32:27): [pam] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_company.com [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_company.com: Connection refused * ... skipping repetitive backtrace ... (2024-07-21 9:32:27): [pam] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus * ... skipping repetitive backtrace ... (2024-07-21 9:32:27): [pam] [sbus_reconnect] (0x0020): Unable to reconnect: maximum retries exceeded. (2024-07-21 9:32:27): [pam] [sss_dp_on_reconnect] (0x0010): Could not reconnect to company.com provider.
"lsof +E -aUc sssd" also shows that neither PAM nor NSS responders are connected to the other side of /var/lib/sss/pipes/private/sbus-dp_company.com Unix socket. If I kill sssd_nss process manually it reconnects to the socket just fine.
Am I right in my guess that responders try to connect only "reconnection_retries" times, and if not success, will not try to reconnect until the responder is restarted?
Kind regards, Grigory Trenin
Am Mon, Jul 22, 2024 at 10:57:45PM +0300 schrieb Grigory Trenin:
Hi Sumit,
Yes, I'm running this version. "rpm -q --changelog" also shows that the fix is there:
$ rpm -q --changelog sssd | head -2
- Mon Jul 10 2023 Alexey Tikhonov atikhono@redhat.com - 2.8.2-3
- Resolves: rhbz#2219351 - [sssd] SSSD enters failed state after heavy
load in the system [rhel-8.8.0.z]
Yes, this bug looks similar... but it might be a different issue. In my logs I don't see any handshake_timeouts.
I can see that PAM and NSS responders tried to connect 3 times (because reconnection_retries=3 by default) to backend and then gave up:
(2024-07-21 9:32:13): [pam] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_company.com [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_company.com: Connection refused
- ... skipping repetitive backtrace ...
(2024-07-21 9:32:13): [pam] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus
- ... skipping repetitive backtrace ...
(2024-07-21 9:32:16): [pam] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_company.com [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_company.com: Connection refused
- ... skipping repetitive backtrace ...
(2024-07-21 9:32:16): [pam] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus
- ... skipping repetitive backtrace ...
(2024-07-21 9:32:27): [pam] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_company.com [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_company.com: Connection refused
- ... skipping repetitive backtrace ...
(2024-07-21 9:32:27): [pam] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus
- ... skipping repetitive backtrace ...
(2024-07-21 9:32:27): [pam] [sbus_reconnect] (0x0020): Unable to reconnect: maximum retries exceeded. (2024-07-21 9:32:27): [pam] [sss_dp_on_reconnect] (0x0010): Could not reconnect to company.com provider.
"lsof +E -aUc sssd" also shows that neither PAM nor NSS responders are connected to the other side of /var/lib/sss/pipes/private/sbus-dp_company.com Unix socket. If I kill sssd_nss process manually it reconnects to the socket just fine.
Am I right in my guess that responders try to connect only "reconnection_retries" times, and if not success, will not try to reconnect until the responder is restarted?
Hi,
yes, have you already tried to increase this value in your environment?
Please note that with SSSD-2.10 the way the communication is handled between the components changed and this option is obsolete. Hopefully the new way of communication will also make the issue you see go away.
bye, Sumit
Kind regards, Grigory Trenin -- _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
Hi Sumit,
Haven't tried it yet, I have just investigated it and wanted to make sure that I'm getting it right. Thank you for clarifying. I'll certainly have to try it, because there will be no SSSD-2.10 in RHEL8. I'm also thinking about any other options... is there a way to disable its own watchdog in SSSD?
Kind regards, Grigory Trenin
Am Tue, Jul 23, 2024 at 11:01:14AM +0100 schrieb Grigory Trenin:
Hi Sumit,
Haven't tried it yet, I have just investigated it and wanted to make sure that I'm getting it right. Thank you for clarifying. I'll certainly have to try it, because there will be no SSSD-2.10 in RHEL8. I'm also thinking about any other options... is there a way to disable its own watchdog in SSSD?
Hi,
you cannot disable it completely but you can increase the time between the checks with the `timeout` parameter. But please note that it should not be made to high because otherwise if e.g. the SSSD backend got stuck it will not be restarted and SSSD might end in a similar unresponsive state.
bye, Sumit
Kind regards, Grigory Trenin -- _______________________________________________ sssd-users mailing list -- sssd-users@lists.fedorahosted.org To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.o... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
sssd-users@lists.fedorahosted.org