Hi Sumit,
Yes, I'm running this version. "rpm -q --changelog" also shows that the fix is there:
$ rpm -q --changelog sssd | head -2 * Mon Jul 10 2023 Alexey Tikhonov atikhono@redhat.com - 2.8.2-3 - Resolves: rhbz#2219351 - [sssd] SSSD enters failed state after heavy load in the system [rhel-8.8.0.z]
Yes, this bug looks similar... but it might be a different issue. In my logs I don't see any handshake_timeouts.
I can see that PAM and NSS responders tried to connect 3 times (because reconnection_retries=3 by default) to backend and then gave up:
(2024-07-21 9:32:13): [pam] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_company.com [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_company.com: Connection refused * ... skipping repetitive backtrace ... (2024-07-21 9:32:13): [pam] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus * ... skipping repetitive backtrace ... (2024-07-21 9:32:16): [pam] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_company.com [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_company.com: Connection refused * ... skipping repetitive backtrace ... (2024-07-21 9:32:16): [pam] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus * ... skipping repetitive backtrace ... (2024-07-21 9:32:27): [pam] [sbus_dbus_connect_address] (0x0020): Unable to connect to unix:path=/var/lib/sss/pipes/private/sbus-dp_company.com [org.freedesktop.DBus.Error.NoServer]: Failed to connect to socket /var/lib/sss/pipes/private/sbus-dp_company.com: Connection refused * ... skipping repetitive backtrace ... (2024-07-21 9:32:27): [pam] [sbus_reconnect_attempt] (0x0020): Unable to connect to D-Bus * ... skipping repetitive backtrace ... (2024-07-21 9:32:27): [pam] [sbus_reconnect] (0x0020): Unable to reconnect: maximum retries exceeded. (2024-07-21 9:32:27): [pam] [sss_dp_on_reconnect] (0x0010): Could not reconnect to company.com provider.
"lsof +E -aUc sssd" also shows that neither PAM nor NSS responders are connected to the other side of /var/lib/sss/pipes/private/sbus-dp_company.com Unix socket. If I kill sssd_nss process manually it reconnects to the socket just fine.
Am I right in my guess that responders try to connect only "reconnection_retries" times, and if not success, will not try to reconnect until the responder is restarted?
Kind regards, Grigory Trenin