Hello,
Sometimes SSSD is not recovered after being killed by own watchdog. This is a DEV environment, somewhat poorly monitored, OOM kills are common (but SSSD is not killed). RHEL8.8, sssd-2.8.2 with Active Directory.
1. Under the memory pressure, the system becomes unresponsive, and SSSD's own watchdog terminates sssd_be and restarts it. 2. Since the system is still operating very slowly, it takes quite a while to start sssd_be. From the output of “ps” command I can see that sssd_be start time is 09:32:13, but sssd_be’s initial message in its logs “Starting with debug level = 0x0070” is dated 9:34:02. So it took almost 2 minutes to start sssd_be. 3. Meanwhile, nss/pam responders tried to connect to the backend 3 times and gave up with a message “Unable to reconnect: maximum retries exceeded”. 4. OOM killer finally kills some process (not SSSD ones) and the system performance returns back to normal.
So we end up with SSSD up and running, but not functioning, because nss/pam responders will never try connecting to the backend again. And it was caused by its own watchdog. It looks like if watchdog hadn't killed sssd_be, it would have recovered after OOM killer killed the memory hog process.
I still cannot believe that SSSD cannot recover on its own from such a simple situation. To my mind it should try reconnecting to the backend every 60s or something like that.
Is it expected behaviour or am I missing something?
Kind regards, Grigory Trenin