On Wed, Jun 24, 2015 at 10:18:26AM -0700, Janelle wrote:
On 6/24/15 12:38 AM, Jakub Hrozek wrote:
>On Tue, Jun 23, 2015 at 07:52:46AM -0700, Janelle wrote:
>>On 6/23/15 7:33 AM, John Hodrien wrote:
>>>On Tue, 23 Jun 2015, Janelle wrote:
>>>>Servers are behind a load-balancer. Address never changes.
>>>But one problem with that is that SSSD will see multiple servers as one
>>>server, and so will mark the server as failed if the load balancer
>>>with a broken back end server.
>>>Works much better in my experience when you tell SSSD about all the
>>Sadly that is not possible. If SSSD did load balancing when given multiple
>>servers, then yes, but it does not. When you are running 30,000 servers with
>>3000 users, you have to load balance or SSSD simply dies and an ssh login
>>takes 5 minutes to complete.
>What is the configuration you were running here? I'm interested in
>seeing how we can make SSSD not die :-)
>>The only way to make SSSD happy and not kill
>>the single server it would point to is to have multiple servers behind a
>Hmm, did you consider SRV records as John pointed out elsewhere? Then
>you could load-balance using weight fields of SRV records..
>>Am I completely off base to think this is the way to go? Can SSSD be
>>taught to actually load balance?
>I'm not exactly sure how you would like SSSD to behave. Would this
>ticket help - https://fedorahosted.org/sssd/ticket/2499
>sssd-users mailing list
What I found was that when the VIP servers are updated, even though most of
the systems continue to run, a large population seems to say the LDAP server
Have you tried if cycling the offline/online status with USR1 and USR2
has lost connection. And then SSSD stops trying unless you restart
ldap_id_use_start_tls = falsessd[be[default]]] [fo_resolve_service_send]
(0x0020): No available servers for service 'LDAP'
[autofs]edentials = true5) [sssd[be[default]]]
[sss_ldap_init_sys_connect_done] (0x0020): ldap_install_tls failed: Connect
ldap_tls_cacertdir = /etc/openldap/cacertst]]] [sdap_sys_connect_done]
(0x0020): sdap_async_connect_call request failed.
(ignore cert error - it is set to ALLOW)
A simple "service sssd restart" solves it, but you can see the server is
still up. A telnet connect to either of 389 or 636 works fine. It seems to
me like SSSD just gives up and stops trying?
At that point sssd goes offline, right?
Could you try experimenting with a short offline_timeout? (see man
sssd.conf for more details on that option)
As a side note - nslcd works flawlessly and the server might disconnect for
a second, then it comes back and nslc restores the connect. It does not seem
to give up as SSSD does :-(
I think it's because nslcd is not as stateful as sssd, so it would try
to connect every time. But I'm not totally sure without seeing the issue