On 6/24/15 12:38 AM, Jakub Hrozek wrote:
On Tue, Jun 23, 2015 at 07:52:46AM -0700, Janelle wrote:
> On 6/23/15 7:33 AM, John Hodrien wrote:
>> On Tue, 23 Jun 2015, Janelle wrote:
>>> Servers are behind a load-balancer. Address never changes.
>> But one problem with that is that SSSD will see multiple servers as one
>> server, and so will mark the server as failed if the load balancer
>> presents it
>> with a broken back end server.
>> Works much better in my experience when you tell SSSD about all the
> Sadly that is not possible. If SSSD did load balancing when given multiple
> servers, then yes, but it does not. When you are running 30,000 servers with
> 3000 users, you have to load balance or SSSD simply dies and an ssh login
> takes 5 minutes to complete.
What is the configuration you were running here? I'm interested in
seeing how we can make SSSD not die :-)
> The only way to make SSSD happy and not kill
> the single server it would point to is to have multiple servers behind a
Hmm, did you consider SRV records as John pointed out elsewhere? Then
you could load-balance using weight fields of SRV records..
> Am I completely off base to think this is the way to go? Can SSSD be
> taught to actually load balance?
I'm not exactly sure how you would like SSSD to behave. Would this
ticket help - https://fedorahosted.org/sssd/ticket/2499
sssd-users mailing list
What I found was that
when the VIP servers are updated, even though most
of the systems continue to run, a large population seems to say the LDAP
server has lost connection. And then SSSD stops trying unless you
ldap_id_use_start_tls = falsessd[be[default]]] [fo_resolve_service_send]
(0x0020): No available servers for service 'LDAP'
[autofs]edentials = true5) [sssd[be[default]]]
[sss_ldap_init_sys_connect_done] (0x0020): ldap_install_tls failed:
ldap_tls_cacertdir = /etc/openldap/cacertst]]] [sdap_sys_connect_done]
(0x0020): sdap_async_connect_call request failed.
(ignore cert error - it is set to ALLOW)
A simple "service sssd restart" solves it, but you can see the server is
still up. A telnet connect to either of 389 or 636 works fine. It seems
to me like SSSD just gives up and stops trying?
As a side note - nslcd works flawlessly and the server might disconnect
for a second, then it comes back and nslc restores the connect. It does
not seem to give up as SSSD does :-(