On 6/24/15 10:52 AM, Jakub Hrozek wrote:
>On Wed, Jun 24, 2015 at 10:18:26AM -0700, Janelle wrote:
>>On 6/24/15 12:38 AM, Jakub Hrozek wrote:
>>>On Tue, Jun 23, 2015 at 07:52:46AM -0700, Janelle wrote:
>>>>On 6/23/15 7:33 AM, John Hodrien wrote:
>>>>>On Tue, 23 Jun 2015, Janelle wrote:
>>>>>
>>>>>>Servers are behind a load-balancer. Address never changes.
>>>>>But one problem with that is that SSSD will see multiple servers as
one
>>>>>server, and so will mark the server as failed if the load balancer
>>>>>presents it
>>>>>with a broken back end server.
>>>>>
>>>>>Works much better in my experience when you tell SSSD about all the
>>>>>servers.
>>>>>
>>>>>jh
>>>>Sadly that is not possible. If SSSD did load balancing when given
multiple
>>>>servers, then yes, but it does not. When you are running 30,000 servers
with
>>>>3000 users, you have to load balance or SSSD simply dies and an ssh
login
>>>>takes 5 minutes to complete.
>>>What is the configuration you were running here? I'm interested in
>>>seeing how we can make SSSD not die :-)
>>>
>>>>The only way to make SSSD happy and not kill
>>>>the single server it would point to is to have multiple servers behind a
>>>>VIP.
>>>Hmm, did you consider SRV records as John pointed out elsewhere? Then
>>>you could load-balance using weight fields of SRV records..
>>>
>>>>Am I completely off base to think this is the way to go? Can SSSD be
>>>>taught to actually load balance?
>>>I'm not exactly sure how you would like SSSD to behave. Would this
>>>ticket help -
https://fedorahosted.org/sssd/ticket/2499 ?
>>>_______________________________________________
>>>sssd-users mailing list
>>>sssd-users(a)lists.fedorahosted.org
>>>https://lists.fedorahosted.org/mailman/listinfo/sssd-users
>>What I found was that when the VIP servers are updated, even though most of
>>the systems continue to run, a large population seems to say the LDAP server
>Have you tried if cycling the offline/online status with USR1 and USR2
>helps?
>
>>has lost connection. And then SSSD stops trying unless you restart it:
>>
>>ldap_id_use_start_tls = falsessd[be[default]]] [fo_resolve_service_send]
>>(0x0020): No available servers for service 'LDAP'
>>[autofs]edentials = true5) [sssd[be[default]]]
>>[sss_ldap_init_sys_connect_done] (0x0020): ldap_install_tls failed: Connect
>>error
>>ldap_tls_cacertdir = /etc/openldap/cacertst]]] [sdap_sys_connect_done]
>>(0x0020): sdap_async_connect_call request failed.
>>
>>(ignore cert error - it is set to ALLOW)
>>
>>A simple "service sssd restart" solves it, but you can see the server
is
>>still up. A telnet connect to either of 389 or 636 works fine. It seems to
>>me like SSSD just gives up and stops trying?
>At that point sssd goes offline, right?
>
>Could you try experimenting with a short offline_timeout? (see man
>sssd.conf for more details on that option)
>
>>As a side note - nslcd works flawlessly and the server might disconnect for
>>a second, then it comes back and nslc restores the connect. It does not seem
>>to give up as SSSD does :-(
>I think it's because nslcd is not as stateful as sssd, so it would try
>to connect every time. But I'm not totally sure without seeing the issue
>myself..
>_______________________________________________
>sssd-users mailing list
>sssd-users(a)lists.fedorahosted.org
>https://lists.fedorahosted.org/mailman/listinfo/sssd-users
What version was offline_timeout added? I would expect with a default of 60,
it would recover, but it does not seem to. But maybe there is a version
issue here?