On 08/20/2012 02:01 PM, Stephen Gallagher wrote:
On Mon, 2012-08-20 at 12:28 +0200, Sigbjorn Lie wrote:
>
> On Mon, August 20, 2012 12:05, Jakub Hrozek wrote:
>> On Mon, Aug 20, 2012 at 08:33:47AM +0200, Sigbjorn Lie wrote:
>>
>>> Hi,
>>>
>>>
>>> When I arrived into the office this morning our Nagios server was displaying
a lot of alarms.
>>>
>>>
>>> The "sssd_pam" process was consuming 100% CPU, and I was unable to
log on to the box as
>>> anything else than root.
>>>
>>> 2310 root 20 0 219m 44m 2176 R 99.6 0.3 2883:27 sssd_pam
>>>
>>>
>>>
>>> In the var/log/sssd/sssd_pam.log file, the following error message was
repeated:
>>>
>>>
>>> [sssd[pam]] [accept_fd_handler] (0x0020): Accept failed [Too many open
files]
>>>
>>>
>>>
>>> This being our Nagios server the maximum amount of concurrent open files has
been increased
>>> from the default 1024 to 4096 for all users.
>>>
>>> This is RHEL 6.3 with sssd-1.8.0-32.
>>>
>>>
>>> What can I do to prevent this from happening in the future?
>>>
>>>
>>>
>>> Regards,
>>> Siggi
>>>
>> In SSSD 1.8 the limit of file descriptors was raised to either 8k or the
>> hard limit from limits.conf, whichever was lesser.
>>
> That would be 4k for me then.
>
>> There is also a new option fd_limit that can be used to set the limit and
>> in cases where the SSSD has the CAP_SYS_RESOURCE capability, even override the
hard limit from
>> limits.conf [1]
> When is the appropriate time to use this? I presume what I need is more file
descriptors and not
> less?
>
>> I'd like to ask for some more info to tell if the server was simply busy or
>> if we were really leaking a file descriptor:
>>
>> Do you know how many files were open at the time?
> No, recovery of the services we're of a higher priority than collecting info at
the time.
>
> Were there many
>> concurrent logins happening to that server?
> Yes, nagios spans a lot of ssh sessions to other hosts. There we're also other
processes that had
> been spawned from cron which we're hung.
>
> Did you have a chance to run lsof to check what file
>> descriptors were open?
> I'm sorry no.
>
>
> We did increase the system-wide nproc value in limits.conf from 1024 to 4096 4 days
ago due to too
> many Nagios checks running at the same time. If that is the issue when we will see
the SSSD issue
> happening again in a few days.
>
> Anything I should change in SSSD's config or just wait for this to happen again
and collect more
> information?
>
> We are not running SELinux on this box.
>
> Any other existing known bugs I should be aware about?
Please ensure that you are using the latest kernel from RHEL 6.3.
Earlier kernels (I'm not sure how far back) defaulted to having only
1024 file descriptors available to SSSD, whereas the latest kernels now
allow it to reach 4096.
I updated the kernel on the systems having issues some
weeks ago, and
the issue has not reappeared yet. :)
Sigbjorn, we've made several changes to how SSSD handles file
descriptors upstream which are targeted for RHEL 6.4. The primary ones:
1) SSSD will be able to request an arbitrarily-high number of file
descriptors through a config file option, fd_limit.
2) SSSD will no longer hold on to file descriptors for long-running
applications, and will close them once the PAM conversation concludes.
We had been doing this for performance (opening the socket is overhead)
but the file descriptor issue has proven more serious.
Sounds very good. Thanks for the update. :)