On May 21, 2013, at 12:24 PM, Jakub Hrozek wrote:
On Tue, May 21, 2013 at 09:58:06AM +0000, Steve Traylen wrote:
>
> Hi,
>
> We are experienced trouble in our interactive login service where sssd in crashing
every few hours in most of the nodes.
> We run on Scientific Linux CERN SLC release 6.4 (Carbon) (which is RH 6.4 based),
more precisely:
>
Hi,
I'm sorry for the trouble.
:-) No need to apologize.
> sssd-1.9.2-82.7.el6_4.x86_64
> sssd-client-1.9.2-82.7.el6_4.x86_64
>
That's pretty much the latest stable release.
> We have some logs
>
> (Mon May 20 14:31:27 2013) [sssd[pam]] [sss_dp_init] (0x0010): Failed to connect to
monitor services.
> (Mon May 20 14:31:27 2013) [sssd[pam]] [sss_process_init] (0x0010): fatal error
setting up backend connector
>
> which is the restart of the backed failing?
>
>
It's actually the PAM process not succeeding in reconnecting to the UNIX
pipe connecting it with the monitor process that watches over all the
SSSD worker processes.
>
> Here is our sssd.conf
>
[snip]
The config file looks good to me.
>
> and a backtrace below. I can provide a core file.
>
Providing the core file would be great if you don't mind. However, keep
in mind that the core file might contain confidential information, in
some cases even the passwords.
Hi Jakub,
I'll make the core files to available to you offline. In fact we have a selection of
core files:
1) sssd_be_core.14772
#0 dbus_watch_handle (watch=0x90, flags=2) at dbus-watch.c:650
#1 0x000000000046edcc in sbus_watch_handler (ev=<value optimized out>,
fde=<value optimized out>,
flags=<value optimized out>, data=<value optimized out>) at
src/sbus/sssd_dbus_common.c:93
#2 0x00007fcb8f3e23ff in epoll_event_loop (ev=<value optimized out>,
location=<value optimized out>)
at ../tevent_standard.c:328
#3 std_event_loop_once (ev=<value optimized out>, location=<value optimized
out>) at ../tevent_standard.c:567
#4 0x00007fcb8f3de8f0 in _tevent_loop_once (ev=0x176a590, location=0x47c713
"src/util/server.c:601")
at ../tevent.c:507
#5 0x00007fcb8f3de95b in tevent_common_loop_wait (ev=0x176a590, location=0x47c713
"src/util/server.c:601")
at ../tevent.c:608
#6 0x0000000000451973 in server_loop (main_ctx=0x176b6a0) at src/util/server.c:601
#7 0x000000000041a066 in main (argc=<value optimized out>, argv=<value optimized
out>)
at src/providers/data_provider_be.c:2732
2) sssd_nss_core.32675
#0 dbus_watch_handle (watch=0x90, flags=2) at dbus-watch.c:650
#1 0x000000000047227c in sbus_watch_handler (ev=<value optimized out>,
fde=<value optimized out>,
flags=<value optimized out>, data=<value optimized out>) at
src/sbus/sssd_dbus_common.c:93
#2 0x00007fbd5c5623ff in epoll_event_loop (ev=<value optimized out>,
location=<value optimized out>)
at ../tevent_standard.c:328
#3 std_event_loop_once (ev=<value optimized out>, location=<value optimized
out>) at ../tevent_standard.c:567
#4 0x00007fbd5c55e8f0 in _tevent_loop_once (ev=0x89e3b0, location=0x4816a3
"src/util/server.c:601")
at ../tevent.c:507
#5 0x00007fbd5c55e95b in tevent_common_loop_wait (ev=0x89e3b0, location=0x4816a3
"src/util/server.c:601")
at ../tevent.c:608
#6 0x000000000045a1b3 in server_loop (main_ctx=0x89f530) at src/util/server.c:601
#7 0x00000000004090a0 in main (argc=<value optimized out>, argv=<value optimized
out>)
3) sssd_pam_core.18894
#0 0x00007f297c2778a5 in raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007f297c279085 in abort () at abort.c:92
#2 0x00007f297f41cc3c in talloc_abort (reason=0x7f297f422378 "Bad talloc magic value
- unknown value")
at ../talloc.c:317
#3 0x00007f297f41cdf1 in talloc_abort_unknown_value (ptr=<value optimized out>) at
../talloc.c:341
#4 talloc_chunk_from_ptr (ptr=<value optimized out>) at ../talloc.c:360
#5 talloc_get_name (ptr=<value optimized out>) at ../talloc.c:1153
#6 0x00007f297f41ce1e in talloc_check_name (ptr=0x1fe4c90, name=0x461f5d "struct
pam_auth_req")
at ../talloc.c:1172
#7 0x0000000000410e0a in pam_dp_process_reply (pending=0x1fdd250, ptr=<value optimized
out>)
at src/responder/pam/pamsrv_dp.c:42
#8 0x00007f297edb161a in complete_pending_call_and_unlock (connection=0x1fb9780,
pending=0x1fdd250,
message=<value optimized out>) at dbus-connection.c:2234
#9 0x00007f297edb386f in dbus_connection_dispatch (connection=0x1fb9780) at
dbus-connection.c:4397
#10 0x000000000045425e in sbus_dispatch (ev=0x1fad3d0, te=<value optimized out>,
tv=...,
data=<value optimized out>) at src/sbus/sssd_dbus_connection.c:104
#11 0x00007f297f62bbd9 in tevent_common_loop_timer_delay (ev=0x1fad3d0) at
../tevent_timed.c:254
#12 0x00007f297f62b2ab in std_event_loop_once (ev=<value optimized out>,
location=<value optimized out>)
at ../tevent_standard.c:560
#13 0x00007f297f6278f0 in _tevent_loop_once (ev=0x1fad3d0, location=0x46d063
"src/util/server.c:601")
at ../tevent.c:507
#14 0x00007f297f62795b in tevent_common_loop_wait (ev=0x1fad3d0, location=0x46d063
"src/util/server.c:601")
at ../tevent.c:608
#15 0x0000000000455bb3 in server_loop (main_ctx=0x1fae550) at src/util/server.c:601
#16 0x0000000000409b32 in main (argc=<value optimized out>, argv=<value optimized
out>)
at src/responder/pam/pamsrv.c:260
> Cheers; Steve
>
> # ls -ltd /core.18894
> -rw-------. 1 root root 1253376 May 14 20:10 /core.18894
> # file /core.18894
> /core.18894: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from
'/usr/libexec/sssd/sssd_pam --debug-to-files'
> #
>
> .....does this ring a bell?
No, sorry. This looks like some kind of a new bug to me. Can you install
sssd-debuginfo (debuginfo-install sssd should be all you need) and
generate the backtrace again?
When does the crash happen, can you see any pattern in SSSD usage? I'd
like to reproduce the crash in-house if possible.
I don't have a pattern at the moment. .. We were using sssd happily on our internal
test clusters
with a limited number of users. .. When we switched a user facing (14,000 users) cluster
we started
to have problems. I suspect we have more crashes when particular machines are under load
and
we certainly have less (*) sssd crashes since we fixed another piece of software that was
going bananas CPU
wise.
Will try and find a correlation.
(*) Of 60 (hopefully identical) nodes around 10 have had an sssd crash in the last couple
of days.
>
_______________________________________________
sssd-devel mailing list
sssd-devel(a)lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/sssd-devel