-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Apologies for not starting with any of this info -- I was copying from
an internal e-mail and didn't think to add it all in to a mailing list
message. I am running CentOS 6.6 and SSSD 1.11.6.
On 11/04/2015 04:10 AM, Jakub Hrozek wrote:
On Tue, Nov 03, 2015 at 04:15:57PM -0500, Novosielski, Ryan wrote:
> Over time, I’ve been having seemingly random sssd quits that
> I’ve not been able to figure out. Today, I finally traced it to
> fluctuations on my Infiniband fabric:
>
> sssd.log
>
> (Tue Nov 3 13:17:59 2015) [sssd] [message_type] (0x0200):
> netlink Message type: 16 (Tue Nov 3 13:17:59 2015) [sssd]
> [link_msg_handler] (0x1000): netlink link message: iface idx 4
> (ib0) flags 0x1003 (broadcast,multicast,up) (Tue Nov 3 13:17:59
> 2015) [sssd] [message_type] (0x0200): netlink Message type: 16
> (Tue Nov 3 13:17:59 2015) [sssd] [link_msg_handler] (0x1000):
> netlink link message: iface idx 4 (ib0) flags 0x11043
> (broadcast,multicast,up,running,lower)
These messages should just tell sssd to reset its online/offline
status. It's possible we have a bug there, but this code should be
pretty stable, there were no changes there recently. Maybe the
services don't handle the resetOffline signal well, but that's
just my speculation.
A colleague was thinking maybe a status change that happens inside of
one second like this?
There's no way to disable the netlink integration except for
build
changes.
> This exactly corresponds to the time in /var/log/messages for
> the unexplained shutdown:
>
> 2015-11-03T13:17:59-05:00 node75 sssd[pam]: Shutting down
> 2015-11-03T13:17:59-05:00 node75 sssd[be[default]]: Shutting down
> 2015-11-03T13:17:59-05:00 node75 sssd[nss]: Shutting down
>
> Here is sssd_default.log for good measure:
>
> (Tue Nov 3 13:17:59 2015) [sssd[be[default]]]
> [sbus_remove_watch] (0x2000): 0x1414770/0x14133d0 (Tue Nov 3
> 13:17:59 2015) [sssd[be[default]]] [sbus_remove_watch] (0x2000):
> 0x1414770/0x13fef90 (Tue Nov 3 13:17:59 2015)
> [sssd[be[default]]] [be_ptask_destructor] (0x0400): Terminating
> periodic task [Cleanup of default] (Tue Nov 3 13:17:59 2015)
> [sssd[be[default]]] [sdap_handle_release] (0x2000): Trace:
> sh[0x14bd850], connected[1], ops[(nil)], ldap[0x1424260],
> destructor_lock[0], release_memory[0] (Tue Nov 3 13:17:59 2015)
> [sssd[be[default]]] [remove_connection_callback] (0x4000):
> Successfully removed connection callback. (Tue Nov 3 13:17:59
> 2015) [sssd[be[default]]] [sbus_remove_watch] (0x2000):
> 0x1415970/0x1416430 (Tue Nov 3 13:17:59 2015)
> [sssd[be[default]]] [remove_socket_symlink] (0x4000): The symlink
> points to [/var/lib/sss/pipes/private/sbus-dp_default.18702] (Tue
> Nov 3 13:17:59 2015) [sssd[be[default]]]
> [remove_socket_symlink] (0x4000): The path including our pid is
> [/var/lib/sss/pipes/private/sbus-dp_default.18702] (Tue Nov 3
> 13:17:59 2015) [sssd[be[default]]] [remove_socket_symlink]
> (0x4000): Removed the symlink (Tue Nov 3 13:17:59 2015)
> [sssd[be[default]]] [be_client_destructor] (0x0400): Removed PAM
> client (Tue Nov 3 13:17:59 2015) [sssd[be[default]]]
> [be_client_destructor] (0x0400): Removed NSS client
Is there anything else in the logs? Can you look for messages just
before the "server_setup" line? (The server_setup function is the
first one we print after startup)
They were above. Here's an excerpt from the right spot in sssd.log.
Everything relevant, IMO, happened in that one second.
(Tue Nov 3 13:17:59 2015) [sssd] [link_msg_handler] (0x1000): netlink
link message: iface idx 4 (ib0) flags 0x11043
(broadcast,multicast,up,running,lowerup)
(Tue Nov 3 15:38:56 2015) [sssd] [server_setup] (0x0400): CONFDB:
/var/lib/sss/db/config.ldb
The messages prior to that are just the normal stuff.
(Tue Nov 3 13:17:57 2015) [sssd] [service_send_ping] (0x0100):
Pinging nss
(Tue Nov 3 13:17:57 2015) [sssd] [sbus_add_timeout] (0x2000): 0x1b9fe20
(Tue Nov 3 13:17:57 2015) [sssd] [service_send_ping] (0x0100):
Pinging pam
(Tue Nov 3 13:17:57 2015) [sssd] [sbus_add_timeout] (0x2000): 0x1b9fe60
(Tue Nov 3 13:17:57 2015) [sssd] [sbus_remove_timeout] (0x2000):
0x1b9fe20
(Tue Nov 3 13:17:57 2015) [sssd] [sbus_dispatch] (0x4000): dbus conn:
0x1b9a5d0
(Tue Nov 3 13:17:57 2015) [sssd] [sbus_dispatch] (0x4000): Dispatching.
(Tue Nov 3 13:17:57 2015) [sssd] [ping_check] (0x0100): Service nss
replied to ping
(Tue Nov 3 13:17:57 2015) [sssd] [sbus_remove_timeout] (0x2000):
0x1b9fe60
(Tue Nov 3 13:17:57 2015) [sssd] [sbus_dispatch] (0x4000): dbus conn:
0x1b99cc0
(Tue Nov 3 13:17:57 2015) [sssd] [sbus_dispatch] (0x4000): Dispatching.
(Tue Nov 3 13:17:57 2015) [sssd] [ping_check] (0x0100): Service pam
replied to ping
What about the sssd.log itself?
That's what is right above and what I'd posted before to show the
netlink messages. If you want the whole log, or a certain amount of
it, let me know where that would be helpful to send. I've had
debugging up quite high to try to track this one down, so it's rather
large.
> I can duplicate this by manually taking down the Infiniband
> link:
>
> [root@node24 ~]# service sssd status sssd (pid 9132) is
> running... [root@node24 ~]# ifdown ib0 [root@node24 ~]# service
> sssd status sssd dead but pid file exists
What SSSD version are you running on what OS?
If it's RHEL/Centos or some of its derivatives, can you enable
abrt to see if there's any crash? Are there any sssd-related
messages in syslog?
I will turn abrt on. The problem is extremely intermittent and these
are compute nodes having the problem, so it's going to take awhile to
get anything useful out of it.
The only messages in sssd were the ones printed first
(/var/log/messages), about it shutting down.
- --
____ *Note: UMDNJ is now Rutgers-Biomedical and Health Sciences*
|| \\UTGERS |---------------------*O*---------------------
||_// Biomedical | Ryan Novosielski - Senior Technologist
|| \\ and Health | novosirj(a)rutgers.edu - 973/972.0922 (2x0922)
|| \\ Sciences | OIRT/High Perf & Res Comp - MSB C630, Newark
`'
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.22 (GNU/Linux)
iEYEARECAAYFAlY6Kc8ACgkQmb+gadEcsb6pLACZAbsRcZPEYiFyfODD46VZsths
XqEAn3WaGE5aAYLInDDiUE42IM47c7oj
=bNHL
-----END PGP SIGNATURE-----