Hi,
We run sssd to bind a number of machines to LDAP for auth. On a subset
of these machines, we have software that makes several thousand IPv6
route changes per second.
Recently, we found that on these hosts the sssd_nss responder process
fails several times a day[1], and will not recover until sssd is
restarted. strace[2] of the main sssd process indicates that sssd is
receiving many, many netlink messages - so many, in fact, that sssd
cannot process them fast enough and is receiving ENOBUFS from
recvmsg(2).
The messages that are received seem to get forwarded[3] to the sssd
responders over the unix socket and flood them until they fail.
From what I can see, the netlink code in
src/monitor/monitor_netlink.c:setup_netlink() subscribes to netlink
notifications with the aim of detecting things like wifi network
changes. This isn't something we'd find useful on our servers and
seems to have performance implications - is there any easy way of
turning off this functionality in sssd that I've missed?
We see this issue running sssd 1.11.7.
Cheers,
Patrick
1. The failures look something like this. I have replaced our sss
domain with "ourdomain"
/var/log/sssd/sssd_nss.log
(Tue Mar 22 02:58:01 2016) [sssd[nss]] [accept_fd_handler] (0x0100):
Client connected!
(Tue Mar 22 02:58:01 2016) [sssd[nss]] [nss_cmd_initgroups] (0x0100):
Requesting info for [systemuser] from [<ALL>]
(Tue Mar 22 02:58:01 2016) [sssd[nss]] [nss_cmd_initgroups_search]
(0x0100): Requesting info for [systemuser@ourdomain]
(Tue Mar 22 02:59:04 2016) [sssd[nss]]
[nss_cmd_initgroups_dp_callback] (0x0040): Unable to get information
from Data Provider
Error: 3, 5, (null)
Will try to return what we have in cache
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [server_setup] (0x0080):
CONFDB: /var/lib/sss/db/config.ldb
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [monitor_common_send_id]
(0x0100): Sending ID: (nss,1)
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [check_file] (0x0020): lstat
for [/var/lib/sss/pipes/private/sbus-dp_ourdomain] failed: [2][No such
file or directory].
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [sbus_client_init] (0x0020):
check_file failed for [/var/lib/sss/pipes/private/sbus-dp_ourdomain].
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_dp_init] (0x0010): Failed
to connect to monitor services.
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_process_init] (0x0010):
fatal error setting up backend connector
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [server_setup] (0x0080):
CONFDB: /var/lib/sss/db/config.ldb
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [monitor_common_send_id]
(0x0100): Sending ID: (nss,1)
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [check_file] (0x0020): lstat
for [/var/lib/sss/pipes/private/sbus-dp_ourdomain] failed: [2][No such
file or directory].
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [sbus_client_init] (0x0020):
check_file failed for [/var/lib/sss/pipes/private/sbus-dp_ourdomain].
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_dp_init] (0x0010): Failed
to connect to monitor services.
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_process_init] (0x0010):
fatal error setting up backend connector
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [server_setup] (0x0080):
CONFDB: /var/lib/sss/db/config.ldb
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [monitor_common_send_id]
(0x0100): Sending ID: (nss,1)
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [check_file] (0x0020): lstat
for [/var/lib/sss/pipes/private/sbus-dp_ourdomain] failed: [2][No such
file or directory].
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [sbus_client_init] (0x0020):
check_file failed for [/var/lib/sss/pipes/private/sbus-dp_ourdomain].
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_dp_init] (0x0010): Failed
to connect to monitor services.
(Tue Mar 22 02:59:14 2016) [sssd[nss]] [sss_process_init] (0x0010):
fatal error setting up backend connector
(no further messages until service restarts)
2. strace sssd -i
sendmsg(12, {msg_name(0)=NULL,
msg_iov(2)=[{"l\1\0\1\0\0\0\0M\10\0\0e\0\0\0\1\1o\0\35\0\0\0/org/fre"...,
120}, {"", 0}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 120
gettimeofday({1458663090, 595705}, NULL) = 0
epoll_wait(5, {{EPOLLIN, {u32=150507384, u64=150507384}}}, 1, 1563) = 1
recvmsg(12, {msg_name(0)=NULL,
msg_iov(1)=[{"l\2\1\1\0\0\0\0M\10\0\0\10\0\0\0\5\1u\0M\10\0\0\0\2\1\1\0\0\0\0"...,
2048}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC},
MSG_CMSG_CLOEXEC) = 24
gettimeofday({1458663090, 595978}, NULL) = 0
recvmsg(12, 0xffcdcac0, MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource
temporarily unavailable)
gettimeofday({1458663090, 596110}, NULL) = 0
gettimeofday({1458663090, 596151}, NULL) = 0
epoll_wait(5, {{EPOLLIN|EPOLLERR, {u32=150512424, u64=150512424}}}, 1, 1563) = 1
recvmsg(11, 0xffcdcb3c, 0) = -1 ENOBUFS (No buffer space available)
gettimeofday({1458663090, 596330}, NULL) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
stat64("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2819, ...}) = 0
write(2, "(Tue Mar 22 09:11:30 2016) [sssd"..., 65(Tue Mar 22 09:11:30
2016) [sssd] [netlink_fd_handler] (0x0020): ) = 65
write(2, "Error while reading from netlink"..., 36Error while reading
from netlink fd
) = 36
gettimeofday({1458663090, 596538}, NULL) = 0
epoll_wait(5, {{EPOLLIN, {u32=150512424, u64=150512424}}}, 1, 1563) = 1
recvmsg(11, {msg_name(12)={sa_family=AF_NETLINK, pid=0,
groups=00000400},
msg_iov(1)=[{"\224\0\0\0\31\0\0\0\0\0\0\0\0\0\0\0\n\200\0\0\1\0\0\1\0\2\0\0\10\0\17\0"...,
4096}], msg_controllen=24, {cmsg_len=24, cmsg_level=SOL_SOCKET,
cmsg_type=SCM_CREDENTIALS{pid=0, uid=0, gid=0}}, msg_flags=0}, 0) =
148
gettimeofday({1458663090, 596679}, NULL) = 0
sendmsg(12, {msg_name(0)=NULL,
msg_iov(2)=[{"l\1\0\1\0\0\0\0N\10\0\0e\0\0\0\1\1o\0\35\0\0\0/org/fre"...,
120}, {"", 0}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 120
gettimeofday({1458663090, 596777}, NULL) = 0
3. /var/log/sssd/sssd.log
(Wed Mar 23 06:59:07 2016) [sssd] [message_type] (0x0200): netlink
Message type: 24
(Wed Mar 23 06:59:07 2016) [sssd] [route_msg_debug_print] (0x1000):
route idx 209591 flags 0X200 family 10 addr
fd0a:9b09:1f7:0:218:aff:fe34:3d08/128
(Wed Mar 23 06:59:07 2016) [sssd] [network_status_change_cb] (0x2000):
A networking status change detected signaling providers to reset
offline status
(Wed Mar 23 06:59:07 2016) [sssd] [service_signal] (0x0020): Could not
signal service [ourdomain].
(Wed Mar 23 06:59:07 2016) [sssd] [message_type] (0x0200): netlink
Message type: 24
(Wed Mar 23 06:59:07 2016) [sssd] [route_msg_debug_print] (0x1000):
route idx 209591 flags 0X200 family 10 addr
fd0a:9b09:1f7:0:218:aff:fe33:3b66/128
(Wed Mar 23 06:59:07 2016) [sssd] [network_status_change_cb] (0x2000):
A networking status change detected signaling providers to reset
offline status
(Wed Mar 23 06:59:07 2016) [sssd] [service_signal] (0x0020): Could not
signal service [ourdomain].
(Wed Mar 23 06:59:07 2016) [sssd] [message_type] (0x0200): netlink
Message type: 24
$ grep network_status_change_cb sssd.log | grep '06:59:12' | wc -l
1245