Hello gurus,

We are running a 3 nodes FreeIPA cluster for some time without major trouble. One server may stale from time to time, without real trouble to restart it.

A few days ago, we had to migrate the VMs between two clouds (disk image copied from one to the other). They have been renumbered from old to new IPv4 address space. Not that easy, but we finally got it done with all DNS entries in sync. Yet, since the migration, ns-slapd process hangs randomly way more often than before (went from once every few months to several times a day) and is especially hard to restart on any node.

While starting up, the netstat output is like:

Active Internet connections (w/o servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp6  184527      0 10.217.151.3:389        10.217.151.2:52314      ESTABLISHED 29948/ns-slapd      

Netstat and tcpdump show it processes very slowly the recvq (sometimes like 79 bytes per 1-2 seconds). At some point it just stops processing it and hangs (only kill -9 works to take it down). When stale, strace shows the process loops only on :

getpeername(8, 0x7ffe62c49fd0, 0x7ffe62c49f94) = -1 ENOTCONN (Transport endpoint is not connected)
poll([{fd=50, events=POLLIN}, {fd=7, events=POLLIN}, {fd=8, events=POLLIN}, {fd=9, events=POLLIN}, {fd=117, events=POLLIN}, {fd=116, events=POLLIN}, {fd=115, events=POLLIN}, {fd=114, events=POLLIN}, {fd=89, events=POLLIN}, {fd=85, events=POLLIN}, {fd=83, events=POLLIN}, {fd=82, events=POLLIN}, {fd=81, events=POLLIN}, {fd=80, events=POLLIN}, {fd=79, events=POLLIN}, {fd=78, events=POLLIN}, {fd=77, events=POLLIN}, {fd=76, events=POLLIN}, {fd=67, events=POLLIN}, {fd=72, events=POLLIN}, {fd=69, events=POLLIN}, {fd=64, events=POLLIN}, {fd=66, events=POLLIN}], 23, 250) = 0 (Timeout)

If it can go through startup replication, one of the server will hang a little bit later, freezing the whole cluster. Forcing us to restart the faulty node to unlock things.

When stale, the dirsrv access log only contains entries like:
[20/Oct/2019:17:52:46.950029525 +0100] conn=86 fd=131 slot=131 connection from 10.217.151.4 to 10.217.151.4
[20/Oct/2019:17:52:51.280412883 +0100] conn=87 fd=132 slot=132 SSL connection from 10.217.151.10 to 10.217.151.4
[20/Oct/2019:17:52:54.956204031 +0100] conn=88 fd=133 slot=133 connection from 10.217.151.4 to 10.217.151.4
[20/Oct/2019:17:53:04.966542441 +0100] conn=89 fd=134 slot=134 connection from 10.217.151.2 to 10.217.151.4
[20/Oct/2019:17:53:22.659053020 +0100] conn=90 fd=135 slot=135 SSL connection from 10.217.151.10 to 10.217.151.4
[20/Oct/2019:17:53:51.006707605 +0100] conn=91 fd=136 slot=136 connection from 10.217.151.4 to 10.217.151.4
[20/Oct/2019:17:53:54.514162543 +0100] conn=92 fd=137 slot=137 SSL connection from 10.217.151.10 to 10.217.151.4
[20/Oct/2019:17:53:59.011602776 +0100] conn=93 fd=138 slot=138 connection from 10.217.151.3 to 10.217.151.4
[20/Oct/2019:17:54:09.019296900 +0100] conn=94 fd=139 slot=139 connection from 10.217.151.4 to 10.217.151.4

And netstat lists 10s of accepted network connections that are stale like :
tcp6     286      0 10.217.151.4:389        10.217.151.10:32512      ESTABLISHED 29948/ns-slapd      


The underlying network seams clean and uses jumbo frames. tcpdump and ping show 0 packet loss and no retransmit. Being afraid it could be a jumbo frame issue, mtu was even forced down to 1500. Without success.

Entropy seems fine as well :
# cat /proc/sys/kernel/random/entropy_avail
3138

Running version on all servers:
ipa-client-4.6.5-11.el7.centos.x86_64
ipa-client-common-4.6.5-11.el7.centos.noarch
ipa-common-4.6.5-11.el7.centos.noarch
ipa-server-4.6.5-11.el7.centos.x86_64
ipa-server-common-4.6.5-11.el7.centos.noarch
ipa-server-dns-4.6.5-11.el7.centos.noarch


I'd happily listen to any hint regarding this critical problem.

/Sylvain.