I did not check all connections, but all of those I checked come in from Server A. I could
not check all of them, because I do not have all of the access files. As I mentioned
before, I have quite some pcaps and I go through them. My previous statement about quite
some number of TCP Resets was incorrect. Those belonged to other TCP streams. The
observation I made is that I only see TCP Retransmissions for LDAP streams between A and
B. To be more precise, only for replication sessions. The replication did work for a
period of about 48 hours. During this period, I saw only one TCP Retransmission for a
replication session. So far, I just went through the pcaps from Server B.
However, this suggests to me that the MMR issue is linked to the TCP Retransmissions. I am
almost at the point that it may be a hardware failure (NIC) either on A or B. But then, I
don’t get why C is not implicated…..
From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Friday, November 16, 2012 10:20 AM
To: Reinhard Nappert
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...
On 11/16/2012 08:11 AM, Reinhard Nappert wrote:
This is correct. This also means that this connection is established and active. What I
don’t understand is why this replication session stays active? Any idea….
Because the replication supplier is keeping the connection open? From which machine are
all these open connections from? What are those servers doing?
From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Thursday, November 15, 2012 2:06 PM
To: Reinhard Nappert
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...
On 11/15/2012 12:01 PM, Reinhard Nappert wrote:
The system is cleaned up right now. The next time, when it happens, I hope I get this
information.
I am surprised that one session has more than 20.000 ops. There are not many updates on
the system…..
These are not just update operations, but all of the other search and replication extended
operations.
From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Thursday, November 15, 2012 1:36 PM
To: Reinhard Nappert
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...
On 11/15/2012 11:30 AM, Reinhard Nappert wrote:
Rich,
I also made the following observation:
I see quite some open connections of the replication directory:
connection: 84:20121114233917Z:5:5::cn=replication,cn=config
connection: 85:20121113224528Z:3:3::cn= replication,cn=config
connection: 88:20121112163205Z:3:3::cn= replication,cn=config
connection: 89:20121113220034Z:4:4::cn= replication,cn=config
connection: 90:20121115061343Z:209:209::cn= replication,cn=config
connection: 91:20121114234513Z:23374:23373:r:cn= replication,cn=config
connection: 92:20121114235823Z:9579:9579::cn= replication,cn=config
connection: 93:20121115120212Z:1077:1076:r:cn= replication,cn=config
I only see this on Server B. A and C are fine, meaning that the session all disappear.
If you do a netstat, what does it say are the states of these connections?
Let me know, if you have any other idea what might be going on.
Thanks,
-Reinhard
From:
389-users-bounces@lists.fedoraproject.org<mailto:389-users-bounces@lists.fedoraproject.org>
[mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Reinhard Nappert
Sent: Tuesday, November 13, 2012 3:54 PM
To: Rich Megginson
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...
There are a lot of RST for the time frame, especially on Server A.
I also see a lot TCP Retransmission for this particular replication session. I also see a
couple of TCP Dup ACKs. Is this enough to cause the issue?
Of course, I don’t have an idea, what is the cause of that!
From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Tuesday, November 13, 2012 1:57 PM
To: Reinhard Nappert
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...
On 11/13/2012 11:53 AM, Reinhard Nappert wrote:
How would you proceed to figure out what is going on there?
Since it doesn't appear that the replication logs are giving enough information, and
you don't see any disconnects or TCP resets happening in the packet capture, then I
guess you have no choice but to familiarize yourself with the source code and use gdb.
You see that I ran out of ideas!
Thanks
From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Tuesday, November 13, 2012 1:32 PM
To: Reinhard Nappert
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...
On 11/13/2012 11:21 AM, Reinhard Nappert wrote:
The 3 servers do not crash.
I am not sure about the network, though. My first assumption was that the firewall
(between A and B) might cause the issue. The latest occurrence (the one, I described) had
the firewall removed. I see quite some TCP Retransmissions in the packet captures. Could
that be the issue?
That could be, although that would mean there are so many tcp retransmissions that take
such a long time to process that it causes the application to think the network connection
has timed out.
-Reinhard
From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Tuesday, November 13, 2012 1:15 PM
To: General discussion list for the 389 Directory server project.
Cc: Reinhard Nappert
Subject: Re: [389-users] MMR issue ...
On 11/13/2012 11:02 AM, Reinhard Nappert wrote:
Rich,
Do you know what the cause of this issue is?
No, I don't know.
You would expect that you saw this issue in different deployments, but I only saw it in
one instance.
If it turns out that the issue I see is identical the issue, you mentioned, I’d like to
know, when it was fixed.
Upon further investigation, this does not appear to be the same as
https://fedorahosted.org/389/ticket/374
I'm not sure what the problem is. I've seen timeouts when servers crash or there
are network issues.
Thanks,
-Reinhard
From:
389-users-bounces@lists.fedoraproject.org<mailto:389-users-bounces@lists.fedoraproject.org>
[mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Reinhard Nappert
Sent: Tuesday, November 13, 2012 12:22 PM
To: Rich Megginson; General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...
I use 1.2.8.2
From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Tuesday, November 13, 2012 12:18 PM
To: General discussion list for the 389 Directory server project.
Cc: Reinhard Nappert
Subject: Re: [389-users] MMR issue ...
On 11/13/2012 09:24 AM, Reinhard Nappert wrote:
Hi,
I’ve encountered issues with a MMR setup, which looks like the following:
A ------- B
\ /
\ /
\ /
C
The replication works for approximately 24 hours. There are not many changes to the
content anyway. After about 1 day, the attribute value of the type
“nsds5replicaLastUpdateStatus” changes to “1 Can't acquire busy replica “ of the
replication agreement object from type “nsDS5ReplicationAgreement”. I see this message on
C for the agreement “C-to-B”. The start-time of the last update is 01:08:33. When I
check the status on B, it looks fine for “B-to-C” and “B-to-A”, however, the start-time of
the last update is stuck at 01:08:36 for “B-to-C”, whereas A gets updated afterwards as
well. I don’t have the values for A!
When, I check errors and access on the boxes, I see the following:
Errors on A:
[10/Nov/2012:01:19:31 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Warning: unable to receive endReplication extended operation response (Timed out)
[10/Nov/2012:01:25:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Can't contact LDAP server). Will retry later.
[10/Nov/2012:01:25:05 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth resumed
[10/Nov/2012:02:26:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Timed out). Will retry later.
[10/Nov/2012:02:31:55 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Can't contact LDAP server). Will retry later.
[10/Nov/2012:02:31:59 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth resumed
[10/Nov/2012:02:43:36 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Timed out). Will retry later.
[10/Nov/2012:03:03:00 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Timed out). Will retry later.
[10/Nov/2012:03:08:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Can't contact LDAP server). Will retry later.
[10/Nov/2012:03:11:35 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:03:11:35 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP
server) ((null))
[10/Nov/2012:03:14:45 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:03:14:52 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth resumed
[10/Nov/2012:03:33:29 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:03:33:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP
server) ((null))
[10/Nov/2012:03:43:29 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading
bind response for [cn=replication,cn=config] mech [SIMPLE]
[10/Nov/2012:03:43:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))
[10/Nov/2012:03:46:39 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:03:46:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP
server) ((null))
[10/Nov/2012:03:46:42 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth resumed
[10/Nov/2012:05:12:02 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Timed out). Will retry later.
[10/Nov/2012:06:16:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Timed out). Will retry later.
[10/Nov/2012:06:21:27 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Can't contact LDAP server). Will retry later.
[10/Nov/2012:06:21:46 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth resumed
[10/Nov/2012:06:39:23 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Timed out). Will retry later.
[10/Nov/2012:07:43:45 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Timed out). Will retry later.
[10/Nov/2012:19:38:22 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Timed out). Will retry later.
[10/Nov/2012:19:43:51 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Unable to receive the response for a startReplication extended operation to consumer
(Can't contact LDAP server). Will retry later.
[10/Nov/2012:19:43:54 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth resumed
[10/Nov/2012:20:18:21 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:20:18:21 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP
server) ((null))
[10/Nov/2012:20:23:21 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:20:28:21 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:20:33:20 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:20:38:22 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:20:43:22 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:20:48:20 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:21:00:15 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading
bind response for [cn=replication,cn=config] mech [SIMPLE]
[10/Nov/2012:21:00:15 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))
[10/Nov/2012:21:03:24 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:21:03:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP
server) ((null))
[10/Nov/2012:21:06:36 -0300] slapi_ldap_bind - Error: could not send bind request for id
[cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server)
-5961 (TCP connection reset by peer.) 115 (Operation now in progress)
[10/Nov/2012:21:16:39 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading
bind response for [cn=replication,cn=config] mech [SIMPLE]
[10/Nov/2012:21:16:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))
[10/Nov/2012:21:24:14 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389):
Replication bind with SIMPLE auth resumed
Errors on B:
[10/Nov/2012:01:27:56 -0300] NSMMReplicationPlugin - agmt="cn=B-to-A" (A:389):
Replication bind with SIMPLE auth resumed
Errors on C:
No entry for the time
I also analyzed the access files and see the following:
From access on A:
I see replication sessions from B and C, established at 01:00:10 and closed at 01:01:14.
Next replication session from C, established at 01:05:12 and closed at 01:06:12. There was
NO session from B.
B and C establish the next replication session at 01:07:52. During the session, a MOD is
performed at 01:08:33. C closes this session at 01:09:35. The session from B stays open,
and I see a ABANDOM request from A at 01:17:53. Eventually, the session gets closed (no
unbind) at 01:23:25.
[10/Nov/2012:01:17:53 -0300] conn=1439 op=3 ABANDON targetop=NOTFOUND msgid=3
Replication sessions from C continue to work fine.
There is one more replication session from B, established at 01:27:56 and closed at
01:28:38. After this session, B does not establish any replication session anymore.
From access on B:
I see replication sessions from A and C, established at 01:00:09 and closed at 01:01:14.
Next replication session from C, established at 01:05:12 and closed at 01:06:12. There was
NO session from A.
A and C establish the next replication session at 01:07:52. During the session, a MOD is
performed at 01:08:33. C closes this session at 01:09:35. The session from A stays open
forever!!!
Replication sessions from C continue to work fine.
From access on C:
I see replication sessions from A and B, established at 01:00:09 and closed at 01:01:14.
I don’t see any replication session from the 01:05 time-slot
A and B establish the next replication session at 01:07:53. During the session, a MOD is
performed at 01:08:33. A closes this session at 01:09:33 and B closes session at
01:09:38
From now on, I only see replication sessions from A!
Has anybody seen this kind of behavior? Any feedback is highly appreciated.
This looks like
https://fedorahosted.org/389/ticket/374
What version of 389-ds-base?
Thanks,
-Reinhard
--
389 users mailing list
389-users@lists.fedoraproject.org<mailto:389-users@lists.fedoraproject.org>
https://admin.fedoraproject.org/mailman/listinfo/389-users
--
389 users mailing list
389-users@lists.fedoraproject.org<mailto:389-users@lists.fedoraproject.org>
https://admin.fedoraproject.org/mailman/listinfo/389-users