There are a lot of RST for the time frame, especially on Server A.
I also see a lot TCP Retransmission for this particular replication
session. I also see a couple of TCP Dup ACKs. Is this enough to cause
the issue?
The RST would cause the connection to be closed. Other than that, not
sure what specifically would cause timeouts.
Of course, I don’t have an idea, what is the cause of that!
*From:*Rich Megginson [mailto:rmeggins@redhat.com]
*Sent:* Tuesday, November 13, 2012 1:57 PM
*To:* Reinhard Nappert
*Cc:* General discussion list for the 389 Directory server project.
*Subject:* Re: [389-users] MMR issue ...
On 11/13/2012 11:53 AM, Reinhard Nappert wrote:
How would you proceed to figure out what is going on there?
Since it doesn't appear that the replication logs are giving enough
information, and you don't see any disconnects or TCP resets happening
in the packet capture, then I guess you have no choice but to
familiarize yourself with the source code and use gdb.
You see that I ran out of ideas!
Thanks
*From:*Rich Megginson [mailto:rmeggins@redhat.com]
*Sent:* Tuesday, November 13, 2012 1:32 PM
*To:* Reinhard Nappert
*Cc:* General discussion list for the 389 Directory server project.
*Subject:* Re: [389-users] MMR issue ...
On 11/13/2012 11:21 AM, Reinhard Nappert wrote:
The 3 servers do not crash.
I am not sure about the network, though. My first assumption was that
the firewall (between A and B) might cause the issue. The latest
occurrence (the one, I described) had the firewall removed. I see
quite some TCP Retransmissions in the packet captures. Could that be
the issue?
That could be, although that would mean there are so many tcp
retransmissions that take such a long time to process that it causes
the application to think the network connection has timed out.
-Reinhard
*From:*Rich Megginson [mailto:rmeggins@redhat.com]
*Sent:* Tuesday, November 13, 2012 1:15 PM
*To:* General discussion list for the 389 Directory server project.
*Cc:* Reinhard Nappert
*Subject:* Re: [389-users] MMR issue ...
On 11/13/2012 11:02 AM, Reinhard Nappert wrote:
Rich,
Do you know what the cause of this issue is?
No, I don't know.
You would expect that you saw this issue in different deployments, but
I only saw it in one instance.
If it turns out that the issue I see is identical the issue, you
mentioned, I’d like to know, when it was fixed.
Upon further investigation, this does not appear to be the same as
https://fedorahosted.org/389/ticket/374
I'm not sure what the problem is. I've seen timeouts when servers
crash or there are network issues.
Thanks,
-Reinhard
*From:*389-users-bounces@lists.fedoraproject.org
<mailto:389-users-bounces@lists.fedoraproject.org>
[mailto:389-users-bounces@lists.fedoraproject.org] *On Behalf Of
*Reinhard Nappert
*Sent:* Tuesday, November 13, 2012 12:22 PM
*To:* Rich Megginson; General discussion list for the 389 Directory
server project.
*Subject:* Re: [389-users] MMR issue ...
I use 1.2.8.2
*From:*Rich Megginson [mailto:rmeggins@redhat.com]
*Sent:* Tuesday, November 13, 2012 12:18 PM
*To:* General discussion list for the 389 Directory server project.
*Cc:* Reinhard Nappert
*Subject:* Re: [389-users] MMR issue ...
On 11/13/2012 09:24 AM, Reinhard Nappert wrote:
Hi,
I’ve encountered issues with a MMR setup, which looks like the following:
A ------- B
\ /
\ /
\ /
C
The replication works for approximately 24 hours. There are not many
changes to the content anyway. After about 1 day, the attribute value
of the type “nsds5replicaLastUpdateStatus” changes to “1 Can't
acquire busy replica “ of the replication agreement object from type
“nsDS5ReplicationAgreement”. I see this message on C for the
agreement “C-to-B”. The start-time of the last update is 01:08:33.
When I check the status on B, it looks fine for “B-to-C” and
“B-to-A”, however, the start-time of the last update is stuck at
01:08:36 for “B-to-C”, whereas A gets updated afterwards as well. I
don’t have the values for A!
When, I check errors and access on the boxes, I see the following:
Errors on A:
[10/Nov/2012:01:19:31 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Warning: unable to receive endReplication extended operation
response (Timed out)
[10/Nov/2012:01:25:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Can't contact LDAP server). Will retry
later.
[10/Nov/2012:01:25:05 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth resumed
[10/Nov/2012:02:26:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Timed out). Will retry later.
[10/Nov/2012:02:31:55 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Can't contact LDAP server). Will retry
later.
[10/Nov/2012:02:31:59 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth resumed
[10/Nov/2012:02:43:36 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Timed out). Will retry later.
[10/Nov/2012:03:03:00 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Timed out). Will retry later.
[10/Nov/2012:03:08:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Can't contact LDAP server). Will retry
later.
[10/Nov/2012:03:11:35 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:03:11:35 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth failed: LDAP error 91
(Can't connect to the LDAP server) ((null))
[10/Nov/2012:03:14:45 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:03:14:52 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth resumed
[10/Nov/2012:03:33:29 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:03:33:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth failed: LDAP error 91
(Can't connect to the LDAP server) ((null))
[10/Nov/2012:03:43:29 -0300] slapi_ldap_bind - Error: timeout after
[0.0] seconds reading bind response for [cn=replication,cn=config]
mech [SIMPLE]
[10/Nov/2012:03:43:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth failed: LDAP error 85
(Timed out) ((null))
[10/Nov/2012:03:46:39 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:03:46:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth failed: LDAP error 91
(Can't connect to the LDAP server) ((null))
[10/Nov/2012:03:46:42 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth resumed
[10/Nov/2012:05:12:02 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Timed out). Will retry later.
[10/Nov/2012:06:16:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Timed out). Will retry later.
[10/Nov/2012:06:21:27 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Can't contact LDAP server). Will retry
later.
[10/Nov/2012:06:21:46 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth resumed
[10/Nov/2012:06:39:23 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Timed out). Will retry later.
[10/Nov/2012:07:43:45 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Timed out). Will retry later.
[10/Nov/2012:19:38:22 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Timed out). Will retry later.
[10/Nov/2012:19:43:51 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Unable to receive the response for a startReplication
extended operation to consumer (Can't contact LDAP server). Will retry
later.
[10/Nov/2012:19:43:54 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth resumed
[10/Nov/2012:20:18:21 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:20:18:21 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth failed: LDAP error 91
(Can't connect to the LDAP server) ((null))
[10/Nov/2012:20:23:21 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:20:28:21 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:20:33:20 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:20:38:22 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:20:43:22 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:20:48:20 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:21:00:15 -0300] slapi_ldap_bind - Error: timeout after
[0.0] seconds reading bind response for [cn=replication,cn=config]
mech [SIMPLE]
[10/Nov/2012:21:00:15 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth failed: LDAP error 85
(Timed out) ((null))
[10/Nov/2012:21:03:24 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:21:03:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth failed: LDAP error 91
(Can't connect to the LDAP server) ((null))
[10/Nov/2012:21:06:36 -0300] slapi_ldap_bind - Error: could not send
bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91
(Can't connect to the LDAP server) -5961 (TCP connection reset by
peer.) 115 (Operation now in progress)
[10/Nov/2012:21:16:39 -0300] slapi_ldap_bind - Error: timeout after
[0.0] seconds reading bind response for [cn=replication,cn=config]
mech [SIMPLE]
[10/Nov/2012:21:16:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth failed: LDAP error 85
(Timed out) ((null))
[10/Nov/2012:21:24:14 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B"
(B:389): Replication bind with SIMPLE auth resumed
Errors on B:
[10/Nov/2012:01:27:56 -0300] NSMMReplicationPlugin - agmt="cn=B-to-A"
(A:389): Replication bind with SIMPLE auth resumed
Errors on C:
No entry for the time
I also analyzed the access files and see the following:
From access on A:
I see replication sessions from B and C, established at 01:00:10 and
closed at 01:01:14.
Next replication session from C, established at 01:05:12 and closed at
01:06:12. There was NO session from B.
B and C establish the next replication session at 01:07:52. During the
session, a MOD is performed at 01:08:33. C closes this session at
01:09:35. The session from B stays open, and I see a ABANDOM request
from A at 01:17:53. Eventually, the session gets closed (no unbind) at
01:23:25.
[10/Nov/2012:01:17:53 -0300] conn=1439 op=3 ABANDON targetop=NOTFOUND
msgid=3
Replication sessions from C continue to work fine.
There is one more replication session from B, established at 01:27:56
and closed at 01:28:38. After this session, B does not establish any
replication session anymore.
From access on B:
I see replication sessions from A and C, established at 01:00:09 and
closed at 01:01:14.
Next replication session from C, established at 01:05:12 and closed at
01:06:12. There was NO session from A.
A and C establish the next replication session at 01:07:52. During the
session, a MOD is performed at 01:08:33. C closes this session at
01:09:35. The session from A stays open forever!!!
Replication sessions from C continue to work fine.
From access on C:
I see replication sessions from A and B, established at 01:00:09 and
closed at 01:01:14.
I don’t see any replication session from the 01:05 time-slot
A and B establish the next replication session at 01:07:53. During the
session, a MOD is performed at 01:08:33. A closes this session at
01:09:33 and B closes session at 01:09:38
From now on, I only see replication sessions from A!
Has anybody seen this kind of behavior? Any feedback is highly
appreciated.
This looks like
https://fedorahosted.org/389/ticket/374
What version of 389-ds-base?
Thanks,
-Reinhard
--
389 users mailing list
389-users(a)lists.fedoraproject.org <mailto:389-users@lists.fedoraproject.org>
https://admin.fedoraproject.org/mailman/listinfo/389-users
--
389 users mailing list
389-users(a)lists.fedoraproject.org <mailto:389-users@lists.fedoraproject.org>
https://admin.fedoraproject.org/mailman/listinfo/389-users