On 11/13/2012 11:53 AM, Reinhard Nappert wrote:

How would you proceed to figure out what is going on there?


Since it doesn't appear that the replication logs are giving enough information, and you don't see any disconnects or TCP resets happening in the packet capture, then I guess you have no choice but to familiarize yourself with the source code and use gdb.

 

You see that I ran out of ideas!

 

Thanks

 

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Tuesday, November 13, 2012 1:32 PM
To: Reinhard Nappert
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...

 

On 11/13/2012 11:21 AM, Reinhard Nappert wrote:

The 3 servers do not crash.

 

I am not sure about the network, though. My first assumption was that the firewall (between A and B) might cause the issue. The latest occurrence (the one, I described) had the firewall removed. I see quite some TCP Retransmissions in the packet captures. Could that be the issue?


That could be, although that would mean there are so many tcp retransmissions that take such a long time to process that it causes the application to think the network connection has timed out.


 

-Reinhard

 

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Tuesday, November 13, 2012 1:15 PM
To: General discussion list for the 389 Directory server project.
Cc: Reinhard Nappert
Subject: Re: [389-users] MMR issue ...

 

On 11/13/2012 11:02 AM, Reinhard Nappert wrote:

Rich,

 

Do you know what the cause of this issue is?


No, I don't know.



You would expect that you saw this issue in different deployments, but I only saw it in one instance.

 

If it turns out that the issue I see is identical the issue, you mentioned, I’d like to know, when it was fixed.


Upon further investigation, this does not appear to be the same as
https://fedorahosted.org/389/ticket/374

I'm not sure what the problem is.  I've seen timeouts when servers crash or there are network issues.



 

Thanks,

-Reinhard

 

From: 389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Reinhard Nappert
Sent: Tuesday, November 13, 2012 12:22 PM
To: Rich Megginson; General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...

 

I use 1.2.8.2

 

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Tuesday, November 13, 2012 12:18 PM
To: General discussion list for the 389 Directory server project.
Cc: Reinhard Nappert
Subject: Re: [389-users] MMR issue ...

 

On 11/13/2012 09:24 AM, Reinhard Nappert wrote:

Hi,

 

I’ve encountered issues with a MMR setup, which looks like the following:

 

     A ------- B

       \           /

         \       /

           \   /

             C

 

The replication works for approximately 24 hours. There are not many changes to the content anyway. After about 1 day, the attribute  value of the type “nsds5replicaLastUpdateStatus”  changes to “1 Can't acquire busy replica “ of the replication agreement object from type “nsDS5ReplicationAgreement”.  I see this message on C for the agreement “C-to-B”.  The start-time of the last update is 01:08:33.  When I check the status on B, it looks fine for “B-to-C” and “B-to-A”, however, the start-time of the last update is stuck at 01:08:36 for “B-to-C”, whereas A gets updated afterwards as well. I don’t have the values for A!

 

When, I check errors and access on the boxes, I see the following:

 

Errors on A:

[10/Nov/2012:01:19:31 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Warning: unable to receive endReplication extended operation response (Timed out)

[10/Nov/2012:01:25:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.

[10/Nov/2012:01:25:05 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:02:26:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:02:31:55 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.

[10/Nov/2012:02:31:59 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:02:43:36 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:03:03:00 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:03:08:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.

[10/Nov/2012:03:11:35 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:03:11:35 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))

[10/Nov/2012:03:14:45 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:03:14:52 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:03:33:29 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:03:33:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))

[10/Nov/2012:03:43:29 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE]

[10/Nov/2012:03:43:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))

[10/Nov/2012:03:46:39 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:03:46:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))

[10/Nov/2012:03:46:42 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:05:12:02 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:06:16:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:06:21:27 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.

[10/Nov/2012:06:21:46 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:06:39:23 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:07:43:45 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:19:38:22 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:19:43:51 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.

[10/Nov/2012:19:43:54 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:20:18:21 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:18:21 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))

[10/Nov/2012:20:23:21 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:28:21 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:33:20 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:38:22 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:43:22 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:48:20 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:21:00:15 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE]

[10/Nov/2012:21:00:15 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))

[10/Nov/2012:21:03:24 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:21:03:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))

[10/Nov/2012:21:06:36 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:21:16:39 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE]

[10/Nov/2012:21:16:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))

[10/Nov/2012:21:24:14 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

 

Errors on B:

[10/Nov/2012:01:27:56 -0300] NSMMReplicationPlugin - agmt="cn=B-to-A" (A:389): Replication bind with SIMPLE auth resumed

 

Errors on C:

No entry for the time

 

I also analyzed the access files and see the following:

From access on A:

I see replication sessions from B and C, established at 01:00:10 and closed at 01:01:14.

Next replication session from C, established at 01:05:12 and closed at 01:06:12. There was NO session from B.

B and C establish the next replication session at 01:07:52. During the session, a MOD is performed at 01:08:33. C closes this session at 01:09:35. The session from B stays open, and I see a ABANDOM request from A at 01:17:53. Eventually, the session gets closed (no unbind) at 01:23:25.

[10/Nov/2012:01:17:53 -0300] conn=1439 op=3 ABANDON targetop=NOTFOUND msgid=3

Replication sessions from C continue to work fine.

There is one more replication session from B, established at 01:27:56 and closed at 01:28:38. After this session, B does not establish any replication session anymore.

 

From access on B:

I see replication sessions from A and C, established at 01:00:09 and closed at 01:01:14.

Next replication session from C, established at 01:05:12 and closed at 01:06:12. There was NO session from A.

A and C establish the next replication session at 01:07:52. During the session, a MOD is performed at 01:08:33. C closes this session at 01:09:35. The session from A stays open forever!!!

Replication sessions from C continue to work fine.

 

From access on C:

I see replication sessions from A and B, established at 01:00:09 and closed at 01:01:14.

I don’t see any replication session from the 01:05 time-slot

A and B establish the next replication session at 01:07:53. During the session, a MOD is performed at 01:08:33. A closes this session at 01:09:33 and B closes session at 01:09:38

From now on, I only see replication sessions from A!

 

Has anybody seen this kind of behavior? Any feedback is highly appreciated.

This looks like https://fedorahosted.org/389/ticket/374

What version of 389-ds-base?

 

Thanks,

-Reinhard





--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

 





--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users