On 11/16/2012 08:43 AM, Reinhard Nappert wrote:

I did not check all connections, but all of those I checked come in from Server A. I could not check all of them, because I do not have all of the access files. As I mentioned before, I have quite some pcaps and I go through them. My previous statement about quite some number of TCP Resets was incorrect. Those belonged to other TCP streams. The observation I made is that I only see TCP Retransmissions for LDAP streams between A and B. To be more precise, only for replication sessions. The replication did work for a period of about 48 hours. During this period, I saw only one TCP Retransmission for a replication session. So far, I just went through the pcaps from Server B.

 

However, this suggests to me that the MMR issue is linked to the TCP Retransmissions. I am almost at the point that it may be a hardware failure (NIC) either on A or B. But then, I don’t get why C is not implicated…..


I don't know.  389 only knows about what is going on at the TCP layer via the networking stack API and NSPR.

 

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Friday, November 16, 2012 10:20 AM
To: Reinhard Nappert
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...

 

On 11/16/2012 08:11 AM, Reinhard Nappert wrote:

This is correct. This also means that this connection is established and active. What I don’t understand is why this replication session stays active? Any idea….


Because the replication supplier is keeping the connection open?  From which machine are all these open connections from?  What are those servers doing?


 

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Thursday, November 15, 2012 2:06 PM
To: Reinhard Nappert
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...

 

On 11/15/2012 12:01 PM, Reinhard Nappert wrote:

The system is cleaned up right now.  The next time, when it happens, I hope I get this information.

 

I am surprised that one session has more than 20.000 ops. There are not many updates on the system…..


These are not just update operations, but all of the other search and replication extended operations.



 

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Thursday, November 15, 2012 1:36 PM
To: Reinhard Nappert
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...

 

On 11/15/2012 11:30 AM, Reinhard Nappert wrote:

Rich,

 

I also made the following observation:

 

I see quite some open connections of the replication directory:

connection: 84:20121114233917Z:5:5::cn=replication,cn=config

connection: 85:20121113224528Z:3:3::cn= replication,cn=config

connection: 88:20121112163205Z:3:3::cn= replication,cn=config

connection: 89:20121113220034Z:4:4::cn= replication,cn=config

connection: 90:20121115061343Z:209:209::cn= replication,cn=config

connection: 91:20121114234513Z:23374:23373:r:cn= replication,cn=config

connection: 92:20121114235823Z:9579:9579::cn= replication,cn=config

connection: 93:20121115120212Z:1077:1076:r:cn= replication,cn=config

 

I only see this on Server B. A and C are fine, meaning that the session all disappear.


If you do a netstat, what does it say are the states of these connections?




 

Let me know, if you have any other idea what might be going on.

 

Thanks,

-Reinhard

 

From: 389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Reinhard Nappert
Sent: Tuesday, November 13, 2012 3:54 PM
To: Rich Megginson
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...

 

There are a lot of RST for the time frame, especially on Server A.

 

I also see a lot TCP Retransmission for this particular replication session. I also see a couple of TCP Dup ACKs. Is this enough to cause the issue?

 

Of course, I don’t have an idea, what is the cause of that!

 

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Tuesday, November 13, 2012 1:57 PM
To: Reinhard Nappert
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...

 

On 11/13/2012 11:53 AM, Reinhard Nappert wrote:

How would you proceed to figure out what is going on there?


Since it doesn't appear that the replication logs are giving enough information, and you don't see any disconnects or TCP resets happening in the packet capture, then I guess you have no choice but to familiarize yourself with the source code and use gdb.




 

You see that I ran out of ideas!

 

Thanks

 

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Tuesday, November 13, 2012 1:32 PM
To: Reinhard Nappert
Cc: General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...

 

On 11/13/2012 11:21 AM, Reinhard Nappert wrote:

The 3 servers do not crash.

 

I am not sure about the network, though. My first assumption was that the firewall (between A and B) might cause the issue. The latest occurrence (the one, I described) had the firewall removed. I see quite some TCP Retransmissions in the packet captures. Could that be the issue?


That could be, although that would mean there are so many tcp retransmissions that take such a long time to process that it causes the application to think the network connection has timed out.





 

-Reinhard

 

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Tuesday, November 13, 2012 1:15 PM
To: General discussion list for the 389 Directory server project.
Cc: Reinhard Nappert
Subject: Re: [389-users] MMR issue ...

 

On 11/13/2012 11:02 AM, Reinhard Nappert wrote:

Rich,

 

Do you know what the cause of this issue is?


No, I don't know.






You would expect that you saw this issue in different deployments, but I only saw it in one instance.

 

If it turns out that the issue I see is identical the issue, you mentioned, I’d like to know, when it was fixed.


Upon further investigation, this does not appear to be the same as https://fedorahosted.org/389/ticket/374

I'm not sure what the problem is.  I've seen timeouts when servers crash or there are network issues.






 

Thanks,

-Reinhard

 

From: 389-users-bounces@lists.fedoraproject.org [mailto:389-users-bounces@lists.fedoraproject.org] On Behalf Of Reinhard Nappert
Sent: Tuesday, November 13, 2012 12:22 PM
To: Rich Megginson; General discussion list for the 389 Directory server project.
Subject: Re: [389-users] MMR issue ...

 

I use 1.2.8.2

 

From: Rich Megginson [mailto:rmeggins@redhat.com]
Sent: Tuesday, November 13, 2012 12:18 PM
To: General discussion list for the 389 Directory server project.
Cc: Reinhard Nappert
Subject: Re: [389-users] MMR issue ...

 

On 11/13/2012 09:24 AM, Reinhard Nappert wrote:

Hi,

 

I’ve encountered issues with a MMR setup, which looks like the following:

 

     A ------- B

       \           /

         \       /

           \   /

             C

 

The replication works for approximately 24 hours. There are not many changes to the content anyway. After about 1 day, the attribute  value of the type “nsds5replicaLastUpdateStatus”  changes to “1 Can't acquire busy replica “ of the replication agreement object from type “nsDS5ReplicationAgreement”.  I see this message on C for the agreement “C-to-B”.  The start-time of the last update is 01:08:33.  When I check the status on B, it looks fine for “B-to-C” and “B-to-A”, however, the start-time of the last update is stuck at 01:08:36 for “B-to-C”, whereas A gets updated afterwards as well. I don’t have the values for A!

 

When, I check errors and access on the boxes, I see the following:

 

Errors on A:

[10/Nov/2012:01:19:31 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Warning: unable to receive endReplication extended operation response (Timed out)

[10/Nov/2012:01:25:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.

[10/Nov/2012:01:25:05 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:02:26:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:02:31:55 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.

[10/Nov/2012:02:31:59 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:02:43:36 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:03:03:00 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:03:08:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.

[10/Nov/2012:03:11:35 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:03:11:35 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))

[10/Nov/2012:03:14:45 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:03:14:52 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:03:33:29 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:03:33:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))

[10/Nov/2012:03:43:29 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE]

[10/Nov/2012:03:43:29 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))

[10/Nov/2012:03:46:39 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:03:46:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))

[10/Nov/2012:03:46:42 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:05:12:02 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:06:16:01 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:06:21:27 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.

[10/Nov/2012:06:21:46 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:06:39:23 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:07:43:45 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:19:38:22 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Timed out). Will retry later.

[10/Nov/2012:19:43:51 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.

[10/Nov/2012:19:43:54 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

[10/Nov/2012:20:18:21 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:18:21 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))

[10/Nov/2012:20:23:21 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:28:21 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:33:20 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:38:22 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:43:22 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:20:48:20 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:21:00:15 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE]

[10/Nov/2012:21:00:15 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))

[10/Nov/2012:21:03:24 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:21:03:24 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 91 (Can't connect to the LDAP server) ((null))

[10/Nov/2012:21:06:36 -0300] slapi_ldap_bind - Error: could not send bind request for id [cn=replication,cn=config] mech [SIMPLE]: error 91 (Can't connect to the LDAP server) -5961 (TCP connection reset by peer.) 115 (Operation now in progress)

[10/Nov/2012:21:16:39 -0300] slapi_ldap_bind - Error: timeout after [0.0] seconds reading bind response for [cn=replication,cn=config] mech [SIMPLE]

[10/Nov/2012:21:16:39 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth failed: LDAP error 85 (Timed out) ((null))

[10/Nov/2012:21:24:14 -0300] NSMMReplicationPlugin - agmt="cn=A-to-B" (B:389): Replication bind with SIMPLE auth resumed

 

Errors on B:

[10/Nov/2012:01:27:56 -0300] NSMMReplicationPlugin - agmt="cn=B-to-A" (A:389): Replication bind with SIMPLE auth resumed

 

Errors on C:

No entry for the time

 

I also analyzed the access files and see the following:

From access on A:

I see replication sessions from B and C, established at 01:00:10 and closed at 01:01:14.

Next replication session from C, established at 01:05:12 and closed at 01:06:12. There was NO session from B.

B and C establish the next replication session at 01:07:52. During the session, a MOD is performed at 01:08:33. C closes this session at 01:09:35. The session from B stays open, and I see a ABANDOM request from A at 01:17:53. Eventually, the session gets closed (no unbind) at 01:23:25.

[10/Nov/2012:01:17:53 -0300] conn=1439 op=3 ABANDON targetop=NOTFOUND msgid=3

Replication sessions from C continue to work fine.

There is one more replication session from B, established at 01:27:56 and closed at 01:28:38. After this session, B does not establish any replication session anymore.

 

From access on B:

I see replication sessions from A and C, established at 01:00:09 and closed at 01:01:14.

Next replication session from C, established at 01:05:12 and closed at 01:06:12. There was NO session from A.

A and C establish the next replication session at 01:07:52. During the session, a MOD is performed at 01:08:33. C closes this session at 01:09:35. The session from A stays open forever!!!

Replication sessions from C continue to work fine.

 

From access on C:

I see replication sessions from A and B, established at 01:00:09 and closed at 01:01:14.

I don’t see any replication session from the 01:05 time-slot

A and B establish the next replication session at 01:07:53. During the session, a MOD is performed at 01:08:33. A closes this session at 01:09:33 and B closes session at 01:09:38

From now on, I only see replication sessions from A!

 

Has anybody seen this kind of behavior? Any feedback is highly appreciated.

This looks like https://fedorahosted.org/389/ticket/374

What version of 389-ds-base?

 

Thanks,

-Reinhard








--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users

 








--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users