Re: [EXTERNAL] Re: Replication delay, connection blocking ending in closed - B1 - 389-users - Fedora Mailing-Lists

Friday, 5 March 2021

Hi Colin,

The important point in what you describe is that master2 keeps a
replication session open
(i.e: no ext 2.16.840.1.113730.3.5.5) without sending any updates

So the problem is either on master2 or on network

my guess is that master2 has trouble to determine the next change to send.
  A possible scenario is that there is a huge changelog and it is walked
from the start
   (as described in github issue 4644 ...)

Regards,
    Pierre

On Thu, Mar 4, 2021 at 8:36 PM Colin Tulloch <Colin.Tulloch(a)entrust.com&gt;
wrote:

...
 I tried it just now but it was way too verbose - we filled up 500mb
of
 error logs in 15 minutes.

 We have a lot more space, but it could take hours before we see another
 failure (these delays intermittently cause an application to fail).
 Unfortunately we can't re-produce on demand.

 -----Original Message-----
 From: William Brown [mailto:wbrown@suse.de]
 Sent: Wednesday, March 03, 2021 8:38 PM
 To: 389-users(a)lists.fedoraproject.org
 Subject: [EXTERNAL] [389-users] Re: Replication delay, connection blocking
 ending in closed - B1

 WARNING: This email originated outside of Entrust.
 DO NOT CLICK links or attachments unless you trust the sender and know the
 content is safe.

 ______________________________________________________________________
 Can you turn on replication logging? I think it's level 8192 in the
 errorlog-level.

 > On 4 Mar 2021, at 11:12, Colin Tulloch <Colin.Tulloch(a)entrust.com&gt;
 wrote:
 >
 > Hello –
 >
 > We are seeing an issue where changes can be very slow to replicate to
 one of our consumers (up to 15m+).  We have a large topology, but in this
 case the issue is isolated between 2 masters that replicate to 1 consumer.
 >
 > In one example entry addition I found, it appears that we see;
 >
 > -          connection from master1 to consumer1, bunch of changes pushed
 > -          then master2 connects to consumer1
 > -          master1 stops doing changes but stays connected
 > -          master2 does literally 1 change (success, no issues), stays
 connected
 > -          16 minutes goes by, no additional changes or replication EXT
 ops for that connection are done (master1->consumer1 EXT ops continue
 normally…)
 > -          after that long pause, master2 disconnects - B1 bad BER tag
 code
 > -          and then Master1 resumes making tons of changes
 >
 > Searches in this DB and others continue to take place, and changes in
 other DBs.  So it wasn’t as if the server was unresponsive/hung.  It is
 almost as if that DB went “read-only” for a time – I’m unable to tell if
 something else besides replication was attempting but unable to make writes
 though.
 >
 >
 > Anyone see something like this before?  We see lots of B1 codes
 randomly, I’ve never understood what may cause that – the description of
 corruption/physical network problems does not make much sense.  Maybe if
 our directories were replicating to eachother over the internet, or using
 Wifi….
 >
 > The time it takes for that connection to end in the B1 doesn’t seem to
 > line up with any dirsrv OR system/TCP timeouts either
 >
 >
 > Nothing illuminating in the error logs really.
 >
 > Log snips of this happening;
 >
 > [03/Mar/2021:13:50:34.395939343 -0600] conn=270076 fd=280 slot=280
 > connection from master1 to consumer1 ...
 > [03/Mar/2021:13:50:34.608149165 -0600] conn=270076 op=13 MOD
 dn="cn=CRLblahblah,c=US"
 > [03/Mar/2021:13:50:34.609565373 -0600] conn=270076 op=13 RESULT err=0
 > tag=103 nentries=0 etime=0.001451229 csn=603febdd00035aae0000
 > [03/Mar/2021:13:50:34.818645762 -0600] conn=270076 op=14 EXT
 oid="2.16.840.1.113730.3.5.5" name="replication-multimaster-extop"
 > [03/Mar/2021:13:50:34.820544342 -0600] conn=270076 op=14 RESULT err=0
 > tag=120 nentries=0 etime=0.002004143
 >
 > [03/Mar/2021:13:50:34.460210520 -0600] conn=270077 fd=307 slot=307
 > connection from master2 to consumer1
 > [03/Mar/2021:13:50:34.460676562 -0600] conn=270077 op=0 BIND
 > dn="cn=replication manager,cn=config" method=128 version=3
 > [03/Mar/2021:13:50:34.460992487 -0600] conn=270077 op=0 RESULT err=0
 tag=97 nentries=0 etime=0.000376400 dn="cn=replication manager,cn=config"
 > <snipped replication startup jargon>
 > [03/Mar/2021:13:50:35.715331361 -0600] conn=270077 op=5 MOD
 dn="cn=CRLblahblah,c=US"
 > [03/Mar/2021:13:50:35.717330540 -0600] conn=270077 op=5 RESULT err=0
 > tag=103 nentries=0 etime=0.002066162 csn=603febdd00045aae0000 ...
 > [03/Mar/2021:13:50:36.828647054 -0600] conn=270076 op=15 EXT
 oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
 > [03/Mar/2021:13:50:36.828888493 -0600] conn=270076 op=15 RESULT err=0
 > tag=120 nentries=0 etime=0.000368293
 > [03/Mar/2021:13:50:37.112334122 -0600] conn=270076 op=16 EXT
 oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
 > [03/Mar/2021:13:50:37.112782309 -0600] conn=270076 op=16 RESULT err=0
 > tag=120 nentries=0 etime=0.000624719
 > [03/Mar/2021:13:50:38.113792312 -0600] conn=270076 op=17 EXT
 oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
 > [03/Mar/2021:13:50:38.114092751 -0600] conn=270076 op=17 RESULT err=0
 > tag=120 nentries=0 etime=0.000476209 … continued EXT ops on
 > conn=270076, from master1 …
 > [03/Mar/2021:14:06:07.872623403 -0600] conn=270077 op=-1 fd=307 closed
 > - B1
 > [03/Mar/2021:14:06:07.916640931 -0600] conn=270076 op=1106 EXT
 oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
 > [03/Mar/2021:14:06:07.917191372 -0600] conn=270076 op=1106 RESULT
 > err=0 tag=120 nentries=0 etime=0.000662946
 > [03/Mar/2021:14:06:07.921837193 -0600] conn=270076 op=1107 MOD
 dn="cn=CRLblahblah,c=US"
 > [03/Mar/2021:14:06:07.923754219 -0600] conn=270076 op=1107 RESULT
 > err=0 tag=103 nentries=0 etime=0.002036469 csn=603febdd00065aae0000
 > and a flood of changes now ...
 >
 >
 > Colin Tulloch
 > Architect, USmPKI
 > colin.tulloch(a)entrust.com
 >
 > _______________________________________________
 > 389-users mailing list -- 389-users(a)lists.fedoraproject.org To
 > unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org
 > Fedora Code of Conduct:
 > https://urldefense.com/v3/__https://docs.fedoraproject.org/en-US/proje
 > ct/code-of-conduct/__;!!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRCX7rW2IFhPkZirblU2W
 > u8kUbk3We1GkGmYpPtcGYtadjVcbLfQh5Y$
 > List Guidelines:
 > https://urldefense.com/v3/__https://fedoraproject.org/wiki/Mailing_lis
 > t_guidelines__;!!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRCX7rW2IFhPkZirblU2Wu8kUbk3
 > We1GkGmYpPtcGYtadjVcTvoSprg$ List Archives:
 > https://urldefense.com/v3/__https://lists.fedoraproject.org/archives/l
 > ist/389-users(a)lists.fedoraproject.org__;!!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRC
 > X7rW2IFhPkZirblU2Wu8kUbk3We1GkGmYpPtcGYtadjVcbxMY_7c$
 > Do not reply to spam on the list, report it:
 > https://urldefense.com/v3/__https://pagure.io/fedora-infrastructure__;
 > !!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRCX7rW2IFhPkZirblU2Wu8kUbk3We1GkGmYpPtcGYt
 > adjVcGTRcLlY$

 —
 Sincerely,

 William Brown

 Senior Software Engineer, 389 Directory Server SUSE Labs, Australia
 _______________________________________________
 389-users mailing list -- 389-users(a)lists.fedoraproject.org To
 unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org
 Fedora Code of Conduct:

https://urldefense.com/v3/__https://docs.fedoraproject.org/en-US/project/...
 List Guidelines:

https://urldefense.com/v3/__https://fedoraproject.org/wiki/Mailing_list_g...
 List Archives:

https://urldefense.com/v3/__https://lists.fedoraproject.org/archives/list...
 Do not reply to spam on the list, report it:

https://urldefense.com/v3/__https://pagure.io/fedora-infrastructure__;!!F...
 _______________________________________________
 389-users mailing list -- 389-users(a)lists.fedoraproject.org
 To unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org
 Fedora Code of Conduct:
 https://docs.fedoraproject.org/en-US/project/code-of-conduct/
 List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
 List Archives:
 https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproje...
 Do not reply to spam on the list, report it:
 https://pagure.io/fedora-infrastructure

-- 
--

389 Directory Server Development Team

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [EXTERNAL] Re: Replication delay, connection blocking ending in closed - B1