Hi Colin,
The important point in what you describe is that master2 keeps a replication session open (i.e: no ext 2.16.840.1.113730.3.5.5) without sending any updates
So the problem is either on master2 or on network
my guess is that master2 has trouble to determine the next change to send. A possible scenario is that there is a huge changelog and it is walked from the start (as described in github issue 4644 ...)
Regards, Pierre
On Thu, Mar 4, 2021 at 8:36 PM Colin Tulloch Colin.Tulloch@entrust.com wrote:
I tried it just now but it was way too verbose - we filled up 500mb of error logs in 15 minutes.
We have a lot more space, but it could take hours before we see another failure (these delays intermittently cause an application to fail). Unfortunately we can't re-produce on demand.
-----Original Message----- From: William Brown [mailto:wbrown@suse.de] Sent: Wednesday, March 03, 2021 8:38 PM To: 389-users@lists.fedoraproject.org Subject: [EXTERNAL] [389-users] Re: Replication delay, connection blocking ending in closed - B1
WARNING: This email originated outside of Entrust. DO NOT CLICK links or attachments unless you trust the sender and know the content is safe.
Can you turn on replication logging? I think it's level 8192 in the errorlog-level.
On 4 Mar 2021, at 11:12, Colin Tulloch Colin.Tulloch@entrust.com
wrote:
Hello –
We are seeing an issue where changes can be very slow to replicate to
one of our consumers (up to 15m+). We have a large topology, but in this case the issue is isolated between 2 masters that replicate to 1 consumer.
In one example entry addition I found, it appears that we see;
connection from master1 to consumer1, bunch of changes pushed
then master2 connects to consumer1
master1 stops doing changes but stays connected
master2 does literally 1 change (success, no issues), stays
connected
16 minutes goes by, no additional changes or replication EXT
ops for that connection are done (master1->consumer1 EXT ops continue normally…)
after that long pause, master2 disconnects - B1 bad BER tag
code
and then Master1 resumes making tons of changes
Searches in this DB and others continue to take place, and changes in
other DBs. So it wasn’t as if the server was unresponsive/hung. It is almost as if that DB went “read-only” for a time – I’m unable to tell if something else besides replication was attempting but unable to make writes though.
Anyone see something like this before? We see lots of B1 codes
randomly, I’ve never understood what may cause that – the description of corruption/physical network problems does not make much sense. Maybe if our directories were replicating to eachother over the internet, or using Wifi….
The time it takes for that connection to end in the B1 doesn’t seem to line up with any dirsrv OR system/TCP timeouts either
Nothing illuminating in the error logs really.
Log snips of this happening;
[03/Mar/2021:13:50:34.395939343 -0600] conn=270076 fd=280 slot=280 connection from master1 to consumer1 ... [03/Mar/2021:13:50:34.608149165 -0600] conn=270076 op=13 MOD
dn="cn=CRLblahblah,c=US"
[03/Mar/2021:13:50:34.609565373 -0600] conn=270076 op=13 RESULT err=0 tag=103 nentries=0 etime=0.001451229 csn=603febdd00035aae0000 [03/Mar/2021:13:50:34.818645762 -0600] conn=270076 op=14 EXT
oid="2.16.840.1.113730.3.5.5" name="replication-multimaster-extop"
[03/Mar/2021:13:50:34.820544342 -0600] conn=270076 op=14 RESULT err=0 tag=120 nentries=0 etime=0.002004143
[03/Mar/2021:13:50:34.460210520 -0600] conn=270077 fd=307 slot=307 connection from master2 to consumer1 [03/Mar/2021:13:50:34.460676562 -0600] conn=270077 op=0 BIND dn="cn=replication manager,cn=config" method=128 version=3 [03/Mar/2021:13:50:34.460992487 -0600] conn=270077 op=0 RESULT err=0
tag=97 nentries=0 etime=0.000376400 dn="cn=replication manager,cn=config"
<snipped replication startup jargon> [03/Mar/2021:13:50:35.715331361 -0600] conn=270077 op=5 MOD
dn="cn=CRLblahblah,c=US"
[03/Mar/2021:13:50:35.717330540 -0600] conn=270077 op=5 RESULT err=0 tag=103 nentries=0 etime=0.002066162 csn=603febdd00045aae0000 ... [03/Mar/2021:13:50:36.828647054 -0600] conn=270076 op=15 EXT
oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[03/Mar/2021:13:50:36.828888493 -0600] conn=270076 op=15 RESULT err=0 tag=120 nentries=0 etime=0.000368293 [03/Mar/2021:13:50:37.112334122 -0600] conn=270076 op=16 EXT
oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[03/Mar/2021:13:50:37.112782309 -0600] conn=270076 op=16 RESULT err=0 tag=120 nentries=0 etime=0.000624719 [03/Mar/2021:13:50:38.113792312 -0600] conn=270076 op=17 EXT
oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[03/Mar/2021:13:50:38.114092751 -0600] conn=270076 op=17 RESULT err=0 tag=120 nentries=0 etime=0.000476209 … continued EXT ops on conn=270076, from master1 … [03/Mar/2021:14:06:07.872623403 -0600] conn=270077 op=-1 fd=307 closed
- B1
[03/Mar/2021:14:06:07.916640931 -0600] conn=270076 op=1106 EXT
oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
[03/Mar/2021:14:06:07.917191372 -0600] conn=270076 op=1106 RESULT err=0 tag=120 nentries=0 etime=0.000662946 [03/Mar/2021:14:06:07.921837193 -0600] conn=270076 op=1107 MOD
dn="cn=CRLblahblah,c=US"
[03/Mar/2021:14:06:07.923754219 -0600] conn=270076 op=1107 RESULT err=0 tag=103 nentries=0 etime=0.002036469 csn=603febdd00065aae0000 and a flood of changes now ...
Colin Tulloch Architect, USmPKI colin.tulloch@entrust.com
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://urldefense.com/v3/__https://docs.fedoraproject.org/en-US/proje ct/code-of-conduct/__;!!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRCX7rW2IFhPkZirblU2W u8kUbk3We1GkGmYpPtcGYtadjVcbLfQh5Y$ List Guidelines: https://urldefense.com/v3/__https://fedoraproject.org/wiki/Mailing_lis t_guidelines__;!!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRCX7rW2IFhPkZirblU2Wu8kUbk3 We1GkGmYpPtcGYtadjVcTvoSprg$ List Archives: https://urldefense.com/v3/__https://lists.fedoraproject.org/archives/l ist/389-users@lists.fedoraproject.org__;!!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRC X7rW2IFhPkZirblU2Wu8kUbk3We1GkGmYpPtcGYtadjVcbxMY_7c$ Do not reply to spam on the list, report it: https://urldefense.com/v3/__https://pagure.io/fedora-infrastructure__; !!FJ-Y8qCqXTj2!Ps6Dn3c8qrA0DRCX7rW2IFhPkZirblU2Wu8kUbk3We1GkGmYpPtcGYt adjVcGTRcLlY$
— Sincerely,
William Brown
Senior Software Engineer, 389 Directory Server SUSE Labs, Australia _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://urldefense.com/v3/__https://docs.fedoraproject.org/en-US/project/cod... List Guidelines: https://urldefense.com/v3/__https://fedoraproject.org/wiki/Mailing_list_guid... List Archives: https://urldefense.com/v3/__https://lists.fedoraproject.org/archives/list/38... Do not reply to spam on the list, report it: https://urldefense.com/v3/__https://pagure.io/fedora-infrastructure__;!!FJ-Y... _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
389-users@lists.fedoraproject.org