Afternoon all
I’ve got a slightly strange one with one of our FreeIPA clusters, whereby the topology suffixes appear to have disappeared. From what I can see, this is causing replication issues between the hosts, which is causing us issues with bootstrapping new clients against FreeIPA.
I’m not aware of any config changes that have happened on the FreeIPA hosts that could have caused this issue, so am a bit stumped atm.
Is someone able to advise next steps on how to investigate the cause and correct the configuration?
Regards Gavin
On Wed, Apr 4, 2018 at 4:31 PM, Gavin Williams via FreeIPA-users freeipa-users@lists.fedorahosted.org wrote:
Afternoon all
I’ve got a slightly strange one with one of our FreeIPA clusters, whereby the topology suffixes appear to have disappeared.
How is this manifested? No visible in Web UI, CLI?
From what I can see, this is causing replication issues between the hosts, which is causing us issues with bootstrapping new clients against FreeIPA.
I’m not aware of any config changes that have happened on the FreeIPA hosts that could have caused this issue, so am a bit stumped atm.
Is someone able to advise next steps on how to investigate the cause and correct the configuration?
For anything regarding replication, a good start is to check directory server error and access logs on both sides.
https://www.freeipa.org/page/Files_to_be_attached_to_bug_report#Directory_se... https://www.freeipa.org/page/Troubleshooting#Directory_Server_issues
Next step could be to check for replication conflicts:
https://access.redhat.com/documentation/en-us/red_hat_directory_server/10/ht...
Regards Gavin
Petr
Yeh, I was unable to see the suffixes and replication agreements via the WebUI.
However searching using ldapsearch, they were still present. So I tracked the issue down to my named user account not having enough permissions. Logged in as ‘admin’ user and was able to see all the details.
So that just leaves the issue with the fact that replication broke in the first place. Looking back through the slapd error log, I came across this:
[28/Mar/2018:17:27:04.558588967 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - retry (49) the transaction (c
sn=5abbc252002800040000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[28/Mar/2018:17:27:04.575793325 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - Failed to write entry with cs
n (5abbc252002800040000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[28/Mar/2018:17:27:04.578285790 +0100] - ERR - NSMMReplicationPlugin - write_changelog_and_ruv - Can't add a change for ipaUniqueID=6dc4846c-27a
3-11e8-a0a5-fa163e82604c,cn=sudorules,cn=sudo,dc=weareact,dc=net (uniqid: 64da9801-27a311e8-8bfb8904-640ff48c, optype: 8) to changelog csn 5abbc
252002800040000
[28/Mar/2018:17:27:04.595240157 +0100] - ERR - NSMMReplicationPlugin - process_postop - Failed to apply update (5abbc252002800040000) error (1).
Aborting replication session(conn=453585 op=18)
[28/Mar/2018:17:27:14.160079067 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - retry (49) the transaction (c
sn=5abbc252002800040000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[28/Mar/2018:17:27:14.161481168 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - Failed to write entry with csn (5abbc252002800040000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[28/Mar/2018:17:27:14.162533841 +0100] - ERR - NSMMReplicationPlugin - write_changelog_and_ruv - Can't add a change for ipaUniqueID=6dc4846c-27a3-11e8-a0a5-fa163e82604c,cn=sudorules,cn=sudo,dc=weareact,dc=net (uniqid: 64da9801-27a311e8-8bfb8904-640ff48c, optype: 8) to changelog csn 5abbc252002800040000
[28/Mar/2018:17:27:14.177194703 +0100] - ERR - NSMMReplicationPlugin - process_postop - Failed to apply update (5abbc252002800040000) error (1). Aborting replication session(conn=453594 op=6)
Any pointers on identifying possible cause?
Cheers Gav
On 5 Apr 2018, at 18:24, Petr Vobornik <pvoborni@redhat.commailto:pvoborni@redhat.com> wrote:
On Wed, Apr 4, 2018 at 4:31 PM, Gavin Williams via FreeIPA-users <freeipa-users@lists.fedorahosted.orgmailto:freeipa-users@lists.fedorahosted.org> wrote: Afternoon all
I’ve got a slightly strange one with one of our FreeIPA clusters, whereby the topology suffixes appear to have disappeared.
How is this manifested? No visible in Web UI, CLI?
From what I can see, this is causing replication issues between the hosts, which is causing us issues with bootstrapping new clients against FreeIPA.
I’m not aware of any config changes that have happened on the FreeIPA hosts that could have caused this issue, so am a bit stumped atm.
Is someone able to advise next steps on how to investigate the cause and correct the configuration?
For anything regarding replication, a good start is to check directory server error and access logs on both sides.
https://www.freeipa.org/page/Files_to_be_attached_to_bug_report#Directory_se... https://www.freeipa.org/page/Troubleshooting#Directory_Server_issues
Next step could be to check for replication conflicts:
https://access.redhat.com/documentation/en-us/red_hat_directory_server/10/ht...
Regards Gavin -- Petr Vobornik
On 04/05/2018 11:28 PM, Gavin Williams via FreeIPA-users wrote:
Petr
Yeh, I was unable to see the suffixes and replication agreements via the WebUI.
However searching using ldapsearch, they were still present. So I tracked the issue down to my named user account not having enough permissions. Logged in as ‘admin’ user and was able to see all the details.
So that just leaves the issue with the fact that replication broke in the first place. Looking back through the slapd error log, I came across this:
The errors below do not indicate that replication is broken, a replication session failed and is retried, you can see that the errors are 10 sec apart and refer to different replication conenctions, so replication was probably working in between and has probably resumed again.
The underlying problem is that there is concurrent access to the changelog by incoming connection writing to the changelog and by outgoing replication connections reading it. The access is protected by locks at the db (BDB) level and in some situations there can be deadlocks. The db layer has a mechanism to abort one of the threads and let the other continue. The aborted thread will log an error and has to be retried, either immediately or after returning an error to the client -that is what you are seeing.
Which thread is aborted is determined by the configured deadlock policy, the default tends to abort the writing one. If these errors occur frequently and make issues it is worth to change this policy. In the entry:
dn: cn=config,cn=ldbm database,cn=plugins,cn=config
change the attribute: nsslapd-db-deadlock-policy
to nsslapd-db-deadlock-policy: 6
this will abort the thread with the minimal write locks and will abort the outgoing repl connection and should have less impact
[28/Mar/2018:17:27:04.558588967 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - retry (49) the transaction (c sn=5abbc252002800040000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock)) [28/Mar/2018:17:27:04.575793325 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - Failed to write entry with cs n (5abbc252002800040000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock [28/Mar/2018:17:27:04.578285790 +0100] - ERR - NSMMReplicationPlugin - write_changelog_and_ruv - Can't add a change for ipaUniqueID=6dc4846c-27a 3-11e8-a0a5-fa163e82604c,cn=sudorules,cn=sudo,dc=weareact,dc=net (uniqid: 64da9801-27a311e8-8bfb8904-640ff48c, optype: 8) to changelog csn 5abbc 252002800040000 [28/Mar/2018:17:27:04.595240157 +0100] - ERR - NSMMReplicationPlugin - process_postop - Failed to apply update (5abbc252002800040000) error (1). Aborting replication session(conn=453585 op=18) [28/Mar/2018:17:27:14.160079067 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - retry (49) the transaction (c sn=5abbc252002800040000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock)) [28/Mar/2018:17:27:14.161481168 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - Failed to write entry with csn (5abbc252002800040000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock [28/Mar/2018:17:27:14.162533841 +0100] - ERR - NSMMReplicationPlugin - write_changelog_and_ruv - Can't add a change for ipaUniqueID=6dc4846c-27a3-11e8-a0a5-fa163e82604c,cn=sudorules,cn=sudo,dc=weareact,dc=net (uniqid: 64da9801-27a311e8-8bfb8904-640ff48c, optype: 8) to changelog csn 5abbc252002800040000 [28/Mar/2018:17:27:14.177194703 +0100] - ERR - NSMMReplicationPlugin - process_postop - Failed to apply update (5abbc252002800040000) error (1). Aborting replication session(conn=453594 op=6) Any pointers on identifying possible cause?
Cheers Gav
On 5 Apr 2018, at 18:24, Petr Vobornik <pvoborni@redhat.com mailto:pvoborni@redhat.com> wrote:
On Wed, Apr 4, 2018 at 4:31 PM, Gavin Williams via FreeIPA-users <freeipa-users@lists.fedorahosted.org mailto:freeipa-users@lists.fedorahosted.org> wrote:
Afternoon all
I’ve got a slightly strange one with one of our FreeIPA clusters, whereby the topology suffixes appear to have disappeared.
How is this manifested? No visible in Web UI, CLI?
From what I can see, this is causing replication issues between the hosts, which is causing us issues with bootstrapping new clients against FreeIPA.
I’m not aware of any config changes that have happened on the FreeIPA hosts that could have caused this issue, so am a bit stumped atm.
Is someone able to advise next steps on how to investigate the cause and correct the configuration?
For anything regarding replication, a good start is to check directory server error and access logs on both sides.
https://www.freeipa.org/page/Files_to_be_attached_to_bug_report#Directory_se... https://www.freeipa.org/page/Troubleshooting#Directory_Server_issues
Next step could be to check for replication conflicts:
https://access.redhat.com/documentation/en-us/red_hat_directory_server/10/ht...
Regards Gavin
-- Petr Vobornik
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org To unsubscribe send an email to freeipa-users-leave@lists.fedorahosted.org
Ludwig et al,
Apologies on resurrecting this thread!
Which thread is aborted is determined by the configured deadlock policy, the default tends to abort the writing one. If these errors occur frequently and make issues it is worth to change this policy. In the entry:
dn: cn=config,cn=ldbm database,cn=plugins,cn=config
change the attribute: nsslapd-db-deadlock-policy
to nsslapd-db-deadlock-policy: 6
this will abort the thread with the minimal write locks and will abort the outgoing repl connection and should have less impact
I've seen a few similar messages in our error logs, but after a period of time the replication resumed without conflict. I have not found any remaining issues in the directory server logs after following the history of this thread and the thread ([Freeipa-users] replica DS failure deadlock; 10/18/2016).
Would adjusting the nsslapd-threadnumber on both IPA servers potentially contribute to these messages? During my initial deployment, I adjusted the value of nsslapd-threadnumber per the history of the thread ([Freeipa-users] performance scaling of sssd / freeipa; 01/26/2017). Of course, this value may not be appropriate for our deployment and I can certainly scale it back.
Thank you for any information! John DeSantis
Il giorno ven 6 apr 2018 alle ore 03:37 Ludwig Krispenz via FreeIPA-users freeipa-users@lists.fedorahosted.org ha scritto:
On 04/05/2018 11:28 PM, Gavin Williams via FreeIPA-users wrote:
Petr
Yeh, I was unable to see the suffixes and replication agreements via the WebUI.
However searching using ldapsearch, they were still present. So I tracked the issue down to my named user account not having enough permissions. Logged in as ‘admin’ user and was able to see all the details.
So that just leaves the issue with the fact that replication broke in the first place. Looking back through the slapd error log, I came across this:
The errors below do not indicate that replication is broken, a replication session failed and is retried, you can see that the errors are 10 sec apart and refer to different replication conenctions, so replication was probably working in between and has probably resumed again.
The underlying problem is that there is concurrent access to the changelog by incoming connection writing to the changelog and by outgoing replication connections reading it. The access is protected by locks at the db (BDB) level and in some situations there can be deadlocks. The db layer has a mechanism to abort one of the threads and let the other continue. The aborted thread will log an error and has to be retried, either immediately or after returning an error to the client -that is what you are seeing.
Which thread is aborted is determined by the configured deadlock policy, the default tends to abort the writing one. If these errors occur frequently and make issues it is worth to change this policy. In the entry:
dn: cn=config,cn=ldbm database,cn=plugins,cn=config
change the attribute: nsslapd-db-deadlock-policy
to nsslapd-db-deadlock-policy: 6
this will abort the thread with the minimal write locks and will abort the outgoing repl connection and should have less impact
[28/Mar/2018:17:27:04.558588967 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - retry (49) the transaction (c
sn=5abbc252002800040000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[28/Mar/2018:17:27:04.575793325 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - Failed to write entry with cs
n (5abbc252002800040000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[28/Mar/2018:17:27:04.578285790 +0100] - ERR - NSMMReplicationPlugin - write_changelog_and_ruv - Can't add a change for ipaUniqueID=6dc4846c-27a
3-11e8-a0a5-fa163e82604c,cn=sudorules,cn=sudo,dc=weareact,dc=net (uniqid: 64da9801-27a311e8-8bfb8904-640ff48c, optype: 8) to changelog csn 5abbc
252002800040000
[28/Mar/2018:17:27:04.595240157 +0100] - ERR - NSMMReplicationPlugin - process_postop - Failed to apply update (5abbc252002800040000) error (1).
Aborting replication session(conn=453585 op=18)
[28/Mar/2018:17:27:14.160079067 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - retry (49) the transaction (c
sn=5abbc252002800040000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[28/Mar/2018:17:27:14.161481168 +0100] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - Failed to write entry with csn (5abbc252002800040000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[28/Mar/2018:17:27:14.162533841 +0100] - ERR - NSMMReplicationPlugin - write_changelog_and_ruv - Can't add a change for ipaUniqueID=6dc4846c-27a3-11e8-a0a5-fa163e82604c,cn=sudorules,cn=sudo,dc=weareact,dc=net (uniqid: 64da9801-27a311e8-8bfb8904-640ff48c, optype: 8) to changelog csn 5abbc252002800040000
[28/Mar/2018:17:27:14.177194703 +0100] - ERR - NSMMReplicationPlugin - process_postop - Failed to apply update (5abbc252002800040000) error (1). Aborting replication session(conn=453594 op=6)
Any pointers on identifying possible cause?
Cheers Gav
On 5 Apr 2018, at 18:24, Petr Vobornik pvoborni@redhat.com wrote:
On Wed, Apr 4, 2018 at 4:31 PM, Gavin Williams via FreeIPA-users freeipa-users@lists.fedorahosted.org wrote:
Afternoon all
I’ve got a slightly strange one with one of our FreeIPA clusters, whereby the topology suffixes appear to have disappeared.
How is this manifested? No visible in Web UI, CLI?
From what I can see, this is causing replication issues between the hosts, which is causing us issues with bootstrapping new clients against FreeIPA.
I’m not aware of any config changes that have happened on the FreeIPA hosts that could have caused this issue, so am a bit stumped atm.
Is someone able to advise next steps on how to investigate the cause and correct the configuration?
For anything regarding replication, a good start is to check directory server error and access logs on both sides.
https://www.freeipa.org/page/Files_to_be_attached_to_bug_report#Directory_se... https://www.freeipa.org/page/Troubleshooting#Directory_Server_issues
Next step could be to check for replication conflicts:
https://access.redhat.com/documentation/en-us/red_hat_directory_server/10/ht...
Regards Gavin
-- Petr Vobornik
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org To unsubscribe send an email to freeipa-users-leave@lists.fedorahosted.org
-- Red Hat GmbH, http://www.de.redhat.com/, Registered seat: Grasbrunn, Commercial register: Amtsgericht Muenchen, HRB 153243, Managing Directors: Charles Cachera, Michael Cunningham, Michael O'Neill, Eric Shander
FreeIPA-users mailing list -- freeipa-users@lists.fedorahosted.org To unsubscribe send an email to freeipa-users-leave@lists.fedorahosted.org
freeipa-users@lists.fedorahosted.org