I’m facing issues with replication in the following scenario:
3 Linux nodes (Rocky) running version 2.4.5 B2024.198.0000 of 389. Replication is configured in a ring topology: node01 -> node02 -> node03 -> node01. Password changes are made via the PWM-Project web interface.
Problem: At some point, the synchronization between nodes is lost. When I attempt to restart replication, the node being updated crashes the database. For example, when initializing replication from node01 to node02, the following error occurs: --------- [09/Dec/2024:11:32:30.382466035 -0300] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry uid=app.tzv.w,OU=APLICACOES,dc=colorado,dc=local [09/Dec/2024:11:32:30.387198997 -0300] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry uid=app.poc.w,OU=APLICACOES,dc=colorado,dc=local [09/Dec/2024:11:32:30.390378254 -0300] - ERR - factory_destructor - ERROR bulk import abandoned [09/Dec/2024:11:32:30.557600717 -0300] - ERR - bdb_import_run_pass - import userroot: Thread monitoring returned: -23
[09/Dec/2024:11:32:30.559453847 -0300] - ERR - bdb_public_bdb_import_main - import userroot: Aborting all Import threads... [09/Dec/2024:11:32:36.468531612 -0300] - ERR - bdb_public_bdb_import_main - import userroot: Import threads aborted. [09/Dec/2024:11:32:36.470641812 -0300] - INFO - bdb_public_bdb_import_main - import userroot: Closing files... [09/Dec/2024:11:32:36.553007637 -0300] - ERR - bdb_public_bdb_import_main - import userroot: Import failed. [09/Dec/2024:11:32:36.574692177 -0300] - DEBUG - NSMMReplicationPlugin - consumer_connection_extension_destructor - Aborting total update in progress for replicated area dc=colorado,dc=local connid=7019159 [09/Dec/2024:11:32:36.577255941 -0300] - ERR - process_bulk_import_op - NULL target sdn [09/Dec/2024:11:32:36.579573401 -0300] - DEBUG - NSMMReplicationPlugin - replica_relinquish_exclusive_access - conn=7019159 op=-1 repl="dc=colorado,dc=local": Released replica held by locking_purl=conn=7019159 id=3 [09/Dec/2024:11:32:36.600514849 -0300] - ERR - pw_get_admin_users - Search failed for cn=GRP_SRV_PREHASHED_PASSWORD,ou=389,OU=GRUPOS,ou=colorado,dc=colorado,dc=local: error 10 - Password Policy Administrators can not be set [09/Dec/2024:11:32:36.757883417 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop - decoding payload... [09/Dec/2024:11:32:36.760105387 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop - decoded protocol_oid: 2.16.840.1.113730.3.6.1 [09/Dec/2024:11:32:36.762467539 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop - decoded repl_root: dc=colorado,dc=local [09/Dec/2024:11:32:36.765113155 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop - decoded csn: 6756ff84000001910000 [09/Dec/2024:11:32:36.767727935 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop: RUV: [09/Dec/2024:11:32:36.769205061 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop: {replicageneration} 6748f91f000001910000 [09/Dec/2024:11:32:36.770721824 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop: {replica 401 ldap://node01.ldap.colorado.br:389} 6748f921000001910000 6756ff6f000101910000 00000000 [09/Dec/2024:11:32:36.772753378 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop: {replica 403 ldap://node03-ldap:389} 6748f9db000101930000 6756ff79000001930000 00000000 [09/Dec/2024:11:32:36.774289526 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop: {replica 402 ldap://node02.ldap.colorado.br:389} 6748f996000101920000 6756ff34000001920000 00000000 [09/Dec/2024:11:32:36.775750926 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop - Finshed decoding payload. [09/Dec/2024:11:32:36.777404849 -0300] - DEBUG - NSMMReplicationPlugin - consumer_connection_extension_acquire_exclusive_access - conn=7019230 op=4 Acquired consumer connection extension [09/Dec/2024:11:32:36.779856975 -0300] - DEBUG - NSMMReplicationPlugin - multisupplier_extop_StartNSDS50ReplicationRequest - conn=7019230 op=4 repl="dc=colorado,dc=local": Begin incremental protocol [09/Dec/2024:11:32:36.781999075 -0300] - DEBUG - _csngen_adjust_local_time - gen state before 6756ff7b0001:1733754747:0:0 [09/Dec/2024:11:32:36.784626039 -0300] - DEBUG - _csngen_adjust_local_time - gen state after 6756ff840000:1733754756:0:0 [09/Dec/2024:11:32:36.786708353 -0300] - DEBUG - csngen_adjust_time - gen state before 6756ff840000:1733754756:0:0 [09/Dec/2024:11:32:36.788232997 -0300] - DEBUG - csngen_adjust_time - gen state after 6756ff840001:1733754756:0:0 [09/Dec/2024:11:32:36.790217310 -0300] - DEBUG - NSMMReplicationPlugin - replica_get_exclusive_access - conn=7019230 op=4 repl="dc=colorado,dc=local": Acquired replica ---------- To restore synchronization, I need to delete all replication configurations and recreate them. However, the issue reappears after some time.
I’d appreciate any suggestions on how to identify and resolve this issue permanently.
Thks.
Hi,
You are describing two different symptoms: - replication gets desynchronized. - replication initialization fails
We do not have much data about the first one. It could be due to lots of reasons. About the second one:
The bdb_bulk_import_queue returned 0 with entry ... are normal The first error is: [09/Dec/2024:11:32:30.390378254 -0300] - ERR - factory_destructor - ERROR bulk import abandoned
Which happens on the server being initialized when the connection handling the replication initialization gets unexpectedly closed. That could usually be because: [1] there was a problem on the supplier from which the replication get initialized. (Is there errors on that supplier error log or has it crashed ?) [2] network is unreliable (that could also explain why replication get desynchronized)
regards Pierre
On Mon, Dec 9, 2024 at 5:20 PM Luiz Gustavo Quirino via 389-users < 389-users@lists.fedoraproject.org> wrote:
I’m facing issues with replication in the following scenario:
3 Linux nodes (Rocky) running version 2.4.5 B2024.198.0000 of 389. Replication is configured in a ring topology: node01 -> node02 -> node03 -> node01. Password changes are made via the PWM-Project web interface.
Problem: At some point, the synchronization between nodes is lost. When I attempt to restart replication, the node being updated crashes the database. For example, when initializing replication from node01 to node02, the following error occurs:
[09/Dec/2024:11:32:30.382466035 -0300] - DEBUG - bdb_ldbm_back_wire_import
- bdb_bulk_import_queue returned 0 with entry
uid=app.tzv.w,OU=APLICACOES,dc=colorado,dc=local [09/Dec/2024:11:32:30.387198997 -0300] - DEBUG - bdb_ldbm_back_wire_import
- bdb_bulk_import_queue returned 0 with entry
uid=app.poc.w,OU=APLICACOES,dc=colorado,dc=local [09/Dec/2024:11:32:30.390378254 -0300] - ERR - factory_destructor - ERROR bulk import abandoned [09/Dec/2024:11:32:30.557600717 -0300] - ERR - bdb_import_run_pass - import userroot: Thread monitoring returned: -23
[09/Dec/2024:11:32:30.559453847 -0300] - ERR - bdb_public_bdb_import_main
- import userroot: Aborting all Import threads...
[09/Dec/2024:11:32:36.468531612 -0300] - ERR - bdb_public_bdb_import_main
- import userroot: Import threads aborted.
[09/Dec/2024:11:32:36.470641812 -0300] - INFO - bdb_public_bdb_import_main
- import userroot: Closing files...
[09/Dec/2024:11:32:36.553007637 -0300] - ERR - bdb_public_bdb_import_main
- import userroot: Import failed.
[09/Dec/2024:11:32:36.574692177 -0300] - DEBUG - NSMMReplicationPlugin - consumer_connection_extension_destructor - Aborting total update in progress for replicated area dc=colorado,dc=local connid=7019159 [09/Dec/2024:11:32:36.577255941 -0300] - ERR - process_bulk_import_op - NULL target sdn [09/Dec/2024:11:32:36.579573401 -0300] - DEBUG - NSMMReplicationPlugin - replica_relinquish_exclusive_access - conn=7019159 op=-1 repl="dc=colorado,dc=local": Released replica held by locking_purl=conn=7019159 id=3 [09/Dec/2024:11:32:36.600514849 -0300] - ERR - pw_get_admin_users - Search failed for cn=GRP_SRV_PREHASHED_PASSWORD,ou=389,OU=GRUPOS,ou=colorado,dc=colorado,dc=local: error 10 - Password Policy Administrators can not be set [09/Dec/2024:11:32:36.757883417 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop - decoding payload... [09/Dec/2024:11:32:36.760105387 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop - decoded protocol_oid: 2.16.840.1.113730.3.6.1 [09/Dec/2024:11:32:36.762467539 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop - decoded repl_root: dc=colorado,dc=local [09/Dec/2024:11:32:36.765113155 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop - decoded csn: 6756ff84000001910000 [09/Dec/2024:11:32:36.767727935 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop: RUV: [09/Dec/2024:11:32:36.769205061 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop: {replicageneration} 6748f91f000001910000 [09/Dec/2024:11:32:36.770721824 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop: {replica 401 ldap://node01.ldap.colorado.br:389} 6748f921000001910000 6756ff6f000101910000 00000000 [09/Dec/2024:11:32:36.772753378 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop: {replica 403 ldap://node03-ldap:389} 6748f9db000101930000 6756ff79000001930000 00000000 [09/Dec/2024:11:32:36.774289526 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop: {replica 402 ldap://node02.ldap.colorado.br:389} 6748f996000101920000 6756ff34000001920000 00000000 [09/Dec/2024:11:32:36.775750926 -0300] - DEBUG - NSMMReplicationPlugin - decode_startrepl_extop - Finshed decoding payload. [09/Dec/2024:11:32:36.777404849 -0300] - DEBUG - NSMMReplicationPlugin - consumer_connection_extension_acquire_exclusive_access - conn=7019230 op=4 Acquired consumer connection extension [09/Dec/2024:11:32:36.779856975 -0300] - DEBUG - NSMMReplicationPlugin - multisupplier_extop_StartNSDS50ReplicationRequest - conn=7019230 op=4 repl="dc=colorado,dc=local": Begin incremental protocol [09/Dec/2024:11:32:36.781999075 -0300] - DEBUG - _csngen_adjust_local_time
- gen state before 6756ff7b0001:1733754747:0:0
[09/Dec/2024:11:32:36.784626039 -0300] - DEBUG - _csngen_adjust_local_time
- gen state after 6756ff840000:1733754756:0:0
[09/Dec/2024:11:32:36.786708353 -0300] - DEBUG - csngen_adjust_time - gen state before 6756ff840000:1733754756:0:0 [09/Dec/2024:11:32:36.788232997 -0300] - DEBUG - csngen_adjust_time - gen state after 6756ff840001:1733754756:0:0 [09/Dec/2024:11:32:36.790217310 -0300] - DEBUG - NSMMReplicationPlugin - replica_get_exclusive_access - conn=7019230 op=4 repl="dc=colorado,dc=local": Acquired replica
To restore synchronization, I need to delete all replication configurations and recreate them. However, the issue reappears after some time.
I’d appreciate any suggestions on how to identify and resolve this issue permanently.
Thks.
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
Hi, Pierre,
I appreciate your assertive approach to addressing this issue.
I reviewed the logs on node01, which was sending data for replication, and noticed that at the exact moment the error was recorded on node02, node01 logged an LDAP error 51 ("Server is busy").
Given that node01 and node02 are virtual machines on the same network segment, I do not believe this issue is related to the network interface.
This leads me to focus on the minimum resource requirements or internal parameters of the 389 Directory Server as potential causes.
Currently, both nodes are configured with 4 vCPUs and 4 GB of RAM each.
I suspect that some internal parameter in the 389 DS configuration might be contributing to this issue.
Looking forward to your insights.
so node2 returns a LDAP_BUSY error which is relatively rare (and usually happens just after logging the "Retry count exceeded" error ...) because txn get aborted too often because of db locks conflicts You can try to tweak the nsslapd-db-deadlock-policy but it is a bit puzzling: During a bulk import (i.e importing entries from another supplier), only the import is active on the backend, so there should not be db lock conflict during an import And I do not understand why deleting and recreating the agreements could solve such issues. Especially since agreement toward and from the target replica are disabled while the replication is in progress. The fact that they exist or not, should not change anything ... Unless there is a bug somewhere an a db lock is leaking. But that does not ring any bells in mind mind ...
On Mon, Dec 9, 2024 at 6:37 PM Luiz Quirino via 389-users < 389-users@lists.fedoraproject.org> wrote:
Hi, Pierre,
I appreciate your assertive approach to addressing this issue.
I reviewed the logs on node01, which was sending data for replication, and noticed that at the exact moment the error was recorded on node02, node01 logged an LDAP error 51 ("Server is busy").
Given that node01 and node02 are virtual machines on the same network segment, I do not believe this issue is related to the network interface.
This leads me to focus on the minimum resource requirements or internal parameters of the 389 Directory Server as potential causes.
Currently, both nodes are configured with 4 vCPUs and 4 GB of RAM each.
I suspect that some internal parameter in the 389 DS configuration might be contributing to this issue.
Looking forward to your insights.
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
I used the *iperf* program to check for any issues in data traffic between the nodes, but no abnormalities were detected.
I also adjusted the Linux *limits* parameters according to the tuning and performance recommendations from the 389 documentation.
Additionally, I upgraded the *389 Directory Server* from version *2.4.5* to *2.5.2*.
However, despite these changes, the same errors persist when starting the replication process.
On Mon, Dec 9, 2024 at 3:47 PM Pierre Rogier progier@redhat.com wrote:
so node2 returns a LDAP_BUSY error which is relatively rare (and usually happens just after logging the "Retry count exceeded" error ...) because txn get aborted too often because of db locks conflicts You can try to tweak the nsslapd-db-deadlock-policy but it is a bit puzzling: During a bulk import (i.e importing entries from another supplier), only the import is active on the backend, so there should not be db lock conflict during an import And I do not understand why deleting and recreating the agreements could solve such issues. Especially since agreement toward and from the target replica are disabled while the replication is in progress. The fact that they exist or not, should not change anything ... Unless there is a bug somewhere an a db lock is leaking. But that does not ring any bells in mind mind ...
On Mon, Dec 9, 2024 at 6:37 PM Luiz Quirino via 389-users < 389-users@lists.fedoraproject.org> wrote:
Hi, Pierre,
I appreciate your assertive approach to addressing this issue.
I reviewed the logs on node01, which was sending data for replication, and noticed that at the exact moment the error was recorded on node02, node01 logged an LDAP error 51 ("Server is busy").
Given that node01 and node02 are virtual machines on the same network segment, I do not believe this issue is related to the network interface.
This leads me to focus on the minimum resource requirements or internal parameters of the 389 Directory Server as potential causes.
Currently, both nodes are configured with 4 vCPUs and 4 GB of RAM each.
I suspect that some internal parameter in the 389 DS configuration might be contributing to this issue.
Looking forward to your insights.
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
--
389 Directory Server Development Team
389-users@lists.fedoraproject.org