389 community,
We are currently testing an upgrade of our LDAP infrastructure centos 7.9 389-ds-1.3.11.1-5.el7_9 to Rhel 9.4 389-ds 2.4.5-9.el9_4. We are wanting to use the replication method to upgrade alongside production and not fully shutdown our center for the migration which would be impactful. The error I am reporting is for initialization of replicas by a rhel 9.x provider. Appears to be something related to rhel 9.x and 389-ds 2.x. Note: I have not seen any other issues besides replica initialization.
Our site has run 389-DS for many years w/o issue and never have seen this issue before. I have done numerous tests to try and isolate this failure. I tried a smaller database and it still fails. different providers and consumers have the same error for rhel 9.x Provider.
Rhel 9.4 389-DS 2.4.5-9 Provider fails to fully initialize Rhel 9.4 389-DS 2.4.5-9 Consumer database. A Rhel 8.10 provider 389-ds 2.4.5-1 can load our whole data base w/o error on the same consumer Rhel 9.4 389-DS 2.4.5-9 on port 636 LDAPS. I tried rhel 9.4 389-ds 2.4.5-1 (same version that worked on Rhel 8.10) and it fails to the same consumer. I tried changing ports and protocol on rhel 9.4 Provider - Port 636 LDAPS fails, 389 STARTTLS fails, 389 LDAP works on rhel9.4.
The issue is rhel 9.x provider initializing database on rhel 9.4 consumer (fails Sending 33552 entries with TLS). I have tried restricting Provider and consumer to TLS 1.2 then TLS 1.3 and both fail. I tried tuning nsds5ReplicaFlowControlWindow and Debug logs show it gets past the authentication piece and starts to load the consumer and then gets ERR – factory destructor - ERROR bulk import abandoned on consumer. The Provider shows Error "Failed to send extended operation: LDAP error 51 (Server is busy)".
I did try tuning nsds5ReplicaFlowControlWindow and nsds5ReplicaFlowControlPause with various values with little success. I also worry is I start slowing down the Provider to replica traffic it will impact performance. The error logs below are with nsslapd-errorlog-level=24576 and default values for the mentioned tunables.
LDAP Debug Provider Error Log: [13/Nov/2024:15:54:37.386325753 -0600] - DEBUG - NSMMReplicationPlugin - acquire_replica - agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636): Replica was successfully acquired. [13/Nov/2024:15:54:37.388243309 -0600] - INFO - NSMMReplicationPlugin - repl5_tot_run - Beginning total update of replica "agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636)". [13/Nov/2024:15:54:37.390191833 -0600] - DEBUG - NSMMReplicationPlugin - replica_subentry_check - replication keep alive entry <cn=repl keep alive 8,dc=ncsa,dc=illinois,dc=edu> already exists [13/Nov/2024:15:54:37.513840430 -0600] - ERR - NSMMReplicationPlugin - perform_operation - agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636): Failed to send extended operation: LDAP error 51 (Server is busy) [13/Nov/2024:15:54:37.515818173 -0600] - DEBUG - NSMMReplicationPlugin - check_flow_control_tot_init - agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636) - Invalid message ids [ msgid sent: -1, rcv: 5] [13/Nov/2024:15:54:37.519877720 -0600] - ERR - NSMMReplicationPlugin - repl5_tot_log_operation_failure - agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636): Received error 51 (Server is busy): for total update operation
LDAP Consumer (Replica) Error Log: [13/Nov/2024:15:54:37.345059927 -0600] - DEBUG - NSMMReplicationPlugin - replica_disable_replication - replica dc=ncsa,dc=illinois,dc=edu is acquired [13/Nov/2024:15:54:37.378553066 -0600] - INFO - bdb_instance_start - Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [13/Nov/2024:15:54:37.383226816 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_start returned 0 [13/Nov/2024:15:54:37.383985431 -0600] - DEBUG - NSMMReplicationPlugin - multisupplier_extop_StartNSDS50ReplicationRequest - encoded response: 0 [13/Nov/2024:15:54:37.384635904 -0600] - DEBUG - NSMMReplicationPlugin - multisupplier_extop_StartNSDS50ReplicationRequest - Finished encoding payload [13/Nov/2024:15:54:37.385240342 -0600] - DEBUG - NSMMReplicationPlugin - multisupplier_extop_StartNSDS50ReplicationRequest - conn=162194 op=3 repl="dc=ncsa,dc=illinois,dc=edu": StartNSDS90ReplicationRequest: response=0 rc=0 [13/Nov/2024:15:54:37.385907940 -0600] - DEBUG - NSMMReplicationPlugin - consumer_connection_extension_relinquish_exclusive_access - conn=162194 op=3 Relinquishing consumer connection extension [13/Nov/2024:15:54:37.390220175 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.505440971 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry cn=Directory Administrators,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.507799711 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry ou=Groups,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.509710361 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry ou=People,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.511664842 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry cn=repl keep alive 5,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.513327884 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry ou=resource_delta,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.514630160 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry ou=resource_vforge,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.516008120 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry ou=resource_holli,dc=ncsa,dc=illinois,dc=edu .. .. .. [13/Nov/2024:15:54:37.530146158 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry cn=PD Managers,ou=Groups,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.542448330 -0600] - ERR - factory_destructor - ERROR bulk import abandoned [13/Nov/2024:15:54:37.584123281 -0600] - ERR - bdb_import_run_pass - import userroot: Thread monitoring returned: -23
[13/Nov/2024:15:54:37.585520872 -0600] - ERR - bdb_public_bdb_import_main - import userroot: Aborting all Import threads... [13/Nov/2024:15:54:43.494296253 -0600] - ERR - bdb_public_bdb_import_main - import userroot: Import threads aborted. [13/Nov/2024:15:54:43.496403618 -0600] - INFO - bdb_public_bdb_import_main - import userroot: Closing files... [13/Nov/2024:15:54:43.503149502 -0600] - ERR - bdb_public_bdb_import_main - import userroot: Import failed. [13/Nov/2024:15:54:43.506359910 -0600] - DEBUG - NSMMReplicationPlugin - consumer_connection_extension_destructor - Aborting total update in progress for replicated area dc=ncsa,dc=illinois,dc=edu connid=162194 [13/Nov/2024:15:54:43.507738378 -0600] - ERR - process_bulk_import_op - NULL target sdn [13/Nov/2024:15:54:43.509027437 -0600] - DEBUG - NSMMReplicationPlugin - replica_relinquish_exclusive_access - conn=162194 op=-1 repl="dc=ncsa,dc=illinois,dc=edu": Released replica held by locking_purl=conn=162194 id=3
-Tim
there is no apparent code changes between 389-ds-base-2.4.5-9 and 389-ds-base-2.4.5 that relate to the supplier's messages from ldap/servers/plugins/replication/repl5_connection.c :
[13/Nov/2024:15:54:37.513840430 -0600] - ERR - NSMMReplicationPlugin - perform_operation - agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636): Failed to send extended operation: LDAP error 51 (Server is busy) [13/Nov/2024:15:54:37.515818173 -0600] - DEBUG - NSMMReplicationPlugin - check_flow_control_tot_init - agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636) - Invalid message ids [ msgid sent: -1, rcv: 5]
is it always failing in the same exact way?
are there any error messages in the form of ERR - setup_ol_tls_conn - failed: unable to create new TLS context ?
how many CPU cores are available on the supplier? is there 1 NUMA node configured for the LDAP service?
verify what is the nsslapd-ioblocktimeout value, default 10000mS / 10sec , may be too long dsconf INSTANCE config get nsslapd-ioblocktimeout 2.1.105. nsslapd-ioblocktimeout https://docs.redhat.com/en/documentation/red_hat_directory_server/12/html/co...
verify what is the nsslapd-idletimeout value, default 3600 sec, may be too long 2.1.102. nsslapd-idletimeout https://docs.redhat.com/en/documentation/red_hat_directory_server/12/html/co...
-> dsconf INSTANCE config get | grep timeout dsconf INSTANCE config replace nsslapd-idletimeout=xx dsconf INSTANCE config replace nsslapd-ioblocktimeout=xx
may be try to change the nsds5ReplicaReleaseTimeout value from default 60sec to 45sec ( not under 30 seconds), for the duration of the LDAP init: 2.7.20. nsDS5ReplicaReleaseTimeout https://docs.redhat.com/en/documentation/red_hat_directory_server/12/html/co...
or the parameter nsDS5ReplicaTimeout to a shorter value than the default of 120 sec, for the duration of the LDAP init: 2.8.32. nsDS5ReplicaTimeout https://docs.redhat.com/en/documentation/red_hat_directory_server/12/html/co...
-> ldapsearch -o ldif-wrap=no -LLLxD "cn=directory manager" -W -b "cn=mapping tree,cn=config" nsDS5ReplicaTimeout nsds5ReplicaProtocolTimeout nsDS5ReplicaReleaseTimeout
( dsconf INSTANCE replication set --help ... --repl-release-timeout REPL_RELEASE_TIMEOUT A timeout in seconds a replication supplier should send updates before it yields its replication session
dsconf INSTANCE replication set --suffix SUFFIX ... )
eventually try using LMDB instead of BDB?
was there any dsconf backup and restore previously done?
note: there is no 389-ds 2.4.5-1 build on RHDS-11.x on RHEL-8.x or RHEL-8.x IdM, the 2.4.5.x builds are for RHDS-12.4 on RHEL-9.4 or RHEL-9.4 IdM, the 389-ds-base version 1.4.3.39-* was for RHDS-11.9 on RHEL-8.10 and RHEL-8.10 IdM, may be a typo in this comment? " A Rhel 8.10 provider 389-ds 2.4.5-1 can load our whole data base w/o error on the same consumer Rhel 9.4 389-DS 2.4.5-9 on port 636 LDAPS. I tried rhel 9.4 389-ds 2.4.5-1 (same version that worked on Rhel 8.10) "
concern about the 389-ds-base versions mentioned: the interesting described scenario appears to be with 2.4.5-9 and 2.4.5-1 on RHEL-9.4, but the versions 389-ds-base-2.4.5-1 and 2.4.5-9 were for RHEL-9.4 IdM, not for a standalone LDAP server ( supported scenario are RHEL IdM or RHDS )
Cordially, Marc S.
On Sat, Dec 28, 2024 at 11:21 PM Timothy Bouvet via 389-users < 389-users@lists.fedoraproject.org> wrote:
Is anyone seeing this 389-DS question?
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
On 12/29/24 8:07 AM, Timothy Bouvet via 389-users wrote:
389 community,
We are currently testing an upgrade of our LDAP infrastructure centos 7.9 389-ds-1.3.11.1-5.el7_9 to Rhel 9.4 389-ds 2.4.5-9.el9_4. We are wanting to use the replication method to upgrade alongside production and not fully shutdown our center for the migration which would be impactful. The error I am reporting is for initialization of replicas by a rhel 9.x provider. Appears to be something related to rhel 9.x and 389-ds 2.x. Note: I have not seen any other issues besides replica initialization.
Our site has run 389-DS for many years w/o issue and never have seen this issue before. I have done numerous tests to try and isolate this failure. I tried a smaller database and it still fails. different providers and consumers have the same error for rhel 9.x Provider.
Rhel 9.4 389-DS 2.4.5-9 Provider fails to fully initialize Rhel 9.4 389-DS 2.4.5-9 Consumer database. A Rhel 8.10 provider 389-ds 2.4.5-1 can load our whole data base w/o error on the same consumer Rhel 9.4 389-DS 2.4.5-9 on port 636 LDAPS. I tried rhel 9.4 389-ds 2.4.5-1 (same version that worked on Rhel 8.10) and it fails to the same consumer. I tried changing ports and protocol on rhel 9.4 Provider - Port 636 LDAPS fails, 389 STARTTLS fails, 389 LDAP works on rhel9.4.
The issue is rhel 9.x provider initializing database on rhel 9.4 consumer (fails Sending 33552 entries with TLS). I have tried restricting Provider and consumer to TLS 1.2 then TLS 1.3 and both fail. I tried tuning nsds5ReplicaFlowControlWindow and Debug logs show it gets past the authentication piece and starts to load the consumer and then gets ERR – factory destructor - ERROR bulk import abandoned on consumer. The Provider shows Error "Failed to send extended operation: LDAP error 51 (Server is busy)".
I did try tuning nsds5ReplicaFlowControlWindow and nsds5ReplicaFlowControlPause with various values with little success. I also worry is I start slowing down the Provider to replica traffic it will impact performance. The error logs below are with nsslapd-errorlog-level=24576 and default values for the mentioned tunables.
LDAP Debug Provider Error Log: [13/Nov/2024:15:54:37.386325753 -0600] - DEBUG - NSMMReplicationPlugin - acquire_replica - agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636): Replica was successfully acquired. [13/Nov/2024:15:54:37.388243309 -0600] - INFO - NSMMReplicationPlugin - repl5_tot_run - Beginning total update of replica "agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636)". [13/Nov/2024:15:54:37.390191833 -0600] - DEBUG - NSMMReplicationPlugin - replica_subentry_check - replication keep alive entry <cn=repl keep alive 8,dc=ncsa,dc=illinois,dc=edu> already exists [13/Nov/2024:15:54:37.513840430 -0600] - ERR - NSMMReplicationPlugin - perform_operation - agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636): Failed to send extended operation: LDAP error 51 (Server is busy) [13/Nov/2024:15:54:37.515818173 -0600] - DEBUG - NSMMReplicationPlugin - check_flow_control_tot_init - agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636) - Invalid message ids [ msgid sent: -1, rcv: 5]
At that point the supplier (aka provider) has acquired the replica on the consumer, updated the schema and send several entries. Then the supplier sending a next entry (via extop) receives a err=51 (server busy). You may check in the consumer access log if it exist err=51 on extop but I suspect that the extop was never received by the consumer. Indeed the supplier logs a 'msgid sent: -1' that may be because the supplier failed to send the extop at the network level. In short I suspect that the consumer can not sustain the import rate. I would suggest to reduce nsds5ReplicaFlowControlWindow to 50 or less.
Another option is to use offline init.
regards thierry
[13/Nov/2024:15:54:37.519877720 -0600] - ERR - NSMMReplicationPlugin - repl5_tot_log_operation_failure - agmt="cn=ldap-repb-test-ra" (ldap-repb-test:636): Received error 51 (Server is busy): for total update operation
LDAP Consumer (Replica) Error Log: [13/Nov/2024:15:54:37.345059927 -0600] - DEBUG - NSMMReplicationPlugin - replica_disable_replication - replica dc=ncsa,dc=illinois,dc=edu is acquired [13/Nov/2024:15:54:37.378553066 -0600] - INFO - bdb_instance_start - Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database [13/Nov/2024:15:54:37.383226816 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_start returned 0 [13/Nov/2024:15:54:37.383985431 -0600] - DEBUG - NSMMReplicationPlugin - multisupplier_extop_StartNSDS50ReplicationRequest - encoded response: 0 [13/Nov/2024:15:54:37.384635904 -0600] - DEBUG - NSMMReplicationPlugin - multisupplier_extop_StartNSDS50ReplicationRequest - Finished encoding payload [13/Nov/2024:15:54:37.385240342 -0600] - DEBUG - NSMMReplicationPlugin - multisupplier_extop_StartNSDS50ReplicationRequest - conn=162194 op=3 repl="dc=ncsa,dc=illinois,dc=edu": StartNSDS90ReplicationRequest: response=0 rc=0 [13/Nov/2024:15:54:37.385907940 -0600] - DEBUG - NSMMReplicationPlugin - consumer_connection_extension_relinquish_exclusive_access - conn=162194 op=3 Relinquishing consumer connection extension [13/Nov/2024:15:54:37.390220175 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.505440971 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry cn=Directory Administrators,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.507799711 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry ou=Groups,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.509710361 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry ou=People,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.511664842 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry cn=repl keep alive 5,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.513327884 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry ou=resource_delta,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.514630160 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry ou=resource_vforge,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.516008120 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry ou=resource_holli,dc=ncsa,dc=illinois,dc=edu .. .. .. [13/Nov/2024:15:54:37.530146158 -0600] - DEBUG - bdb_ldbm_back_wire_import - bdb_bulk_import_queue returned 0 with entry cn=PD Managers,ou=Groups,dc=ncsa,dc=illinois,dc=edu [13/Nov/2024:15:54:37.542448330 -0600] - ERR - factory_destructor - ERROR bulk import abandoned [13/Nov/2024:15:54:37.584123281 -0600] - ERR - bdb_import_run_pass - import userroot: Thread monitoring returned: -23
[13/Nov/2024:15:54:37.585520872 -0600] - ERR - bdb_public_bdb_import_main - import userroot: Aborting all Import threads... [13/Nov/2024:15:54:43.494296253 -0600] - ERR - bdb_public_bdb_import_main - import userroot: Import threads aborted. [13/Nov/2024:15:54:43.496403618 -0600] - INFO - bdb_public_bdb_import_main - import userroot: Closing files... [13/Nov/2024:15:54:43.503149502 -0600] - ERR - bdb_public_bdb_import_main - import userroot: Import failed. [13/Nov/2024:15:54:43.506359910 -0600] - DEBUG - NSMMReplicationPlugin - consumer_connection_extension_destructor - Aborting total update in progress for replicated area dc=ncsa,dc=illinois,dc=edu connid=162194 [13/Nov/2024:15:54:43.507738378 -0600] - ERR - process_bulk_import_op - NULL target sdn [13/Nov/2024:15:54:43.509027437 -0600] - DEBUG - NSMMReplicationPlugin - replica_relinquish_exclusive_access - conn=162194 op=-1 repl="dc=ncsa,dc=illinois,dc=edu": Released replica held by locking_purl=conn=162194 id=3
-Tim
389-users@lists.fedoraproject.org