I have a test instance with two 2.5 container instances replicating. I took one down, but not gracefully, and restarted it using the 3.1.1 container. It showed that it was recovering the RUV and then died 60 seconds after container start. I've tried to restart it several times and it continually fails after 60 seconds. I've never seen a 389ds instance fail to recover so this is alarming: Here's what I see in the logs:
[02/Aug/2024:23:40:08.164519017 +0000] - INFO - main - 389-Directory/3.1.1 B2024.213.0201 starting up [02/Aug/2024:23:40:08.167190043 +0000] - INFO - main - Setting the maximum file descriptor limit to: 65535 [02/Aug/2024:23:40:08.324281525 +0000] - INFO - PBKDF2_SHA256 - Based on CPU performance, chose 2048 rounds [02/Aug/2024:23:40:08.328414518 +0000] - INFO - ldbm_instance_config_cachememsize_set - force a minimal value 512000 [02/Aug/2024:23:40:08.334434881 +0000] - NOTICE - bdb_start_autotune - found 126976000k physical memory [02/Aug/2024:23:40:08.336991774 +0000] - NOTICE - bdb_start_autotune - found 101486140k available [02/Aug/2024:23:40:08.340297223 +0000] - NOTICE - bdb_start_autotune - total cache size: 29477568512 B; [02/Aug/2024:23:40:08.343560343 +0000] - NOTICE - bdb_start - Detected Disorderly Shutdown last time Directory Server was running, recovering database. [02/Aug/2024:23:40:50.311047857 +0000] - INFO - slapi_vattrspi_regattr - Because pwdpolicysubentry is a new registered virtual attribute , nsslapd-ignore-virtual-attrs was set to 'off' [02/Aug/2024:23:40:50.367322989 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding the replication changelog RUV, this may take several minutes... [02/Aug/2024:23:41:06.202467004 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding replication changelog RUV complete. Result 0 (Success)
(dies here and the container exits)
Is docker killing the container before recovery completes?
On Mon, Aug 5, 2024, at 9:16 AM, tdarby@arizona.edu wrote:
I have a test instance with two 2.5 container instances replicating. I took one down, but not gracefully, and restarted it using the 3.1.1 container. It showed that it was recovering the RUV and then died 60 seconds after container start. I've tried to restart it several times and it continually fails after 60 seconds. I've never seen a 389ds instance fail to recover so this is alarming: Here's what I see in the logs:
[02/Aug/2024:23:40:08.164519017 +0000] - INFO - main - 389-Directory/3.1.1 B2024.213.0201 starting up [02/Aug/2024:23:40:08.167190043 +0000] - INFO - main - Setting the maximum file descriptor limit to: 65535 [02/Aug/2024:23:40:08.324281525 +0000] - INFO - PBKDF2_SHA256 - Based on CPU performance, chose 2048 rounds [02/Aug/2024:23:40:08.328414518 +0000] - INFO - ldbm_instance_config_cachememsize_set - force a minimal value 512000 [02/Aug/2024:23:40:08.334434881 +0000] - NOTICE - bdb_start_autotune - found 126976000k physical memory [02/Aug/2024:23:40:08.336991774 +0000] - NOTICE - bdb_start_autotune - found 101486140k available [02/Aug/2024:23:40:08.340297223 +0000] - NOTICE - bdb_start_autotune - total cache size: 29477568512 B; [02/Aug/2024:23:40:08.343560343 +0000] - NOTICE - bdb_start - Detected Disorderly Shutdown last time Directory Server was running, recovering database. [02/Aug/2024:23:40:50.311047857 +0000] - INFO - slapi_vattrspi_regattr - Because pwdpolicysubentry is a new registered virtual attribute , nsslapd-ignore-virtual-attrs was set to 'off' [02/Aug/2024:23:40:50.367322989 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding the replication changelog RUV, this may take several minutes... [02/Aug/2024:23:41:06.202467004 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding replication changelog RUV complete. Result 0 (Success)
(dies here and the container exits)
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
I'm looking at everything but I don't think it's Docker at this point. It does have the appearance of something killing it off with a 60 second timeout though.
Tim Darby ________________________________ From: David Boreham david@bozemanpass.com Sent: Monday, August 5, 2024 08:47 To: 389-users@lists.fedoraproject.org 389-users@lists.fedoraproject.org Subject: [EXT] [389-users] Re: container 3.1.1: BDB is failing to recover from hard shutdown
External Email
________________________________ Is docker killing the container before recovery completes?
On Mon, Aug 5, 2024, at 9:16 AM, tdarby@arizona.edumailto:tdarby@arizona.edu wrote: I have a test instance with two 2.5 container instances replicating. I took one down, but not gracefully, and restarted it using the 3.1.1 container. It showed that it was recovering the RUV and then died 60 seconds after container start. I've tried to restart it several times and it continually fails after 60 seconds. I've never seen a 389ds instance fail to recover so this is alarming: Here's what I see in the logs:
[02/Aug/2024:23:40:08.164519017 +0000] - INFO - main - 389-Directory/3.1.1 B2024.213.0201 starting up [02/Aug/2024:23:40:08.167190043 +0000] - INFO - main - Setting the maximum file descriptor limit to: 65535 [02/Aug/2024:23:40:08.324281525 +0000] - INFO - PBKDF2_SHA256 - Based on CPU performance, chose 2048 rounds [02/Aug/2024:23:40:08.328414518 +0000] - INFO - ldbm_instance_config_cachememsize_set - force a minimal value 512000 [02/Aug/2024:23:40:08.334434881 +0000] - NOTICE - bdb_start_autotune - found 126976000k physical memory [02/Aug/2024:23:40:08.336991774 +0000] - NOTICE - bdb_start_autotune - found 101486140k available [02/Aug/2024:23:40:08.340297223 +0000] - NOTICE - bdb_start_autotune - total cache size: 29477568512 B; [02/Aug/2024:23:40:08.343560343 +0000] - NOTICE - bdb_start - Detected Disorderly Shutdown last time Directory Server was running, recovering database. [02/Aug/2024:23:40:50.311047857 +0000] - INFO - slapi_vattrspi_regattr - Because pwdpolicysubentry is a new registered virtual attribute , nsslapd-ignore-virtual-attrs was set to 'off' [02/Aug/2024:23:40:50.367322989 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding the replication changelog RUV, this may take several minutes... [02/Aug/2024:23:41:06.202467004 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding replication changelog RUV complete. Result 0 (Success)
(dies here and the container exits) -- _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.orgmailto:389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.orgmailto:389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelineshttps://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject....https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issuehttps://pagure.io/fedora-infrastructure/new_issue
On 8/5/24 17:59, Darby, Tim - (tdarby) wrote:
I'm looking at everything but I don't think it's Docker at this point. It does have the appearance of something killing it off with a 60 second timeout though.
Tim Darby
*From:* David Boreham david@bozemanpass.com *Sent:* Monday, August 5, 2024 08:47 *To:* 389-users@lists.fedoraproject.org 389-users@lists.fedoraproject.org *Subject:* [EXT] [389-users] Re: container 3.1.1: BDB is failing to recover from hard shutdown
*External Email*
Is docker killing the container before recovery completes?
On Mon, Aug 5, 2024, at 9:16 AM, tdarby@arizona.edu wrote:
I have a test instance with two 2.5 container instances replicating. I took one down, but not gracefully, and restarted it using the 3.1.1 container. It showed that it was recovering the RUV and then died 60 seconds after container start. I've tried to restart it several times and it continually fails after 60 seconds. I've never seen a 389ds instance fail to recover so this is alarming: Here's what I see in the logs:
[02/Aug/2024:23:40:08.164519017 +0000] - INFO - main - 389-Directory/3.1.1 B2024.213.0201 starting up [02/Aug/2024:23:40:08.167190043 +0000] - INFO - main - Setting the maximum file descriptor limit to: 65535 [02/Aug/2024:23:40:08.324281525 +0000] - INFO - PBKDF2_SHA256 - Based on CPU performance, chose 2048 rounds [02/Aug/2024:23:40:08.328414518 +0000] - INFO - ldbm_instance_config_cachememsize_set - force a minimal value 512000 [02/Aug/2024:23:40:08.334434881 +0000] - NOTICE - bdb_start_autotune
- found 126976000k physical memory
[02/Aug/2024:23:40:08.336991774 +0000] - NOTICE - bdb_start_autotune
- found 101486140k available
[02/Aug/2024:23:40:08.340297223 +0000] - NOTICE - bdb_start_autotune
- total cache size: 29477568512 B;
[02/Aug/2024:23:40:08.343560343 +0000] - NOTICE - bdb_start - Detected Disorderly Shutdown last time Directory Server was running, recovering database. [02/Aug/2024:23:40:50.311047857 +0000] - INFO - slapi_vattrspi_regattr - Because pwdpolicysubentry is a new registered virtual attribute , nsslapd-ignore-virtual-attrs was set to 'off' [02/Aug/2024:23:40:50.367322989 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding the replication changelog RUV, this may take several minutes... [02/Aug/2024:23:41:06.202467004 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding replication changelog RUV complete. Result 0 (Success)
At this point the server already completed successfully the BDB recovery and the replication plugin startup (Changelog ruv was rebuilt). Unfortunately it has not yet completed the startup of all the others plugins and it was not up and running. So I agree that the most likely RC is that DS is too slow to complete its startup (after a crash) and docker kills it. Was it SIGKILL ? it should try SIGTERM first.
thierry
(dies here and the container exits)
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
Hi,
There is a default timeout of 60 seconds for container to start up: https://github.com/389ds/389-ds-base/blob/e7b0e20c1a76753a78c0c0e04d3579fd20... Try to pass DS_STARTUP_TIMEOUT more than 60 seconds.
HTH
On Mon, Aug 5, 2024 at 5:17 PM tdarby@arizona.edu wrote:
I have a test instance with two 2.5 container instances replicating. I took one down, but not gracefully, and restarted it using the 3.1.1 container. It showed that it was recovering the RUV and then died 60 seconds after container start. I've tried to restart it several times and it continually fails after 60 seconds. I've never seen a 389ds instance fail to recover so this is alarming: Here's what I see in the logs:
[02/Aug/2024:23:40:08.164519017 +0000] - INFO - main - 389-Directory/3.1.1 B2024.213.0201 starting up [02/Aug/2024:23:40:08.167190043 +0000] - INFO - main - Setting the maximum file descriptor limit to: 65535 [02/Aug/2024:23:40:08.324281525 +0000] - INFO - PBKDF2_SHA256 - Based on CPU performance, chose 2048 rounds [02/Aug/2024:23:40:08.328414518 +0000] - INFO - ldbm_instance_config_cachememsize_set - force a minimal value 512000 [02/Aug/2024:23:40:08.334434881 +0000] - NOTICE - bdb_start_autotune - found 126976000k physical memory [02/Aug/2024:23:40:08.336991774 +0000] - NOTICE - bdb_start_autotune - found 101486140k available [02/Aug/2024:23:40:08.340297223 +0000] - NOTICE - bdb_start_autotune - total cache size: 29477568512 B; [02/Aug/2024:23:40:08.343560343 +0000] - NOTICE - bdb_start - Detected Disorderly Shutdown last time Directory Server was running, recovering database. [02/Aug/2024:23:40:50.311047857 +0000] - INFO - slapi_vattrspi_regattr - Because pwdpolicysubentry is a new registered virtual attribute , nsslapd-ignore-virtual-attrs was set to 'off' [02/Aug/2024:23:40:50.367322989 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding the replication changelog RUV, this may take several minutes... [02/Aug/2024:23:41:06.202467004 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding replication changelog RUV complete. Result 0 (Success)
(dies here and the container exits)
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.... Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
Thanks, this may actually be the issue. It's odd that I've never seen the default 60 second timeout have an effect before.
Tim Darby
________________________________ From: Viktor Ashirov vashirov@redhat.com Sent: Monday, August 5, 2024 09:35 To: General discussion list for the 389 Directory server project. 389-users@lists.fedoraproject.org Subject: [EXT] [389-users] Re: container 3.1.1: BDB is failing to recover from hard shutdown
External Email
________________________________ Hi,
There is a default timeout of 60 seconds for container to start up: https://github.com/389ds/389-ds-base/blob/e7b0e20c1a76753a78c0c0e04d3579fd20...https://github.com/389ds/389-ds-base/blob/e7b0e20c1a76753a78c0c0e04d3579fd2037cdee/src/lib389/cli/dscontainer#L367 Try to pass DS_STARTUP_TIMEOUT more than 60 seconds.
HTH
On Mon, Aug 5, 2024 at 5:17 PM <tdarby@arizona.edumailto:tdarby@arizona.edu> wrote: I have a test instance with two 2.5 container instances replicating. I took one down, but not gracefully, and restarted it using the 3.1.1 container. It showed that it was recovering the RUV and then died 60 seconds after container start. I've tried to restart it several times and it continually fails after 60 seconds. I've never seen a 389ds instance fail to recover so this is alarming: Here's what I see in the logs:
[02/Aug/2024:23:40:08.164519017 +0000] - INFO - main - 389-Directory/3.1.1 B2024.213.0201 starting up [02/Aug/2024:23:40:08.167190043 +0000] - INFO - main - Setting the maximum file descriptor limit to: 65535 [02/Aug/2024:23:40:08.324281525 +0000] - INFO - PBKDF2_SHA256 - Based on CPU performance, chose 2048 rounds [02/Aug/2024:23:40:08.328414518 +0000] - INFO - ldbm_instance_config_cachememsize_set - force a minimal value 512000 [02/Aug/2024:23:40:08.334434881 +0000] - NOTICE - bdb_start_autotune - found 126976000k physical memory [02/Aug/2024:23:40:08.336991774 +0000] - NOTICE - bdb_start_autotune - found 101486140k available [02/Aug/2024:23:40:08.340297223 +0000] - NOTICE - bdb_start_autotune - total cache size: 29477568512 B; [02/Aug/2024:23:40:08.343560343 +0000] - NOTICE - bdb_start - Detected Disorderly Shutdown last time Directory Server was running, recovering database. [02/Aug/2024:23:40:50.311047857 +0000] - INFO - slapi_vattrspi_regattr - Because pwdpolicysubentry is a new registered virtual attribute , nsslapd-ignore-virtual-attrs was set to 'off' [02/Aug/2024:23:40:50.367322989 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding the replication changelog RUV, this may take several minutes... [02/Aug/2024:23:41:06.202467004 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUVs - Rebuilding replication changelog RUV complete. Result 0 (Success)
(dies here and the container exits) -- _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.orgmailto:389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.orgmailto:389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelineshttps://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject....https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issuehttps://pagure.io/fedora-infrastructure/new_issue
-- Viktor
389-users@lists.fedoraproject.org