Hi all,
We have a long standing 389ds master LDAP server that was found to be unable to contact it’s slaves. Most specifically, the slaves show nothing in their logs about any kind of connection, while the master is logging this:
[12/Nov/2019:21:39:47.212715697 +0000] - ERR - slapi_ldap_bind - Could not send bind request for id [(anon)] authentication mechanism [EXTERNAL]: error -1 (Can't contact LDAP server), system error 0 (no error), network error 0 (Unknown error, host “ldap01:636”)
Key is "system error 0 (no error)”, which leaves us stumped. The error is obviously “success”.
Has anyone seen this kind of thing before?
This is 389ds running on CentOS7 as follows:
389-ds-base-1.3.9.1-10.el7.x86_64
Regards, Graham —
On 11/12/19 4:47 PM, Graham Leggett wrote:
Hi all,
We have a long standing 389ds master LDAP server that was found to be unable to contact it’s slaves. Most specifically, the slaves show nothing in their logs about any kind of connection, while the master is logging this:
[12/Nov/2019:21:39:47.212715697 +0000] - ERR - slapi_ldap_bind - Could not send bind request for id [(anon)] authentication mechanism [EXTERNAL]: error -1 (Can't contact LDAP server), system error 0 (no error), network error 0 (Unknown error, host “ldap01:636”)
What is the bind method of the agreement? SSLCLIENTAUTH? The problem is that the ID is anonymous (anon). So it's not binding correctly to the consumer. What do you have for these attributes in the replication agreement:
This is what I have:
dn: cn=blah,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsDS5ReplicaBindMethod: sslclientauth nsDS5ReplicaTransportInfo: LDAPS nsDS5ReplicaBindDN: cn=replication manager,cn=config
Mark
Key is "system error 0 (no error)”, which leaves us stumped. The error is obviously “success”.
Has anyone seen this kind of thing before?
This is 389ds running on CentOS7 as follows:
389-ds-base-1.3.9.1-10.el7.x86_64
Regards, Graham —
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject....
On 13 Nov 2019, at 08:13, Mark Reynolds mreynolds@redhat.com wrote:
On 11/12/19 4:47 PM, Graham Leggett wrote:
Hi all,
We have a long standing 389ds master LDAP server that was found to be unable to contact it’s slaves. Most specifically, the slaves show nothing in their logs about any kind of connection, while the master is logging this:
[12/Nov/2019:21:39:47.212715697 +0000] - ERR - slapi_ldap_bind - Could not send bind request for id [(anon)] authentication mechanism [EXTERNAL]: error -1 (Can't contact LDAP server), system error 0 (no error), network error 0 (Unknown error, host “ldap01:636”)
What is the bind method of the agreement? SSLCLIENTAUTH? The problem is that the ID is anonymous (anon). So it's not binding correctly to the consumer. What do you have for these attributes in the replication agreement:
Hmmm, ldap01:636 also seems like a bad hostname too?
This is what I have:
dn: cn=blah,cn=replica,cn=dc\3Dexample\2Cdc\3Dcom,cn=mapping tree,cn=config
nsDS5ReplicaBindMethod: sslclientauth nsDS5ReplicaTransportInfo: LDAPS nsDS5ReplicaBindDN: cn=replication manager,cn=config
Mark
Key is "system error 0 (no error)”, which leaves us stumped. The error is obviously “success”.
Has anyone seen this kind of thing before?
This is 389ds running on CentOS7 as follows:
389-ds-base-1.3.9.1-10.el7.x86_64
Regards, Graham —
389-users mailing list -- 389-users@lists.fedoraproject.org
To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject....
--
389 Directory Server Development Team
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject....
— Sincerely,
William Brown
Senior Software Engineer, 389 Directory Server SUSE Labs
On 13 Nov 2019, at 00:13, Mark Reynolds mreynolds@redhat.com wrote:
We have a long standing 389ds master LDAP server that was found to be unable to contact it’s slaves. Most specifically, the slaves show nothing in their logs about any kind of connection, while the master is logging this:
[12/Nov/2019:21:39:47.212715697 +0000] - ERR - slapi_ldap_bind - Could not send bind request for id [(anon)] authentication mechanism [EXTERNAL]: error -1 (Can't contact LDAP server), system error 0 (no error), network error 0 (Unknown error, host “ldap01:636”)
What is the bind method of the agreement? SSLCLIENTAUTH? The problem is that the ID is anonymous (anon). So it's not binding correctly to the consumer. What do you have for these attributes in the replication agreement:
More of the problem was picked up by wireshark - the 389ds LDAP slave is telling the 389ds LDAP master that the 389ds LDAP slave does not recognise the CA:
Transmission Control Protocol, Src Port: 636, Dst Port: 53994, Seq: 5462, Ack: 2279, Len: 7 Transport Layer Security TLSv1.2 Record Layer: Alert (Level: Fatal, Description: Unknown CA) Content Type: Alert (21) Version: TLS 1.2 (0x0303) Length: 2 Alert Message Level: Fatal (2) Description: Unknown CA (48)
(The certificates are privately generated, and have been in place since 2016, and are all still valid.)
This in in turn caused because the 389ds LDAP master has for some reason decided to not pass the full certificate chain across to the slave (intermediates are involved) and the slave is quite correctly saying unknown CA.
Does anyone know why 389ds would suddenly stop sending the full certificate chain while replicating?
It also looks like the error handling in 389ds SSL is broken - if the slave sent “unknown CA" to the master, the master needs to log that fact, and not report the error as “success”.
Regards, Graham —
On 13 Nov 2019, at 09:34, Graham Leggett minfrin@sharp.fm wrote:
On 13 Nov 2019, at 00:13, Mark Reynolds mreynolds@redhat.com wrote:
We have a long standing 389ds master LDAP server that was found to be unable to contact it’s slaves. Most specifically, the slaves show nothing in their logs about any kind of connection, while the master is logging this:
[12/Nov/2019:21:39:47.212715697 +0000] - ERR - slapi_ldap_bind - Could not send bind request for id [(anon)] authentication mechanism [EXTERNAL]: error -1 (Can't contact LDAP server), system error 0 (no error), network error 0 (Unknown error, host “ldap01:636”)
What is the bind method of the agreement? SSLCLIENTAUTH? The problem is that the ID is anonymous (anon). So it's not binding correctly to the consumer. What do you have for these attributes in the replication agreement:
More of the problem was picked up by wireshark - the 389ds LDAP slave is telling the 389ds LDAP master that the 389ds LDAP slave does not recognise the CA:
Transmission Control Protocol, Src Port: 636, Dst Port: 53994, Seq: 5462, Ack: 2279, Len: 7 Transport Layer Security TLSv1.2 Record Layer: Alert (Level: Fatal, Description: Unknown CA) Content Type: Alert (21) Version: TLS 1.2 (0x0303) Length: 2 Alert Message Level: Fatal (2) Description: Unknown CA (48)
(The certificates are privately generated, and have been in place since 2016, and are all still valid.)
This in in turn caused because the 389ds LDAP master has for some reason decided to not pass the full certificate chain across to the slave (intermediates are involved) and the slave is quite correctly saying unknown CA.
Does anyone know why 389ds would suddenly stop sending the full certificate chain while replicating?
It also looks like the error handling in 389ds SSL is broken - if the slave sent “unknown CA" to the master, the master needs to log that fact, and not report the error as “success”.
We'll need to see the output of certutil -L -d /etc/dirsrv/slapd-<instance>/ from both the master and replica servers please.
In a TLS auth process the client doesn't send it's CA - if you get unknown CA it's most likely the replica has either had the CA and it's chain members expire, or they are not marked as trusted for client auth. So that's why I'd like to see the certutil output please.
Regards, Graham —
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject....
— Sincerely,
William Brown
Senior Software Engineer, 389 Directory Server SUSE Labs
On 13 Nov 2019, at 01:37, William Brown wbrown@suse.de wrote:
Does anyone know why 389ds would suddenly stop sending the full certificate chain while replicating?
It also looks like the error handling in 389ds SSL is broken - if the slave sent “unknown CA" to the master, the master needs to log that fact, and not report the error as “success”.
We'll need to see the output of certutil -L -d /etc/dirsrv/slapd-<instance>/ from both the master and replica servers please.
In a TLS auth process the client doesn't send it's CA - if you get unknown CA it's most likely the replica has either had the CA and it's chain members expire, or they are not marked as trusted for client auth. So that's why I'd like to see the certutil output please.
I discovered the same problem had been reported in OpenLDAP: https://www.centos.org/forums/viewtopic.php?t=67042
This in turn is caused by a regression in NSS, where it is no longer sufficient to have a trusted root certificate, you now need all intermediate certificates marked as trusted as well.
Making the following change to the intermediate certs fixed the problem:
[root@ldap01 ~]# certutil -L -d /etc/dirsrv/slapd-hg
Certificate Nickname Trust Attributes SSL,S/MIME,JAR/XPI
intermediateB ,, intermediateA ,, rootrootroot CT,C,C ldap01 u,u,u [root@ldap01 ~]# certutil -M -d /etc/dirsrv/slapd-hg -t "CT,C,C" -n "intermediateA" [root@ldap01 ~]# certutil -M -d /etc/dirsrv/slapd-hg -t "CT,C,C" -n "intermediateB" [root@ldap01 ~]# certutil -L -d /etc/dirsrv/slapd-hg
Certificate Nickname Trust Attributes SSL,S/MIME,JAR/XPI
intermediateA CT,C,C intermediateB CT,C,C rootrootroot CT,C,C ldap01 u,u,u
Raised the bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1771979
Regards, Graham —
On 13 Nov 2019, at 20:29, Graham Leggett minfrin@sharp.fm wrote:
On 13 Nov 2019, at 01:37, William Brown wbrown@suse.de wrote:
Does anyone know why 389ds would suddenly stop sending the full certificate chain while replicating?
It also looks like the error handling in 389ds SSL is broken - if the slave sent “unknown CA" to the master, the master needs to log that fact, and not report the error as “success”.
We'll need to see the output of certutil -L -d /etc/dirsrv/slapd-<instance>/ from both the master and replica servers please.
In a TLS auth process the client doesn't send it's CA - if you get unknown CA it's most likely the replica has either had the CA and it's chain members expire, or they are not marked as trusted for client auth. So that's why I'd like to see the certutil output please.
I discovered the same problem had been reported in OpenLDAP: https://www.centos.org/forums/viewtopic.php?t=67042
This in turn is caused by a regression in NSS, where it is no longer sufficient to have a trusted root certificate, you now need all intermediate certificates marked as trusted as well.
Making the following change to the intermediate certs fixed the problem:
[root@ldap01 ~]# certutil -L -d /etc/dirsrv/slapd-hg
Certificate Nickname Trust Attributes SSL,S/MIME,JAR/XPI
intermediateB ,, intermediateA ,, rootrootroot CT,C,C ldap01 u,u,u [root@ldap01 ~]# certutil -M -d /etc/dirsrv/slapd-hg -t "CT,C,C" -n "intermediateA" [root@ldap01 ~]# certutil -M -d /etc/dirsrv/slapd-hg -t "CT,C,C" -n "intermediateB" [root@ldap01 ~]# certutil -L -d /etc/dirsrv/slapd-hg
Certificate Nickname Trust Attributes SSL,S/MIME,JAR/XPI
intermediateA CT,C,C intermediateB CT,C,C rootrootroot CT,C,C ldap01 u,u,u
Raised the bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1771979
Awesome work, thanks for following up on this!
Regards, Graham —
389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject....
— Sincerely,
William Brown
Senior Software Engineer, 389 Directory Server SUSE Labs
On 13 Nov 2019, at 12:29, Graham Leggett minfrin@sharp.fm wrote:
On 13 Nov 2019, at 01:37, William Brown wbrown@suse.de wrote:
Does anyone know why 389ds would suddenly stop sending the full certificate chain while replicating?
It also looks like the error handling in 389ds SSL is broken - if the slave sent “unknown CA" to the master, the master needs to log that fact, and not report the error as “success”.
We'll need to see the output of certutil -L -d /etc/dirsrv/slapd-<instance>/ from both the master and replica servers please.
In a TLS auth process the client doesn't send it's CA - if you get unknown CA it's most likely the replica has either had the CA and it's chain members expire, or they are not marked as trusted for client auth. So that's why I'd like to see the certutil output please.
I discovered the same problem had been reported in OpenLDAP: https://www.centos.org/forums/viewtopic.php?t=67042
This in turn is caused by a regression in NSS, where it is no longer sufficient to have a trusted root certificate, you now need all intermediate certificates marked as trusted as well.
Making the following change to the intermediate certs fixed the problem:
[root@ldap01 ~]# certutil -L -d /etc/dirsrv/slapd-hg
Certificate Nickname Trust Attributes SSL,S/MIME,JAR/XPI
intermediateB ,, intermediateA ,, rootrootroot CT,C,C ldap01 u,u,u [root@ldap01 ~]# certutil -M -d /etc/dirsrv/slapd-hg -t "CT,C,C" -n "intermediateA" [root@ldap01 ~]# certutil -M -d /etc/dirsrv/slapd-hg -t "CT,C,C" -n "intermediateB" [root@ldap01 ~]# certutil -L -d /etc/dirsrv/slapd-hg
Certificate Nickname Trust Attributes SSL,S/MIME,JAR/XPI
intermediateA CT,C,C intermediateB CT,C,C rootrootroot CT,C,C ldap01 u,u,u
Raised the bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1771979 https://bugzilla.redhat.com/show_bug.cgi?id=1771979
Coming back to this one - got to the bottom of this while investigating something else that wasn’t working.
This wasn’t a regression in NSS, but rather a regression in the openldap libraries shipped by RHEL7.5 and above.
For reasons that I haven’t found, there was an architecture change made half way through the RHEL7 lifecycle where openldap was linked to openssl instead of NSS.
Openldap's NSS support and openldap’s openssl support differ in a fundamental way - with NSS, when openldap makes an SSL connection intermediate certificates are filled in by the client side as normal. With openssl, when openldap makes an SSL connection intermediate certificates are ignored, and the connection breaks.
The hack workaround above fixes this because openldap’s openssl support expects you to place intermediate certs in your trusted certificate store. As soon as you mark the intermediates as trusted in NSS, the hack workaround in 389ds that makes replication sort-of work bound to two different crypto libraries exports trusted certs across into the ca certificate list passed to openldap. Openldap then finds the intermediates and things work.
Fundamentally there are two bugs:
- https://bugzilla.redhat.com/show_bug.cgi?id=1898924 https://bugzilla.redhat.com/show_bug.cgi?id=1898924
- An architectural change half way through the lifecycle of what is supposed to be a stable OS.
Regards, Graham —
On 19 Nov 2020, at 20:34, Graham Leggett minfrin@sharp.fm wrote:
On 13 Nov 2019, at 12:29, Graham Leggett minfrin@sharp.fm wrote:
On 13 Nov 2019, at 01:37, William Brown wbrown@suse.de wrote:
Does anyone know why 389ds would suddenly stop sending the full certificate chain while replicating?
It also looks like the error handling in 389ds SSL is broken - if the slave sent “unknown CA" to the master, the master needs to log that fact, and not report the error as “success”.
We'll need to see the output of certutil -L -d /etc/dirsrv/slapd-<instance>/ from both the master and replica servers please.
In a TLS auth process the client doesn't send it's CA - if you get unknown CA it's most likely the replica has either had the CA and it's chain members expire, or they are not marked as trusted for client auth. So that's why I'd like to see the certutil output please.
I discovered the same problem had been reported in OpenLDAP: https://www.centos.org/forums/viewtopic.php?t=67042
This in turn is caused by a regression in NSS, where it is no longer sufficient to have a trusted root certificate, you now need all intermediate certificates marked as trusted as well.
Making the following change to the intermediate certs fixed the problem:
[root@ldap01 ~]# certutil -L -d /etc/dirsrv/slapd-hg
Certificate Nickname Trust Attributes SSL,S/MIME,JAR/XPI
intermediateB ,, intermediateA ,, rootrootroot CT,C,C ldap01 u,u,u [root@ldap01 ~]# certutil -M -d /etc/dirsrv/slapd-hg -t "CT,C,C" -n "intermediateA" [root@ldap01 ~]# certutil -M -d /etc/dirsrv/slapd-hg -t "CT,C,C" -n "intermediateB" [root@ldap01 ~]# certutil -L -d /etc/dirsrv/slapd-hg
Certificate Nickname Trust Attributes SSL,S/MIME,JAR/XPI
intermediateA CT,C,C intermediateB CT,C,C rootrootroot CT,C,C ldap01 u,u,u
Raised the bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1771979
Coming back to this one - got to the bottom of this while investigating something else that wasn’t working.
This wasn’t a regression in NSS, but rather a regression in the openldap libraries shipped by RHEL7.5 and above.
For reasons that I haven’t found, there was an architecture change made half way through the RHEL7 lifecycle where openldap was linked to openssl instead of NSS.
Openldap's NSS support and openldap’s openssl support differ in a fundamental way - with NSS, when openldap makes an SSL connection intermediate certificates are filled in by the client side as normal. With openssl, when openldap makes an SSL connection intermediate certificates are ignored, and the connection breaks.
The hack workaround above fixes this because openldap’s openssl support expects you to place intermediate certs in your trusted certificate store. As soon as you mark the intermediates as trusted in NSS, the hack workaround in 389ds that makes replication sort-of work bound to two different crypto libraries exports trusted certs across into the ca certificate list passed to openldap. Openldap then finds the intermediates and things work.
Fundamentally there are two bugs:
An architectural change half way through the lifecycle of what is supposed to be a stable OS.
I seem to remember this change (this was at a time I worked at RH). If memory serves correctly, OpenLDAP upstream removed/deprecated their NSS support. This was making it much much harder to apply fixes for issues both stability and technical, so moving to OpenSSL was the "best move" for customer support.
Even internally to 389-ds which has to link to OpenLDAP for some outbound client operations, it internally swapped from NSS to OpenSSL for this as well, which involves extracting some certificates into temporary stores for OpenLDAP client to use. It's quite fun to put it mildly.
There are very good reasons why those decisions were made, and that was very carefully managed too. It was a lot of work and ultimately, it did make the OpenLDAP client library better for our maintainers and many consumers, but as you have noticed, these are complex systems, designed and built in a time that preceded deep testing of things. I believe our client TLS auth tests were made *after* the OpenLDAP to OpenSSL switch was made, which could be why this was not noticed.
Anyway, I hope that this gives some more context to why this happened.
— Sincerely,
William Brown
Senior Software Engineer, 389 Directory Server SUSE Labs, Australia
On 19 Nov 2020, at 10:34, Graham Leggett minfrin@sharp.fm wrote:
Raised the bug here: https://bugzilla.redhat.com/show_bug.cgi?id=1771979
Coming back to this one - got to the bottom of this while investigating something else that wasn’t working.
This wasn’t a regression in NSS, but rather a regression in the openldap libraries shipped by RHEL7.5 and above.
For reasons that I haven’t found, there was an architecture change made half way through the RHEL7 lifecycle where openldap was linked to openssl instead of NSS.
Openldap's NSS support and openldap’s openssl support differ in a fundamental way - with NSS, when openldap makes an SSL connection intermediate certificates are filled in by the client side as normal. With openssl, when openldap makes an SSL connection intermediate certificates are ignored, and the connection breaks.
The hack workaround above fixes this because openldap’s openssl support expects you to place intermediate certs in your trusted certificate store. As soon as you mark the intermediates as trusted in NSS, the hack workaround in 389ds that makes replication sort-of work bound to two different crypto libraries exports trusted certs across into the ca certificate list passed to openldap. Openldap then finds the intermediates and things work.
Fundamentally there are two bugs:
An architectural change half way through the lifecycle of what is supposed to be a stable OS.
End of 2023, the bug is still present in RHEL9:
[11/Dec/2023:23:02:09.510906411 +0000] - ERR - slapi_ldap_bind - Could not send bind request for id [(anon)] authentication mechanism [EXTERNAL]: error -1 (Can't contact LDAP server), system error -5987 (Invalid function argument.), network error 0 (Unknown error, host “ldap2.example.com:636")
This time, the workaround of forcing the intermediate certificates to be marked trusted no longer works. We now get a low level complaint about a certificate verification failure. The error message doesn’t tell us which certificate failed, but this message is an openssl message.
[11/Dec/2023:19:45:28.115134273 +0000] - ERR - NSMMReplicationPlugin - bind_and_check_pwp - agmt=“cn=ldap2" (thor:636) - Replication bind with EXTERNAL auth failed: LDAP error -1 (Can't contact LDAP server) (error:0A000086:SSL routines::certificate verify failed (self-signed certificate in certificate chain))
There are no self-signed certificates being used, they are certs issued by public CAs, which like all public CAs, have intermediate certs.
The bugs I raised in 2020 were all abandoned and closed.
Regards, Graham —
389-users@lists.fedoraproject.org