On 13 Nov 2019, at 00:13, Mark Reynolds <mreynolds@redhat.com> wrote:
We have a long standing 389ds master LDAP server that was found to be unable to contact it’s slaves. Most specifically, the slaves show nothing in their logs about any kind of connection, while the master is logging this:

[12/Nov/2019:21:39:47.212715697 +0000] - ERR - slapi_ldap_bind - Could not send bind request for id [(anon)] authentication mechanism [EXTERNAL]: error -1 (Can't contact LDAP server), system error 0 (no error), network error 0 (Unknown error, host “ldap01:636”)

What is the bind method of the agreement?  SSLCLIENTAUTH?  The problem is that the ID is anonymous (anon).  So it's not binding correctly to the consumer.   What do you have for these attributes in the replication agreement:

More of the problem was picked up by wireshark - the 389ds LDAP slave is telling the 389ds LDAP master that the 389ds LDAP slave does not recognise the CA:

Transmission Control Protocol, Src Port: 636, Dst Port: 53994, Seq: 5462, Ack: 2279, Len: 7
Transport Layer Security
    TLSv1.2 Record Layer: Alert (Level: Fatal, Description: Unknown CA)
        Content Type: Alert (21)
        Version: TLS 1.2 (0x0303)
        Length: 2
        Alert Message
            Level: Fatal (2)
            Description: Unknown CA (48)

(The certificates are privately generated, and have been in place since 2016, and are all still valid.)

This in in turn caused because the 389ds LDAP master has for some reason decided to not pass the full certificate chain across to the slave (intermediates are involved) and the slave is quite correctly saying unknown CA.

Does anyone know why 389ds would suddenly stop sending the full certificate chain while replicating?

It also looks like the error handling in 389ds SSL is broken - if the slave sent “unknown CA" to the master, the master needs to log that fact, and not report the error as “success”.