Hi
So, I have what I think seems to be a slightly odd problem. And I think I've worked out what the solution might be - but not the root cause. In any case, I wanted to run it by you all and see whether you agree or have any insight into it.
The background
running 6 directory servers 4.5.0-21 on CentOS 7.4.1708, 3 of which have the CA role. I've been running the directory blissfully uneventfully for 7ish months now. We have experimented a little bit with the CA features, but nothing that can't be done trivially with the web interface (on reflection I'm sure it probably is trivial to revoke your primary certificate authority with the web interface, but you know what I mean).
The problem
In the past few days I've had the occasion to try to create a new replica but on each attempt, the process fails around this time:
[4/4]: configuring ipa-custodia to start on boot Done configuring ipa-custodia. The ipa-replica-install command failed, exception: HTTPError: 404 Client Error: Not Found 404 Client Error: Not Found The ipa-replica-install command failed. See /var/log/ipareplica-install.log for more information
Now, I've learned a fair amount over the past few days digging into this, like what ipa-custodia is, and how to poke it.
It seems that at this point, the process is still actually actively doing things - it appears to be generating some kind of NSS certificate/key store. And that process is failing, because apparently it can't find the key for the entry "auditSigningCert cert-pki-ca" - specifically in custodiainstance.__get_keys the call to cli.fetch_key is failing for this nickname (but no others).
So, more digging, and I find that yes indeed, the private key appears to be missing from the cert database on one of the directory servers (specifically the "first" directory server).
I haven't quite joined the dots on how custodia is working here, but using the following command: sudo certutil -L -d /etc/pki/pki-tomcat/alias I can determine that on the first directory server, the trust attributes for this cert are ",,P" whereas on the other two CA directory servers, the trust attributes are "u,u,uP", and that indeed the key is missing from the first directory server in this database. I also note that the cert databases seem to be divergent in other ways between the CA servers. Which I find interesting.
But anyway, so my next action is to copy the cert databases to another machine and to try to import the cert/key from a "good" CA db to the "bad" CA db using pk12util.
This gives me a segmentation fault.
So, I try with a new DB. I export all the cert/key pairs from the "bad" CA individually and import them into a new DB, replicating the trust attributes. So far so good. I also export the missing cert/key from a "good" CA and import that into the same new DB. Also apparently good.
The solution?
So, at this point, I feel relatively confident that I have constructed a good DB and I should be able to perform some surgery to remove the old "bad" DB and replace it with this "good" DB.
My questions are:
1. Does this approach seem reasonable or am I oversimplifying? 2. If this is a reasonable approach: what's my best method for performing the surgery? ipactl stop, move bad db directory out of way, move "good" db in, don't forget the selinux stuff, then ipactl start again? 3. How could this even happen in the first place? Is it a known issue? 4. Shouldn't the CA databases basically all look the same between servers created at the same time? Why might they diverge? 5. Do you have any other comments or questions which you feel might be pertinent?
Thanks in advance for any input or insights shared.
Best Regards
Andy
As an update - just in case somebody comes across this thread in the future
I copied the environment to a test rig and performed the surgery as proposed. And it worked. I was able to promote a new replica.
For those really interested in the details, here's the series of steps I performed - some steps are slightly edited, so the working directory for each step might not be quite right, but it's not far off.
mkdir work cd work mkdir ds2 cd ds2 tar -zxvf ~/ds2-alias.tgz cd .. mkdir new certutil -N -d new cp -a /etc/pki/pki-tomcat/alias/pwdfile.txt new mkdir stage cd stage for i in 'caSigningCert cert-pki-ca' 'Server-Cert cert-pki-ca' 'ocspSigningCert cert-pki-ca' 'subsystemCert cert-pki-ca' ; do echo "$i" ; pk12util -o "$i" -d /etc/pki/pki-tomcat/alias/ -n "$i" -k /etc/pki/pki-tomcat/alias/pwdfile.txt -w /etc/pki/pki-tomcat/alias/pwdfile.txt ; done pk12util -d ../ds2/etc/pki/pki-tomcat/alias/ -n 'auditSigningCert cert-pki-ca' -k ../ds2/etc/pki/pki-tomcat/alias/pwdfile.txt -w /etc/pki/pki-tomcat/alias/pwdfile.txt -o 'auditSigningCert cert-pki-ca' for i in * ; do pk12util -i "$i" -w ../new/pwdfile.txt -k ../new/pwdfile.txt -d ../new/ -n "$i" ; done cd ../new certutil -L -d . certutil -L -d /etc/pki/pki-tomcat/alias/ certutil -d . -n 'caSigningCert cert-pki-ca' -M -t 'CTu,Cu,Cu' certutil -d . -n 'auditSigningCert cert-pki-ca' -M -t 'u,u,uP' certutil -L -d . cd .. ipactl stop systemctl stop certmonger cd /etc/pki/pki-tomcat/ mv alias alias-old mv ~/work/new alias chown -R pkiuser:pkiuser alias restorecon -R alias chcon -R -u system_u alias systemctl start certmonger ipactl start
I'll perform the same-ish series of steps in production in a maintenance window in the not too distant future.
I'm still wondering how this might have happened, whether some cosmic event has corrupted the NSSDB, or ... /shrug
Anyway, I think it's basically fixed.
Regards
A
On 10 July 2018 at 21:54, Andy Stubbs andy.stubbs@treatwell.com wrote:
Hi
So, I have what I think seems to be a slightly odd problem. And I think I've worked out what the solution might be - but not the root cause. In any case, I wanted to run it by you all and see whether you agree or have any insight into it.
The background
running 6 directory servers 4.5.0-21 on CentOS 7.4.1708, 3 of which have the CA role. I've been running the directory blissfully uneventfully for 7ish months now. We have experimented a little bit with the CA features, but nothing that can't be done trivially with the web interface (on reflection I'm sure it probably is trivial to revoke your primary certificate authority with the web interface, but you know what I mean).
The problem
In the past few days I've had the occasion to try to create a new replica but on each attempt, the process fails around this time:
[4/4]: configuring ipa-custodia to start on boot Done configuring ipa-custodia. The ipa-replica-install command failed, exception: HTTPError: 404 Client Error: Not Found 404 Client Error: Not Found The ipa-replica-install command failed. See /var/log/ipareplica-install.log for more information
Now, I've learned a fair amount over the past few days digging into this, like what ipa-custodia is, and how to poke it.
It seems that at this point, the process is still actually actively doing things - it appears to be generating some kind of NSS certificate/key store. And that process is failing, because apparently it can't find the key for the entry "auditSigningCert cert-pki-ca" - specifically in custodiainstance.__get_keys the call to cli.fetch_key is failing for this nickname (but no others).
So, more digging, and I find that yes indeed, the private key appears to be missing from the cert database on one of the directory servers (specifically the "first" directory server).
I haven't quite joined the dots on how custodia is working here, but using the following command: sudo certutil -L -d /etc/pki/pki-tomcat/alias I can determine that on the first directory server, the trust attributes for this cert are ",,P" whereas on the other two CA directory servers, the trust attributes are "u,u,uP", and that indeed the key is missing from the first directory server in this database. I also note that the cert databases seem to be divergent in other ways between the CA servers. Which I find interesting.
But anyway, so my next action is to copy the cert databases to another machine and to try to import the cert/key from a "good" CA db to the "bad" CA db using pk12util.
This gives me a segmentation fault.
So, I try with a new DB. I export all the cert/key pairs from the "bad" CA individually and import them into a new DB, replicating the trust attributes. So far so good. I also export the missing cert/key from a "good" CA and import that into the same new DB. Also apparently good.
The solution?
So, at this point, I feel relatively confident that I have constructed a good DB and I should be able to perform some surgery to remove the old "bad" DB and replace it with this "good" DB.
My questions are:
- Does this approach seem reasonable or am I oversimplifying?
- If this is a reasonable approach: what's my best method for performing
the surgery? ipactl stop, move bad db directory out of way, move "good" db in, don't forget the selinux stuff, then ipactl start again? 3. How could this even happen in the first place? Is it a known issue? 4. Shouldn't the CA databases basically all look the same between servers created at the same time? Why might they diverge? 5. Do you have any other comments or questions which you feel might be pertinent?
Thanks in advance for any input or insights shared.
Best Regards
Andy
Andrew Stubbs, PhD Head of Technical Operations
treatwell.co.uk
freeipa-users@lists.fedorahosted.org