On 10/7/21 12:01, Spike White wrote:
> FYI -- update on this situation.
>
> AD DC logs no help. They show the exact same response sent back to a
> good machine account password renewal as for a failed renewal.
>
> One of the AD administrators have identified a particular AD DC NIC
> teaming configuration that they state has caused problems with Kerberos
> on the past. It's on a small percentage of their AD DCs and they will
> work to correct. They will keep us apprised as to update.
>
> I'm skeptical that's the underlying root cause -- for two reasons:
> 1. If Kerberos was sensitive to this, it should affect all Kerberos
> operations (Kerberos auth, etc.) and not just the kpasswd operations.
> 2. This is not occurring on our older RHEL6 and RHEL7 builds AD
> integrated via our older commercial AD integration product. It's
> occurring only on our sssd-integrated builds.
>
> At this point, we're turned off debug level 7 (it was filling up our
> /var/log filesystems and we have the verbose adcli update output from at
> least two failed clients). We're going to take the alternate
> suggestion of setting ad_maximum_machine_account_password_age to 0
> (disabling sssd from updating password) and run a cron job to do 'adcli
> update'.
>
> We're wrapping this adcli_update with tcpdump to get the exact kpasswd
> request/response packets, as well as wrapping with KRB5_TRACE.
>
> We want to call adcli update exactly as sssd calls it.
> From SOURCES/sssd-2.4.0/src/providers/ad/ad_machine_pw_renewal.c, this
> appears to be how sssd calls external program /usr/sbin/adcli to do its
> adcli update:
>
> /usr/sbin/adcli update --verbose --domain=$AD_DOMAIN
> --host-keytab=/etc/krb5.keytab --host-fqdn=$FQDN
> --computer-password-lifetime=30
>
> because we aren't doing any Samba stuff.
Question: how would Samba stuff be relevant to updating the Kerberos
ticket using adcli?
Is that the correct
> invocation? We'll set computer-password-lifetime lower, say to 7.
> Because we want to see examples more frequently, to find failed updates.
>
> BTW, the packet capture on a successful machine account password renewal
> is only 8K, so that very targeted debug will not swamp our /var/log or
> /tmp filesystems.
>
> Spike
>
> On Wed, Aug 25, 2021 at 10:32 AM Spike White <spikewhitetx@gmail.com
> <mailto:spikewhitetx@gmail.com>> wrote:
>
> Sssd experts,
>
> *_Short summary:_//* How can we troubleshoot sssd’s ‘Automatic
> Kerberos Host Keytab Renewal’ process? We have ~0.4% of our Linux
> servers dropping off the AD domain monthly.
>
> *_Longer explanation:_*
>
> Over the past two years, we have on-boarded sssd as our Linux AD
> integration component. Largely displacing a former commercial
> product that did the same.
>
> We have about ~20K Linux servers that are sssd-enabled. A mix of
> RHEL6, RHEL7, RHEL8, Oracle Linux 6, 7 and 8. We have ~7K Linux
> servers still on the old commercial product. (For certain edge-case
> scenarios, such as DMZs, the commercial product works better.)
>
> Our AD forest is a single AD forest, with 4 regional child domains.
> All with transitive trust. Sssd auto-discovers parent domain and
> all 4 child domains, no problem – whenever it’s adcli joined to its
> regional local domain.
>
> Why are I writing this?
>
> Because we are researching an ongoing problem reported by L1 server
> ops. About 70 – 80 sssd-enabled Linux servers / month drop off the
> domain. Out of our current sssd-enabled population of ~20K server,
> that’s not horrible. But still it should be better. (Our former
> commercial product did better.)
>
> It’s not limited to one particular OS, OS version, build location or
> region. We have surveyed; it seems to occur randomly among all OS
> versions, regions and locations.
>
> To be clear, it’s extremely likely that this behavior arising from
> some subtle misconfiguration on our part – not from any sssd or
> adcli or Kerberos bug. We have a couple of configuration
> improvements we’re pursuing. (Kerberos max ticket lifetime mismatch
> between AD and /etc/krb5.conf file for instance.)
>
> We are taking sssd’s default settings for
> ad_maximum_machine_account_password_age and
> ad_machine_account_password_renewal_opts. So after 30 days, sssd
> will attempt daily to renew the host Kerberos keytab file. It
> should re-attempt daily if not renewed. By company policy, our AD
> disables any machine accounts that have not renewed their
> credentials in 40 days. So when we find servers that have dropped
> off the domain, it’s because they have not renewed their AD machine
> accounts in 40 days.
>
> We have SR’s open with our OS vendors (Redhat and Oracle
> respectively) for months now. To no great help. (They gave a few
> suggestions, but none panned out.)
>
> We thought we were hitting this bug:
>
> https://github.com/SSSD/sssd/issues/4762
> <https://github.com/SSSD/sssd/issues/4762>
>
> But packet captures proved that adcli update is using TCP on
> RHEL7/8. Thus, this might be a potential problem, but only on
> RHEL6. (BTW ‘udp_preference_limit = 0’ doesn’t force use of TCP for
> the kpasswd invocation in RHEL6 – it still uses UDP. Thus, the
> recommended work-around for this bug doesn’t work.)
>
> So that isn’t our underlying problem.
>
> We’re at a loss now – as you can see, we’re grasping at straws.
>
> How can we troubleshoot sssd’s ‘automatic Kerberos Host keytab
> renewal’ process? Whenever we inspect a particular server it
> works. We can’t run all sssd clients at debug level 9; it fills up
> /var/log filesystem after a few days of that. We’re interested in
> troubleshooting that one particular sssd process on all clients; not
> all parts of sssd.
>
> Other than a steep learning curve (on our part), obscure situations
> (like DMZ auto-discovery of AD controllers) and exotic scenarios
> (like above), we’re quite happy with our 2 yr journey of direct AD
> integration with sssd. Obviously, the troubleshooting tools on
> RHEL6 are very minimal. But certainly, overall the quality of sssd
> on RHEL7/8 is excellent. AD integration has innumerable devils in
> the details; I’m amazed that sssd performs as well as it does
> against our multi-domain forest.
>
> Spike
>
> PS the problem with sssd auto-discovery of AD controllers in DMZs
> has been fixed in a recent sssd release. The better discovery
> algorithm was implemented – same one used by Windows clients and
> commercial products. It’s just that recent sssd version is not on
> RHEL7 or 8.
>
>
>
> _______________________________________________
> sssd-users mailing list -- sssd-users@lists.fedorahosted.org
> To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org
> Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
>>> This message is from an external sender. Learn more about why this <<
>>> matters at https://links.utexas.edu/rtyclf. <<
_______________________________________________
sssd-users mailing list -- sssd-users@lists.fedorahosted.org
To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure