Adding the list since Sumit appears to be busy. The info is anonymized so it should be
ok. Hopefully, the gz file makes it through.
=G=?
________________________________
From: Galen Johnson
Sent: Thursday, September 21, 2017 5:36 PM
To: Sumit Bose
Cc: Philip Holman
Subject: sssd email login performance
Hi Sumit,
I'm finally getting a chance to follow up on the email thread (of the same title) from
the sssd list. We've seen some delays (multi-second) for auth requests when users use
their email address versus their id. I've attached a tar file with several log files.
Phil may need to explain the summary file if you have any questions about it. We are
running Centos 7.4 now but I'm fairly certain that it's the same binaries as RHEL
7.4. These logs were taken while on 7.3. I noticed that sssd bumped to 1.15 with 7.4.
Some outstanding questions we have are:
1. The cache appears to not be used for the email attribute. Why is this not used?
2. We're also curious why the ldap requests add 2 seconds when performing the same
query from the command-line returns almost immediately.
3. Is it possible to have SSSD ignore the domain and just immediately look up the
address? We see "is_email_from_domain" in the domain log (reflected in the nss
log). We checked the man pages and nothing really jumped out as a config option.
It should be noted that we also moved the sssd db cache to tmpfs (per a blog from Jakub).
?
Thanks for any insight
=G=?
Phil's analysis follows:
To wrap up, I took one more look at one of the very slow email logins to pull out a trace
of what it was doing. The attached files are the log snippets with line breaks marking off
the incoming requests to make it more clear what each module was servicing when. The
summary.txt shows the summarized entry for the connection and also gives an abridged
combined view of the logs marking where the 7 seconds appear to have gone. So this seemed
enough info to share if we have the opportunity for a consult with someone.
The short version is that 1 second roughly went to the bind that tests the user, but the
other 6 appear to have likely been the result of interacting with local caches rather than
the DCs. So that makes the cache files and related configuration look suspicious. It also
makes more sense that our earlier checks (against logs or live tests) of the Exnet
interactions have failed to show any latency issues on those step.
Possibly the fiddling we've already done with the cache files and cache config
resolved this, but it is probably still worth passing this along to someone knowledgeable
who might be able to explain what about the setup likely made everything go sideways.
Otherwise, we might be facing some kind of build-up pattern where it will always look rosy
after a restart and gradually degrade over time as state builds up.
It might also be a good idea to bounce and clear out sssd/pam state on the weekly restarts
just to protect against any possible build-up (unless we want to intentionally avoid that
for now to see if it does degrade over time).
Show replies by date