https://bugzilla.redhat.com/show_bug.cgi?id=2088481
Bug ID: 2088481 Summary: selinux_child: Cannot beign SELinux transaction Product: Fedora Version: 34 Status: NEW Component: sssd Assignee: sssd-maintainers@lists.fedoraproject.org Reporter: rcritten@redhat.com QA Contact: extras-qa@fedoraproject.org CC: abokovoy@redhat.com, atikhono@redhat.com, jhrozek@redhat.com, lslebodn@redhat.com, luk.claes@gmail.com, mzidek@redhat.com, pbrezina@redhat.com, sbose@redhat.com, ssorce@redhat.com, sssd-maintainers@lists.fedoraproject.org Target Milestone: --- Classification: Fedora
Description of problem:
Some logins fail under load due to contention over the SELinux transaction lock in libsemanage.
I'm trying to gather scalability information on IPA. The test revolves around bringing up a number of client VMs, enrolling as IPA clients and running a forking client that does a PAM transaction with the login service. Each client has a set of unique users with non-expired passwords to authenticate with. No pre-loading of the cache is done.
In this particular case I brought up 10 clients and ran the test tool do to 10 logins. All of these 100 logins happen more or less simultaneously (time is synced). 30 of the authentications failed due to the transaction locking.
IPA should be able to handle 100 authentications without breaking a sweat and the server-side logs don't show any issues but it may be somewhat load related. If I re-run the test tool after the fact, even clearing the SSSD cache I can't reproduce the transaction failure.
I wonder if a retry could be implemented.
The selinux_child log contains the following. It is consistent across all the clients (log level 3):
(2022-05-19 13:42:01): [selinux_child[11668]] [libsemanage] (0x0020): Could not get direct transaction lock at /var/lib/selinux/targeted/semanage.trans.LOCK. ********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE: * (2022-05-19 13:41:56): [selinux_child[11668]] [main] (0x0400): selinux_child started. * (2022-05-19 13:41:56): [selinux_child[11668]] [main] (0x2000): Running with effective IDs: [0][0]. * (2022-05-19 13:41:56): [selinux_child[11668]] [main] (0x2000): Running with real IDs [0][0]. * (2022-05-19 13:41:56): [selinux_child[11668]] [main] (0x0400): context initialized * (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): seuser length: 12 * (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): seuser: unconfined_u * (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): mls_range length: 14 * (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): mls_range: s0-s0:c0.c1023 * (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): username length: 23 * (2022-05-19 13:41:56): [selinux_child[11668]] [unpack_buffer] (0x2000): username: user6client000.ipa.test * (2022-05-19 13:41:56): [selinux_child[11668]] [main] (0x0400): performing selinux operations * (2022-05-19 13:41:56): [selinux_child[11668]] [seuser_needs_update] (0x2000): sss_get_seuser: ret: 0 seuser: unconfined_u mls: s0-s0:c0.c1023 * (2022-05-19 13:41:56): [selinux_child[11668]] [sss_seuser_exists] (0x0400): seuser exists: no * (2022-05-19 13:41:56): [selinux_child[11668]] [seuser_needs_update] (0x0400): The SELinux user does need an update * (2022-05-19 13:42:01): [selinux_child[11668]] [libsemanage] (0x0020): Could not get direct transaction lock at /var/lib/selinux/targeted/semanage.trans.LOCK. ********************** BACKTRACE DUMP ENDS HERE *********************************
(2022-05-19 13:42:01): [selinux_child[11668]] [sss_set_seuser] (0x0020): Cannot begin SELinux transaction (2022-05-19 13:42:01): [selinux_child[11668]] [main] (0x0020): Cannot set SELinux login context. ********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE: * (2022-05-19 13:42:01): [selinux_child[11668]] [sss_set_seuser] (0x0020): Cannot begin SELinux transaction * (2022-05-19 13:42:01): [selinux_child[11668]] [main] (0x0020): Cannot set SELinux login context. ********************** BACKTRACE DUMP ENDS HERE *********************************
(2022-05-19 13:42:01): [selinux_child[11668]] [main] (0x0020): selinux_child failed! (2022-05-19 13:42:01): [selinux_child[11671]] [libsemanage] (0x0020): Could not get direct transaction lock at /var/lib/selinux/targeted/semanage.trans.LOCK. ********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE: * (2022-05-19 13:41:56): [selinux_child[11671]] [main] (0x0400): selinux_child started. * (2022-05-19 13:41:56): [selinux_child[11671]] [main] (0x2000): Running with effective IDs: [0][0]. * (2022-05-19 13:41:56): [selinux_child[11671]] [main] (0x2000): Running with real IDs [0][0]. * (2022-05-19 13:41:56): [selinux_child[11671]] [main] (0x0400): context initialized * (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): seuser length: 12 * (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): seuser: unconfined_u * (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): mls_range length: 14 * (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): mls_range: s0-s0:c0.c1023 * (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): username length: 23 * (2022-05-19 13:41:56): [selinux_child[11671]] [unpack_buffer] (0x2000): username: user1client000.ipa.test * (2022-05-19 13:41:56): [selinux_child[11671]] [main] (0x0400): performing selinux operations * (2022-05-19 13:41:56): [selinux_child[11671]] [seuser_needs_update] (0x2000): sss_get_seuser: ret: 0 seuser: unconfined_u mls: s0-s0:c0.c1023 * (2022-05-19 13:41:56): [selinux_child[11671]] [sss_seuser_exists] (0x0400): seuser exists: no * (2022-05-19 13:41:56): [selinux_child[11671]] [seuser_needs_update] (0x0400): The SELinux user does need an update * (2022-05-19 13:42:01): [selinux_child[11671]] [libsemanage] (0x0020): Could not get direct transaction lock at /var/lib/selinux/targeted/semanage.trans.LOCK. ********************** BACKTRACE DUMP ENDS HERE *********************************
(2022-05-19 13:42:01): [selinux_child[11671]] [sss_set_seuser] (0x0020): Cannot begin SELinux transaction (2022-05-19 13:42:01): [selinux_child[11671]] [main] (0x0020): Cannot set SELinux login context. ********************** PREVIOUS MESSAGE WAS TRIGGERED BY THE FOLLOWING BACKTRACE: * (2022-05-19 13:42:01): [selinux_child[11671]] [sss_set_seuser] (0x0020): Cannot begin SELinux transaction * (2022-05-19 13:42:01): [selinux_child[11671]] [main] (0x0020): Cannot set SELinux login context. ********************** BACKTRACE DUMP ENDS HERE *********************************
(2022-05-19 13:42:01): [selinux_child[11671]] [main] (0x0020): selinux_child failed!
Version-Release number of selected component (if applicable):
sssd-common-2.5.2-2.fc34
https://bugzilla.redhat.com/show_bug.cgi?id=2088481
--- Comment #1 from Alexey Tikhonov atikhono@redhat.com ---
Failing function is `semanage_begin_transaction()`: ``` /* Attempt to obtain a transaction lock on the manager. If another * process has the lock then this function may block, depending upon * the timeout value in the handle. * * Note that if the semanage_handle has not yet obtained a transaction * lock whenever a writer function is called, there will be an * implicit call to this function. */ extern int semanage_begin_transaction(semanage_handle_t *); ```
From a quick glance I don't find an API to specify this timeout explicitly. But https://github.com/SELinuxProject/selinux/blob/0a8c177dacdc1df96ea11bb8aa75e...: ``` sh->timeout = SEMANAGE_COMMIT_READ_WAIT; ``` and https://github.com/SELinuxProject/selinux/blob/0a8c177dacdc1df96ea11bb8aa75e...: ``` https://github.com/SELinuxProject/selinux/blob/0a8c177dacdc1df96ea11bb8aa75e... ```
This matches timestamps in the log: ``` * (2022-05-19 13:41:56): [selinux_child[11668]] [seuser_needs_update] (0x0400): The SELinux user does need an update * (2022-05-19 13:42:01): [selinux_child[11668]] [libsemanage] (0x0020): Could not get direct transaction lock at /var/lib/selinux/targeted/semanage.trans.LOCK. ```
I'm not sure a retry would help. If a retry would succeed immediately this would mean a bug in `libsemanage`, imo.
Btw, Rob, if your goal is a load testing of a *server* then you could consider setting `selinux_provider = none` on the client. It will reduce amount of fetched data, but I think overall request rate will be increased.
https://bugzilla.redhat.com/show_bug.cgi?id=2088481
--- Comment #2 from Rob Crittenden rcritten@redhat.com --- Thanks, Sumit suggested the same. I'm trying at first with vanilla client and server installs and I'll go from there. Modifying this on all clients is trivial with Ansible or other config management tools so users that don't do SELinux assignments would be fine with that, but I'm focusing on the defaults right now.
https://bugzilla.redhat.com/show_bug.cgi?id=2088481
Andre Boscatto aboscatt@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|--- |NOTABUG CC| |aboscatt@redhat.com Status|NEW |CLOSED Last Closed| |2022-06-02 13:51:14
--- Comment #3 from Andre Boscatto aboscatt@redhat.com --- Hi Rob,
We discussed this BZ during our weekly call a couple of minutes ago and Sumit gave a perfect background overview about what is happening.
It seems to be a collateral effect of the Load/Performance Test setup, which is something expected from this kind of situation.
With that being said, I will go ahead and close it as NOTABUG, please re-open if you disagree or if there is any flaw we might find in a real case scenario that should be indeed addressed.
Kindly
https://bugzilla.redhat.com/show_bug.cgi?id=2088481
--- Comment #4 from Alexey Tikhonov atikhono@redhat.com --- JFTR: I disagree with the resolution.
- limits aren't documented - there is no clear message neither to the user nor to the admin explaining what has happened - this creates artificial bottleneck and degrades performance limits overall for no good reason
There is clearly a room for improvement. Perhaps scenario is not important enough to justify such improvement, so resolution could be "WONTFIX". But "NOTBUG" sounds like denial of any issue, and I think this is wrong.
https://bugzilla.redhat.com/show_bug.cgi?id=2088481
Andre Boscatto aboscatt@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Resolution|NOTABUG |WONTFIX
--- Comment #5 from Andre Boscatto aboscatt@redhat.com --- As discussed with Alexey in private, I am changing the resolution from NOTABUG to WONTFIX.
Plans are to, at least, take a look into this during CY22Q4 or later (CY23Q1), to validate what can be done to improve the user experience (the load is low, and we have similar cases already opened in the past).
Rob, if you have new information by then, please let us know, so we can take it into account the new results from the benchmark you are working on.
Last but not least, thanks, Alexey, for pointing it out and explaining it differently and bringing another point of view, I do really appreciate it.
Kindly
https://bugzilla.redhat.com/show_bug.cgi?id=2088481
Red Hat One Jira (issues.redhat.com) redhat-one-jira@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Link ID| |Red Hat Issue Tracker | |SSSD-4878
https://bugzilla.redhat.com/show_bug.cgi?id=2088481
--- Comment #6 from Rob Crittenden rcritten@redhat.com --- It's up to you I guess but if enough people, and it doesn't require many, log into a sssd-managed system simultaneously it's very possible for one or more to fail because of contention on the SELinux lock.
sssd-maintainers@lists.fedoraproject.org