On Tue, 2017-06-13 at 14:12 +0200, Joakim Tjernlund wrote:
On Mon, 2017-06-12 at 18:06 +0200, Joakim Tjernlund wrote:
> On Mon, 2017-06-12 at 17:51 +0200, Jakub Hrozek wrote:
> > On Mon, Jun 12, 2017 at 03:38:28PM +0000, Joakim Tjernlund wrote:
> > > On Mon, 2017-06-12 at 17:32 +0200, Joakim Tjernlund wrote:
> > > > On Mon, 2017-06-12 at 16:01 +0200, Jakub Hrozek wrote:
> > > > > On Mon, Jun 12, 2017 at 01:53:27PM +0000, Joakim Tjernlund
wrote:
> > > > > > On Sun, 2017-06-11 at 20:55 +0200, Jakub Hrozek wrote:
> > > > > > > On Sat, Jun 10, 2017 at 07:56:47AM +0000, Joakim
Tjernlund wrote:
> > > > > > > > On Sat, 2017-06-10 at 08:24 +0200, Jakub Hrozek
wrote:
> > > > > > > > > On Fri, Jun 09, 2017 at 04:28:45PM +0000,
Joakim Tjernlund wrote:
> > > > > > > > > > both 1.15.2 and git master hangs after
less than 24 hour on
> > > > > > > > > > a server.
> > > > > > > > > >
> > > > > > > > > > I can see this repeating the domain
log:
> > > > > > > > > >
> > > > > > > > > > (Fri Jun 9 18:21:49 2017)
[sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
> > > > > > > > > > (Fri Jun 9 18:21:49 2017)
[sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context
[0xf65ce0] on /var/lib/sss/db/cache_infinera.com.ldb
> > > > > > > > > > (Fri Jun 9 18:22:42 2017)
[sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
> > > > > > > > > > (Fri Jun 9 18:22:42 2017)
[sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context
[0x239cce0] on /var/lib/sss/db/cache_infinera.com.ldb
> > > > > > > > > > (Fri Jun 9 18:23:35 2017)
[sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
> > > > > > > > > > (Fri Jun 9 18:23:35 2017)
[sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context
[0x1421ce0] on /var/lib/sss/db/cache_infinera.com.ldb
> > > > > > > > > > (Fri Jun 9 18:24:28 2017)
[sssd[be[infinera.com]]] [orderly_shutdown] (0x0010): SIGTERM: killing children
> > > > > > > > > > (Fri Jun 9 18:24:28 2017)
[sssd[be[infinera.com]]] [ldb] (0x0010): A transaction is still active in ldb context
[0x1cb0ce0] on /var/lib/sss/db/cache_infinera.com.ldb
> > > > > > > > >
> > > > > > > > > This is caused by too long write to disk.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Can I just increase the timeout for now? I will
patch the code if needed.
> > > > > > > > On this sever we need enumerate = true ATM,
cannot just turn it off.
> > > > > > >
> > > > > > > Oh, sure. The other alternative might be to mount the
cache to tmpfs.
> > > > > >
> > > > > > After mounting a tmpfs this morning on /var/lib/sss/db, the
error has returned.
> > > > > > Seems to an additional problem here.
> > > > > >
> > > > > > I don't this AD is that big either:
> > > > > > # > getent passwd | wc -l
> > > > > > 3236
> > > > > > # > getent group | wc -l
> > > > > > 885
> > > > > >
> > > > > > Any ideas?
> > > > >
> > > > > Can you get a pstack of when the process is 'stuck' ?
> > > > >
> > > > > Does increasing the 'timeout' parameter from its default
'10' to maybe
> > > > > 30 in the domain section help?
> > > >
> > > > I see ALOT of this in the log( figured I look before I restart sssd)
> > > >
> > > >
> > > > (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]]
[sdap_find_entry_by_origDN] (0x4000): Searching cache for
[CN=Jovy\20Sena,OU=Sunnyvale,OU=CorpUsers,DC=infinera,DC=com].
> > > > (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000):
Added timed event "ltdb_callback": 0x4c28c00
> > > >
> > > > (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000):
Added timed event "ltdb_timeout": 0x4c28cc0
> > > >
> > > > (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000):
Running timer event 0x4c28c00 "ltdb_callback"
> > > >
> > > > (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000):
Destroying timer event 0x4c28cc0 "ltdb_timeout"
> > > >
> > > > (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000):
Ending timer event 0x4c28c00 "ltdb_callback"
> > > >
> > > > (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000):
Added timed event "ltdb_callback": 0x34ccf50
> > > >
> > > > (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000):
Added timed event "ltdb_timeout": 0x34cd0c0
> > > >
> > > > (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000):
Running timer event 0x34ccf50 "ltdb_callback"
> > > >
> > > > (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000):
Destroying timer event 0x34cd0c0 "ltdb_timeout"
> > > >
> > > > (Mon Jun 12 15:55:09 2017) [sssd[be[infinera.com]]] [ldb] (0x4000):
Ending timer event 0x34ccf50 "ltdb_callback"
> > >
> > > After just adding timout = 30 and restarting sssd it still hung. Had to
clear out(saved a copy first)
> >
> > ^^^^^^^^^^^
> > There is a typo here, I wonder if you used the correct spelling in the
> > config? Also, did you add the option to the domain section?
>
> It is now :) was in the wrong section before
timeout = 30 in domain section SEEMS to help, no problem since yesterday.
What did I really do here?
However, now I see that getent group/getent group <a-grp-name> is incomplete,
members are missing.
And it varies between machines, even ones that have enumerate = false has incomplete
member list for
a random grop name.
Jocke