On 20-04-2020 14:51, Rob Crittenden wrote:
> *** EXTERNAL E-MAIL ***
>
>
> Kees Bakker via FreeIPA-users wrote:
>> On 20-04-2020 09:58, Kees Bakker via FreeIPA-users wrote:
>>> On 20-04-2020 09:09, Florence Blanc-Renaud wrote:
>>>> On 4/20/20 8:28 AM, Kees Bakker via FreeIPA-users wrote:
>>>>> Hey,
>>>>>
>>>>> I'm looking for advice how to analyse/debug this.
>>>>>
>>>>> On one of the masters the dirsrv is unresponsive. It runs, but every
>>>>> attempt to connect it hangs.
>>>>>
>>>>> The command "systemctl status" does not show anything
alarming
>>>>>
>>>>> ● dirsrv(a)EXAMPLE-COM.service - 389 Directory Server EXAMPLE-COM.
>>>>> Loaded: loaded (/usr/lib/systemd/system/dirsrv@.service;
enabled; vendor preset: disabled)
>>>>> Active: active (running) since vr 2020-04-17 13:46:25 CEST; 1h
33min ago
>>>>> Process: 3123 ExecStartPre=/usr/sbin/ds_systemd_ask_password_acl
/etc/dirsrv/slapd-%i/dse.ldif (code=exited, status=0/SUCCESS)
>>>>> Main PID: 3134 (ns-slapd)
>>>>> Status: "slapd started: Ready to process requests"
>>>>> CGroup:
/system.slice/system-dirsrv.slice/dirsrv(a)EXAMPLE-COM.service
>>>>> └─3134 /usr/sbin/ns-slapd -D
/etc/dirsrv/slapd-EXAMPLE-COM -i /var/run/dirsrv/slapd-EXAMPLE-COM.pid
>>>>>
>>>>> apr 17 15:13:54
linge.example.com ns-slapd[3134]: GSSAPI client step
1
>>>>> apr 17 15:13:54
linge.example.com ns-slapd[3134]: GSSAPI client step
1
>>>>> apr 17 15:13:54
linge.example.com ns-slapd[3134]: GSSAPI client step
1
>>>>> apr 17 15:13:54
linge.example.com ns-slapd[3134]: GSSAPI client step
1
>>>>> apr 17 15:13:54
linge.example.com ns-slapd[3134]: GSSAPI client step
2
>>>>> apr 17 15:18:54
linge.example.com ns-slapd[3134]: GSSAPI client step
1
>>>>> apr 17 15:18:54
linge.example.com ns-slapd[3134]: GSSAPI client step
1
>>>>> apr 17 15:18:55
linge.example.com ns-slapd[3134]: GSSAPI client step
1
>>>>> apr 17 15:18:55
linge.example.com ns-slapd[3134]: GSSAPI client step
1
>>>>> apr 17 15:18:55
linge.example.com ns-slapd[3134]: GSSAPI client step
2
>>>>>
>>>>> However, an ldapsearch command hangs forever
>>>>>
>>>>> [root@rotte ~]# ldapsearch -H
ldaps://linge.example.com -D
uid=keesbtest,cn=users,cn=accounts,dc=example,dc=com -W -LLL -o ldif-wrap=no -b
cn=users,cn=accounts,dc=example,dc=com
'(&(objectClass=person)(memberOf=cn=admins,cn=groups,cn=accounts,dc=example,dc=com))'
uid
>>>>> Enter LDAP Password:
>>>>>
>>>>> Even if I use the socket
(ldapi://%2fvar%2frun%2fslapd-EXAMPLE-COM.socket) the ldapsearch
>>>>> command hangs.
>>>>>
>>>>> "ipactl status" hangs
>>>>>
>>>>> "kinit" hangs
>>>>>
>>>>>
>>>> Hi,
>>>> you can start by having a look at dirsrv error log in
>>>> /var/log/dirsrv-slapd-YOUR_DOMAIN/errors, and the journal.
>>>>
>>>> The FAQ page of 389 also explains a few troubleshooting steps:
>>>>
http://www.port389.org/docs/389ds/FAQ/faq.html#Troubleshooting
>>> I did exactly that, look at the "errors" log, but there was no
clue, at least
>>> not for me. Strange enough it kept running for a few hours and then it
>>> was hanging again.
>>>
>>> I tried the command "ipctl restart", but that was hanging forever.
>>> However "systemctl restart dirsrv@MY-DOMAIN" was able to restart
>>> it after several minutes. Meanwhile the sn-slapd process was using 100%
>>> CPU.
>>>
>>> Another remark I want to make. Every ldap connection (ldapsearch, whatever)
>>> hangs for ever. No timeout, nothing.
>>>
>>> When it rains, it pours, they say. There is another master with the same
symptom.
>>> I'm getting nervous now.
>>>
>>> Thanks for the Troubleshooting link. I'll have to dive into the deep, I
guess.
>> Could it be a deadlock?
>>
>> [root@linge ~]# grep -a1 '^#0.*lock' slapd-stacktrace.1587374239.txt
>> Thread 23 (Thread 0x7ff8ff265700 (LWP 14474)):
>> #0 0x00007ff929430144 in pthread_rwlock_rdlock () at /lib64/libpthread.so.0
>> #1 0x00007ff9190cc49c in map_rdlock () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>> --
>> Thread 7 (Thread 0x7ff8f7255700 (LWP 14490)):
>> #0 0x00007ff929430144 in pthread_rwlock_rdlock () at /lib64/libpthread.so.0
>> #1 0x00007ff9190cc49c in map_rdlock () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>> --
>> Thread 4 (Thread 0x7ff8f5a52700 (LWP 14493)):
>> #0 0x00007ff929430144 in pthread_rwlock_rdlock () at /lib64/libpthread.so.0
>> #1 0x00007ff9190cc49c in map_rdlock () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>> --
>> Thread 2 (Thread 0x7ff92c355700 (LWP 15679)):
>> #0 0x00007ff92943035e in pthread_rwlock_wrlock () at /lib64/libpthread.so.0
>> #1 0x00007ff9190b6639 in backend_be_pre_write_cb () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>>
>> Without debuginfo the trace of these threads look like this:
>>
>> Thread 23 (Thread 0x7ff8ff265700 (LWP 14474)):
>> #0 0x00007ff929430144 in pthread_rwlock_rdlock () at /lib64/libpthread.so.0
>> #1 0x00007ff9190cc49c in map_rdlock () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>> #2 0x00007ff9190b8745 in backend_search_cb () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>> #3 0x00007ff92bcd9028 in plugin_call_func () at /usr/lib64/dirsrv/libslapd.so.0
>> #4 0x00007ff92bcd92e3 in plugin_call_plugins () at
/usr/lib64/dirsrv/libslapd.so.0
>> #5 0x00007ff92bccc0d7 in op_shared_search () at /usr/lib64/dirsrv/libslapd.so.0
>> #6 0x0000562454427bbe in do_search ()
>> #7 0x000056245441595a in connection_threadmain ()
>> #8 0x00007ff929a8cc5b in _pt_root () at /lib64/libnspr4.so
>> #9 0x00007ff92942ce65 in start_thread () at /lib64/libpthread.so.0
>> #10 0x00007ff928ad888d in clone () at /lib64/libc.so.6
>>
>> Thread 7 (Thread 0x7ff8f7255700 (LWP 14490)):
>> #0 0x00007ff929430144 in pthread_rwlock_rdlock () at /lib64/libpthread.so.0
>> #1 0x00007ff9190cc49c in map_rdlock () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>> #2 0x00007ff9190b8745 in backend_search_cb () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>> #3 0x00007ff92bcd9028 in plugin_call_func () at /usr/lib64/dirsrv/libslapd.so.0
>> #4 0x00007ff92bcd92e3 in plugin_call_plugins () at
/usr/lib64/dirsrv/libslapd.so.0
>> #5 0x00007ff92bccc0d7 in op_shared_search () at /usr/lib64/dirsrv/libslapd.so.0
>> #6 0x0000562454427bbe in do_search ()
>> #7 0x000056245441595a in connection_threadmain ()
>> #8 0x00007ff929a8cc5b in _pt_root () at /lib64/libnspr4.so
>> #9 0x00007ff92942ce65 in start_thread () at /lib64/libpthread.so.0
>> #10 0x00007ff928ad888d in clone () at /lib64/libc.so.6
>>
>> Thread 4 (Thread 0x7ff8f5a52700 (LWP 14493)):
>> #0 0x00007ff929430144 in pthread_rwlock_rdlock () at /lib64/libpthread.so.0
>> #1 0x00007ff9190cc49c in map_rdlock () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>> #2 0x00007ff9190b8745 in backend_search_cb () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>> #3 0x00007ff92bcd9028 in plugin_call_func () at /usr/lib64/dirsrv/libslapd.so.0
>> #4 0x00007ff92bcd92e3 in plugin_call_plugins () at
/usr/lib64/dirsrv/libslapd.so.0
>> #5 0x00007ff92bccc0d7 in op_shared_search () at /usr/lib64/dirsrv/libslapd.so.0
>> #6 0x0000562454427bbe in do_search ()
>> #7 0x000056245441595a in connection_threadmain ()
>> #8 0x00007ff929a8cc5b in _pt_root () at /lib64/libnspr4.so
>> #9 0x00007ff92942ce65 in start_thread () at /lib64/libpthread.so.0
>> #10 0x00007ff928ad888d in clone () at /lib64/libc.so.6
>>
>> Thread 2 (Thread 0x7ff92c355700 (LWP 15679)):
>> #0 0x00007ff92943035e in pthread_rwlock_wrlock () at /lib64/libpthread.so.0
>> #1 0x00007ff9190b6639 in backend_be_pre_write_cb () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>> #2 0x00007ff92bcd9028 in plugin_call_func () at /usr/lib64/dirsrv/libslapd.so.0
>> #3 0x00007ff92bcd92e3 in plugin_call_plugins () at
/usr/lib64/dirsrv/libslapd.so.0
>> #4 0x00007ff91b08ac88 in ldbm_back_delete () at
/usr/lib64/dirsrv/plugins/libback-ldbm.so
>> #5 0x00007ff92bc89feb in op_shared_delete () at /usr/lib64/dirsrv/libslapd.so.0
>> #6 0x00007ff92bc8a1c8 in delete_internal_pb () at
/usr/lib64/dirsrv/libslapd.so.0
>> #7 0x00007ff9196f131e in delete_changerecord () at
/usr/lib64/dirsrv/plugins/libretrocl-plugin.so
>> #8 0x00007ff9196f1605 in changelog_trim_thread_fn () at
/usr/lib64/dirsrv/plugins/libretrocl-plugin.so
>> #9 0x00007ff929a8cc5b in _pt_root () at /lib64/libnspr4.so
>> #10 0x00007ff92942ce65 in start_thread () at /lib64/libpthread.so.0
>> #11 0x00007ff928ad888d in clone () at /lib64/libc.so.6
>>
>> With the following (stripped) debuginfo you can see that it is hanging while
deleting
>> a changerecord.
>>
>> Thread 2 (Thread 0x7ff92c355700 (LWP 15679)):
>> #0 0x00007ff92943035e in pthread_rwlock_wrlock () at /lib64/libpthread.so.0
>> #1 0x00007ff9190b6639 in backend_be_pre_write_cb () at
/usr/lib64/dirsrv/plugins/schemacompat-plugin.so
>> #2 0x00007ff92bcd9028 in plugin_call_func (list=0x5624565a3340,
operation=operation@entry=453, pb=pb@entry=0x5624629c28
>> 40, call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:2028
>> n = <optimized out>
>> func = 0x7ff9190b65a0 <backend_be_pre_write_cb>
>> rc = <optimized out>
>> return_value = 0
>> count = 0
>> #3 0x00007ff92bcd92e3 in plugin_call_list (pb=0x5624629c2840, operation=453,
list=<optimized out>) at ldap/servers/slap
>> d/plugin.c:1972
>> p = 0x5624564ea2c0
>> locked = <optimized out>
>> plugin_list_number = 3
>> rc = 0
>> do_op = <optimized out>
>> #4 0x00007ff92bcd92e3 in plugin_call_plugins (pb=pb@entry=0x5624629c2840,
whichfunction=whichfunction@entry=453) at lda
>> p/servers/slapd/plugin.c:442
>> p = 0x5624564ea2c0
>> locked = <optimized out>
>> plugin_list_number = 3
>> rc = 0
>> do_op = <optimized out>
>> #5 0x00007ff91b08ac88 in ldbm_back_delete (pb=0x5624629c2840) at
ldap/servers/slapd/back-ldbm/ldbm_delete.c:373
>> be = 0x5624564fc000
>> inst = 0x56245635c680
>> li = 0x5624562d3a40
>> e = 0x562462377810
>> tombstone = 0x0
>> original_tombstone = 0x0
>> tmptombstone = 0x0
>> dn = 0x56245d90fa70 "changenumber=879530,cn=changelog"
>> txn = {back_txn_txn = 0x562459c2cc60}
>> parent_txn = 0x0
>> retval = 0
>> msg = <optimized out>
>> errbuf = 0x0
>> retry_count = <optimized out>
>> disk_full = 0
>> parent_found = <optimized out>
>> ruv_c_init = 0
>> parent_modify_c = {old_entry = 0x0, new_entry = 0x0, smods = 0x0,
attr_encrypt = 0}
>> ruv_c = {old_entry = 0x0, new_entry = 0x0, smods = 0x0, attr_encrypt =
0}
>> ...
>>
> What distribution are you using and what version of 389-ds?
>
Centos7
[root@linge ~]# rpm -qa 389\*
389-ds-base-debuginfo-1.3.9.1-13.el7_7.x86_64
389-ds-base-libs-1.3.9.1-13.el7_7.x86_64
389-ds-base-1.3.9.1-13.el7_7.x86_64
[root@linge ~]# rpm -qa centos-release
centos-release-7-7.1908.0.el7.centos.x86_64
This is a known bug [1].
With the same bug there are two deadlock scenario but only one is fixed
(for example in slapi-nis-0.56.4-1 [2]).
A fix for the second one is under tests.
At the moment I would recommend the workaround [3]. The drawback is a
growth of the retroCL database but unless you have a very high rate of
update it should not be a concern.
[1]