Hi all, thanks for answering,
it took some time until I came again to this problem.
Am 09.08.2018 um 10:44 schrieb Ludwig Krispenz:
On 08/09/2018 01:53 AM, William Brown wrote:
> Sadly this doesn't tell us much :(
we could get a pstack along with iotop to see which threads do teh IO,
unfortunately pstack doesn't give any output, even with debugging
symbols installed:
ii 389-ds-base 1.3.3.5-4+deb8u1
amd64 389 Directory Server suite - server
ii 389-ds-base-dbg 1.3.3.5-4+deb8u1
amd64 389 Directory Server suite - server debugging symbols
ii 389-ds-base-libs 1.3.3.5-4+deb8u1
amd64 389 Directory Server suite - libraries
ii 389-ds-base-libs-dbg 1.3.3.5-4+deb8u1
amd64 389 Directory Server suite - library debugging symbols
pstack 718
718: /usr/sbin/ns-slapd -D /etc/dirsrv/slapd-ldap1 -i
/var/run/dirsrv/slapd-ldap1.pid -w /var/run/dirsrv/slapd-ldap1.startpid
(No symbols found)
0x7f69b42b37bc: ???? (7f69a0018270, 0, 7f69a00824a0, 0, 7f69b5c1e616,
564c254998e0) + ffffd6e2892ce4a0
0x3000001b3: ???? (0, 200000, 0, 31, 3, 564c252df680) + ffffa9b3da9bbfe3
crawl: Input/output error
Error tracing through process 718
0x564c242aaa00: ????
regular mods or the BDB regulars like trickle, checkpointing ....
>
> It would be good to see the ldbm config here I think. It would
also be
> good to know what kind of work load you have. In some configurations of
> the network, things like load balancers could be sending more
work load ist generally ok. With two cores ~ 0.80, i/o wait is little
but CPU is sometimes on 100 % ore more (look here:
https://nextcloud.datenkollektiv.net/s/XEHmCyscEt4j9cn).
> read/write traffic to one server, rather than over all of them.
dn: cn=ldbm database,cn=plugins,cn=config
objectClass: top
objectClass: nsSlapdPlugin
objectClass: extensibleObject
cn: ldbm database
nsslapd-pluginPath: libback-ldbm
nsslapd-pluginInitfunc: ldbm_back_init
nsslapd-pluginType: database
nsslapd-pluginEnabled: on
nsslapd-plugin-depends-on-type: Syntax
nsslapd-plugin-depends-on-type: matchingRule
nsslapd-pluginId: ldbm-backend
nsslapd-pluginVersion: 1.3.3.5
nsslapd-pluginVendor: 389 Project
nsslapd-pluginDescription: high-performance LDAP backend database plugin
numSubordinates: 58
example for ldbm database config:
dn: cn=example_net,cn=ldbm database,cn=plugins,cn=config
objectClass: top
objectClass: extensibleobject
objectClass: nsbackendinstance
cn: notraces_net
creatorsName: cn=directory manager
modifiersName: cn=directory manager
createTimestamp: 20171214123209Z
modifyTimestamp: 20171214123209Z
numSubordinates: 4
nsslapd-suffix: dc=notraces,dc=net
nsslapd-cachesize: -1
nsslapd-cachememsize: 1346371
nsslapd-readonly: off
nsslapd-require-index: off
nsslapd-directory: /var/lib/dirsrv/slapd-ldap0/db/notraces_net
nsslapd-dncachememsize: 512000
The issue: even if I set the dncachememsize to some higher value (while
dirsrv is off), after starting it the values are changed by dirsrv.
> Another spot to check is your log buffer size, as it may be
small
> (causing the threads to all be writing small ammounts). The logging
> subsystem is slated for a rework in the future to make it more
> scalable.
access log is off. Only audit log is enabled - and there is not much.
Probably the special setup with a lot of single databases (one per
email-domain > 50 dbs) is responsible.
Maybe the high write rates are not really responsible for poor
ldap-performance (look at my thread on this list with "ldapsearch
performance problem" in june).
It's still the same. Sometimes (a few times per minute) the ldap-server
becomes unresponsive for one up to several seconds. I testet it with a
loop like:
while true; do time ldapsearch -h localhost -o ldif-wrap=no -D
"cn=directory manager" -w secret -s sub -b
"dc=datenkollektiv.net,dc=net" 'objectclass=*' dn > /dev/null; done
Interesting: if one of these queries hang for some time every other
query from other hosts also hang for exactly the same time.
We wrote a strace along with the queries (output here:
https://nextcloud.datenkollektiv.net/s/XEHmCyscEt4j9cn)
so I see e.g. for a query which took about 5.489 seconds (timestamp is
above):
in query.log:
12:55:39.745018658
real 0m5.489s
user 0m0.024s
sys 0m0.040s
12:55:45.237697886
real 0m1.625s
user 0m0.004s
sys 0m0.008s
12:55:46.869872903
in strace.log:
[pid 8088] 12:55:39.739539 poll([{fd=435, events=POLLOUT}], 1, 1800000
<unfinished ...>
[pid 8058] 12:55:39.739573 <... write resumed> ) = 1 <0.000087>
[pid 8088] 12:55:39.739723 <... poll resumed> ) = 1 ([{fd=435,
revents=POLLOUT}]) <0.000168>
[pid 8058] 12:55:39.739754 write(426, "dn: cn=ntU"..., 344 <unfinished
...>
[pid 8088] 12:55:39.739824 sendto(435, "0<\2\1\2d7\0043u"..., 62, 0,
NULL, 0 <unfinished ...>
[pid 8058] 12:55:39.739858 <... write resumed> ) = 344 <0.000086>
[pid 8088] 12:55:39.739917 <... sendto resumed> ) = 62 <0.000077>
[pid 8058] 12:55:39.739944 write(426, "\n", 1 <unfinished ...>
[pid 8088] 12:55:39.740032 clock_gettime(CLOCK_MONOTONIC, <unfinished ...>
[pid 8058] 12:55:39.740061 <... write resumed> ) = 1 <0.000100>
[pid 8088] 12:55:39.740191 <... clock_gettime resumed> {1083541,
905025629}) = 0 <0.000149>
[pid 8058] 12:55:39.740223 write(426, "dn: cn=ntU"..., 352 <unfinished
...>
[pid 8088] 12:55:39.740296 poll([{fd=435, events=POLLOUT}], 1, 1800000
<unfinished ...>
[pid 8058] 12:55:39.740331 <... write resumed> ) = 352 <0.000090>
[pid 8088] 12:55:39.740391 <... poll resumed> ) = 1 ([{fd=435,
revents=POLLOUT}]) <0.000079>
[pid 8058] 12:55:39.740423 write(426, "\n", 1 <unfinished ...>
[pid 13776] 12:55:39.740502 <... select resumed> ) = 0 (Timeout) <0.129345>
[pid 8088] 12:55:39.740528 sendto(435, "0\f\2\1\2e\7\n\1\0"..., 14, 0,
NULL, 0 <unfinished ...>
[pid 8058] 12:55:39.740562 <... write resumed> ) = 1 <0.000123>
[pid 13776] 12:55:39.740705 gettimeofday( <unfinished ...>
[pid 8088] 12:55:39.740739 <... sendto resumed> ) = 14 <0.000191>
[pid 8058] 12:55:39.740760 write(426, "dn: cn=num"..., 355 <unfinished
...>
[pid 13776] 12:55:39.740952 <... gettimeofday resumed> {1534244139,
740898}, NULL) = 0 <0.000229>
[pid 8088] 12:55:39.740986 setsockopt(435, SOL_TCP, TCP_CORK, [0], 4
<unfinished ...>
[pid 8058] 12:55:39.741019 <... write resumed> ) = 355 <0.000240>
[pid 13776] 12:55:39.742191 poll([{fd=436, events=POLLIN|POLLPRI}], 1, 0
<unfinished ...>
[pid 8088] 12:55:39.742258 <... setsockopt resumed> ) = 0 <0.001252>
[pid 8058] 12:55:39.742289 write(426, "\n", 1 <unfinished ...>
[pid 13776] 12:55:39.743439 <... poll resumed> ) = 0 (Timeout) <0.001206>
[pid 8088] 12:55:39.743487 write(409, "\0", 1 <unfinished ...>
[pid 8058] 12:55:39.743529 <... write resumed> ) = 1 <0.001219>
[pid 13776] 12:55:39.748812 select(0, NULL, NULL, NULL, {0, 256000}
<unfinished ...>
[pid 13772] 12:55:39.748893 <... select resumed> ) = 0 (Timeout) <0.258082>
[pid 13772] 12:55:39.748934 gettimeofday({1534244139, 748952}, NULL) = 0
<0.000019>
[pid 13772] 12:55:39.749001 poll([{fd=434, events=POLLIN|POLLPRI}], 1,
0) = 0 (Timeout) <0.000022>
[pid 13772] 12:55:39.749072 select(0, NULL, NULL, NULL, {0, 512000}
<unfinished ...>
[pid 8088] 12:55:39.749148 <... write resumed> ) = 1 <0.005636>
[pid 8058] 12:55:39.749189 write(426, "dn: cn=obj"..., 345 <unfinished
...>
[pid 8088] 12:55:39.749254 futex(0x564c333401dc, FUTEX_WAIT_PRIVATE,
6381055, NULL <unfinished ...>
[pid 8058] 12:55:39.749314 <... write resumed> ) = 345 <0.000103>
[pid 8058] 12:55:39.749360 write(426, "\n", 1) = 1 <0.000036>
[pid 8058] 12:55:39.749638 write(426, "dn: cn=own"..., 332) = 332
<0.000035>
[pid 8058] 12:55:39.749729 write(426, "\n", 1) = 1 <0.000033>
[pid 8058] 12:55:39.749956 write(426, "dn: cn=par"..., 337) = 337
<0.000036>
[pid 8058] 12:55:39.750049 write(426, "\n", 1) = 1 <0.000035>
[pid 8058] 12:55:39.750214 write(426, "dn: cn=see"..., 336) = 336
<0.000034>
[pid 8058] 12:55:39.750303 write(426, "\n", 1) = 1 <0.000032>
[pid 8058] 12:55:39.750446 write(426, "dn: cn=sn,"..., 361) = 361
<0.000034>
[pid 8058] 12:55:39.750532 write(426, "\n", 1) = 1 <0.000032>
[pid 8058] 12:55:39.750663 write(426, "dn: cn=tar"..., 352) = 352
<0.000033>
[pid 8058] 12:55:39.750747 write(426, "\n", 1) = 1 <0.000032>
[pid 8058] 12:55:39.750877 write(426, "dn: cn=tel"..., 389) = 389
<0.000033>
[pid 8058] 12:55:39.750963 write(426, "\n", 1) = 1 <0.000035>
[pid 8058] 12:55:39.751111 write(426, "dn: cn=uid"..., 328) = 328
<0.000034>
[pid 8058] 12:55:39.751199 write(426, "\n", 1) = 1 <0.000032>
....
but this is not a regular pattern:
We still don't have a clue if the performance problem is maybe caused
from xen virtualisation or it's a dirsrv performance issue due to our
configuration or from replication (three masters replicate to each
other). But replication doesn't seem the problem since the issue still
exists with disabled replications.
Before upgrading to another machine (more performant) we didn't had
these problems (but with replicas only between two servers).
Debian Jessie 8.0
Kernel 4.9.0-0
Xen-Version Xen 4.8.4-pre
Setup for Kolab-Groupware
Thanks for ideas.
Jan