ldap perfomance

repl-monitor.pl - monitoring user

disk i/o: very high write rates

Ghiurea, Isabella

Wednesday, 5 September 2018 Wed, 5 Sep '18

4:14 p.m.

Hello Gurus, looking for an answer to the following performance behavior my env: 389-ds-base-1.3.5.15-1.fc24.x86_64 in multimaster fractional replication running rsearch for 5 min with 1 thread seeing spikes for a basic read using index uid And running with 10 threads same search the avg ms/ops performance are much better with no major spike/burst Any explanation much appreciate it see bellow for 1 thread and the spike/burst T 300 -t 1 rsearch: 1 threads launched. T1 min= 0ms, max= 5ms, count = 54710 T1 min= 0ms, max= 42ms, count = 64930 T1 min= 0ms, max= 2ms, count = 65174 T1 min= 0ms, max= 2ms, count = 65110 T1 min= 0ms, max= 44ms, count = 64966 T1 min= 0ms, max= 1ms, count = 65101 T1 min= 0ms, max= 22ms, count = 65056 T1 min= 0ms, max= 32ms, count = 64981 T1 min= 0ms, max= 1ms, count = 65145 T1 min= 0ms, max= 1ms, count = 65223 T1 min= 0ms, max= 27ms, count = 65015 T1 min= 0ms, max= 1ms, count = 65182 T1 min= 0ms, max= 3ms, count = 65213 T1 min= 0ms, max= 23ms, count = 64760 T1 min= 0ms, max= 2ms, count = 64214 T1 min= 0ms, max= 3ms, count = 52279 T1 min= 0ms, max= 11ms, count = 64914 T1 min= 0ms, max= 1ms, count = 65118 T1 min= 0ms, max= 5ms, count = 64852 T1 min= 0ms, max= 91ms, count = 64180 T1 min= 0ms, max= 4ms, count = 64746 T1 min= 0ms, max= 1ms, count = 65080 T1 min= 0ms, max= 12ms, count = 65110 T1 min= 0ms, max= 702ms, count = 59243 T1 min= 0ms, max= 1ms, count = 65082 T1 min= 0ms, max= 89ms, count = 64331 T1 min= 0ms, max= 23ms, count = 64647 T1 min= 0ms, max= 5ms, count = 64818 T1 min= 0ms, max= 55ms, count = 64374 T1 min= 0ms, max= 8ms, count = 64713 T1 min= 0ms, max= 8ms, count = 64713 300 sec >= 300 Final Average rate: 6394.22/sec = 0.1564msec/op, total: 64713 And final avg rate for 10 threads, no significant spike/burst for this num of threads 20180905 14:07:23 - Rate: 16962.10/thr (16962.10/sec = 0.0590ms/op), total:169621 (10 thr) 300 sec >= 300 Final Average rate: 17420.40/sec = 0.0574msec/op, total:169621

Attachments:

attachment.html (text/html — 3.3 KB)

Show replies by date

Ivanov Andrey (M.)

Thursday, 6 September Thu, 6 Sep

5:13 a.m.

Hi, it could be flushing of logs (access.log) on disk which happens more often when server load is higher. you could use iostat or dstat to see what happens Regards, Andrey

...

De: "Ghiurea, Isabella" <Isabella.Ghiurea(a)nrc-cnrc.gc.ca> À: 389-users(a)lists.fedoraproject.org Envoyé: Mercredi 5 Septembre 2018 23:14:24 Objet: [389-users] ldap perfomance

...

Hello Gurus,

...

looking for an answer to the following performance behavior my env: 389-ds-base-1.3.5.15-1.fc24.x86_64 in multimaster fractional replication

...

running rsearch for 5 min with 1 thread seeing spikes for a basic read using index uid

...

And running with 10 threads same search the avg ms/ops performance are much better with no major spike/burst

...

Any explanation much appreciate it

...

see bellow for 1 thread and the spike/burst

...

T 300 -t 1 rsearch: 1 threads launched. T1 min= 0ms, max= 5ms, count = 54710 T1 min= 0ms, max= 42ms, count = 64930 T1 min= 0ms, max= 2ms, count = 65174 T1 min= 0ms, max= 2ms, count = 65110 T1 min= 0ms, max= 44ms, count = 64966 T1 min= 0ms, max= 1ms, count = 65101 T1 min= 0ms, max= 22ms, count = 65056 T1 min= 0ms, max= 32ms, count = 64981 T1 min= 0ms, max= 1ms, count = 65145 T1 min= 0ms, max= 1ms, count = 65223 T1 min= 0ms, max= 27ms, count = 65015 T1 min= 0ms, max= 1ms, count = 65182 T1 min= 0ms, max= 3ms, count = 65213 T1 min= 0ms, max= 23ms, count = 64760 T1 min= 0ms, max= 2ms, count = 64214 T1 min= 0ms, max= 3ms, count = 52279 T1 min= 0ms, max= 11ms, count = 64914 T1 min= 0ms, max= 1ms, count = 65118 T1 min= 0ms, max= 5ms, count = 64852 T1 min= 0ms, max= 91ms, count = 64180 T1 min= 0ms, max= 4ms, count = 64746 T1 min= 0ms, max= 1ms, count = 65080 T1 min= 0ms, max= 12ms, count = 65110 T1 min= 0ms, max= 702ms, count = 59243 T1 min= 0ms, max= 1ms, count = 65082 T1 min= 0ms, max= 89ms, count = 64331 T1 min= 0ms, max= 23ms, count = 64647 T1 min= 0ms, max= 5ms, count = 64818 T1 min= 0ms, max= 55ms, count = 64374 T1 min= 0ms, max= 8ms, count = 64713

...

T1 min= 0ms, max= 8ms, count = 64713 300 sec >= 300 Final Average rate: 6394.22/sec = 0.1564msec/op, total: 64713

...

And final avg rate for 10 threads, no significant spike/burst for this num of threads

...

20180905 14:07:23 - Rate: 16962.10/thr (16962.10/sec = 0.0590ms/op), total:169621 (10 thr) 300 sec >= 300 Final Average rate: 17420.40/sec = 0.0574msec/op, total:169621

...

_______________________________________________ 389-users mailing list -- 389-users(a)lists.fedoraproject.org To unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproje...

Ivanov Andrey (M.)

5:22 a.m.

A more detailed discussion about it: http://www.port389.org/docs/389ds/design/logging-performance-improvement.... You could also disable logging and see whether the spikes disappear to be sure of their source: http://www.port389.org/docs/389ds/howto/howto-logsystemperf.html

...

Hi,

...

it could be flushing of logs (access.log) on disk which happens more often when server load is higher. you could use iostat or dstat to see what happens

...

Regards, Andrey

...

> De: "Ghiurea, Isabella" <Isabella.Ghiurea(a)nrc-cnrc.gc.ca> > À: 389-users(a)lists.fedoraproject.org > Envoyé: Mercredi 5 Septembre 2018 23:14:24 > Objet: [389-users] ldap perfomance

...

> Hello Gurus,

...

> looking for an answer to the following performance behavior > my env: 389-ds-base-1.3.5.15-1.fc24.x86_64 in multimaster fractional replication

...

> running rsearch for 5 min with 1 thread seeing spikes for a basic read using > index uid

...

> And running with 10 threads same search the avg ms/ops performance are much > better with no major spike/burst

...

> Any explanation much appreciate it

...

> see bellow for 1 thread and the spike/burst

...

> T 300 -t 1 > rsearch: 1 threads launched. > T1 min= 0ms, max= 5ms, count = 54710 > T1 min= 0ms, max= 42ms, count = 64930 > T1 min= 0ms, max= 2ms, count = 65174 > T1 min= 0ms, max= 2ms, count = 65110 > T1 min= 0ms, max= 44ms, count = 64966 > T1 min= 0ms, max= 1ms, count = 65101 > T1 min= 0ms, max= 22ms, count = 65056 > T1 min= 0ms, max= 32ms, count = 64981 > T1 min= 0ms, max= 1ms, count = 65145 > T1 min= 0ms, max= 1ms, count = 65223 > T1 min= 0ms, max= 27ms, count = 65015 > T1 min= 0ms, max= 1ms, count = 65182 > T1 min= 0ms, max= 3ms, count = 65213 > T1 min= 0ms, max= 23ms, count = 64760 > T1 min= 0ms, max= 2ms, count = 64214 > T1 min= 0ms, max= 3ms, count = 52279 > T1 min= 0ms, max= 11ms, count = 64914 > T1 min= 0ms, max= 1ms, count = 65118 > T1 min= 0ms, max= 5ms, count = 64852 > T1 min= 0ms, max= 91ms, count = 64180 > T1 min= 0ms, max= 4ms, count = 64746 > T1 min= 0ms, max= 1ms, count = 65080 > T1 min= 0ms, max= 12ms, count = 65110 > T1 min= 0ms, max= 702ms, count = 59243 > T1 min= 0ms, max= 1ms, count = 65082 > T1 min= 0ms, max= 89ms, count = 64331 > T1 min= 0ms, max= 23ms, count = 64647 > T1 min= 0ms, max= 5ms, count = 64818 > T1 min= 0ms, max= 55ms, count = 64374 > T1 min= 0ms, max= 8ms, count = 64713

...

> T1 min= 0ms, max= 8ms, count = 64713 > 300 sec >= 300 > Final Average rate: 6394.22/sec = 0.1564msec/op, total: 64713

...

> And final avg rate for 10 threads, no significant spike/burst for this num of > threads

...

> 20180905 14:07:23 - Rate: 16962.10/thr (16962.10/sec = 0.0590ms/op), > total:169621 (10 thr) > 300 sec >= 300 > Final Average rate: 17420.40/sec = 0.0574msec/op, total:169621

...

> _______________________________________________ > 389-users mailing list -- 389-users(a)lists.fedoraproject.org > To unsubscribe send an email to 389-users-leave(a)lists.fedoraproject.org > Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html > List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines > List Archives: > https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproje...

...

isabella.ghiurea＠nrc-cnrc.gc.ca

9:50 a.m.

This does not justify this since running 1 tread takes 0.1564msec/op and running 10 threads takes 0.0590ms/op and the last one will require the access.log to be flush more frequently I think for 10 threads and I do not see the spike in exec time showed for 1 thread. Maybe something else ?

David Boreham

9:55 a.m.

On 9/6/2018 8:50 AM, isabella.ghiurea(a)nrc-cnrc.gc.ca wrote:

...

I think, looking at the data you posted, the question you're asking is "why, when I subject my server to a continuous search operation load, do some operations have much longer latency than others?". This isn't a performance issue per se, imho because performance overall is acceptable. The problem is that rsearch reports a maximum search response time that is quite high. That maximum could have been measured on only _one_ operation though.

William Brown

7:41 p.m.

On Thu, 2018-09-06 at 08:55 -0600, David Boreham wrote:

...

On 9/6/2018 8:50 AM, isabella.ghiurea(a)nrc-cnrc.gc.ca wrote: > This does not justify this since running 1 tread > takes 0.1564msec/op and running 10 threads takes 0.0590ms/op and > the last one will require the access.log to be flush more > frequently I think for 10 threads and I do not see the > spike in exec time showed for 1 thread. Maybe something else ? > _______________________________________________ > I think, looking at the data you posted, the question you're asking is "why, when I subject my server to a continuous search operation load, do some operations have much longer latency than others?".

If they are doing the same operation repeatedly, then it *is* an issue because it shows that within the server architecture there is a flaw that causes non deterministic behaviour. I have observed this behaviour myself, and traced it to logging (I did a server build with log_* as a stub function). But there are plenty of other places in the code which could be issues - I know plenty of them, but time is the enemy of all things :) -- Sincerely, William

David Boreham

Friday, 7 September Fri, 7 Sep

2:30 p.m.

On 9/6/2018 6:41 PM, William Brown wrote:

...

> I think, looking at the data you posted, the question you're asking > is > "why, when I subject my server to a continuous search operation load, > do > some operations have much longer latency than others?". If they are doing the same operation repeatedly, then it *is* an issue because it shows that within the server architecture there is a flaw that causes non deterministic behaviour.

I didn't say it wasn't a problem -- just attempting to clarify the specific scope of the problem.

Gordon Messmer

Sunday, 9 September Sun, 9 Sep

4:20 p.m.

On 09/06/2018 07:50 AM, isabella.ghiurea(a)nrc-cnrc.gc.ca wrote:

...

This does not justify this since running 1 tread takes 0.1564msec/op and running 10 threads takes 0.0590ms/op

Yes, but that's an average. Running 10 threads doesn't make individual searches take less time. When you're running just one thread, you're getting the result of one CPU core. When you increase the number of threads in rsearch, the server will use more CPU cores. You'll see more searches completed in the same amount of time, which rsearch will report as a lower average time/op. When you increase threads enough to saturate the CPU cores in the server, then average time will start to increase again.

...

and the last one will require the access.log to be flush more frequently I think for 10 threads and I do not see the spike in exec time showed for 1 thread. Maybe something else ?

With access logging enabled, I see similar spikes in search time reported by rsearch. When I turn access logging off, I don't. Try it yourself and see if you can reproduce the behavior you reported: # ldapmodify -h localhost -D 'cn=Directory Manager' -W dn: cn=config changetype: modify replace: nsslapd-accesslog-logging-enabled nsslapd-accesslog-logging-enabled: off

William Brown

Thursday, 6 September Thu, 6 Sep

7:37 p.m.

Hi there, I have seen this behaviour due to an issue in the design of access log buffer flushing. During a buffer flush all other threads are delayed, which can cause this spike. You can confirm this by changing your access log buffer size up or down. Sadly this problem is not simply fixed without a complete rewrite of the logging subsystem - which I would love to do, but I am not employed to work on 389 at this time so I lack the time. Hope that helps. On Wed, 2018-09-05 at 21:14 +0000, Ghiurea, Isabella wrote:

...

-- Sincerely, William

David Boreham

Friday, 7 September Fri, 7 Sep

2:38 p.m.

On 9/6/2018 6:37 PM, William Brown wrote:

...

I have seen this behaviour due to an issue in the design of access log buffer flushing. During a buffer flush all other threads are delayed, which can cause this spike. You can confirm this by changing your access log buffer size up or down. Sadly this problem is not simply fixed without a complete rewrite of the logging subsystem - which I would love to do, but I am not employed to work on 389 at this time so I lack the time.

Interesting to see this because when I first ran performance tests on the DS, around 20 years ago, I quickly discovered that the logging code was a bottleneck. So I re-wrote it with in-memory ping-pong buffering for the operation threads' log output, and removed any synchronous filesystem writes. I expect that in the years since perhaps concurrency bugs led to that that code being removed, or it could be that it no longer performs well with modern hardware.

William Brown

Saturday, 8 September Sat, 8 Sep

5:48 p.m.

On Fri, 2018-09-07 at 13:38 -0600, David Boreham wrote:

...

On 9/6/2018 6:37 PM, William Brown wrote: > > I have seen this behaviour due to an issue in the design of access > log > buffer flushing. During a buffer flush all other threads are > delayed, > which can cause this spike. > > You can confirm this by changing your access log buffer size up or > down. > > Sadly this problem is not simply fixed without a complete rewrite > of > the logging subsystem - which I would love to do, but I am not > employed > to work on 389 at this time so I lack the time. > > Interesting to see this because when I first ran performance tests on the DS, around 20 years ago, I quickly discovered that the logging code was a bottleneck. So I re-wrote it with in-memory ping-pong buffering for the operation threads' log output, and removed any synchronous filesystem writes. I expect that in the years since perhaps concurrency bugs led to that that code being removed, or it could be that it no longer performs well with modern hardware.

Could be both. Modern hardware is especially bad at handling any situation where locking get's involved due to the highly async nature of modern CPU's and their cache layers. There is a lot on my "todo list" to rearchitect to make the code scalable on modern systems today. A lot of what I want to do is directly inspired by erlang's actor model and concurrent readability.

...

-- Sincerely, William

2056

days inactive

2060

days old

389-users@lists.fedoraproject.org

Manage subscription

10 comments

6 participants

tags (0)

participants (6)

David Boreham
Ghiurea, Isabella
Gordon Messmer
isabella.ghiurea＠nrc-cnrc.gc.ca
Ivanov Andrey (M.)
William Brown

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

ldap perfomance