In a ticket Thierry and I mentioned that we should have a quick discussion about ideas for profiling and what we want it to look like and what we need. I think it’s important we improve our observation into the server so that we can target improvements correctly,
I think we should know:
* Who is the target audience to run our profiling tools?
* What kind of information we do want?
* Potential solution for the above.
With those in mind I think that Thierry suggested STAP scripts.
* Target audience - developers (us) and some “highly experienced” admins (STAP is not the easiest thing to run).
* Information - STAP would largely tell us timing and possibly allows some variable/struct extraction. STAP does allow us to look at connection info too a bit easier.
I would suggest an “event” struct, and logging service
At the start of an operation we create an event struct. As we enter - exit a plugin we can append timing information, and the plugin itself can add details (for example, backend could add idl performance metrics or other). At the end of the operation, we log the event struct as a json blob to our access log associated to the conn/op.
* Target - anyone, it’s a log level. Really easy to enable (Think mailing list or user support, can easily send us diagnostic logs)
* Information - we need a bit more work to structure the “event” struct internally for profiling, but we’d get timings and possibly internal variable data as well in the event.
I think these are two possible approaches. STAP is less invasive, easier to start now, but harder to extend later. Logging is more accessible to users/admins, easier to extend later, but more work to add now.
What do we think?
C10K is a scalability problem that a server can face when dealing with
events of thousands of connections (i.e. clients) at the same time.
Events can be new connections, new operations on the established
connections, closure of connection (from client or server)
For 389-ds, C10K problem was resolved with a new framework *Nunc-Stans*
. Nunc-stans was first enabled in RHDS 7.4 and improved/fixed in 7.5.
Robustness issues  and  were reported in 7.5 and it was decided to
disable Nunc-stans. It is not known if those issues exist or not in 7.4.
William posted a PR to fix those two issues . Nunc-stans is a complex
framework, with its own dynamic. Review of this PR is not easy and even
a careful review may not guaranty it will fix  and  and may not
introduce others unexpected side effects.
From there we discussed two options (but there may be others):
1. Review and merge the PR , then later run some intensive tests
aiming to verify , and checking the robustness in order to
2. Build some tests for
1. measure the benefit of NS as  and  do not prevent some
2. identify possible reproducers for  and 
3. create robustness and long duration NS specific tests
4. review and merge the PR 
As PR  is not intended for perf improvement, the step 2.1 will impact
the priority according to the performance benefits.
Comments are welcomed
Regarding 2.1 plan we made the following notes for the test plan:
/The benefit of Nunc-Stans can only be measure with a large number
of connections (i.e. client) above 1000. That means a set of clients
(sometime all) should keep their connection //*opened*//. Clients
should run on several hosts so that clients are not the bootleneck./
/For the two types of events (new connection and new operations),
the measurement could be/
* /Event: New connections /
o /Start all clients in parallel to establish connections
(keeping them opened) take the duration to get 1000, 2000,
... 10000 connections and check there are drop or not/
o /Establish 1000 connections and monitor during to open 100
more, the same starting with 2000, 10000/
o /Client should not run any operations during the monitoring/
* /Event: New operations /
o /Start all clients and when 1000 connections are
established, launch simple operations (e.g. search -s base
-b "" objectclass) and monitor how many of them can be
handled. The same with 2000, ... 10000./
o /response time and workqueue length could be monitored to be
sure the bottleneck are not the worker./
 https://bugzilla.redhat.com/show_bug.cgi?id=1605554 connection leaks
I remember that Simon did some docs work for lib389. What are we currently using for that? I would like to extend this to the server core so that our man pages are easier to update, more extensive, and are kept up-to-date along with our releases.
Sphinx was recommended and the name rings a bell ….