Hello everyone,
I have a request for advice on how to approach monitoring of replication in an environment
with approximately 30 FreeIPA servers, all in a master-master replication agreement, using
389-ds (389-ds-base-1.4.3.28-6). I am currently looking for ways to reduce the number of
replicas (because there are more to come) and need to justify it to the architecture
department with evidence based on experimental observations.
The problem we are facing is that our installation has started experiencing lags in some
operations, such as adding user groups, HBAC, and SUDO rules and the most heaviest (by the
impact) is automember-rebuild operation.
The number of entities being added is not large, with a maximum of 10 groups and several
sudo and HBAC rules, though for automember-rebuild I don't know for certain cause for
now I didn't figure out what operations are done internally by this. The
"lag" manifests as latency in LDAP operations, leading to timeouts, which in
turn causes some services that rely on Kerberos or DNS (because FreeIPA uses LDAP
directory for everything) to go down. Our monitoring system also shows that the outage
propagates through replicas as replication progresses.
The classic approach of monitoring replication agreements through the
nsds5replicaLastUpdateStatus attribute is not sufficient. We need a more dynamic approach
that can show the "waves" or replication sessions throughout the environment,
which can help in further tuning replication parameters.
I am facing the following problems:
1) The only way to get full replication information currently is to turn on full debug for
error logs. While this can be done in test environments, I cannot rely on it in
production. I thought that BPF could be the answer, but I am not sure if dirsrv has
internal support (predefined probe points) for it. Has anyone from the developers tried to
use BPF to monitor some features in 389-ds?
2) Regardless of BPF support, I can still try to implement monitoring with it, in
conjunction with debug symbols. However, another problem is that I do not know the exact
algorithm of the replication process. I have read this article
(
https://www.port389.org/docs/389ds/design/replication_troubleshooting.html), but it is
still obscure for my purposes. Can you shed some light on the approach I should take here?
In my mind, the first step should be very basic - attach to a set of consumer level
functions responsible for receiving replica updates, and monitor the latency, the amount
of incoming connections at a given point in time, and so on. But if you could point me in
the right direction (other than just directly pointing to the repository and searching the
source code), I would greatly appreciate it.
3) This feature
(
https://directory.fedoraproject.org/docs/389ds/design/log-operation-stats...) is not
supported for my version of 389-ds, is it? Is there a way to patch my version to support
it?
Thank you in advance for your help.