On 07/24/2014 09:08 AM, Thomas Walker wrote:
On Wed, Jul 23, 2014 at 06:23:24PM -0600, Rich Megginson wrote:
Not sure.  This is a complex area.

Are you using IdM/FreeIPA or just plain 389?
Just plain old 389 (backend for sssd) with a load balancer inbetween (which is flapping when we get erratic response times to simple bind attempts).
A lightly loaded server is fine but once we get more than a frew dozen connections, the pauses become a real problem...

Some stack traces taking during the run might be helpful.  Please
see http://port389.org/wiki/FAQ#Debugging_Hangs
I guess you'll have to replace "yum" with "apt" and I'm not sure how
the debuginfo packages are handled on Debian.

Ah, missed that bit of the website.  Attaching a tarball of dumps from a hang this morning.  I had an external process "pinging" the LDAP server doing lookups for a UID 'bazbar.$i' via loopback every few seconds.  It's request at 1406210089 went unregistered by the server until 1406210106 (log timestamps below are GMT):

[24/Jul/2014:13:55:06 +0000] conn=301 fd=182 slot=182 connection from ::1 to ::1
[24/Jul/2014:13:55:06 +0000] conn=301 op=0 BIND dn="" method=128 version=3
[24/Jul/2014:13:55:06 +0000] conn=301 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
[24/Jul/2014:13:55:06 +0000] conn=301 op=1 SRCH base="ou=People,dc=n,dc=twosigma,dc=com" scope=2 filter="(&(uid=bazbar.7)(objectClass=posixAccount))" attrs=ALL
[24/Jul/2014:13:55:06 +0000] conn=301 op=1 RESULT err=0 tag=101 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=301 op=2 UNBIND
[24/Jul/2014:13:55:06 +0000] conn=301 op=2 fd=182 closed - U1

I've obfuscated a few of the buffers in the stack dumps to keep my security folks happy, shouldn't be any trouble though.  These "hangs" always seem to be followed by a flurry of angry clients when things unjam:

[24/Jul/2014:13:55:04 +0000] conn=299 op=1 ABANDON targetop=NOTFOUND msgid=1
[24/Jul/2014:13:55:06 +0000] conn=302 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=307 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=308 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=309 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=313 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=315 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=316 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=317 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=318 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=319 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=320 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0
[24/Jul/2014:13:55:06 +0000] conn=321 op=1 ABANDON targetop=NOTFOUND msgid=1
[24/Jul/2014:13:55:06 +0000] conn=322 op=1 ABANDON targetop=0 msgid=1 nentries=0 etime=0

even a few TCP resets:

[24/Jul/2014:13:55:09 +0000] conn=323 op=-1 fd=182 closed error 104 (Connection reset by peer) - TCP connection reset by peer.
[24/Jul/2014:13:55:09 +0000] conn=326 op=-1 fd=165 closed error 104 (Connection reset by peer) - TCP connection reset by peer.
[24/Jul/2014:13:55:11 +0000] conn=330 op=-1 fd=276 closed error 104 (Connection reset by peer) - TCP connection reset by peer.
[24/Jul/2014:13:55:11 +0000] conn=332 op=-1 fd=190 closed error 104 (Connection reset by peer) - TCP connection reset by peer.

Some initial observations based on the stack traces:

1) I think you are running too many threads.  Having too many threads causes too much thread contention, which leads to performance problems.  Did you set nsslapd-threadnumber?  If so, try setting it to 2 times the number of cores on your system.  If you didn't touch nsslapd-threadnumber, then it must be a bug, because the default number is 30 (for historical reasons).

2) Thread 1 summary:
fsync fsync ?? vslapd_log_error slapd_log_error_proc_internal slapd_log_error_proc handle_pr_read_ready slapd_daemon main

When doing stack traces, it is best to do it with no additional error log levels.  This fsync() in the daemon main loop is very expensive.

3) There are a lot of stack traces waiting on locks related to virtual attribute lookups, for example:
Thread 86
?? ?? ?? ?? PR_vsmprintf slapi_ch_smprintf vattr_map_namespace_sp_getlist slapi_vattr_namespace_values_get_sp send_specific_attrs send_ldap_search_entry_ext send_ldap_search_entry iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

You might want to see if this is a source of your performance problems.  Are you using Class of Service, Roles, per-subtree Password Policy?
Try setting nsslapd-ignore-virtual-attrs: on in cn=config to disable virtual attribute lookups.  At least, that will either rule out vattrs as the source of the performance problems, or point out that we really need to address this performance sooner rather than later.

4) There are a lot of threads waiting for some sort of input from the client e.g.

Thread 98
poll poll ?? connection_read_operation connection_threadmain ?? start_thread clone ??

This means the client has initiated an operation but has not completed sending the data, either due to client problems, or there is some sort of network problem preventing the server's receipt of the data.


Here is the list of all of the "interesting" threads (that is, threads not just sitting in connection_wait_for_new_work):

Thread 126
poll poll ?? slapd_poll write_function openldap_write_function ber_int_sb_write ber_flush2 flush_ber send_ldap_search_entry_ext send_ldap_search_entry iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

Thread 124
__lll_lock_wait __lll_lock_wait pthread_rwlock_rdlock slapi_rwlock_rdlock vattr_map_lookup vattr_map_namespace_sp_getlist vattr_test_filter slapi_vattr_filter_test_ext_internal vattr_test_filter_list slapi_vattr_filter_test_ext_internal slapi_vattr_filter_test_ext slapi_vattr_filter_test ldbm_back_next_search_entry_ext iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

Thread 120
__lll_lock_wait __lll_lock_wait pthread_rwlock_rdlock slapi_rwlock_rdlock vattr_map_lookup vattr_map_namespace_sp_getlist vattr_test_filter slapi_vattr_filter_test_ext_internal vattr_test_filter_list slapi_vattr_filter_test_ext_internal slapi_vattr_filter_test_ext slapi_vattr_filter_test ldbm_back_next_search_entry_ext iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

Thread 115
__lll_lock_wait __lll_lock_wait _L_lock_1145 pthread_mutex_lock PR_Lock connection_threadmain ?? start_thread clone ??

Thread 114
__lll_lock_wait __lll_lock_wait pthread_rwlock_rdlock slapi_rwlock_rdlock vattr_map_lookup vattr_map_namespace_sp_getlist vattr_test_filter slapi_vattr_filter_test_ext_internal slapi_vattr_filter_test_ext_internal vattr_test_filter_list slapi_vattr_filter_test_ext_internal slapi_vattr_filter_test_ext slapi_vattr_filter_test ldbm_back_next_search_entry_ext iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

Thread 113
__lll_lock_wait __lll_lock_wait _L_lock_1145 pthread_mutex_lock PR_Lock connection_threadmain ?? start_thread clone ??

Thread 112
__lll_lock_wait __lll_lock_wait _L_lock_1145 pthread_mutex_lock PR_Lock connection_set_ssl_ssf connection_dispatch_operation connection_threadmain ?? start_thread clone ??

Thread 107
pthread_rwlock_rdlock pthread_rwlock_rdlock slapi_rwlock_rdlock vattr_map_lookup vattr_map_namespace_sp_getlist slapi_vattr_namespace_values_get_sp send_specific_attrs send_ldap_search_entry_ext send_ldap_search_entry iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

Thread 106
__lll_lock_wait __lll_lock_wait pthread_rwlock_rdlock slapi_rwlock_rdlock vattr_map_lookup vattr_map_namespace_sp_getlist vattr_test_filter slapi_vattr_filter_test_ext_internal slapi_vattr_filter_test_ext_internal vattr_test_filter_list slapi_vattr_filter_test_ext_internal slapi_vattr_filter_test_ext slapi_vattr_filter_test ldbm_back_next_search_entry_ext iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

Thread 105
poll poll ?? slapd_poll write_function openldap_write_function ber_int_sb_write ber_flush2 flush_ber send_ldap_search_entry_ext send_ldap_search_entry iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

Thread 101
__lll_lock_wait __lll_lock_wait pthread_rwlock_rdlock slapi_rwlock_rdlock aclanom_get_suffix_info acl_access_allowed acl_access_allowed_main plugin_call_acl_plugin test_filter_access slapi_vattr_filter_test_ext_internal vattr_test_filter_list slapi_vattr_filter_test_ext_internal slapi_vattr_filter_test_ext slapi_vattr_filter_test ldbm_back_next_search_entry_ext iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

Thread 99
__lll_lock_wait __lll_lock_wait pthread_rwlock_rdlock slapi_rwlock_rdlock vattr_map_lookup vattr_map_namespace_sp_getlist slapi_vattr_namespace_values_get_sp send_specific_attrs send_ldap_search_entry_ext send_ldap_search_entry iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

Thread 98
poll poll ?? connection_read_operation connection_threadmain ?? start_thread clone ??

Thread 97
poll poll ?? connection_read_operation connection_threadmain ?? start_thread clone ??

Thread 90
__lll_lock_wait __lll_lock_wait _L_lock_1145 pthread_mutex_lock PR_Lock connection_threadmain ?? start_thread clone ??

Thread 86
?? ?? ?? ?? PR_vsmprintf slapi_ch_smprintf vattr_map_namespace_sp_getlist slapi_vattr_namespace_values_get_sp send_specific_attrs send_ldap_search_entry_ext send_ldap_search_entry iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

Thread 81
__lll_lock_wait __lll_lock_wait pthread_rwlock_rdlock slapi_rwlock_rdlock aclanom_get_suffix_info acl_access_allowed acl_access_allowed_main plugin_call_acl_plugin test_filter_access slapi_vattr_filter_test_ext_internal vattr_test_filter_list slapi_vattr_filter_test_ext_internal slapi_vattr_filter_test_ext slapi_vattr_filter_test ldbm_back_next_search_entry_ext iterate send_results_ext op_shared_search do_search connection_dispatch_operation ?? start_thread clone ??

Thread 1
fsync fsync ?? vslapd_log_error slapd_log_error_proc_internal slapd_log_error_proc handle_pr_read_ready slapd_daemon main




--
389 users mailing list
389-users@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/389-users