URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: opened
PR body: """ While debugging rhbz#1396912 some deadlock on sssd_be was noticed[0] and it's been caused by the use of non async-signal-safe functions from the signal_handler (please, see man 7 signal for more info about which are the async-signal-safe functions that can be used).
The removal of those functions we lose some debug messages, which is the least of problems now. Also instead of re-setting the watchdog timeout in case of time-shift (please, see b8ceaeb for more info), let's just re-set the timer's time.
Along with this commit, the function to teardown the watchdog has been removed as it's not used anymore.
[0]: [root@dusan ~]# pstack 17922 #0 __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95 #1 0x00007fe707d04f93 in _L_lock_14932 () from /lib64/libc.so.6 #2 0x00007fe707d02013 in __GI___libc_malloc (bytes=140630248638304, bytes@entry=15) at malloc.c:2891 #3 0x00007fe707d0888a in __GI___strdup (s=0x7fe707dff4f7 "/etc/localtime") at strdup.c:42 #4 0x00007fe707d31b61 in tzset_internal (always=<optimized out>, explicit=explicit@entry=1) at tzset.c:438 #5 0x00007fe707d32523 in __tz_convert (timer=timer@entry=0x7ffcd5d2b090, use_localtime=use_localtime@entry=1, tp=tp@entry=0x7fe708041d40 <_tmbuf>) at tzset.c:621 #6 0x00007fe707d30521 in __GI_localtime (t=t@entry=0x7ffcd5d2b090) at localtime.c:42 #7 0x00007fe70886c7b0 in sss_vdebug_fn (file=<optimized out>, line=<optimized out>, function=0x7fe70bff27f0 <__FUNCTION__.9379> "watchdog_handler", level=16, flags=flags@entry=0, format=format@entry=0x7fe70bff2760 "Watchdog timer overflow, killing process!\n", ap=ap@entry=0x7ffcd5d2b130) at src/util/debug.c:248 #8 0x00007fe70886c995 in sss_debug_fn (file=file@entry=0x7fe70bff263b "src/util/util_watchdog.c", line=line@entry=82, function=function@entry=0x7fe70bff27f0 <__FUNCTION__.9379> "watchdog_handler", level=level@entry=16, format=format@entry=0x7fe70bff2760 "Watchdog timer overflow, killing process!\n") at src/util/debug.c:284 #9 0x00007fe70bfdb409 in watchdog_handler (sig=<optimized out>) at src/util/util_watchdog.c:81 #10 <signal handler called> #11 0x00007fe707cff664 in _int_malloc (av=av@entry=0x7fe70803c760 <main_arena>, bytes=bytes@entry=151) at malloc.c:3494 #12 0x00007fe707d01fbc in __GI___libc_malloc (bytes=bytes@entry=151) at malloc.c:2893 #13 0x00007fe708450749 in __talloc_with_prefix (prefix_len=0, size=55, context=0x7fe718373210) at ../talloc.c:668 #14 __talloc (size=55, context=0x7fe718373210) at ../talloc.c:708 #15 _talloc_named_const (name=0x7fe70bb7015d "../common/ldb_pack.c:425", size=55, context=0x7fe718373210) at ../talloc.c:865 #16 talloc_named_const (context=<optimized out>, size=size@entry=55, name=name@entry=0x7fe70bb7015d "../common/ldb_pack.c:425") at ../talloc.c:1606 #17 0x00007fe70bb61803 in ldb_unpack_data_only_attr_list (ldb=ldb@entry=0x7fe70e4d52c0, data=data@entry=0x7ffcd5d2b990, message=0x7fe7184aa1e0, list=list@entry=0x0, list_size=list_size@entry=0, nb_elements_in_db=nb_elements_in_db@entry=0x0) at ../common/ldb_pack.c:425 #18 0x00007fe70bb61a7d in ldb_unpack_data (ldb=ldb@entry=0x7fe70e4d52c0, data=data@entry=0x7ffcd5d2b990, message=<optimized out>) at ../common/ldb_pack.c:470 #19 0x00007fe6fdc29b46 in ltdb_parse_data_unpack (key=..., data=..., private_data=0x7ffcd5d2ba70) at ../ldb_tdb/ldb_search.c:249 #20 0x00007fe70a5e0a24 in tdb_parse_data (tdb=tdb@entry=0x7fe70e4eaa10, key=..., offset=15619748, len=414772, parser=parser@entry=0x7fe6fdc29b10 <ltdb_parse_data_unpack>, private_data=private_data@entry=0x7ffcd5d2ba70) at ../common/io.c:637 #21 0x00007fe70a5dc1fc in tdb_parse_record (tdb=0x7fe70e4eaa10, key=..., parser=parser@entry=0x7fe6fdc29b10 <ltdb_parse_data_unpack>, private_data=private_data@entry=0x7ffcd5d2ba70) at ../common/tdb.c:253 #22 0x00007fe6fdc29e7b in ltdb_search_dn1 (module=module@entry=0x7fe70e4eab50, dn=dn@entry=0x7fe7183c4940, msg=msg@entry=0x7fe7184aa1e0) at ../ldb_tdb/ldb_search.c:287 #23 0x00007fe6fdc2acbb in ltdb_dn_list_load (module=module@entry=0x7fe70e4eab50, dn=dn@entry=0x7fe7183c4940, list=list@entry=0x7fe7183c3a30) at ../ldb_tdb/ldb_index.c:181 #24 0x00007fe6fdc2bbbb in ltdb_index_add1 (module=module@entry=0x7fe70e4eab50, dn=dn@entry=0x7fe7183bf3e0 "name=testuser7045@domain.com,cn=users,cn=DOMAIN.COM,cn=sysdb", v_idx=v_idx@entry=0, el=<optimized out>, el=<optimized out>) at ../ldb_tdb/ldb_index.c:1134 #25 0x00007fe6fdc2c62c in ltdb_index_add_el (el=0x7fe7184aa3e0, dn=0x7fe7183bf3e0 "name=testuser7045@domain.com,cn=users,cn=DOMAIN.COM,cn=sysdb", module=0x7fe70e4eab50) at ../ldb_tdb/ldb_index.c:1180 #26 ltdb_index_add_element (module=module@entry=0x7fe70e4eab50, dn=<optimized out>, el=el@entry=0x7fe7184aa3e0) at ../ldb_tdb/ldb_index.c:1290 #27 0x00007fe6fdc290bb in ltdb_modify_internal (module=module@entry=0x7fe70e4eab50, msg=0x7fe7183bf0c0, req=req@entry=0x7fe7183bdc10) at ../ldb_tdb/ldb_tdb.c:903 #28 0x00007fe6fdc2958a in ltdb_modify (ctx=0x7fe7183c2950, ctx=0x7fe7183c2950) at ../ldb_tdb/ldb_tdb.c:998 #29 ltdb_callback (ev=<optimized out>, te=<optimized out>, t=..., private_data=<optimized out>) at ../ldb_tdb/ldb_tdb.c:1380 #30 0x00007fe708664b4f in tevent_common_loop_timer_delay (ev=ev@entry=0x7fe70e4d2890) at ../tevent_timed.c:341 #31 0x00007fe708665b5a in epoll_event_loop_once (ev=0x7fe70e4d2890, location=<optimized out>) at ../tevent_epoll.c:911 #32 0x00007fe708664257 in std_event_loop_once (ev=0x7fe70e4d2890, location=0x7fe70bb72ec5 "../common/ldb.c:631") at ../tevent_standard.c:114 #33 0x00007fe70866040d in _tevent_loop_once (ev=ev@entry=0x7fe70e4d2890, location=location@entry=0x7fe70bb72ec5 "../common/ldb.c:631") at ../tevent.c:533 #34 0x00007fe70bb6bc4f in ldb_wait (handle=0x7fe7183c4530, type=<optimized out>) at ../common/ldb.c:631 #35 0x00007fe70bb6c793 in ldb_autotransaction_request (ldb=0x7fe70e4d52c0, req=0x7fe7183bdc10) at ../common/ldb.c:573 #36 0x00007fe70bb6d263 in ldb_modify (ldb=ldb@entry=0x7fe70e4d52c0, message=<optimized out>) at ../common/ldb.c:1655 #37 0x00007fe70bfa2ab5 in sysdb_set_cache_entry_attr (ldb=0x7fe70e4d52c0, entry_dn=entry_dn@entry=0x7fe7183c4760, attrs=attrs@entry=0x7fe7183bf680, mod_op=mod_op@entry=2) at src/db/sysdb_ops.c:1159 #38 0x00007fe70bfa304d in sysdb_rep_ts_entry_attr (sysdb=0x7fe70e4eadd0, attrs=0x7fe7183bf680, entry_dn=0x7fe7183c4760) at src/db/sysdb_ops.c:1218 #39 sysdb_set_ts_entry_attr (sysdb=sysdb@entry=0x7fe70e4eadd0, entry_dn=entry_dn@entry=0x7fe7183c4760, attrs=attrs@entry=0x7fe7183bb840, mod_op=mod_op@entry=2) at src/db/sysdb_ops.c:1248 #40 0x00007fe70bfa4aa9 in sysdb_set_entry_attr (sysdb=0x7fe70e4eadd0, entry_dn=0x7fe7183c4760, attrs=attrs@entry=0x7fe7183bb840, mod_op=mod_op@entry=2) at src/db/sysdb_ops.c:1199 #41 0x00007fe70bfa4b5f in sysdb_set_user_attr (domain=domain@entry=0x7fe70e4d62f0, name=name@entry=0x7fe7183c01f0 "testuser7045@domain.com", attrs=attrs@entry=0x7fe7183bb840, mod_op=mod_op@entry=2) at src/db/sysdb_ops.c:1285 #42 0x00007fe70bfa58c3 in sysdb_add_user (domain=domain@entry=0x7fe70e4d62f0, name=name@entry=0x7fe7183c01f0 "testuser7045@domain.com", uid=uid@entry=1415408147, gid=<optimized out>, gid@entry=1415400513, gecos=gecos@entry=0x7fe710465d00 "Test User7045", homedir=homedir@entry=0x0, shell=shell@entry=0x0, orig_dn=orig_dn@entry=0x7fe710465940 "CN=Test User7045,OU=Sales,DC=DOMAIN,DC=COM", attrs=attrs@entry=0x7fe7183bb840, cache_timeout=cache_timeout@entry=5400, now=now@entry=1481105315) at src/db/sysdb_ops.c:1928 #43 0x00007fe70bfab271 in sysdb_store_new_user (now=1481105315, cache_timeout=5400, attrs=0x7fe7183bb840, orig_dn=0x7fe710465940 "CN=Test User7045,OU=Sales,DC=DOMAIN,DC=COM", shell=0x0, homedir=0x0, gecos=0x7fe710465d00 "Test User7045", gid=1415400513, uid=1415408147, name=0x7fe7183c01f0 "testuser7045@domain.com", domain=0x7fe70e4d62f0) at src/db/sysdb_ops.c:2549 #44 sysdb_store_user (domain=domain@entry=0x7fe70e4d62f0, name=0x7fe7183c01f0 "testuser7045@domain.com", pwd=pwd@entry=0x0, uid=1415408147, gid=1415400513, gecos=gecos@entry=0x7fe710465d00 "Test User7045", homedir=homedir@entry=0x0, shell=shell@entry=0x0, orig_dn=orig_dn@entry=0x7fe710465940 "CN=Test User7045,OU=Sales,DC=DOMAIN,DC=COM", attrs=attrs@entry=0x7fe7183bb840, remove_attrs=0x7fe7183c08a0, cache_timeout=cache_timeout@entry=5400, now=now@entry=1481105315) at src/db/sysdb_ops.c:2499 #45 0x00007fe6fba0d9f9 in sdap_save_user (memctx=memctx@entry=0x7fe70e544ee0, opts=opts@entry=0x7fe70e518400, dom=dom@entry=0x7fe70e4d62f0, attrs=<optimized out>, _usn_value=_usn_value@entry=0x7ffcd5d2c260, now=now@entry=1481105315) at src/providers/ldap/sdap_async_users.c:509 #46 0x00007fe6fba0df9a in sdap_save_users (memctx=memctx@entry=0x7fe70e544e40, sysdb=0x7fe70e4eadd0, dom=0x7fe70e4d62f0, opts=0x7fe70e518400, users=<optimized out>, num_users=10006, _usn_value=_usn_value@entry=0x7fe70e544e60) at src/providers/ldap/sdap_async_users.c:572 #47 0x00007fe6fba0e460 in sdap_get_users_done (subreq=<optimized out>) at src/providers/ldap/sdap_async_users.c:938 #48 0x00007fe6fba0c9d5 in sdap_search_user_process (subreq=0x0) at src/providers/ldap/sdap_async_users.c:814 #49 0x00007fe6fba07379 in generic_ext_search_handler (subreq=0x0, opts=<optimized out>) at src/providers/ldap/sdap_async.c:1689 #50 0x00007fe6fba0991b in sdap_get_generic_op_finished (op=<optimized out>, reply=<optimized out>, error=<optimized out>, pvt=<optimized out>) at src/providers/ldap/sdap_async.c:1621 #51 0x00007fe6fba083cd in sdap_process_message (ev=<optimized out>, sh=<optimized out>, msg=0x7fe70e5f9ce0) at src/providers/ldap/sdap_async.c:353 #52 sdap_process_result (ev=<optimized out>, pvt=<optimized out>) at src/providers/ldap/sdap_async.c:197 #53 0x00007fe708664b4f in tevent_common_loop_timer_delay (ev=ev@entry=0x7fe70e4cbc30) at ../tevent_timed.c:341 #54 0x00007fe708665b5a in epoll_event_loop_once (ev=0x7fe70e4cbc30, location=<optimized out>) at ../tevent_epoll.c:911 #55 0x00007fe708664257 in std_event_loop_once (ev=0x7fe70e4cbc30, location=0x7fe70bfee8e7 "src/util/server.c:702") at ../tevent_standard.c:114 #56 0x00007fe70866040d in _tevent_loop_once (ev=ev@entry=0x7fe70e4cbc30, location=location@entry=0x7fe70bfee8e7 "src/util/server.c:702") at ../tevent.c:533 #57 0x00007fe7086605ab in tevent_common_loop_wait (ev=0x7fe70e4cbc30, location=0x7fe70bfee8e7 "src/util/server.c:702") at ../tevent.c:637 #58 0x00007fe7086641f7 in std_event_loop_wait (ev=0x7fe70e4cbc30, location=0x7fe70bfee8e7 "src/util/server.c:702") at ../tevent_standard.c:140 #59 0x00007fe70bfd1993 in server_loop (main_ctx=0x7fe70e4cd080) at src/util/server.c:702 #60 0x00007fe70c84cb82 in main (argc=8, argv=<optimized out>) at src/providers/data_provider_be.c:587
Resolves: https://fedorahosted.org/sssd/ticket/3266
Signed-off-by: Fabiano Fidêncio fidencio@redhat.com """
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
lslebodn commented: """ On (12/12/16 01:20), fidencio wrote:
While debugging rhbz#1396912 some deadlock on sssd_be was noticed[0] and it's been caused by the use of non async-signal-safe functions from the signal_handler (please, see man 7 signal for more info about which are the async-signal-safe functions that can be used).
The removal of those functions we lose some debug messages,
If these messages are important we can write them(static strings) to stderr or other file descriptor directly with write which can be safely called inside signal handler.
which is the least of problems now. Also instead of re-setting the watchdog timeout in case of time-shift (please, see b8ceaeb for more info), let's just re-set the timer's time.
Along with this commit, the function to teardown the watchdog has been removed as it's not used anymore.
Must not be removed.
One of the main usesaces is to disable watchdog for debugging with gdb. It is very usefull. You might add comment there why it should not be removed.
NOTE: I didn't read the patch. Just commenting commit message.
LS
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266418517
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ On Mon, Dec 12, 2016 at 1:22 PM, lslebodn notifications@github.com wrote:
On (12/12/16 01:20), fidencio wrote:
While debugging rhbz#1396912 some deadlock on sssd_be was noticed[0] and it's been caused by the use of non async-signal-safe functions from the signal_handler (please, see man 7 signal for more info about which are the async-signal-safe functions that can be used).
The removal of those functions we lose some debug messages,
If these messages are important we can write them(static strings) to stderr or other file descriptor directly with write which can be safely called inside signal handler.
I would say it doesn't worth the trouble, to be honest. But, well, other's opinions are welcome!
which is the least of problems now. Also instead of re-setting the watchdog timeout in case of time-shift (please, see b8ceaeb for more info), let's just re-set the timer's time.
Along with this commit, the function to teardown the watchdog has been removed as it's not used anymore.
Must not be removed.
One of the main usesaces is to disable watchdog for debugging with gdb. It is very usefull. You might add comment there why it should not be removed.
Okay.
NOTE: I didn't read the patch. Just commenting commit message.
LS
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SSSD/sssd/pull/107#issuecomment-266418517, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG4ehvkxOoiIMntatP_ZmDKMyTknDZmks5rHTxqgaJpZM4LKVED .
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266430843
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ On Mon, Dec 12, 2016 at 1:50 PM, Pavel Březina notifications@github.com wrote:
*@pbrezina* commented on this pull request.
In src/util/util_watchdog.c https://github.com/SSSD/sssd/pull/107#pullrequestreview-12447634:
*
* As restarting the watchdog from the signal_handler may
* lead sssd_be to a deadlock due to the use of non async-
* signal-safe function. let's just (re)set the timer's
* time.
*/
/* As done when setting up the watchdog handler,
* we give 1 second head start to the watchdog event */
its.it_value.tv_sec = watchdog_ctx.interval.tv_sec + 1;
its.it_value.tv_nsec = 0;
its.it_interval.tv_sec = watchdog_ctx.interval.tv_sec;
its.it_interval.tv_nsec = 0;
ret = timer_settime(watchdog_ctx.timerid, 0, &its, NULL);
if (ret == -1) {
_exit(1);
This is not sufficient because we need to reset the tevent timer. I think watchdog itself is actually fired in the right time because the timer clock is adjusted. The problematic timer is the tevent one, which needs to be rescheduled.
Okay. I guess I see what you mean. Even with this timeout being reset the watchdog would end up killing the process due to the tevent timeout, is it? Considering that's the case I'd say there's nothing much that can be done from our side apart from ignoring the ticks on those cases. At least till we have it changed/fixed on tevent's side. And suggestions are welcome!
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SSSD/sssd/pull/107#pullrequestreview-12447634, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG4ehq9DnuKDBXQwCWLtnxBLhHl90yUks5rHUMRgaJpZM4LKVED .
Thanks for the review! -- Fabiano Fidêncio
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266431337
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ So..I wonder how much of a hack this is but if the intent here is to reset the watchdog from a signal handler, could we send a signal to self (`kill(getpid(), SIGHUP)`) with a signal that is handled by tevent and in that tevent-driven signal handler, reset the watchdog? """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266446340
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ On Mon, Dec 12, 2016 at 3:37 PM, Jakub Hrozek notifications@github.com wrote:
So..I wonder how much of a hack this is but if the intent here is to reset the watchdog from a signal handler, could we send a signal to self (kill(getpid(), SIGHUP)) with a signal that is handled by tevent and in that tevent-driven signal handler, reset the watchdog?
Things are getting more and more hacky. I'm wondering here, what would be the problem to simply `_exit(1)` in case of a timeshift is detected?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SSSD/sssd/pull/107#issuecomment-266446340, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG4enqr9bqncI7VM76uw-d4O9L1-INtks5rHVwdgaJpZM4LKVED .
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266447493
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ I'm not sure I like the idea of killing a service because the admin runs ntpdate, even if the monitor should restart the service. @simo5 do you have a preference here?
I really wonder why doesn't tevent use monotonic clock itself, should we ask on the samba list? """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266449652
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
simo5 commented: """ Yes we should ask, I think we really need to try to use monotonic clocks for most tasks. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266450977
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ On Mon, Dec 12, 2016 at 06:55:13AM -0800, Simo Sorce wrote:
Yes we should ask, I think we really need to try to use monotonic clocks for most tasks.
Thank you, @simo5, do you also have a preference for how to reset the watchdog from a 'plain' POSIX signal handler where we can't use signal-unsafe functions? My suggestion was to signal 'self' to get into a tevent-managed signal handler, Fabiano suggested to play it safe and just exist and let monitor restart sssd_be.
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266458115
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
simo5 commented: """ well you could have a globalk variable for the watchdog and change it from a custom signal handler, but the point of the watchdog is to go thorugh the tevent handler instead so that we are sure the machinery is working and not stuck somwhere. Resetting directly from the singal handler would bypass all processing and therefore render the watchdog useless I guess. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266460639
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ On Mon, Dec 12, 2016 at 07:30:30AM -0800, Simo Sorce wrote:
well you could have a globalk variable for the watchdog and change it from a custom signal handler, but the point of the watchdog is to go thorugh the tevent handler instead so that we are sure the machinery is working and not stuck somwhere. Resetting directly from the singal handler would bypass all processing and therefore render the watchdog useless I guess.
The problem here (as I understand it, Pavel or Fabiano can correct me if I'm wrong) is that the watchdog increases the counter inside a POSIX signal handler, but resets the counter in a tevent timer (to make sure the mainloop is being processed).
Now, if the time drifts, we still are receiving the monotonic SIGRT signals into the POSIX handlers, but because the tevent timer never gets invoked (it's set to be invoked in a time in the future, because the time drifted), we never reset the counter.
We can detect the time has drifted in the POSIX SIGRT handler, the question I'm trying to answer is how should we restart the tevent timer when we receive the SIGRT signal, but we because we are in the POSIX handler, we are quite restriced in what we can do..
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266462611
URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
pbrezina commented: """ There are two scenarios: 1) timeshift during system boot -- it is very common to be several hours 2) timeshift due to an ntp update when booted up -- usually only few seconds, not a big deal
The problem with tevent timer is that if we shift backwards the timer remains too far in the future. This applies to all timers, not only for the watchdog. Forward shift is not a problem, it just executes the timers immediately. Resetting the watchdog helps in a way that sssd is not killed, we don't have any capability to reschedule all timed event and we actually can not tell that sssd will be functioning properly (dyndns, sudo refresh, enumeration, domain refresh, even idle timer on socket activation)... all those operations that depends on time() would become unreliable.
I think the best thing to do would be restart the process (although the question is how would this affect the boot up) and patch tevent to deal with timeshift either by using monotonic clock or by detecting them and altering timers accordingly. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266700933
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ Pavel,
On Tue, Dec 13, 2016 at 11:20 AM, Pavel Březina notifications@github.com wrote:
There are two scenarios:
- timeshift during system boot -- it is very common to be several
hours 2. timeshift due to an ntp update when booted up -- usually only few seconds, not a big deal
The problem with tevent timer is that if we shift backwards the timer remains too far in the future. This applies to all timers, not only for the watchdog. Forward shift is not a problem, it just executes the timers immediately. Resetting the watchdog helps in a way that sssd is not killed, we don't have any capability to reschedule all timed event and we actually can not tell that sssd will be functioning properly (dyndns, sudo refresh, enumeration, domain refresh, even idle timer on socket activation)... all those operations that depends on time() would become unreliable.
I think the best thing to do would be restart the process (although the question is how would this affect the boot up) and patch tevent to deal with timeshift either by using monotonic clock or by detecting them and altering timers accordingly.
In the latest version of patch I've just called _exit(1) when the timeshift is detected. About patching tevent, I've seen some old discussions happening and it doesn't seem a trivial thing to do. Would the patch, as it is right now, be acceptable and then a work on tevent could be done later (yes, I'd add it to my queue and do it as soon as we have an agreement on doing this)?
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/SSSD/sssd/pull/107#issuecomment-266700933, or mute the thread https://github.com/notifications/unsubscribe-auth/AAG4eiXYbcI_GV_YJeh4guURQkDwkfEZks5rHnF-gaJpZM4LKVED .
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266702069
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
simo5 commented: """ On Tue, 2016-12-13 at 02:25 -0800, fidencio wrote:
Pavel,
On Tue, Dec 13, 2016 at 11:20 AM, Pavel Březina notifications@github.com wrote:
There are two scenarios:
- timeshift during system boot -- it is very common to be several
hours 2. timeshift due to an ntp update when booted up -- usually only few seconds, not a big deal
The problem with tevent timer is that if we shift backwards the timer remains too far in the future. This applies to all timers, not only for the watchdog. Forward shift is not a problem, it just executes the timers immediately. Resetting the watchdog helps in a way that sssd is not killed, we don't have any capability to reschedule all timed event and we actually can not tell that sssd will be functioning properly (dyndns, sudo refresh, enumeration, domain refresh, even idle timer on socket activation)... all those operations that depends on time() would become unreliable.
I think the best thing to do would be restart the process (although the question is how would this affect the boot up) and patch tevent to deal with timeshift either by using monotonic clock or by detecting them and altering timers accordingly.
In the latest version of patch I've just called _exit(1) when the timeshift is detected. About patching tevent, I've seen some old discussions happening and it doesn't seem a trivial thing to do. Would the patch, as it is right now, be acceptable and then a work on tevent could be done later (yes, I'd add it to my queue and do it as soon as we have an agreement on doing this)?
This is really a blunt tool (calling exit()), but until tevent can be fixed the only other option would be to use some wrapper to keep track of all existing timed events and cancel and restart them all if the clock changes abruptly. The easiest option is to teach tevent about the existence of monotonic clocks and allow to set a default clock type to use as well as overrides for specific timers that want to fire on the actual time not just X seconds in the future.
Simo.
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ On Tue, Dec 13, 2016 at 02:44:44AM -0800, Simo Sorce wrote:
On Tue, 2016-12-13 at 02:25 -0800, fidencio wrote:
Pavel,
On Tue, Dec 13, 2016 at 11:20 AM, Pavel Březina notifications@github.com wrote:
There are two scenarios:
- timeshift during system boot -- it is very common to be several
hours 2. timeshift due to an ntp update when booted up -- usually only few seconds, not a big deal
The problem with tevent timer is that if we shift backwards the timer remains too far in the future. This applies to all timers, not only for the watchdog. Forward shift is not a problem, it just executes the timers immediately. Resetting the watchdog helps in a way that sssd is not killed, we don't have any capability to reschedule all timed event and we actually can not tell that sssd will be functioning properly (dyndns, sudo refresh, enumeration, domain refresh, even idle timer on socket activation)... all those operations that depends on time() would become unreliable.
I think the best thing to do would be restart the process (although the question is how would this affect the boot up) and patch tevent to deal with timeshift either by using monotonic clock or by detecting them and altering timers accordingly.
In the latest version of patch I've just called _exit(1) when the timeshift is detected. About patching tevent, I've seen some old discussions happening and it doesn't seem a trivial thing to do. Would the patch, as it is right now, be acceptable and then a work on tevent could be done later (yes, I'd add it to my queue and do it as soon as we have an agreement on doing this)?
This is really a blunt tool (calling exit()), but until tevent can be fixed the only other option would be to use some wrapper to keep track of all existing timed events and cancel and restart them all if the clock changes abruptly.
that's why I suggested signaling self to a tevent-driven signal handler from where we can just set up the timer anew.
If there is any other way to 'break out' of the POSIX signal handler into somewhere where we can call tevent/talloc (or in general unsafe calls) I'm all ears.
The easiest option is to teach tevent about the existence of monotonic clocks and allow to set a default clock type to use as well as overrides for specific timers that want to fire on the actual time not just X seconds in the future.
Simo.
-- Simo Sorce * Red Hat, Inc * New York
-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/SSSD/sssd/pull/107#issuecomment-266706120
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266745077
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
simo5 commented: """ On Tue, 2016-12-13 at 05:59 -0800, Jakub Hrozek wrote:
On Tue, Dec 13, 2016 at 02:44:44AM -0800, Simo Sorce wrote:
On Tue, 2016-12-13 at 02:25 -0800, fidencio wrote:
Pavel,
On Tue, Dec 13, 2016 at 11:20 AM, Pavel Březina notifications@github.com wrote:
There are two scenarios:
- timeshift during system boot -- it is very common to be several
hours 2. timeshift due to an ntp update when booted up -- usually only few seconds, not a big deal
The problem with tevent timer is that if we shift backwards the timer remains too far in the future. This applies to all timers, not only for the watchdog. Forward shift is not a problem, it just executes the timers immediately. Resetting the watchdog helps in a way that sssd is not killed, we don't have any capability to reschedule all timed event and we actually can not tell that sssd will be functioning properly (dyndns, sudo refresh, enumeration, domain refresh, even idle timer on socket activation)... all those operations that depends on time() would become unreliable.
I think the best thing to do would be restart the process (although the question is how would this affect the boot up) and patch tevent to deal with timeshift either by using monotonic clock or by detecting them and altering timers accordingly.
In the latest version of patch I've just called _exit(1) when the timeshift is detected. About patching tevent, I've seen some old discussions happening and it doesn't seem a trivial thing to do. Would the patch, as it is right now, be acceptable and then a work on tevent could be done later (yes, I'd add it to my queue and do it as soon as we have an agreement on doing this)?
This is really a blunt tool (calling exit()), but until tevent can be fixed the only other option would be to use some wrapper to keep track of all existing timed events and cancel and restart them all if the clock changes abruptly.
that's why I suggested signaling self to a tevent-driven signal handler from where we can just set up the timer anew.
If there is any other way to 'break out' of the POSIX signal handler into somewhere where we can call tevent/talloc (or in general unsafe calls) I'm all ears.
I guess I need to understand better what exactly you want to do to be able to advice on something. I can think of a coulpe of options, none of them particularly elegant :)
Simo.
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ On Tue, Dec 13, 2016 at 07:06:58AM -0800, Simo Sorce wrote:
On Tue, 2016-12-13 at 05:59 -0800, Jakub Hrozek wrote:
On Tue, Dec 13, 2016 at 02:44:44AM -0800, Simo Sorce wrote:
On Tue, 2016-12-13 at 02:25 -0800, fidencio wrote:
Pavel,
On Tue, Dec 13, 2016 at 11:20 AM, Pavel Březina notifications@github.com wrote:
There are two scenarios:
- timeshift during system boot -- it is very common to be several
hours 2. timeshift due to an ntp update when booted up -- usually only few seconds, not a big deal
The problem with tevent timer is that if we shift backwards the timer remains too far in the future. This applies to all timers, not only for the watchdog. Forward shift is not a problem, it just executes the timers immediately. Resetting the watchdog helps in a way that sssd is not killed, we don't have any capability to reschedule all timed event and we actually can not tell that sssd will be functioning properly (dyndns, sudo refresh, enumeration, domain refresh, even idle timer on socket activation)... all those operations that depends on time() would become unreliable.
I think the best thing to do would be restart the process (although the question is how would this affect the boot up) and patch tevent to deal with timeshift either by using monotonic clock or by detecting them and altering timers accordingly.
In the latest version of patch I've just called _exit(1) when the timeshift is detected. About patching tevent, I've seen some old discussions happening and it doesn't seem a trivial thing to do. Would the patch, as it is right now, be acceptable and then a work on tevent could be done later (yes, I'd add it to my queue and do it as soon as we have an agreement on doing this)?
This is really a blunt tool (calling exit()), but until tevent can be fixed the only other option would be to use some wrapper to keep track of all existing timed events and cancel and restart them all if the clock changes abruptly.
that's why I suggested signaling self to a tevent-driven signal handler from where we can just set up the timer anew.
If there is any other way to 'break out' of the POSIX signal handler into somewhere where we can call tevent/talloc (or in general unsafe calls) I'm all ears.
I guess I need to understand better what exactly you want to do to be able to advice on something. I can think of a coulpe of options, none of them particularly elegant :)
OK, let me try to explain better.
A machine drifts time. Then an SSSD process receives SIGRT in watchdog_handler() and detects the time has drifted, so it avoids increasing the watchdog ticks counter -- this is done in watchdog_detect_timeshift() at the moment.
At that point, in the current master, we call teardown_watchdog() and setup_watchdog() to set a new watchdog (the part that is based on tevent timers). This is unsafe to do in a signal handler because it involves malloc and free among others called from tevent.
What I'm trying to figure out is how to reset the watchdog when I detect in watchdog_detect_timeshift() the time is out of sync and the tevent timer that resets the ticks will not arrive until the sssd process receives enough SIGRT signals to get itself killed.
Does the question make sense now?
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266778681
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
simo5 commented: """ On Tue, 2016-12-13 at 08:02 -0800, Jakub Hrozek wrote:
On Tue, Dec 13, 2016 at 07:06:58AM -0800, Simo Sorce wrote:
On Tue, 2016-12-13 at 05:59 -0800, Jakub Hrozek wrote:
On Tue, Dec 13, 2016 at 02:44:44AM -0800, Simo Sorce wrote:
On Tue, 2016-12-13 at 02:25 -0800, fidencio wrote:
Pavel,
On Tue, Dec 13, 2016 at 11:20 AM, Pavel Březina notifications@github.com wrote:
There are two scenarios:
- timeshift during system boot -- it is very common to be several
hours 2. timeshift due to an ntp update when booted up -- usually only few seconds, not a big deal
The problem with tevent timer is that if we shift backwards the timer remains too far in the future. This applies to all timers, not only for the watchdog. Forward shift is not a problem, it just executes the timers immediately. Resetting the watchdog helps in a way that sssd is not killed, we don't have any capability to reschedule all timed event and we actually can not tell that sssd will be functioning properly (dyndns, sudo refresh, enumeration, domain refresh, even idle timer on socket activation)... all those operations that depends on time() would become unreliable.
I think the best thing to do would be restart the process (although the question is how would this affect the boot up) and patch tevent to deal with timeshift either by using monotonic clock or by detecting them and altering timers accordingly.
In the latest version of patch I've just called _exit(1) when the timeshift is detected. About patching tevent, I've seen some old discussions happening and it doesn't seem a trivial thing to do. Would the patch, as it is right now, be acceptable and then a work on tevent could be done later (yes, I'd add it to my queue and do it as soon as we have an agreement on doing this)?
This is really a blunt tool (calling exit()), but until tevent can be fixed the only other option would be to use some wrapper to keep track of all existing timed events and cancel and restart them all if the clock changes abruptly.
that's why I suggested signaling self to a tevent-driven signal handler from where we can just set up the timer anew.
If there is any other way to 'break out' of the POSIX signal handler into somewhere where we can call tevent/talloc (or in general unsafe calls) I'm all ears.
I guess I need to understand better what exactly you want to do to be able to advice on something. I can think of a coulpe of options, none of them particularly elegant :)
OK, let me try to explain better.
A machine drifts time. Then an SSSD process receives SIGRT in watchdog_handler() and detects the time has drifted, so it avoids increasing the watchdog ticks counter -- this is done in watchdog_detect_timeshift() at the moment.
At that point, in the current master, we call teardown_watchdog() and setup_watchdog() to set a new watchdog (the part that is based on tevent timers). This is unsafe to do in a signal handler because it involves malloc and free among others called from tevent.
What I'm trying to figure out is how to reset the watchdog when I detect in watchdog_detect_timeshift() the time is out of sync and the tevent timer that resets the ticks will not arrive until the sssd process receives enough SIGRT signals to get itself killed.
Does the question make sense now?
Yes, I see a few ways, a file descriptor (like tevent does for handling signals) used to fire a tevent_fd event that will perform these actions. Or a global variable, that is checked in the main loop at idle times and resets the watchdog, finally a mutex on which a dedicated thread waits, the thread only job is to run the reset if the mutex is released (but then the mutex needs to be re-acquired, probably on the next tick), but then again not sure if the code run in the setup_watchdog() is thread safe.
Simo.
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ On Tue, Dec 13, 2016 at 08:16:33AM -0800, Simo Sorce wrote:
On Tue, 2016-12-13 at 08:02 -0800, Jakub Hrozek wrote:
On Tue, Dec 13, 2016 at 07:06:58AM -0800, Simo Sorce wrote:
On Tue, 2016-12-13 at 05:59 -0800, Jakub Hrozek wrote:
On Tue, Dec 13, 2016 at 02:44:44AM -0800, Simo Sorce wrote:
On Tue, 2016-12-13 at 02:25 -0800, fidencio wrote:
Pavel,
On Tue, Dec 13, 2016 at 11:20 AM, Pavel Březina notifications@github.com wrote:
> There are two scenarios: > > 1. timeshift during system boot -- it is very common to be several > hours > 2. timeshift due to an ntp update when booted up -- usually only few > seconds, not a big deal > > The problem with tevent timer is that if we shift backwards the timer > remains too far in the future. This applies to all timers, not only for the > watchdog. Forward shift is not a problem, it just executes the timers > immediately. Resetting the watchdog helps in a way that sssd is not killed, > we don't have any capability to reschedule all timed event and we actually > can not tell that sssd will be functioning properly (dyndns, sudo refresh, > enumeration, domain refresh, even idle timer on socket activation)... all > those operations that depends on time() would become unreliable. > > I think the best thing to do would be restart the process (although the > question is how would this affect the boot up) and patch tevent to deal > with timeshift either by using monotonic clock or by detecting them and > altering timers accordingly. >
In the latest version of patch I've just called _exit(1) when the timeshift is detected. About patching tevent, I've seen some old discussions happening and it doesn't seem a trivial thing to do. Would the patch, as it is right now, be acceptable and then a work on tevent could be done later (yes, I'd add it to my queue and do it as soon as we have an agreement on doing this)?
This is really a blunt tool (calling exit()), but until tevent can be fixed the only other option would be to use some wrapper to keep track of all existing timed events and cancel and restart them all if the clock changes abruptly.
that's why I suggested signaling self to a tevent-driven signal handler from where we can just set up the timer anew.
If there is any other way to 'break out' of the POSIX signal handler into somewhere where we can call tevent/talloc (or in general unsafe calls) I'm all ears.
I guess I need to understand better what exactly you want to do to be able to advice on something. I can think of a coulpe of options, none of them particularly elegant :)
OK, let me try to explain better.
A machine drifts time. Then an SSSD process receives SIGRT in watchdog_handler() and detects the time has drifted, so it avoids increasing the watchdog ticks counter -- this is done in watchdog_detect_timeshift() at the moment.
At that point, in the current master, we call teardown_watchdog() and setup_watchdog() to set a new watchdog (the part that is based on tevent timers). This is unsafe to do in a signal handler because it involves malloc and free among others called from tevent.
What I'm trying to figure out is how to reset the watchdog when I detect in watchdog_detect_timeshift() the time is out of sync and the tevent timer that resets the ticks will not arrive until the sssd process receives enough SIGRT signals to get itself killed.
Does the question make sense now?
Yes, I see a few ways, a file descriptor (like tevent does for handling signals) used to fire a tevent_fd event that will perform these actions. Or a global variable, that is checked in the main loop at idle times and resets the watchdog, finally a mutex on which a dedicated thread waits, the thread only job is to run the reset if the mutex is released (but then the mutex needs to be re-acquired, probably on the next tick), but then again not sure if the code run in the setup_watchdog() is thread safe.
Thank you.
Without doing any coding, I like the idea of the fd the best.
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-266785347
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
pbrezina commented: """ Sounds good. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-267003707
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: +Changes requested
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ I just set the Changes Requested label so that it's clear to reviewers new patch set is coming up.. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-267285051
URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ Patch updated according to Simo's suggestion. Removing the label "Changes requested" as it's done for review. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-268272046
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: -Changes requested
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ CI: http://sssd-ci.duckdns.org/logs/job/59/63/summary.html Rawhide failure is unrelated. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-268283999
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ Hmm, I'm sorry, I think this is my fault for suggesting we move more stuff out of the POSIX signal handler, but I don't think we can remove the watchdog overflow. Moving the watchdog overflow out of the signal handler and into the tevent loop would effectively disable watchdog, because if sssd is really blocked, the tevent fd handler would never be called and the watchdog would never kill the process..
So I think the only part that should be moved is the re-initializing of the watchdog in case we detect a timeshift..
btw I only realized this when I tried to test that watchdog still works. I set a ridiculously short timeout (2) and a breakpoint in sdap_save_user. I only saw SIGRT incoming, but the process was never killed. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-268495109
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: +Changes requested
URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: -Changes requested
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ Okay, patch has been updated. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-268503583
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ Not sure why CentOS CI failed. Internal CI has passed in the most of the supported distros and the failures doesn't seem related to this patch: http://sssd-ci.duckdns.org/logs/job/59/64/summary.html """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-268540889
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
lslebodn commented: """ retest this please """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-269941038
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ On Mon, Jan 2, 2017 at 8:35 AM, lslebodn notifications@github.com wrote:
retest this please
Is this comment for the author of the patch or is this comment for the reviewer of the patch?
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-269977806
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
lslebodn commented: """ On (02/01/17 06:15), fidencio wrote:
On Mon, Jan 2, 2017 at 8:35 AM, lslebodn notifications@github.com wrote:
retest this please
Is this comment for the author of the patch or is this comment for the reviewer of the patch?
It is a command for jenkins plugin to retest a PR.
https://github.com/jenkinsci/ghprb-plugin/blob/master/README.md https://wiki.jenkins-ci.org/display/JENKINS/Github+pull+request+builder+plug...
There seems to be some issues with epel repository and therefore installation of packages failed https://ci.centos.org/job/sssd-CentOS6/260/console
LS
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-269985115
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: +Changes requested
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ Here is the diff between the older version and the one about to be pushed: ``` [ffidenci@cat sssd]$ git diff diff --git a/src/util/util_watchdog.c b/src/util/util_watchdog.c index b4889a1..17954d1 100644 --- a/src/util/util_watchdog.c +++ b/src/util/util_watchdog.c @@ -75,7 +75,7 @@ static void watchdog_handler(int sig)
/* if a pre-defined number of ticks passed by kills itself */ if (__sync_add_and_fetch(&watchdog_ctx.ticks, 1) > WATCHDOG_MAX_TICKS) { - _exit(1); + kill(-getpgrp(), SIGTERM); } }
@@ -127,8 +127,8 @@ static errno_t watchdog_fd_recv_data(int fd) }
static void watchdog_fd_read_handler(struct tevent_context *ev, - struct tevent_fd *fde, - uint16_t flags, + struct tevent_fd *fde, + uint16_t flags, void *data) { errno_t ret;
``` """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-270127095
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: -Changes requested
URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ Argh, and also: ``` diff --git a/src/util/util_watchdog.c b/src/util/util_watchdog.c index 17954d1..77ba705 100644 --- a/src/util/util_watchdog.c +++ b/src/util/util_watchdog.c @@ -50,7 +50,7 @@ static bool watchdog_detect_timeshift(void) if (cur_time < prev_time) { /* Time shift detected. We need to restart watchdog. */ if (write(watchdog_ctx.pipefd[1], "1", 1) != 1) { - _exit(1); + kill(-getpgrp(), SIGTERM); }
return true; ``` """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-270128553
URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ hmm, it seems I was wrong and at least with systemd (is that the difference?) when we kill the whole process group also the nss and pam responders (that were started explicitly) are killed.
So I was wrong. Can you please retest if you see the same? But in general I think we can stick to the previous version, sorry about guiding you down the wrong path. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-270622372
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ Hmm. I really didn't notice it before but you're right. It ends up killing all explicitly started responders on my setup. Sorry for not having it tested properly before submitting this version.
I'll replace it by _exit(1) as it was before and resubmit the patch. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-270624750
URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ ACK """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-270630387
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: +Accepted
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
lslebodn commented: """ On (03/01/17 06:41), fidencio wrote:
Argh, and also:
diff --git a/src/util/util_watchdog.c b/src/util/util_watchdog.c index 17954d1..77ba705 100644 --- a/src/util/util_watchdog.c +++ b/src/util/util_watchdog.c @@ -50,7 +50,7 @@ static bool watchdog_detect_timeshift(void) if (cur_time < prev_time) { /* Time shift detected. We need to restart watchdog. */ if (write(watchdog_ctx.pipefd[1], "1", 1) != 1) { - _exit(1); + kill(-getpgrp(), SIGTERM);
That would work if we call setpgrp() in server_setup or just in sssd_be; because other processes does/should not use fork/exec
You can try following example: ``` #include <sys/types.h> #include <unistd.h> #include <signal.h>
int test_func(int main_sec_sleep) { int ret; pid_t main_pid = getpid(); setpgrp();
printf(" getpid() %d:\n", getpid()); printf(" getpgrp() %d:\n", getpgrp());
for (int i=0; i<3; i++) { ret = fork(); if ( ret == 0 ) { printf(" getpid() %d:\n", getpid()); printf(" getpgrp() %d:\n", getpgrp()); sleep(1000); break; } }
if ( main_pid == getpid()) { sleep(main_sec_sleep); kill(-main_pid, 9); } }
int main(void) { int ret; pid_t main_pid = getpid();
printf("getpid() %d:\n", getpid()); printf("getpgrp() %d:\n", getpgrp());
for (int i=0; i<3; i++) { ret = fork(); if ( ret == 0 ) { test_func(5 + i); break; } }
if ( main_pid == getpid()) { sleep(15); kill(-main_pid, 9); } }
```
As you can see I could use -main_pid() in kill becsaue getpid and getpgrp returned the same value
LS
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-270635169
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ btw now I'm wondering if the setpgrp should be a separate patch also for stable branches because I guess the bug was present in sssd for quite a long time? """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-270637274
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
lslebodn commented: """ On (05/01/17 04:52), Jakub Hrozek wrote:
btw now I'm wondering if the setpgrp should be a separate patch also for stable branches because I guess the bug was present in sssd for quite a long time?
sure; Should Fabiano prepare new PR or it can be in this one. Because it's related.
LS
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-270639258
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ On Thu, Jan 5, 2017 at 2:03 PM, lslebodn notifications@github.com wrote:
On (05/01/17 04:52), Jakub Hrozek wrote:
btw now I'm wondering if the setpgrp should be a separate patch also for
stable branches because I guess the bug was present in sssd for quite a long time?
sure; Should Fabiano prepare new PR or it can be in this one. Because it's related.
As the patch should come before this one I'd prefer to have both in this very same PR.
Best Regards, -- Fabiano Fidêncio
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-270639914
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: -Accepted
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: +Changes requested
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
lslebodn commented: """ On (05/01/17 04:52), Jakub Hrozek wrote:
btw now I'm wondering if the setpgrp should be a separate patch also for stable branches because I guess the bug was present in sssd for quite a long time?
I kind of misunderstood the question. You meant 1.13 branch.
We call `kill(-getpgrp(), SIGTERM);` only for final cleanup in monitor_quit.
But in case of setpgrp() we would be able to call do following diff and cleanup would be automaticaly done per service ``` - kret = kill(svc->pid, SIGTERM); + kret = kill(-svc->pid, SIGTERM); ```
But patch for 1.13 and for 1.14 might be different.
LS
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-270641029
URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: -Changes requested
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ New version of the series has been pushed and attend both Pavel and Lukáš comments.
As mentioned in the first patch this series will only work with SELinux on Permissive mode and a follow up patch will be needed once SELinux policies get updated on Fedora/EL. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-271327849
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
lslebodn commented: """ I checked the man page and API on linux and BSD and setpgrp has a different API on these platforms. But there is similar function which do the same and has the same API on both platforms. `setpgid()`.
It would be good if you could change it. It will reduce some platform specific ifdefs. And we needn't change SELinux policy twice from setpgrp -> setpgid.
ATM; there is a comment that we should fail after failure in setpgrp after fixed SElinux policy. But it would cause a problems on older distributions where we do not plan to rebase sssd and therefore SELinux policy would not be updated. Upstream version would not work there.
I think we should just scream to log file and inform users that it can leak some child processes in rare cases. And we should use -getpid() in watchdog handler. It will kill whole process groups if setpgrp/setpgid passed and only the process if it such process group does not exist.
Here is a diff; feel free to rephrase some comments ``` diff --git a/src/util/server.c b/src/util/server.c index 60019a7d3..9d1af6a0a 100644 --- a/src/util/server.c +++ b/src/util/server.c @@ -460,16 +460,16 @@ int server_setup(const char *name, int flags, struct logrotate_ctx *lctx; char *locale; int watchdog_interval; + pid_t my_pid;
- ret = setpgrp(); + my_pid = getpid(); + ret = setpgid(my_pid, my_pid); if (ret != EOK) { ret = errno; DEBUG(SSSDBG_MINOR_FAILURE, - "Failed setting process group: %s[ %d]\n", + "Failed setting process group: %s[ %d]. " + "We might leak processes in case of failure\n", sss_strerror(ret), ret); - - /* Do not abort here till SELinux policies are fixed, - * otherwise SSSD won't start up. */ }
ret = chown_debug_file(NULL, uid, gid); diff --git a/src/util/util_watchdog.c b/src/util/util_watchdog.c index 044e87f6e..68f055f3b 100644 --- a/src/util/util_watchdog.c +++ b/src/util/util_watchdog.c @@ -50,7 +50,12 @@ static void watchdog_detect_timeshift(void) if (cur_time < prev_time) { /* Time shift detected. We need to restart watchdog. */ if (write(watchdog_ctx.pipefd[1], "1", 1) != 1) { - kill(-getpgrp(), SIGTERM); + /* getpid() should return the same value as getpgrp() + * If they are not the same then setpgrp()/setpgid() failed + * and we should not kill whole process group but only current + * process. + */ + kill(-getpid(), SIGTERM); } } } ``` """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-271523487
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
lslebodn commented: """ Are there any objection to the proposed change? """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-271523691
On Tue, Jan 10, 2017 at 10:13 AM, lslebodn sssd-github-notification@fedorahosted.org wrote:
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
lslebodn commented: """ Are there any objection to the proposed change? """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-271523691
sssd-devel mailing list -- sssd-devel@lists.fedorahosted.org To unsubscribe send an email to sssd-devel-leave@lists.fedorahosted.org
I kinda okay with the changes. There are a few things that go against my preference but, sincerely, doesn't worth even arguing about then.
Feel free to squash your patch on mine before pushing ...
Or do you want me to submit a new patchset?
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: +Changes requested
URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: synchronized
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: -Changes requested
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
fidencio commented: """ Patchset updated! """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-271558289
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
pbrezina commented: """ Looks good to me. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-271574259
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
jhrozek commented: """ To me as well. I tested again the watchdog restart and the timeshift and both cases work fine. """
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-272409701
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: +Accepted
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
lslebodn commented: """ master: * e6a5f8c58539fc31fd81fac89cfc85703b4250ea * 087162b85e191af51637904702813969b35eaadc
sssd-1-14: * 0606a71b698c4acf954ba7284e62acbd0aa5e52d * 442985a7af2262fab57f56c7a8cd40af10081610
LS
"""
See the full comment at https://github.com/SSSD/sssd/pull/107#issuecomment-275086247
URL: https://github.com/SSSD/sssd/pull/107 Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler
Label: +Pushed
URL: https://github.com/SSSD/sssd/pull/107 Author: fidencio Title: #107: WATCHDOG: Avoid non async-signal-safe from the signal_handler Action: closed
To pull the PR as Git branch: git remote add ghsssd https://github.com/SSSD/sssd git fetch ghsssd pull/107/head:pr107 git checkout pr107
sssd-devel@lists.fedorahosted.org