meeting 2023-03-15 shutdown timeout proposal

Tuesday, 14 March 2023

Systemd maintains a shutdown time, previously set to 120 secs, now changed to 45 secs (if
I remember correctly).

It a process doesn't completly terminates in case of a shutdown in 120 secs max,
systemd kills the process forcefully.

That means:

Case 1: Everything works normally and as expected
=================================================
All processes terminate in a short time as safely possible, and the system shutdown
completes in a correspondingly short time frame. Systemd does not wait x secs until it
completes shutdown and has no effect at all and is superfluous. 

If a process has overwritten the default timeout, systemd waits that time before it kills
the process. Processes that did not overwrite the default, get killed after default 120
secs (or now default 45 secs). 120 secs is pretty long. So, as long as everythins works
normally, nothing bad should happen. 

Anyway, the system comes down as fast as possible, no shorter time possible.

The default time out does not matter!

Case 2: Some processes "hang" due to an error and stop with termination
=======================================================================

This is an unrecoverable error and termination must be forced, either by systemd (i.e.
forcefully kill the process) or otherwise. Question is, how do you determine whether a
process is "hanging". 

A default timeout value if *properly* determined may be useful. A wildly guessed value is
more harmful than beneficial.

Case 3: Some processes take unexpectedly longer to terminate 
============================================================

As long as a program is working, it is not wise to cancel it. The risk of significant
damage (data loss) is too high.

To ensure that a default timeout value does not do more harm than good, it must be
calculated with sufficient generosity. 

A timeout value is either harmful or ineffective at best.

In summary:
===========
(a) A short default timeout does not bring any advantage to a functioning system. With a
slowed down or otherwise impaired system, a short timeout value brings a high risk of
damage.
(b) Even if some process(es) correctly requests a longer timeout, other processes that do
not, but rely on a reasonable behavior of the overall system, can still be abruptly
terminated prematurely and with damage. 

The situation for server and workstation differs.

On the one hand: In the case of workstations, your own data may no longer be usable. With
a server, it is usually the data of third parties. The responsibility is much heavier.

On the other hand, a server under heavy load reacts far more unpredictably than a
workstation.  The distinction between case 2 and 3 above is much more difficult. Collin
Walter (CoreOS) points this out with much more detailed technical knowledge than I do
(https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.o...)
. 

However, the motto must be: If in doubt, wait a bit longer to see if the system shuts down
safely, rather than forcing an end prematurely and risking damage. 

Accordingly, CoreOS has (responsibly in my view) decided to retain the previous timeout
value and override the change. 

Proposal: Server should do the same.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008