*Why*
The recent spice-vdagent update causes all virtual machines to take 90 seconds longer on every shutdown/reboot: https://bugzilla.redhat.com/show_bug.cgi?id=1813667 The service hangs when systemd tries to stop it, and systemd then kills it after a 90 second timeout expires.
This is a recurring pattern, I saw services blocking shutdown/reboot in the past, and so far we haven't been able to do anything about it from a blocker perspective. I think that for cases where the problem occurs very frequently or every time, we should have a way to block the release until it's fixed. I find it a very poor experience to wait 90+ seconds for machine reboot/shutdown. Much poorer than, say, a crashing desktop application (which we block on), because that application can be replaced with a different one. System services mostly can't be replaced, and certainly not by a general user.
*Proposal*
So I propose to amend the "System services" criterion [1]:
``` All system services present after installation with one of the release-blocking package sets must start properly, unless they require hardware which is not present. ```
with something like this:
``` All system services present after installation with one of the release-blocking package sets must start properly, unless they require hardware which is not present.
*All system services present after installation with one of the release-blocking package sets must not time out frequently or regularly when they are being stopped during system reboot/shutdown.* ```
The way it is written, the mentioned bug would be a conditional violation of that criterion (applies only to VMs) and we'd need to use our judgement to determine whether it's a blocker.
Thoughts?
[1] https://fedoraproject.org/wiki/Fedora_32_Final_Release_Criteria#System_servi...
On Tue, Mar 17, 2020 at 4:48 AM Kamil Paral kparal@redhat.com wrote:
All system services present after installation with one of the release-blocking package sets must not time out frequently or regularly when they are being stopped during system reboot/shutdown.
I like it generally, but I worry we'll get hung up on the definition of "frequently" or "regularly". For shutdown in particular, I'm less concerned with service timeouts (granted, I'm not paying for my compute by the minute). I'm leaning toward something like "predictably" or "reliably" (which is an awkward use of the word). Basically, it's not a blocker unless it does it every time.
On Tue, Mar 17, 2020 at 2:10 PM Ben Cotton bcotton@redhat.com wrote:
On Tue, Mar 17, 2020 at 4:48 AM Kamil Paral kparal@redhat.com wrote:
All system services present after installation with one of the
release-blocking package sets must not time out frequently or regularly when they are being stopped during system reboot/shutdown.
I like it generally, but I worry we'll get hung up on the definition of "frequently" or "regularly". For shutdown in particular, I'm less concerned with service timeouts (granted, I'm not paying for my compute by the minute). I'm leaning toward something like "predictably" or "reliably" (which is an awkward use of the word). Basically, it's not a blocker unless it does it every time.
Yeah, I'd prefer something like: *All system services present after installation with one of the release-blocking package sets must not time out every time when they are being stopped during system reboot/shutdown.*
I like the general proposal, just different wording seems better. If timeout doesn't happen regularly, I won't block on it.
On Tue, Mar 17, 2020, 19:10 Frantisek Zatloukal fzatlouk@redhat.com wrote:
On Tue, Mar 17, 2020 at 2:10 PM Ben Cotton bcotton@redhat.com wrote:
On Tue, Mar 17, 2020 at 4:48 AM Kamil Paral kparal@redhat.com wrote:
All system services present after installation with one of the
release-blocking package sets must not time out frequently or regularly when they are being stopped during system reboot/shutdown.
I like it generally, but I worry we'll get hung up on the definition of "frequently" or "regularly". For shutdown in particular, I'm less concerned with service timeouts (granted, I'm not paying for my compute by the minute). I'm leaning toward something like "predictably" or "reliably" (which is an awkward use of the word). Basically, it's not a blocker unless it does it every time.
Yeah, I'd prefer something like: *All system services present after installation with one of the release-blocking package sets must not time out every time when they are being stopped during system reboot/shutdown.*
I like the general proposal, just different wording seems better. If timeout doesn't happen regularly, I won't block on it.
I like the wording on this better. And I am very much in favor of the proposal.
test mailing list -- test@lists.fedoraproject.org To unsubscribe send an email to test-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/test@lists.fedoraproject.org
On Tue, Mar 17, 2020 at 2:40 PM Frantisek Zatloukal fzatlouk@redhat.com wrote:
Yeah, I'd prefer something like: *All system services present after installation with one of the release-blocking package sets must not time out every time when they are being stopped during system reboot/shutdown.*
It's possible but I find it too weak. See my reply to Ben, thanks.
I like the general proposal, just different wording seems better. If timeout doesn't happen regularly, I won't block on it.
"Every time" is different from "regularly".
On Tue, Mar 17, 2020, 15:34 Kamil Paral kparal@redhat.com wrote:
On Tue, Mar 17, 2020 at 2:40 PM Frantisek Zatloukal fzatlouk@redhat.com wrote:
Yeah, I'd prefer something like: *All system services present after installation with one of the release-blocking package sets must not time out every time when they are being stopped during system reboot/shutdown.*
It's possible but I find it too weak. See my reply to Ben, thanks.
I like the general proposal, just different wording seems better. If timeout doesn't happen regularly, I won't block on it.
"Every time" is different from "regularly".
Doesn't need to be. "Every time" implies "regularly", "regularly" can mean "every time".
I get your point, for the record, I am for blocking on this only if it happens every time.
On Tue, Mar 17, 2020 at 2:10 PM Ben Cotton bcotton@redhat.com wrote:
On Tue, Mar 17, 2020 at 4:48 AM Kamil Paral kparal@redhat.com wrote:
All system services present after installation with one of the
release-blocking package sets must not time out frequently or regularly when they are being stopped during system reboot/shutdown.
I like it generally, but I worry we'll get hung up on the definition of "frequently" or "regularly". For shutdown in particular, I'm less concerned with service timeouts (granted, I'm not paying for my compute by the minute).
The problem is not in the timeout itself, but when you need to wait for it. If I do 30 system reboots a day, because I test stuff, a 90 second timeout is suddenly a lot of time :) As a regular user, when I want to shut down the computer and leave, and the system just doesn't, and I need to wait 2 minutes before it does, I get annoyed. Or when I want to reboot to a different operating system, etc. It's those cases where you wait for it, which matter.
I'm leaning toward something like "predictably" or "reliably" (which is an awkward use of the word).
I'm all for finding better words. "Reliably" might be confusing :-) I want to cover scenarios where it covers very frequently (e.g. in 70% of shutdowns), or deterministically (e.g. for all VMs, but not bare metals), and discuss those. I think we can't go for a clear definition here that would allow us to specify an exact scenario when to block and when don't.
As an example which problems could be found and discussed, I have a suspicion that PackageKit hangs if you try to reboot the machine too early after it starts up. Likely not a blocker, because it's not a too frequent use case, but it could be discussed. Or imagine PackageKit hanging every time you perform some particular operation, like a package install. That could be a more viable candidate for a blocker.
Basically, it's not a blocker unless it does it every time.
This is probably the only exact scenario where we could make it a clear blocker, no discussion needed. But it also means no discussion possible. If we say "every time", as Frantisek proposed as well, we'll likely use this criterion once in a few years, and certainly not now against spice-vdagent. Because it doesn't time out every time, it times out every time just on VMs. That is a condition, i.e. it is not "every time".
We can have such a criterion for sure, I just think it's not that useful. I'd rather have one that gives us some leeway to discuss and decide how important the issue is. Which means including "frequently", "regularly", "predictably" or something along those lines.
I think failure to shut down promptly should not be a "blocker" event.
Is shutdown in three minutes instead of three seconds an aggravation for the user who wants to use a new kernal or another operating system? Yes.
Does unreasonable time to shut down indicate lack of quality in Fedora? Of course.
However, to call this a blocker looks like an attempt by QA to extort developers to address a defect that has little impact on usability.
If you do not want to wait so long for shutdown, pull the plug, or learn how to configure what services run and how long they wait when told to stop.
This is not a true blocker. Real blockers must have greater impact: problems like dnf unable to update a system, faults that prevent network access, or inability to start a graphical desktop.
Richard Ryniker composed on 2020-03-17 14:33 (UTC-0400):
Is shutdown in three minutes instead of three seconds an aggravation for the user who wants to use a new kernal or another operating system? Yes.
When the UPS says the battery is exhausted, must shutdown now, it's more than just an aggravation.
On Tue, Mar 17, 2020 at 12:41 PM Felix Miata mrmazda@earthlink.net wrote:
Richard Ryniker composed on 2020-03-17 14:33 (UTC-0400):
Is shutdown in three minutes instead of three seconds an aggravation for the user who wants to use a new kernal or another operating system? Yes.
When the UPS says the battery is exhausted, must shutdown now, it's more than just an aggravation.
Yep, good point. I'm not familiar with systemd, upowerd, and UPS integration and policies - but for sure there are use cases where heavy write workloads can so thoroughly dirty a filesystem that it can take many minutes to flush to stable media. But I expect that the design of this system accounts for the peak usage time to stop all processes cleanly. You don't really want to lose that data. But at a certain point, you'd want to SIGKILL everything anyway, so that there's time for sync() to return. The systemd reboot/shutdown does umount or remount read-only; both fully flush all file system data and metadata to stable media. But this might take longer than just sync() so I think in such cases you really want sync() to succeed, and the shutdown is nice to have.
When the UPS says the battery is exhausted, must shutdown now, it's more than just an aggravation.
It should not be. Worst case should be similar to abrupt power failure. (Real case: I pull the wrong plug out of my power strip, disconnecting my server instead of the device I intended to move.) Reboot the server, it recovers filesystems from their journals in a few seconds, and life is good again.
If this does not work, then there are more serious problems than time-outs during shutdown, and these problems may indeed be blocker candidates.
Your UPS should be able to give an earlier warning than "Battery is exhausted." but there is no completely adequate advance warning.
For critical activities, one provides a UPS that can say "Power remains for N seconds." where N is sufficient for application-specific shutdown (not system shutdown, though that may also occur). Even this is not proof against catastrophe and user error, but greater durability requires mutiple systems, automatic fall-back, and more exotic strategies.
Personally, I should be very happy to have faster shutdown and boot times. I just feel these are quality, not blocker issues.
On Màrt 17, 2020 aig 09:46:53m +0100, sgrìobh Kamil Paral:
*Why*
The recent spice-vdagent update causes all virtual machines to take 90 seconds longer on every shutdown/reboot: https://bugzilla.redhat.com/show_bug.cgi?id=1813667 The service hangs when systemd tries to stop it, and systemd then kills it after a 90 second timeout expires.
This is a recurring pattern, I saw services blocking shutdown/reboot in the past, and so far we haven't been able to do anything about it from a blocker perspective. I think that for cases where the problem occurs very frequently or every time, we should have a way to block the release until it's fixed. I find it a very poor experience to wait 90+ seconds for machine reboot/shutdown. Much poorer than, say, a crashing desktop application (which we block on), because that application can be replaced with a different one. System services mostly can't be replaced, and certainly not by a general user.
*Proposal*
So I propose to amend the "System services" criterion [1]:
All system services present after installation with one of the release-blocking package sets must start properly, unless they require hardware which is not present.
with something like this:
All system services present after installation with one of the release-blocking package sets must start properly, unless they require hardware which is not present. *All system services present after installation with one of the release-blocking package sets must not time out frequently or regularly when they are being stopped during system reboot/shutdown.*
The way it is written, the mentioned bug would be a conditional violation of that criterion (applies only to VMs) and we'd need to use our judgement to determine whether it's a blocker.
Thoughts?
[1] https://fedoraproject.org/wiki/Fedora_32_Final_Release_Criteria#System_servi...
test mailing list -- test@lists.fedoraproject.org To unsubscribe send an email to test-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/test@lists.fedoraproject.org
I can't recall, but, is podman included in a release blocking package set?
I ask because it already fails to properly exit and is eventually timed out. Podman fails this criteria for F31 (and presumably F32). That said, I'm all for this criteria.
-- Tapadh leabh, Mairi Dulaney.
On Tue, Mar 17, 2020 at 4:49 PM Dulaney jdulaney@gnu.org wrote:
I can't recall, but, is podman included in a release blocking package set?
I ask because it already fails to properly exit and is eventually timed out. Podman fails this criteria for F31 (and presumably F32). That said, I'm all for this criteria.
This is only about services included in default installation. As far as I know, podman isn't a service, so this change wouldn't affect it at all.
Also, I don't think podman is included in default package set anywhere, but I could be wrong here.
On Tue, Mar 17, 2020 at 2:47 AM Kamil Paral kparal@redhat.com wrote:
I find it a very poor experience to wait 90+ seconds for machine reboot/shutdown. Much poorer than, say, a crashing desktop application (which we block on), because that application can be replaced with a different one. System services mostly can't be replaced, and certainly not by a general user.
I agree. I always force power off in this case, a) in the context of desktop reboots I'm very strongly biased toward DO IT NOW! ; b) I love torture testing file systems.
All system services present after installation with one of the release-blocking package sets must not time out frequently or regularly when they are being stopped during system reboot/shutdown.
Alternate 1: 'must not timeout when'
Alternate 2: 'must not consistently timeout when'
I think blocking on transient unit timeouts may not be practical even if desirable. But there remains a problem with all of these: what if the timeout is 90s and the unit consistently takes 80s to stop? I think that's no different than 90s and systemd just killing it off. Some units have 5 minute or even indefinite timeouts set. If the criterion is hinged on the timeout being reached, then it may often not be a blocker even if that's the intent.
Alternate 3: 'must not consistently hang for more than 30s when'
I might get on board with 10s being the max.
There could be differences of opinion for different editions. There are Server use cases where you do not want to just start killing off processes. TimeoutStopSec= should probably be honored in those cases; as well as units that properly send EXTEND_TIMEOUT_USEC=… which shouldn't be second guessed. If the system is hanging, the user should become suspicious and investigate, rather than pull the plug.
On a desktop system? I'm hard pressed to think of why terminate should not be instant. Give it a second and then kill it.
-- Chris Murphy
On Tue, Mar 17, 2020 at 7:58 PM Chris Murphy lists@colorremedies.com wrote:
Alternate 1: 'must not timeout when'
Alternate 2: 'must not consistently timeout when'
I think blocking on transient unit timeouts may not be practical even if desirable. But there remains a problem with all of these: what if the timeout is 90s and the unit consistently takes 80s to stop? I think that's no different than 90s and systemd just killing it off. Some units have 5 minute or even indefinite timeouts set. If the criterion is hinged on the timeout being reached, then it may often not be a blocker even if that's the intent.
That was a tradeoff I was willing to do...
Alternate 3: 'must not consistently hang for more than 30s when'
...but I like this one even better. The word *hang* implies that it's not doing anything. So if e.g. libvirt-guests.service takes 45 seconds to save the state of all your VMs, that's not hanging (and it will not occur when you have no VMs running, so this is easy to distinguish). If the filesystem needs to be synced, that's not hanging (and if you sync before reboot, it will not happen, easy to distinguish). The word *consistently* is also important, it will not apply to e.g. cups.service or packagekit.service acting up once per 30 reboots. But if cups.service hangs each time you print a document, that's consistent, and now we need to determine whether it's serious enough to block the release. This was the reason why I also included *frequently* in the original proposal - I think there should be a judgement call when it doesn't occur every single time. If it is consistent but infrequent (e.g. each time you print), it might not be a good idea to block on it. Although, I guess I should've used *and* instead of *or*, i.e. "frequently and consistently".
On Wed, Mar 18, 2020 at 2:08 AM Kamil Paral kparal@redhat.com wrote:
On Tue, Mar 17, 2020 at 7:58 PM Chris Murphy lists@colorremedies.com wrote:
Alternate 1: 'must not timeout when'
Alternate 2: 'must not consistently timeout when'
I think blocking on transient unit timeouts may not be practical even if desirable. But there remains a problem with all of these: what if the timeout is 90s and the unit consistently takes 80s to stop? I think that's no different than 90s and systemd just killing it off. Some units have 5 minute or even indefinite timeouts set. If the criterion is hinged on the timeout being reached, then it may often not be a blocker even if that's the intent.
That was a tradeoff I was willing to do...
Alternate 3: 'must not consistently hang for more than 30s when'
...but I like this one even better. The word hang implies that it's not doing anything. So if e.g. libvirt-guests.service takes 45 seconds to save the state of all your VMs, that's not hanging (and it will not occur when you have no VMs running, so this is easy to distinguish). If the filesystem needs to be synced, that's not hanging (and if you sync before reboot, it will not happen, easy to distinguish). The word consistently is also important, it will not apply to e.g. cups.service or packagekit.service acting up once per 30 reboots. But if cups.service hangs each time you print a document, that's consistent, and now we need to determine whether it's serious enough to block the release. This was the reason why I also included frequently in the original proposal - I think there should be a judgement call when it doesn't occur every single time. If it is consistent but infrequent (e.g. each time you print), it might not be a good idea to block on it. Although, I guess I should've used and instead of or, i.e. "frequently and consistently".
I have no objection to using frequently. A frequent occurrence happens less often than a consistent occurrence, so there may be more wiggle room using frequently.
A long delay on a shutdown or restart must be a blocker criterion. It's an excuse for folks to say "Fedora is slow". The pundits will say it's broken. Also it's very annoying especially when you're working on software or doing testing.
Please don't let this go undone because of debates over verbiage. It would be better to get this Approved and tweak the works later tham to let this go.
On Mon, Feb 22, 2021 at 1:45 PM Pat Kelly pmkellly@frontier.com wrote:
A long delay on a shutdown or restart must be a blocker criterion. It's an excuse for folks to say "Fedora is slow". The pundits will say it's broken. Also it's very annoying especially when you're working on software or doing testing.
Please don't let this go undone because of debates over verbiage. It would be better to get this Approved and tweak the works later tham to let this go.
Folks who don't have this mailing list's archive of the 2020 might see this as a new email. Just to clarify, Pat replied to my last year's proposal which you can read in full here: https://lists.fedoraproject.org/archives/list/test@lists.fedoraproject.org/t...