On Wed, Nov 27, 2019 at 9:17 PM Chris Murphy <lists@colorremedies.com> wrote:

On Wed, Nov 27, 2019 at 10:35 AM pmkellly@frontier.com
<pmkellly@frontier.com> wrote:

> https://fedoraproject.org/wiki/User:Tablepc/Draft_testcase_reboot

Why does it need to happen on baremetal only? Any problem discovered
by this test case is sure to need a deep dive no matter baremetal or
VM.

I also tend to think that this test case doesn't need to be bare metal exclusive.

One thing all my Btrfs effort has taught me is how underestimated
hardware and firmware bugs are in the storage stack (and even before
Btrfs, the ZFS folks knew about this which is why it was invented).

A fair point about VM testing is whether the disk cache mode affects
the outcome. I use unsafe because it's faster and I embrace misery. I
think QA bots are now mostly using unsafe because it's faster too. So
depending on the situation it may be true that certain corruptions are
expected if unsafe is used, but I *think* unsafe is only unsafe in the
event the host crashes or experiences a power failure. I do forced
power offs of VMs all the time and never lose anything, in the case of
ext4 and XFS, journal replay always makes the file system consistent
again. And journal replay in that example is expected, not a bug.

By "that example", do you mean the story you just described, or the "bad result example" from the test case? Because in that test case example, if the machine was correctly powered off/rebooted, there should be no reason to reply journal or see dirty bits.

How to test, step 2 and 3:
This only applies to FAT and ext4. XFS and Btrfs have no fsck, both
depend on log replay if there was an unclean shutdown. Also, there are
error messages for common unclean shutdowns, and error messages for
uncommon problems. I think we only care about the former, correct?

I believe so. Is there a tool that could tell us whether the currently mounted drives were mounted cleanly, or some error correction had to be performed? Because this is quickly getting into territory where we will need to provide a large amount of examples and rely on people parsing the output, comparing and (mis-)judging. The wording in journal output can change any time as well. And I don't really like that.

Steps 4-7: I'm not following the purpose of these steps. What I'd like
to see for step 4, is, if we get a bad result (any result 2 messages),
we need to collect the journal for the prior boot: `sudo journalctl
-b-1 > journal.log` and attach to a bug report; or we could maybe
parse for systemd messages suggesting it didn't get everything
unmounted. But offhand I don't know what those messages would be, I'd
have to go dig into systemd code to find them.

I think the purpose is to verify that both reboot and poweroff shut down the system correctly without any filesystem issues (which means fully committed journals and no dirty bits set).

If we can make all of this easy to test, then we can automate it, and it's a worthwhile effort. If this is mostly guesswork, I wonder whether we really need such a test case. Because yes, it is something that can go wrong (as everything can), but it's not very likely (and if it does, affected people will probably notice soon and file bugs). And we need to be picky when adding test cases, because we can't test anything and everything, we need to focus on those most important or problematic bits.