On 2021-07-28 1:12 a.m., Chris Murphy wrote:
On Tue, Jul 27, 2021 at 3:36 PM old sixpack13
<sixpack13(a)online.de> wrote:
...
>> is your GPU from intel ?
>> if so:
>> - I get it too, sometimes while browsing with FF.
>> - Crtl+Alt+F3 to get a console (?) and do dmesg => ...GPU Crash dump ... GPU
hang...
>>
>> +++ EDIT +++
>> I should have read the first thread again: it's an Intel GPU.
>>
>> anyway, after Crtl+Alt+F3 you should be able to do
>> "sync && sync && sudo systemctl reboot"
>>
>> saves the headache about an possible (?) brtfs filesystem corruption when doing a
"hardcore power off"
>> IIRC, a brtfs scrub ... afterwards could help
>>
>> There shouldn't be such a thing as file system corruption following
>> forced power off. It's sufficiently well tested on ext4, xfs, and
>> btrfs that if there's corruption, it's almost certainly a drive
>> firmware bug getting write order wrong by not honoring flush/FUA when
>> it should.
>>
>> Btrfs has a bit of an advantage in these cases because it's got a
>> pretty simple update order:
>>
>> data + metadata -> flush/FUA -> superblock -> flush/FUA
>>
>> So in theory, the superblock only points to trees that are definitely
>> valid. All changes, data and metadata get written into free space
>> (copy-on-write, no overwrites), and therefore the worst case is data
>> being written is simply lost during a crash because a superblock
>> update didn't happen before the crash. A superblock that points to
>> bad/stale/missing trees means a new superblock made it to disk before
>> the metadata, metadata was lost. That's a firmware bug. We know that
>> because there's asstrometric amounts of tests done on all the file
>> systems, including btrfs, using xfstests. And a number of those tests
>> use dm-log-writes which expressly test for proper write ordering by
>> the file system.
>>
>> Even in case of such a firmware bug, Btrfs can sometimes recover by
>> mounting with:
>>
>> mount -o usebackuproot
>> mount -o rescue=usebackuproot
>>
>> (same thing)
>>
>> This picks an older root to mount instead of the one the super says
>> should be the most recent. But this still implies the drive firmware
>> did something wrong.
>>
>> btrfs scrub checks integrity, it compares the information in a data
>> and metadata blocks with the checksum for that block; this can only be
>> done with the file system mounted
>>
>> btrfs check checks the consistency of the file system, it's a metadata
>> only check but it's not just checking that there's a checksum match
>> but is it correct; the file system needs to be unmounted.
>>
>> There's also the write time and read time tree checkers. Not
>> everything is included in these checks but it does catch certain kinds
>> of corruption at either read time (it's already happened and on disk
>> so let's stop here and no make it worse), or write time (it's not yet
>> on disk, let's stop here). Common cause of write time tree check
>> errors are memory bit flips, but also sometimes kernel bugs and even
>> btrfs bugs. I guess you could call it a nascent online fsck, but
>> without repair capability. Currently it flips the file system
>> read-only to stop further confusion and keep data safe.
Just an update:
The type of failure that I consistently kept seeing in the logs was
illegal addressing in high memory during shutdown. The failing address
always looked like a couple of stuck high address bits, which was
unlikely with a well-tested processor, motherboard and memory.
Sometimes that was not fatal, while other times it cause a hard crash or
a freeze. Repeated memory and motherboard testing has shown no fault,
and I've replace everything else multiple times. This kind of fault is
really tough to diagnose. Upon a lot of reflection in the middle of more
than a few nights, I upgraded the BIOS and ACPI firmware on this Lenovo
P300 from the 2016 version to the latest June/2020 version. Since then
I have not experienced a failure in 3 weeks of daily use with daily
shutdowns.
As an aside, Lenovo does not make it easy to upgrade unless you are on
Windows -- something for the Lenovo folks to rectify in their firmware
distribution site. Thankfully the stand-alone DVD image did the trick.
This outcome is a huge relief, but would seem to show that something in
the ACPI code is not checking for valid addresses. The newer ACPI
firmware is no longer providing bogus addresses, and the problem has
gone away -- something for the kernel people to think about and identify
the faulty/missing checks.
I'm still waiting and watching, but the machine is now highly stable
when running Fedora.
--
John Mellor