Today one of three machines failed to boot 6.2.0-0.rc4. The machine that failed had not been rebooted in a week and now won't boot any kernel and it appears grub is aborting with a pointer out of range error. All three machines use ext4 and luks, but only the failing machine uses mdraid. I haven't recovered the failing machine yet, but plan to downgrade grub tomorrow and hope that confirms a grub bug by allowing it to boot. If so, I'll file a bug report. I'm wondering if anyone else saw this and/or if they think there might be a different issue I should be looking for?
On Tue, Jan 17, 2023 at 18:37:20 -0600, Bruno Wolff III bruno@wolff.to wrote:
Today one of three machines failed to boot 6.2.0-0.rc4. The machine that failed had not been rebooted in a week and now won't boot any kernel and it appears grub is aborting with a pointer out of range error. All three machines use ext4 and luks, but only the failing machine uses mdraid. I haven't recovered the failing machine yet, but plan to downgrade grub tomorrow and hope that confirms a grub bug by allowing it to boot. If so, I'll file a bug report. I'm wondering if anyone else saw this and/or if they think there might be a different issue I should be looking for?
On these machines /boot is not encrypted, so likely luks isn't a factor in the issue.
On Tue, Jan 17, 2023 at 18:37:20 -0600, Bruno Wolff III bruno@wolff.to wrote:
Today one of three machines failed to boot 6.2.0-0.rc4. The machine that failed had not been rebooted in a week and now won't boot any kernel and it appears grub is aborting with a pointer out of range error. All three machines use ext4 and luks, but only the failing machine uses mdraid. I haven't recovered the failing machine yet, but plan to downgrade grub tomorrow and hope that confirms a grub bug by allowing it to boot. If so, I'll file a bug report.
Running grub2-install in a live image fixed this. I'm not sure if this is a bug or if people are expected to do this themselves (before rebooting) because there are issues trying to automate this for legacy systems. It might be that the scripts to do updates don't pull the correct devices when /boot is on raid. It is very annoying though when this happens.
On Wed, Jan 18, 2023 at 12:18:30 -0600, Bruno Wolff III bruno@wolff.to wrote:
On Tue, Jan 17, 2023 at 18:37:20 -0600, Bruno Wolff III bruno@wolff.to wrote:
Today one of three machines failed to boot 6.2.0-0.rc4. The machine that failed had not been rebooted in a week and now won't boot any kernel and it appears grub is aborting with a pointer out of range error. All three machines use ext4 and luks, but only the failing
I filed bug 2162113 about this issue, though I'm good for now after running grub2-install.
On Wed, Jan 18, 2023, at 1:18 PM, Bruno Wolff III wrote:
On Tue, Jan 17, 2023 at 18:37:20 -0600, Bruno Wolff III bruno@wolff.to wrote:
Today one of three machines failed to boot 6.2.0-0.rc4. The machine that failed had not been rebooted in a week and now won't boot any kernel and it appears grub is aborting with a pointer out of range error. All three machines use ext4 and luks, but only the failing machine uses mdraid. I haven't recovered the failing machine yet, but plan to downgrade grub tomorrow and hope that confirms a grub bug by allowing it to boot. If so, I'll file a bug report.
Running grub2-install in a live image fixed this.
Is the system firmware BIOS or UEFI? Either way it's kinda confusing...
BIOS GRUB does not update the embedded core.img (in the MBR gap or GPT BIOS BOOT partition), so when the RPM version changes, the embedded GRUB doesn't change, nor do any of the modules in /boot/grub2. So a grub bug trigging boot failure on BIOS is ... not expected.
UEFI GRUB does update the grubx64.efi OSLoader found on the EFI System partition. So a grub bug could trigger boot failure here. But I'm surprised grub2-install even executes, due to well known issues making it not recommended as the resulting EFI file ends up having rather different behaviors than the EFI file produced in Fedora infra and included in the GRUB RPM.
So either way it's kind of a weird result... I will speculate -> if UEFI, problem could be /boot/grub2 contained some older version of modules that your specific use case requires (the typical case in Fedora, modules aren't installed in either /usr or the EFI system partition on UEFI systems - everything needed is baked into the pre-built grubx64.efi OSLoader). Upon updating GRUB RPM, a new grubx64.efi was installed, but did not update the modules in /boot/grub2, this could cause a problem that would be resolved by grub2-install because if that command does execute, it results in version parity for grubx64.efi and the modules in /boot/grub2. But again, on UEFI grub2-install really should not be used. It's not a great situation but there is no agreement right now between distros using Secure Boot and upstream GRUB exactly how to handle grub-install any differently than it is right now.
I pretty much see two options. If you want to use Fedora's grub package, you should "reset" it by following these instructions: https://fedoraproject.org/wiki/GRUB_2#Instructions_for_UEFI-based_systems
If you have a special use case for GRUB that Fedora's doesn't meet: (a) I'd file a RFE bug in RHBZ and explain that use case, so the bootloader team is at least aware of it. And (b) I'd build GRUB from upstream source, and then you can use grub-install as expected. (It won't work out of the box with Secure Boot enabled and Fedora's shim, but I assume you're not using Secure Boot or else grub2-install wouldn't have fixed the problem.)
But after writing all that, maybe UEFI doesn't apply to your use case :D
I'm not sure if this is a bug or if people are expected to do this themselves (before rebooting)
On BIOS, yes, it's expected to require manual user intervention to update the embedded GRUB binary and its modules found on the /boot volume.
because there are issues trying to automate this for legacy systems. It might be that the scripts to do updates don't pull the correct devices when /boot is on raid. It is very annoying though when this happens.
I don't know if https://github.com/coreos/bootupd is doing this automatically on BIOS firmware systems now or is planning on doing it? At one time they were. I think the thought process from this point forward is focusing on UEFI Secure Boot workflows rather than inherently insecurable BIOS scenarios, and that the ship has just sailed.