George Avrunin writes:
[..]
I don't have any idea what's going on and it's very
inconvenient (not to
mention strongly discouraged by the powers that be) to have to keep going
on campus to restart the machine. So I'd be very grateful for suggests
about how to figure this out, or at least stop it from happening again.
The capsule summary here is that the system appears to lock up under high
I/O; either disk or network I/O. Doing a dnf upgrade puts a heavy load on
both disk and network I/O. Network I/O only in the case of the update itself
having to go out and download the updates from the repos. If all stuff's
already downloaded, it's mostly just disk I/O.
You can prove that theory by simulating some load yourself. Something like
dd if=/dev/urandom of=/tmp/junk$$ bs=1M count=100 &
Kick this off a dozen times, or so, to write a gig worth of junk into /tmp
(presuming there's space for it).
If this locks up the machine, there you go. If not, and you think your dnf
upgrade was downloading stuff, try generating some network load. You'll have
to have some bandwidth available yourself. You can take the dozen files of
junk, put them in /var/www/html (presuming that apache is running), and wget
them all, in parallel, off this machine from some other place.
For extra credit you can try generating both disk and network load.
If this turns out to reliably lock up this particular bit of hardware, there
you go. What can you do about it? Very little. It's going to be either
failing hardware (hard drive, power supply, or RAM), or a kernel bug.
Looking up the spec sheet for your box, looks like both spinning rust and
SSDs are an available option. You didn't say which one you have, but if your
hard drive are spinning rust, that's the most like point of failure. Pretty
much the only easily-accessible clue would be SMART diagnostics on the hard
drive(s). See if there's anything there that tells you that the hard drive
is on its last breath. The next easiest accessible clue is only available if
you're physically at the machine, that would be a RAM tester. Do Fedora live
images still include a memtest option, does anyone know?
You could be hitting a kernel bug. In the old days, I was rigging up a cross
over on my PCs serial port, and configuring the kernel with a serial
console, then capturing kernel OOPSes on the other machine, over the serial
port. RS-232 ports are long gone. Have some vague recollection of serial
over USB being an option. Another option worth exploring would be look into
remote syslogging. Maybe the kernel can eke out an extra packet or two, to a
remote syslog, before crashing.
But at least confirming that you can reliably reproduce a lockup by
simulating high disk or network I/O is better than nothing.