On Tue, 17 May 2022 07:46:09 -0400
Sam Varshavchik <mrsam@courier-mta.com> wrote:
And I had something like this happen when a power supply was going
flaky. The voltages had drifted out of spec as it decayed.
I too have seen that.
SSH into the system to see if it responds?
If your network supports OvrC you can set it to notify you when a system
goes offline.
Tough problem to diagnose.
Agree (but an interesting challenge)
You can try to determine if the problem occurs randomly at a low rate rate (likely
hardware) or more deterministically after a long period of uptime by rebooting on
a schedule. Random issues are more likely hardware (e.g., power supply) while
deterministic is software (e.g., memory leak).
Getting crashdumps or console messages can help pinpoint the time. Some
people have used a video camera set for time lapse to capture interesting
details from the screen.
I've had systems that crashed when IT hit it with a periodic network scan. I
used the time correlation with network logs on a second system to determine the
cause (faulty implementation of SNMP). You need a system watching the
network and a way to pinpoint the time of the failure.