Well I suspected the PS, but the guy I spoke with at Titan said some other things would fail before the SSD if that was the problem.

The power should be pretty stable, and I did connect to a good transient suppressor strip.  Anyway there was no lightning when it died, which was in the past 24 hours.  I had been watching smartmon every few days and it showed no error and temps <37C.

Titan has suggested installing a sata SSD (eliminate the m.2) and I'm going to try that.  He suggested it might be a software issue, that something might be e.g. erasing the partition table on the drive (I don't have another machine handy to verify this), but this seems really unlikely.  I just installed F35 and a moderate set of scientific packages, no proprietary software.  The only access in via ssh inside of vpn and I have the only account.

On Tue, Feb 22, 2022 at 10:47 AM George N. White III <gnwiii@gmail.com> wrote:
On Tue, 22 Feb 2022 at 10:04, Neal Becker <ndbecker2@gmail.com> wrote:
Thanks Richard.  Yes, I talked with Titan; they suggested trying the pcie-m.2 adapter.  I will try them again.
I have not checked for bios updates.  Not sure how to go about that (last time I did that it required an msdos floppy disc).

Haven't tried the SSDs in another device because I don't have one.  But the fact that replacing the SSD causes it to work, where it wasn't working before, tells me they were damaged.  I have at least once power off/on the workstation, and the bios did not find any ssd to boot from.  So power cycle didn't fix it, but replace ssd did fix it.

I will try Titan again later today, but just looking for ideas.   

With this history, I'd probably replace the workstation power supply.   I would also scan the 
the system board for capacitors on bulging tops or overheated components.  

Are there any externally powered devices connected to the workstation (other than the monitor)? 

Are you in an area with frequent lightning storms?  How stable is your power?  Is the system 
connected to a UPS?    

I had a similar experience with spinning disks in a system that contained a drive-bay radio receiver 
and was connected to a satellite dish and GPS receiver on the roof, and an antenna controller.  Everything
was powered by a high quality UPS.  I added a heavy wire connecting the antenna controller case to the 
workstation case and the failures stopped.   

I gather you now have space for two m.2 SSD's.   If you haven't discarded the non-working devices, 
it would be interesting to see if any are detected and what smartmontools says about them, but 
you also have the option to put /var on a separate drive.  Smartmon tools can monitor a drive and 
report any problems it detects, but you may also want to run self-tests periodically.

 

Thanks,
Neal

On Tue, Feb 22, 2022 at 8:44 AM Richard Shaw <hobbes1069@gmail.com> wrote:
On Tue, Feb 22, 2022 at 7:34 AM Neal Becker <ndbecker2@gmail.com> wrote:
I know this is a bit OT, but you guys are great at answering all questions.

I bought a workstation from Titan computers around 1/2020 (dual EPYC cpu).  After about 1 year it stopped working.  I could ssh to it, and almost any command would return Input/Output error.  Unfortunately journalctl gave input/output error so I can't see logs.  cat /proc/partitions did not show any nvme device (the root device) on which the OS was installed.

I replaced the SSD with a samsung 980 pro.  Reinstalled fedora.  It then worked a few weeks, then the exact same symptoms.

I replaced the SSD with another samsung 980 pro, this time with heatsink.  Reinstalled fedora.  It worked a few weeks.  Then same symptoms.

Then I replaced with a 4th samsung 980 pro, but this time instead of using the M.2 socket I used a pcie-m.2 adapter (in case something was wrong with the m.2 socket).  Also added a surge protector outlet for good measure. Reinstalled.  Watched the smartctl.  No errors.  Temperature was always low.

Now it's failed again, exactly same symptoms.

Any ideas?

I remember your other email about a month or so ago and thought it was really strange. Have you tried the drives in another system to confirm they're truly dead? 

I would check for BIOS updates just for good measure. Other than that, have you had any communication with Titan about it?

Thanks,
Richard
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure


--
Those who don't understand recursion are doomed to repeat it
_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure


--
George N. White III

_______________________________________________
users mailing list -- users@lists.fedoraproject.org
To unsubscribe send an email to users-leave@lists.fedoraproject.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@lists.fedoraproject.org
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure


--
Those who don't understand recursion are doomed to repeat it