I'm running 64-bit F29 using the default LVM, but with no separate home partition. Today smartctl reported that "Current_Pending_Sector" and "Offline_Uncorrectable" increased from 0 to 1. Running a self-test failed almost immediately with
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offline Completed: read failure 90% 42494 3299402936
I'm trying to follow the instructions in https://www.smartmontools.org/wiki/BadBlockHowto . Fdisk gives
[root@lenovo-pc ~]# fdisk -lu /dev/sda Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: dos Disk identifier: 0xc3017146
Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 1026047 1024000 500M 7 HPFS/NTFS/exFAT /dev/sda2 1026048 536872959 535846912 255.5G 7 HPFS/NTFS/exFAT /dev/sda3 536872960 538970111 2097152 1G 83 Linux /dev/sda4 538970112 3907028991 3368058880 1.6T 5 Extended /dev/sda5 538972160 3907028991 3368056832 1.6T 8e Linux LVM [root@lenovo-pc ~]#
so the bad LBA is in both sda4 and sda5. Trying tune2fs to find the block size gives
[root@lenovo-pc ~]# tune2fs -l /dev/sda4 | grep Block tune2fs: Attempt to read block from filesystem resulted in short read while trying to open /dev/sda4 Couldn't find valid filesystem superblock. [root@lenovo-pc ~]# tune2fs -l /dev/sda5 | grep Block tune2fs: Bad magic number in super-block while trying to open /dev/sda5 [root@lenovo-pc ~]# tune2fs -l /dev/mapper/fedora-root | grep Block Block count: 419037184 Block size: 4096 Blocks per group: 32768 [root@lenovo-pc ~]#
so I'm guessing that the block size is 4096. In computing the problem block, I'm not sure whether to use /dev/sda4 or /dev/sda5. Also, if I run debugfs, it doesn't allow me to open either device, so at this point I'm not sure how to identify either the inode or file (if there is one) corresponding to the block. Can anyone help?
BTW, I have another machine also running 64-bit F29, with the same size HDD, and no disk problems. The fdisk output is exactly the same, as are the errors when using tune2fs or debugfs on sda4 or sda5, so those errors have nothing to do with my HDD problems, they appear to be generic to any Fedora install.
On 4/21/19 5:06 PM, Andre Robatino wrote:
Device Boot Start End Sectors Size Id Type /dev/sda1 * 2048 1026047 1024000 500M 7 HPFS/NTFS/exFAT /dev/sda2 1026048 536872959 535846912 255.5G 7 HPFS/NTFS/exFAT /dev/sda3 536872960 538970111 2097152 1G 83 Linux /dev/sda4 538970112 3907028991 3368058880 1.6T 5 Extended /dev/sda5 538972160 3907028991 3368056832 1.6T 8e Linux LVM
so the bad LBA is in both sda4 and sda5. Trying tune2fs to find the block size gives
[root@lenovo-pc ~]# tune2fs -l /dev/sda4 | grep Block tune2fs: Attempt to read block from filesystem resulted in short read while trying to open /dev/sda4
sda4 is an extended partition. That's just a container, no filesystem.
Couldn't find valid filesystem superblock. [root@lenovo-pc ~]# tune2fs -l /dev/sda5 | grep Block tune2fs: Bad magic number in super-block while trying to open /dev/sda5
sda5 is an LVM volume, also not directly a filesystem.
[root@lenovo-pc ~]# tune2fs -l /dev/mapper/fedora-root | grep Block Block count: 419037184 Block size: 4096 Blocks per group: 32768
Do you have a home partition as well? If so, it's more likely to be in that one. Try running "badblocks -s -b 4096 /dev/mapper/fedora-root" and if you have a home partition, "badblocks -s -b 4096 /dev/mapper/fedora-home". I'm assuming that you have 4K blocks, that's the default for ext4. If you get a hit from badblocks, then run debugfs on the filesystem. Enter "icheck <number from badblocks>". That should give you at least one inode. If not, then maybe the block isn't in use. Then exit debugfs and run "find / -xdev -inum <inode number>" to find the file corresponding to the inode. Use /home instead of / if that's where the bad block was found.
Thanks for the info. I only have a root partition, no home partition. Badblocks found one bad block (it said 1/0/0 errors after it found the bad block, when it was still running):
[root@lenovo-pc ~]# badblocks -s -b 4096 /dev/mapper/fedora-root Checking for bad blocks (read-only test): 0.00% done, 0:00 elapsed. (0/0/0 err345053591one, 3:05:03 elapsed. (0/0/0 errors) done [root@lenovo-pc ~]#
Unfortunately, debugfs doesn't let me open the filesystem:
[root@lenovo-pc ~]# debugfs debugfs 1.44.6 (5-Mar-2019) debugfs: open /dev/mapper/fedora-root /dev/mapper/fedora-root: Inode bitmap checksum does not match bitmap while reading allocation bitmaps debugfs: icheck 345053591 icheck: Filesystem not open debugfs: quit [root@lenovo-pc ~]#
If I run debugfs on the other machine which has no disk errors, I get a slightly different error:
[root@compaq-pc ~]# debugfs debugfs 1.44.6 (5-Mar-2019) debugfs: open /dev/mapper/fedora-root /dev/mapper/fedora-root: Block bitmap checksum does not match bitmap while reading allocation bitmaps debugfs: quit [root@compaq-pc ~]#
On 4/22/19 7:07 AM, Andre Robatino wrote:
Unfortunately, debugfs doesn't let me open the filesystem:
[root@lenovo-pc ~]# debugfs debugfs 1.44.6 (5-Mar-2019) debugfs: open /dev/mapper/fedora-root /dev/mapper/fedora-root: Inode bitmap checksum does not match bitmap while reading allocation bitmaps debugfs: icheck 345053591 icheck: Filesystem not open debugfs: quit
Try running "sync" first. If not, then try running debugfs with the -n option. If that doesn't work, you might need to reboot to get an fsck.
Running "sync" fixed it:
[root@lenovo-pc ~]# sync [root@lenovo-pc ~]# debugfs debugfs 1.44.6 (5-Mar-2019) debugfs: open /dev/mapper/fedora-root debugfs: icheck 345053591 Block Inode number 345053591 86246540 debugfs: quit [root@lenovo-pc ~]#
The "find" command identified it as an old file in my home directory that hasn't changed in 11 years. I have everything backed up on the other machine, so copied the file back from there. After doing that, the "Current Pending Sector Count" went from 1 to 0. The "Uncorrectable Sector Count" is still 1. A short smartctl test was successful.
Also, the following appeared in dmesg referencing the same bad LBA 3299402936:
[49702.067452] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [49702.067457] sd 0:0:0:0: [sda] tag#0 Sense Key : Medium Error [current] [49702.067459] sd 0:0:0:0: [sda] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed [49702.067463] sd 0:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 c4 a8 e4 00 00 02 00 00 [49702.067465] print_req_error: I/O error, dev sda, sector 3299402936 flags 0 [49704.912179] sd 0:0:0:0: [sda] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [49704.912185] sd 0:0:0:0: [sda] tag#0 Sense Key : Medium Error [current] [49704.912189] sd 0:0:0:0: [sda] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed [49704.912193] sd 0:0:0:0: [sda] tag#0 CDB: Read(10) 28 00 c4 a8 e4 b8 00 00 08 00 [49704.912197] print_req_error: I/O error, dev sda, sector 3299402936 flags 0
Running "hdparm --read-sector 3299402936 /dev/sda" according to https://www.smartmontools.org/wiki/BadBlockHowto shows that the formerly bad sector is readable, so I'm not sure why the "Uncorrectable Sector Count" is still 1. I don't want to force it to be marked as a bad sector unless I'm sure it actually is bad - if the error happens again with the same sector, I can always mark it as bad later.
On Mon, 22 Apr 2019 at 20:10, Andre Robatino robatino@fedoraproject.org wrote:
Running "hdparm --read-sector 3299402936 /dev/sda" according to
https://www.smartmontools.org/wiki/BadBlockHowto shows that the formerly bad sector is readable, so I'm not sure why the "Uncorrectable Sector Count" is still 1. I don't want to force it to be marked as a bad sector unless I'm sure it actually is bad - if the error happens again with the same sector, I can always mark it as bad later.
Before I retired I worked in remote sensing which involves lots of data moving through lots of drives. Over the years we had many drives fail, and learned that it was best to replace a drive at the first sign of problems. We also found that failure rates increase rapidly after the warrantee expires, so started replacing drives at the end of their warrantee period.
Unless you have a lot more time than money you should consider replacing the drive. If it is under warranty you should try to run the vendor's diagnostics to get a return authorization (some vendors accept a smartmontools test report). We had some non-critical uses for old drives (e.g., transfering a system image between boxes) where recovering from a disk failure was a small chance of lossing a few hours against the certianty of spending a few hours to arrange the purchase of a new drive.
OK, but this drive is on a machine with 2 operating systems installed. Reinstalling and reconfiguring those takes just as much time before the drive fails as after, so replacing preemptively wastes time (until the drive starts experiencing regular problems, which it isn't yet). And I have at least 2 backups for all of the data, so I won't lose anything. If a drive starts experiencing regular problems, then I order a new one and replace it before it dies. So far this is an isolated problem. I had a drive once that acquired a bad sector, and nothing changed for another 1 or 2 years when it started adding new bad sectors regularly, then I replaced it while it was still usable. Another drive failed fairly suddenly with no warning after only 1.5 years and I lost some non-critical data, since I wasn't taking backups as seriously at the time. That won't happen again.
The point is, no matter how often I replace the drives, failures can happen. Without backups, those could result in data loss. With backups, I can avoid data loss if I know which files are affected, which Samuel's information makes possible. And if the drive holds one or more OSes (not just data) it takes time to reinstall and reconfigure and it's not worth it until a problem starts repeating.
On Tue, 23 Apr 2019 at 04:25, Andre Robatino robatino@fedoraproject.org wrote:
OK, but this drive is on a machine with 2 operating systems installed. Reinstalling and reconfiguring those takes just as much time before the drive fails as after, so replacing preemptively wastes time (until the drive starts experiencing regular problems, which it isn't yet).
If the drive is still working, tt should not be a big job to clone the disk to the replacement, so all you loose is the time it takes to do the copy. If you have a space elsewhere, you can create an image of the disk now and restore it when the replacement arrives. I use a USB device that provides slots for several types of drives. It is slow but lets me create a cloned drive offline and then swap out the old one.
And I have at least 2 backups for all of the data, so I won't lose anything. If a drive starts experiencing regular problems, then I order a new one and replace it before it dies. So far this is an isolated problem. I had a drive once that acquired a bad sector, and nothing changed for another 1 or 2 years when it started adding new bad sectors regularly, then I replaced it while it was still usable. Another drive failed fairly suddenly with no warning after only 1.5 years and I lost some non-critical data, since I wasn't taking backups as seriously at the time. That won't happen again.
The point is, no matter how often I replace the drives, failures can happen. Without backups, those could result in data loss. With backups, I can avoid data loss if I know which files are affected, which Samuel's information makes possible. And if the drive holds one or more OSes (not just data) it takes time to reinstall and reconfigure and it's not worth it until a problem starts repeating.
At work (I'm retired now) I kept an image of a fresh install for each OS and applied updates so I always had a good base system ready to go when a drive acted up. We did backups (full, series incrementals, and repeat) and the first sign of trouble was usually a failure on the full backup.