Hi, I have a fedora35 system that uses rsync to operate as a backup server.
It has a 8TB RAID5 array, and for the last few days, has
crashed/segfaulted in what appears to be the same time that it starts to
backup a particular remote host. This indicates to me that perhaps the
there is some spot on the disk that is related to this particular host's
data that is triggering this.
When it happens, there is a segfault message on the console, but nothing
related to it in the logs. There are bits from the kernel about being
unable to write prior to the crash, however:
Aug 1 12:24:32 mail03 kernel: [2415225.412978] EXT4-fs warning (device
md2): ext4_end_bio:343: I/O error 10 writing to inode 232141206 starting
block 3033088)
Aug 1 12:24:32 mail03 kernel: [2415225.412987] Buffer I/O error on device
md2, logical block 3033088
Aug 1 12:24:32 mail03 kernel: [2415225.413025] Buffer I/O error on device
md2, logical block 3033089
...
Aug 1 12:24:32 mail03 kernel: [2415225.526007] JBD2: Detected IO errors
while flushing file data on md2-8
Aug 1 12:24:35 mail03 kernel: [2415227.560338] JBD2: Detected IO errors
while flushing file data on md2-8
How do I identify which of the four disks this is? I've run smartctl short
checks on each disk in the array, but all four passed without error. What
is md2-8?
From /proc/mdstat:
md2 : active raid5 sde1[4] sdc1[7] sda1[5] sdf1[6]
8790402048 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4]
[UUUU]
bitmap: 0/22 pages [0KB], 65536KB chunk
You'll also notice the array is fully operational.
I'm also now running a full fsck scan of the disk:
# fsck -Vfp -C0 /dev/md2
fsck from util-linux 2.37.4
[/usr/sbin/fsck.ext4 (1) -- /var/backup] fsck.ext4 -fp -C0 /dev/md2
/dev/md2: |=== | 5.7%
but it'll clearly take a while.
I also don't see any errors in the kernel log related to each of the four
individual disks.