software RAID Detecting, querying and testing

 

More about software RAID…

It’s always a must for
/var/log/messages
to fill screens with tons of error messages, no matter what happened. But, when it’s about a disk crash, huge lots of kernel errors are reported. Some nasty examples, for the masochists,
kernel: scsi0 channel 0 : resetting for second half of retries.
kernel: SCSI bus is being reset for host 0 channel 0.
kernel: scsi0: Sending Bus Device Reset CCB #2666 to Target 0
kernel: scsi0: Bus Device Reset CCB #2666 to Target 0 Completed
kernel: scsi : aborting command due to timeout : pid 2649, scsi0, channel 0, id 0, lun 0 Write (6) 18 33 11 24 00
kernel: scsi0: Aborting CCB #2669 to Target 0
kernel: SCSI host 0 channel 0 reset (pid 2644) timed out - trying harder
kernel: SCSI bus is being reset for host 0 channel 0.
kernel: scsi0: CCB #2669 to Target 0 Aborted
kernel: scsi0: Resetting BusLogic BT-958 due to Target 0
kernel: scsi0: *** BusLogic BT-958 Initialized Successfully ***

Most often, disk failures look like these,

kernel: sidisk I/O error: dev 08:01, sector 1590410
kernel: SCSI disk error : host 0 channel 0 id 0 lun 0 return code = 28000002

or these
kernel: hde: read_intr: error=0x10 { SectorIdNotFound }, CHS=31563/14/35, sector=0
kernel: hde: read_intr: status=0x59 { DriveReady SeekComplete DataRequest Error }

And, as expected, the classic
/proc/mdstat
look will also reveal problems,
Personalities : [linear] [raid0] [raid1] [translucent] read_ahead not set
md7 : active raid1 sdc9[0] sdd5[8] 32000 blocks [2/1] [U_]

Later on this section we will learn how to monitor RAID with mdadm so we can receive alert reports about disk failures. Now it’s time to learn more about
/proc/mdstat
interpretation.