Sooner or later hard drives fail…

That’s why we use RAID arrays. The best solution is to use hardware RAID – one assisted with specialized processor on the board. In that category I do not include cheap (called Fake RAID) solutions integrated on the motherboard.

But unfortunately sometimes real RAID controllers are too pricey – here on help comes software RAID.

The good news is that it is included in most of the recent OS. Linux does not make an exception and the software included is really well optimized and even recommended to achieve better performance over Fake RAID.

Now in case of failure we are protected, but RAID 1 and 5 will protect the data in case of one drive failure so it is better to replace the failed drive as soon as possible, but on other side you do not want to stop the machine right now.

NOTE: If you have IDE HDD do not use following procedure. IDE drives are NOT HOTSWAPPABLE and removing it may cause MORE DAMAGE.

This is valid also for ordinary s-ATA and SCSI drives.

In case that you have hotswappable drive SCA or similar you can replace the drive when the machine is working.

If you are not sure check the documentation that come with your hardware.

And now after all this precautions let’s start:

Determine the failed drive

To check what array and what drive have problem simply type:

cat /proc/mdstat

Here is sample output:

md2 : active raid5 sdd2[4](F) sda2[0] sdc1[2] sdb2[1]
106221312 blocks level 5, 256k chunk, algorithm 2 [4/3] [UUU_]

In this case the problem is sdd.

Check drive size and type

For the size type:

fdisk -l

And look for sdd in the output.

To check exact drive model type:

dmesg|less

and again look just before SCSI device sdd:

Next step is to obtain replacement drive

(ideally the same model)

Dump the partition table from the drive, if it is still readable:

sfdisk -d /dev/sdd > partitions.sdd

Remove the drive to replace from the array:

mdadm /dev/md2 -r /dev/sdd2

Look up the Host, Channel, ID and Lun of the drive to replace,

by looking in

cat /proc/scsi/scsi

Remove the drive from the bus

echo "scsi remove-single-device 1 0 3 0" > /proc/scsi/scsi

Verify that the drive has been correctly removed

by looking in

cat /proc/scsi/scsi

Physically replace the drive

Unplug the drive from your SCA bay, and insert a new drive
Add the new drive to the bus:

echo "scsi add-single-device 1 0 3 0" > /proc/scsi/scsi

(this should spin up the drive as well)

Recreate the layout

Re-partition the drive using the previously dumped partition table:

sfdisk /dev/sdd < partitions.sdd

If failed drive was unreadable here you need to create new partitions

Add the drive to your array

mdadm /dev/md2 -a /dev/sdd2

You can check if the operation was successful by issuing

cat /proc/mdstat

Inspired with modifications from

Comments

Leave a Reply

You must be logged in to post a comment.