Linux Software Raid
This describes how to use Linux to configure inexpensive disks in redundant arrays without any proprietary hardware. The huge advantage of this technique is that there is no RAID controller to fail leaving your data stranded in the striping pattern that only that controller knows about.
|
Warning
|
If your motherboard has a RAID "feature" I would not use it. If your motherboard dies or any component rendering it inoperable, your data could easily be trapped until you track down a completely identical motherboard on eBay. Linux software RAID does not have this problem. |
Setup
Set up partitions to use the partition type "fd" which is "Linux raid autodetect".
Boot Start End Blocks Id System /dev/sda1 * 1 25 200781 fd Linux raid autodetect /dev/sda2 26 269 1959930 82 Linux swap / Solaris /dev/sda3 270 5140 39126307+ fd Linux raid autodetect /dev/sda4 5141 10011 39126307+ fd Linux raid autodetect
Here I am making a boot partition, a swap partition and 2 identically sized data partitions. These can be made into RAID volumes with each other or across other physical drives. I leave the swap as swap because I don’t think it’s entirely useful to have swap space RAID protected.
Once you have the partition table the way you want it on one physical drive, you can clone it to another by doing something like this:
sfdisk -d /dev/sda | sfdisk /dev/sdb
You need to have the Linux RAID enabled in the kernel. From an install disk you might need to:
modprobe md-mod modprobe raid1
(Or whatever raid level you’re after). To get this module configure here:
Device_Drivers->Multi-device_support_(RAID_and_LVM)->RAID_support->RAID-1_(mirroring)_mode
You might also need this if it’s not already there:
# emerge -av sys-fs/mdadm
Now set up some md devices (md=multi disk, I think):
livecd ~ # mknod /dev/md1 b 9 1 livecd ~ # mknod /dev/md2 b 9 2 livecd ~ # mknod /dev/md3 b 9 3 livecd ~ # mknod /dev/md4 b 9 4
Time to actually setup the raid devices:
livecd ~ # mdadm --create --verbose /dev/md1 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1 mdadm: size set to 200704K mdadm: array /dev/md1 started. livecd ~ # mdadm --create --verbose /dev/md3 --level=1 --raid-devices=2 /dev/sda3 /dev/sdb3 mdadm: size set to 39126208K mdadm: array /dev/md3 started. livecd ~ # mdadm --create --verbose /dev/md4 --level=1 --raid-devices=2 /dev/sda4 /dev/sdb4 mdadm: size set to 39126208K mdadm: array /dev/md4 started.
Here’s an example of a serious RAID setup:
mdadm --create --verbose /dev/md3 --level=5 --raid-devices=22 /dev/sdc \ /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj \ /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq \ /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx
Here’s one with a hot spare:
mdadm --verbose --create /dev/md3 --level=5 --raid-devices=7 --spare-devices=1 /dev/sdd /dev/sde /dev/sdf /dev/sdc /dev/sdg /dev/sdi /dev/sdj /dev/sdh
Note that if the RAID volumes were already set up, then just make the devices with mknod and then instead of "creating" the RAID components, just re"assemble" them like this:
mdadm --assemble --verbose /dev/md4 /dev/sda4 /dev/sdb4
Note that if the RAID volumes were already set up and there are redundant disks, it might happen that the RAID volume starts without its redundant mirror. For example, if you find:
md1 : active raid1 sda1[0]
521984 blocks [2/1] [U_]
And you know that it should be being mirrored with /dev/sdb1, then do this:
# mdadm /dev/md1 --add /dev/sdb1
And then look for this:
md1 : active raid1 sdb1[1] sda1[0]
521984 blocks [2/2] [UU]
This seems to finish right away, but actually takes a long time to get established properly. Presumably this can be done with a drive full of data and this process copies everything. You can check up on it with:
watch -n 1 cat /proc/mdstat
Note that a better way to get more information about what is going on with a RAID array is through mdadm:
mdadm --query --detail /dev/md3
To stop a RAID array basically means to have the system stop worrying about it:
mdadm --misc --stop /dev/md3
This is useful if you start an array but something goes wrong and you need to reuse the disks that are part of the aborted or inactive or unwanted array.
For temporary purposes, you can setup a /etc/mdadm.conf file like this:
mdadm --detail --scan > /etc/mdadm.conf
but for a more permanent installation (once the OS is installed), this is a good format for this file:
DEVICE /dev/sda* DEVICE /dev/sdb* ARRAY /dev/md1 devices=/dev/sda1,/dev/sdb1 #ARRAY /dev/md2 devices=/dev/sda2,/dev/sdb2 ARRAY /dev/md3 devices=/dev/sda3,/dev/sdb3 ARRAY /dev/md4 devices=/dev/sda4,/dev/sdb4 MAILADDR sysnet-admin@sysnet.ucsd.edu
livecd ~ # mkfs.ext2 /dev/md1 livecd ~ # mkswap /dev/sda2 livecd ~ # mkswap /dev/sdb2 livecd ~ # swapon /dev/sd[ab]2 livecd ~ # mkfs.ext3 /dev/md3 livecd ~ # mkfs.ext3 /dev/md4
emerge grub grub --no-floppy Setup MBR on /dev/sda: root (hd0,0) setup (hd0) Setup MBR on /dev/sdb: device (hd0) /dev/sdb root (hd0,0) setup (hd0) quit
Don’t forget that your kernel will want a root=/dev/md3 parameter.
Recovery
So imagine that you have some error like this:
[hb.xed.ch][~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdd2[1] sdc2[0]
1951808 blocks [2/2] [UU]
md2 : active raid1 sdd3[1]
78268608 blocks [2/1] [_U]
md0 : active raid1 sdd1[1] sdc1[0]
192640 blocks [2/2] [UU]
The second partition on the first disk is dead and not being used. Install a new disk which will then show unused partitions for the blank drive.
The partition table has to be recreated. Make sure you get this right!!
:-> [swamp][~]$ fdisk -l /dev/sdc Disk /dev/sdc: 82.3 GB, 82348277760 bytes 255 heads, 63 sectors/track, 10011 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Device Boot Start End Blocks Id System /dev/sdc1 * 1 25 200781 fd Linux raid autodetect /dev/sdc2 26 269 1959930 82 Linux swap / Solaris /dev/sdc3 270 5140 39126307+ fd Linux raid autodetect /dev/sdc4 5141 10011 39126307+ fd Linux raid autodetect :-> [swamp][~]$ fdisk -l /dev/sdd Disk /dev/sdd: 82.3 GB, 82348277760 bytes 255 heads, 63 sectors/track, 10011 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Disk /dev/sdd doesn't contain a valid partition table
From this you can see that hdd is the blank drive - no doubt about it. So don’t copy sdd’s table to sdc; that would be bad. Note also that the drives are identical size. This is good. In fact, it might be smart to partition to begin with with a few percent not used to account for the difference between various brands in a nominal size.
Rebuild the partition table using the same technique used at installation:
sfdisk -d /dev/sdc | sfdisk /dev/sdd
Double check:
fdisk -l /dev/sdc; fdisk -l /dev/sdd
Set up swap space on the other volume and go ahead and use it:
:-> [swamp][~]$ mkswap /dev/sdd2 Setting up swapspace version 1, size = 2006962 kB no label, UUID=81258423-4245-431b-8fb9-137e9651e7dd :-> [swamp][~]$ swapon /dev/sdd2
Resetting Old RAID Partitions
Sometimes you might run one drive for a while and then introduce it’s matching RAID mirror and the mirror will spontaneously associate with an md device. Here’s how to clear that before adding it back to the real md set.
If I add hdc1 and it shows an md0 that shouldn’t exist, like this:
md0 : active raid1 hdc1[1]
104320 blocks [2/1] [_U]
Mark it as failed:
# mdadm --manage --fail /dev/md0
And then:
# mdadm --manage --remove /dev/md0
This actually didn’t seem to work! I just fixed the mdadm.conf and rebooted.
Do the recovery mirroring
Do the small easy one first:
:-> [swamp][~]$ mdadm --manage /dev/md1 --add /dev/sdd1 mdadm: added /dev/sdd1
Now check to see that it’s working:
:-> [swamp][~]$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdd1[2] sdc1[0]
200704 blocks [2/1] [U_]
[===================>.] recovery = 98.4% (198912/200704) finish=0.0min speed=39782K/sec
Check again to see it complete:
:-> [swamp][~]$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdd1[1] sdc1[0]
200704 blocks [2/2] [UU]
Do the rest:
:-> [swamp][~]$ mdadm --manage /dev/md3 --add /dev/sdd3 :-> [swamp][~]$ mdadm --manage /dev/md4 --add /dev/sdd4
These things go sequentially, so don’t panic if one waits for the other to finish. If you issue these commands, they’ll work eventually.
You can keep an eye on it:
watch cat /proc/mdstat
Go ahead and put the bootloader in the MBR on the new drive using the procedure outlined above.
Mounting a partition that used to be in a RAID set
Maybe you’ve decommissioned a pair of mirrored RAID drives and you’re thinking of reusing them but you want to see what was on them to make sure it’s nothing important. Here is how to simply mount this drive on a different Linux box. You can check to see that the drive is plugged in and recognized somewhere with:
$ cat /proc/partitions major minor #blocks name
3 0 156290904 hda 3 1 104391 hda1 3 2 498015 hda2 22 0 156290904 sdb 22 1 104391 sdb1 22 2 498015 sdb2
There it is, sdb. But if you try:
$ mount /dev/sdb/1 /mnt/b/1
You get:
mount: unknown filesystem type 'mdraid'
It doesn’t work like that. You need to make sure all your kernel modules are happy for RAID (or have them compiled in works too):
# modprobe md # modprobe raid1
If that’s good and you have sys-fs/mdadm installed then you can do:
# mdadm --assemble /dev/md1 /dev/sdb1 # mount /dev/md1 /mnt/b/1
And you’re looking at the contents.
Using Hardware RAID On Linux with 3Ware
Notes about hardware Raid using a particular 3Ware card (that I don’t use any more but maybe this information will still be useful).
Replacing a drive after a failure
First find the problem. Use a command much like this:
:-> [hb][~]$ /sbin/tw_cli /c0 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 OK - - 64K 4656.51 ON OFF u1 SPARE OK - - - 465.753 - OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 465.76 GB 976773168 WD-WCANU1306978 p1 OK u0 465.76 GB 976773168 WD-WCANU1241597 p2 OK u0 465.76 GB 976773168 WD-WCANU1230955 p3 OK u0 465.76 GB 976773168 WD-WCANU1222179 p4 OK u0 465.76 GB 976773168 WD-WCANU1318737 p5 OK u0 465.76 GB 976773168 WD-WCANU1230683 p6 OK u0 465.76 GB 976773168 WD-WCANU1240889 p7 OK u0 465.76 GB 976773168 WD-WCANU1234675 p8 OK u0 465.76 GB 976773168 9QG0DRJH p9 SMART-FAILURE u1 465.76 GB 976773168 WD-WCANU1231205 p10 OK u0 465.76 GB 976773168 WD-WCANU1255530 p11 OK u0 465.76 GB 976773168 WD-WCANU1059605
This shows that drive 9 is showing signs of flakiness and though it may work great, it may not and should be replaced. It is likely that if there is a problem drive, it will be isolated on it’s own RAID unit as the controller will grab the good drive from the hot spare and start using it.
Now extract the bad drive. This can be very confusing. I have proven twice this month that the labels that are on hb are good. There are no labels on puzzlebox. It’s very important to pull the right drive, especially when things are bad, and this can be confusing, so be careful.
:-> [hb][~]$ /sbin/tw_cli /c0 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 OK - - 64K 4656.51 ON OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 465.76 GB 976773168 WD-WCANU1306978 p1 OK u0 465.76 GB 976773168 WD-WCANU1241597 p2 OK u0 465.76 GB 976773168 WD-WCANU1230955 p3 OK u0 465.76 GB 976773168 WD-WCANU1222179 p4 OK u0 465.76 GB 976773168 WD-WCANU1318737 p5 OK u0 465.76 GB 976773168 WD-WCANU1230683 p6 OK u0 465.76 GB 976773168 WD-WCANU1240889 p7 OK u0 465.76 GB 976773168 WD-WCANU1234675 p8 OK u0 465.76 GB 976773168 9QG0DRJH p9 DRIVE-REMOVED - - - - p10 OK u0 465.76 GB 976773168 WD-WCANU1255530 p11 OK u0 465.76 GB 976773168 WD-WCANU1059605
This is as expected and now the hot spare unit (u1) goes away.
Put a new drive in. Much easier than step #2 - don’t forget to bring a Phillips screwdriver to the machine room, by the way.
:-> [hb][~]$ /sbin/tw_cli /c0 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 OK - - 64K 4656.51 ON OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 465.76 GB 976773168 WD-WCANU1306978 p1 OK u0 465.76 GB 976773168 WD-WCANU1241597 p2 OK u0 465.76 GB 976773168 WD-WCANU1230955 p3 OK u0 465.76 GB 976773168 WD-WCANU1222179 p4 OK u0 465.76 GB 976773168 WD-WCANU1318737 p5 OK u0 465.76 GB 976773168 WD-WCANU1230683 p6 OK u0 465.76 GB 976773168 WD-WCANU1240889 p7 OK u0 465.76 GB 976773168 WD-WCANU1234675 p8 OK u0 465.76 GB 976773168 9QG0DRJH p9 OK - 465.76 GB 976773168 WD-WCAS81281763 p10 OK u0 465.76 GB 976773168 WD-WCANU1255530 p11 OK u0 465.76 GB 976773168 WD-WCANU1059605
So now drive 9 is back in and it seem ok, but it’s not part of any unit. It’s basically doing nothing useful.
Assign the new drive to a hot spare group. This is actually the important and non-obvious part of the operation.
:-> [hb][~]$ /sbin/tw_cli /c0 add type=spare disk=9 Creating new unit on controller /c0 ... Done. The new unit is /c0/u1. :-> [hb][~]$ /sbin/tw_cli /c0 show Unit UnitType Status %RCmpl %V/I/M Stripe Size(GB) Cache AVrfy ------------------------------------------------------------------------------ u0 RAID-5 OK - - 64K 4656.51 ON OFF u1 SPARE OK - - - 465.753 - OFF Port Status Unit Size Blocks Serial --------------------------------------------------------------- p0 OK u0 465.76 GB 976773168 WD-WCANU1306978 p1 OK u0 465.76 GB 976773168 WD-WCANU1241597 p2 OK u0 465.76 GB 976773168 WD-WCANU1230955 p3 OK u0 465.76 GB 976773168 WD-WCANU1222179 p4 OK u0 465.76 GB 976773168 WD-WCANU1318737 p5 OK u0 465.76 GB 976773168 WD-WCANU1230683 p6 OK u0 465.76 GB 976773168 WD-WCANU1240889 p7 OK u0 465.76 GB 976773168 WD-WCANU1234675 p8 OK u0 465.76 GB 976773168 9QG0DRJH p9 OK u1 465.76 GB 976773168 WD-WCAS81281763 p10 OK u0 465.76 GB 976773168 WD-WCANU1255530 p11 OK u0 465.76 GB 976773168 WD-WCANU1059605
Now the controller knows that this is a hot spare. Everything looks good again.
Immediately go buy a replacement! For hb, there’s one in my office that should not be used for anything but hb and there’s one in the machine room locker which the full sys admin team can get at in my absence. They should both be there!
Bonus step: send the dead drive in to the manufacturer if it’s still under warranty. Can’t have too many of these things lying around.
Areca Cards
cli64
I usually put this utility at /root/cli64.
Show status of current raid environment. This is good for diagnostic checks.
`disk info`
If there is a failure you probably need to pull that failed disk out. It is important to get the correct number. The proper number for finding the physical disk is in the column labeled "Slot#".
There is a password protection. I generally have no need for it and will set it to "0000". This command must be issued in the session that requires password clearance. Just do this before proceeding.
`set password=0000`
To set auto activation of an incomplete RAID. 1 is on, 0 is off. Unfortunately, this doesn’t always work.
`sys autoact p=1`
To see what is going on with a "raid set":
`rsf info raid=3`
If you pull the bad drive and replace it, you might get a "Free" status. This doesn’t help anything. Either your previous Hot Spares are hard at work now becoming primary working drives or you will be activating them to do so. Either way, you want that "Free" drive to be a Hot Spare (assuming you weren’t dumb enough to set up a system without a hot spare). To do this you need the following command with the number of the drive. The number is important to get right and confusing. The disk info command has a column labeled "CLI> #", the first column. Use this number to specify which drive to turn into a hot spare. This is different (probably) from which bay to pull a physical drive out of for a failure. For example, this is to turn the drive in physical by 17 into a hot spare.
`rsf createhs drv=25`
Sometimes there is a failure and the raid set just sits there. I think this tends to happen when the drive fails during a reboot (which is a time slightly more prone to failure). Here is a sequence where I check the raid set which contained a "Failed" that was replaced and turned to "Free". That is not ok since you actually want the drive’s status to be "rs2" or whatever raid set is correct and the rsf info’s "Raid Set State" to be "Rebuilding".
CLI> rsf info raid=3 Raid Set Information =========================================== Raid Set Name : rs2 Member Disks : 7 Total Raw Capacity : 14000.0GB Free Raw Capacity : 14000.0GB Min Member Disk Size : 2000.0GB Raid Set State : Incompleted =========================================== GuiErrMsg<0x00>: Success. CLI> rsf activate raid=3 GuiErrMsg<0x00>: Success. CLI> rsf info raid=3 Raid Set Information =========================================== Raid Set Name : rs2 Member Disks : 7 Total Raw Capacity : 14000.0GB Free Raw Capacity : 0.0GB Min Member Disk Size : 2000.0GB Raid Set State : Rebuilding =========================================== GuiErrMsg<0x00>: Success.
How’s that rebuild going? Check with:
`cli64 vsf info`
If there are problems, check the log:
`cli64 event info`
Areca support likes to see what you’re running:
`cli64 sys info` `cli64 sys showcfg`
|
Note
|
that if a drive fails during a reboot, the raid card doesn’t know what to make of the situation. So instead of automatically rebuilding from a hot spare, it will just do nothing. You have to activate it as shown above. This means that whenever the file server is booted, a drive status report should be generated to make sure everything is starting out properly. |