Linux Software Raid

This describes how to use Linux to configure inexpensive disks in redundant arrays without any proprietary hardware. The huge advantage of this technique is that there is no RAID controller to fail leaving your data stranded in the striping pattern that only that controller knows about.

Warning
If your motherboard has a RAID "feature" I would not use it. If your motherboard dies or any component rendering it inoperable, your data could easily be trapped until you track down a completely identical motherboard on eBay. Linux software RAID does not have this problem.

Setup

Set up partitions to use the partition type "fd" which is "Linux raid autodetect".

          Boot Start  End     Blocks     Id  System
/dev/sda1   *  1      25      200781     fd  Linux raid autodetect
/dev/sda2      26     269     1959930    82  Linux swap / Solaris
/dev/sda3      270    5140    39126307+  fd  Linux raid autodetect
/dev/sda4      5141   10011   39126307+  fd  Linux raid autodetect

Here I am making a boot partition, a swap partition and 2 identically sized data partitions. These can be made into RAID volumes with each other or across other physical drives. I leave the swap as swap because I don’t think it’s entirely useful to have swap space RAID protected.

Once you have the partition table the way you want it on one physical drive, you can clone it to another by doing something like this:

sfdisk -d /dev/sda | sfdisk  /dev/sdb

You need to have the Linux RAID enabled in the kernel. From an install disk you might need to:

modprobe md-mod
modprobe raid1

(Or whatever raid level you’re after). To get this module configure here:

Device_Drivers->Multi-device_support_(RAID_and_LVM)->RAID_support->RAID-1_(mirroring)_mode

You might also need this if it’s not already there:

# emerge -av sys-fs/mdadm

Now set up some md devices (md=multi disk, I think):

livecd ~ # mknod /dev/md1 b 9 1
livecd ~ # mknod /dev/md2 b 9 2
livecd ~ # mknod /dev/md3 b 9 3
livecd ~ # mknod /dev/md4 b 9 4

Time to actually setup the raid devices:

livecd ~ # mdadm --create --verbose /dev/md1 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
mdadm: size set to 200704K
mdadm: array /dev/md1 started.
livecd ~ # mdadm --create --verbose /dev/md3 --level=1 --raid-devices=2 /dev/sda3 /dev/sdb3
mdadm: size set to 39126208K
mdadm: array /dev/md3 started.
livecd ~ # mdadm --create --verbose /dev/md4 --level=1 --raid-devices=2 /dev/sda4 /dev/sdb4
mdadm: size set to 39126208K
mdadm: array /dev/md4 started.

Here’s an example of a serious RAID setup:

mdadm --create --verbose /dev/md3 --level=5 --raid-devices=22 /dev/sdc \
/dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj \
/dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq \
/dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx

Here’s one with a hot spare:

mdadm --verbose --create /dev/md3 --level=5 --raid-devices=7
--spare-devices=1 /dev/sdd /dev/sde /dev/sdf /dev/sdc /dev/sdg
/dev/sdi /dev/sdj /dev/sdh

Note that if the RAID volumes were already set up, then just make the devices with mknod and then instead of "creating" the RAID components, just re"assemble" them like this:

mdadm --assemble --verbose /dev/md4 /dev/sda4 /dev/sdb4

Note that if the RAID volumes were already set up and there are redundant disks, it might happen that the RAID volume starts without its redundant mirror. For example, if you find:

md1 : active raid1 sda1[0]
      521984 blocks [2/1] [U_]

And you know that it should be being mirrored with /dev/sdb1, then do this:

# mdadm /dev/md1 --add /dev/sdb1

And then look for this:

md1 : active raid1 sdb1[1] sda1[0]
      521984 blocks [2/2] [UU]

This seems to finish right away, but actually takes a long time to get established properly. Presumably this can be done with a drive full of data and this process copies everything. You can check up on it with:

watch -n 1 cat /proc/mdstat

Note that a better way to get more information about what is going on with a RAID array is through mdadm:

mdadm --query --detail /dev/md3

To stop a RAID array basically means to have the system stop worrying about it:

mdadm --misc --stop /dev/md3

This is useful if you start an array but something goes wrong and you need to reuse the disks that are part of the aborted or inactive or unwanted array.

For temporary purposes, you can setup a /etc/mdadm.conf file like this:

mdadm --detail --scan > /etc/mdadm.conf

but for a more permanent installation (once the OS is installed), this is a good format for this file:

Contents of /etc/mdadm.conf
DEVICE    /dev/sda*
DEVICE    /dev/sdb*
ARRAY           /dev/md1 devices=/dev/sda1,/dev/sdb1
#ARRAY           /dev/md2 devices=/dev/sda2,/dev/sdb2
ARRAY           /dev/md3 devices=/dev/sda3,/dev/sdb3
ARRAY           /dev/md4 devices=/dev/sda4,/dev/sdb4
MAILADDR   sysnet-admin@sysnet.ucsd.edu
Now for some filesystems
livecd ~ # mkfs.ext2 /dev/md1
livecd ~ # mkswap /dev/sda2
livecd ~ # mkswap /dev/sdb2
livecd ~ # swapon /dev/sd[ab]2
livecd ~ # mkfs.ext3 /dev/md3
livecd ~ # mkfs.ext3 /dev/md4
Getting the bootloader on both drives
emerge grub
grub --no-floppy
Setup MBR on /dev/sda:
root (hd0,0)
setup (hd0)
Setup MBR on /dev/sdb:
device (hd0) /dev/sdb
root (hd0,0)
setup (hd0)
quit

Don’t forget that your kernel will want a root=/dev/md3 parameter.

Recovery

So imagine that you have some error like this:

[hb.xed.ch][~]# cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdd2[1] sdc2[0]
      1951808 blocks [2/2] [UU]

md2 : active raid1 sdd3[1]
      78268608 blocks [2/1] [_U]

md0 : active raid1 sdd1[1] sdc1[0]
      192640 blocks [2/2] [UU]

The second partition on the first disk is dead and not being used. Install a new disk which will then show unused partitions for the blank drive.

The partition table has to be recreated. Make sure you get this right!!

Display the partition tables of both possible partitions
:-> [swamp][~]$ fdisk -l /dev/sdc

Disk /dev/sdc: 82.3 GB, 82348277760 bytes
255 heads, 63 sectors/track, 10011 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1   *           1          25      200781   fd  Linux raid autodetect
/dev/sdc2              26         269     1959930   82  Linux swap / Solaris
/dev/sdc3             270        5140    39126307+  fd  Linux raid autodetect
/dev/sdc4            5141       10011    39126307+  fd  Linux raid autodetect
:-> [swamp][~]$ fdisk -l /dev/sdd

Disk /dev/sdd: 82.3 GB, 82348277760 bytes
255 heads, 63 sectors/track, 10011 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdd doesn't contain a valid partition table

From this you can see that hdd is the blank drive - no doubt about it. So don’t copy sdd’s table to sdc; that would be bad. Note also that the drives are identical size. This is good. In fact, it might be smart to partition to begin with with a few percent not used to account for the difference between various brands in a nominal size.

Rebuild the partition table using the same technique used at installation:

sfdisk -d /dev/sdc | sfdisk /dev/sdd

Double check:

fdisk -l /dev/sdc; fdisk -l /dev/sdd

Set up swap space on the other volume and go ahead and use it:

:-> [swamp][~]$ mkswap /dev/sdd2
Setting up swapspace version 1, size = 2006962 kB
no label, UUID=81258423-4245-431b-8fb9-137e9651e7dd
:-> [swamp][~]$ swapon /dev/sdd2

Resetting Old RAID Partitions

Sometimes you might run one drive for a while and then introduce it’s matching RAID mirror and the mirror will spontaneously associate with an md device. Here’s how to clear that before adding it back to the real md set.

If I add hdc1 and it shows an md0 that shouldn’t exist, like this:

md0 : active raid1 hdc1[1]
      104320 blocks [2/1] [_U]

Mark it as failed:

# mdadm --manage --fail /dev/md0

And then:

# mdadm --manage --remove /dev/md0

This actually didn’t seem to work! I just fixed the mdadm.conf and rebooted.

Do the recovery mirroring

Do the small easy one first:

:-> [swamp][~]$ mdadm --manage /dev/md1 --add /dev/sdd1
mdadm: added /dev/sdd1

Now check to see that it’s working:

:-> [swamp][~]$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdd1[2] sdc1[0]
      200704 blocks [2/1] [U_]
      [===================>.]  recovery = 98.4% (198912/200704) finish=0.0min speed=39782K/sec

Check again to see it complete:

:-> [swamp][~]$ cat /proc/mdstat
Personalities : [raid1]
md1 : active raid1 sdd1[1] sdc1[0]
      200704 blocks [2/2] [UU]

Do the rest:

:-> [swamp][~]$ mdadm --manage /dev/md3 --add /dev/sdd3
:-> [swamp][~]$ mdadm --manage /dev/md4 --add /dev/sdd4

These things go sequentially, so don’t panic if one waits for the other to finish. If you issue these commands, they’ll work eventually.

You can keep an eye on it:

watch cat /proc/mdstat

Go ahead and put the bootloader in the MBR on the new drive using the procedure outlined above.

Mounting a partition that used to be in a RAID set

Maybe you’ve decommissioned a pair of mirrored RAID drives and you’re thinking of reusing them but you want to see what was on them to make sure it’s nothing important. Here is how to simply mount this drive on a different Linux box. You can check to see that the drive is plugged in and recognized somewhere with:

$ cat /proc/partitions
major minor  #blocks  name
 3     0  156290904 hda
 3     1     104391 hda1
 3     2     498015 hda2
22     0  156290904 sdb
22     1     104391 sdb1
22     2     498015 sdb2

There it is, sdb. But if you try:

$ mount /dev/sdb/1 /mnt/b/1

You get:

mount: unknown filesystem type 'mdraid'

It doesn’t work like that. You need to make sure all your kernel modules are happy for RAID (or have them compiled in works too):

# modprobe md
# modprobe raid1

If that’s good and you have sys-fs/mdadm installed then you can do:

# mdadm --assemble /dev/md1 /dev/sdb1
# mount /dev/md1 /mnt/b/1

And you’re looking at the contents.

Using Hardware RAID On Linux with 3Ware

Notes about hardware Raid using a particular 3Ware card (that I don’t use any more but maybe this information will still be useful).

Replacing a drive after a failure

1

First find the problem. Use a command much like this:

 :-> [hb][~]$ /sbin/tw_cli /c0 show

 Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
 ------------------------------------------------------------------------------
 u0    RAID-5    OK             -       -       64K     4656.51   ON     OFF
 u1    SPARE     OK             -       -       -       465.753   -      OFF

 Port   Status           Unit   Size        Blocks        Serial
 ---------------------------------------------------------------
 p0     OK               u0     465.76 GB   976773168     WD-WCANU1306978
 p1     OK               u0     465.76 GB   976773168     WD-WCANU1241597
 p2     OK               u0     465.76 GB   976773168     WD-WCANU1230955
 p3     OK               u0     465.76 GB   976773168     WD-WCANU1222179
 p4     OK               u0     465.76 GB   976773168     WD-WCANU1318737
 p5     OK               u0     465.76 GB   976773168     WD-WCANU1230683
 p6     OK               u0     465.76 GB   976773168     WD-WCANU1240889
 p7     OK               u0     465.76 GB   976773168     WD-WCANU1234675
 p8     OK               u0     465.76 GB   976773168     9QG0DRJH
 p9     SMART-FAILURE    u1     465.76 GB   976773168     WD-WCANU1231205
 p10    OK               u0     465.76 GB   976773168     WD-WCANU1255530
 p11    OK               u0     465.76 GB   976773168     WD-WCANU1059605

This shows that drive 9 is showing signs of flakiness and though it may work great, it may not and should be replaced. It is likely that if there is a problem drive, it will be isolated on it’s own RAID unit as the controller will grab the good drive from the hot spare and start using it.

2

Now extract the bad drive. This can be very confusing. I have proven twice this month that the labels that are on hb are good. There are no labels on puzzlebox. It’s very important to pull the right drive, especially when things are bad, and this can be confusing, so be careful.

 :-> [hb][~]$ /sbin/tw_cli /c0 show

 Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
 ------------------------------------------------------------------------------
 u0    RAID-5    OK             -       -       64K     4656.51   ON     OFF

 Port   Status           Unit   Size        Blocks        Serial
 ---------------------------------------------------------------
 p0     OK               u0     465.76 GB   976773168     WD-WCANU1306978
 p1     OK               u0     465.76 GB   976773168     WD-WCANU1241597
 p2     OK               u0     465.76 GB   976773168     WD-WCANU1230955
 p3     OK               u0     465.76 GB   976773168     WD-WCANU1222179
 p4     OK               u0     465.76 GB   976773168     WD-WCANU1318737
 p5     OK               u0     465.76 GB   976773168     WD-WCANU1230683
 p6     OK               u0     465.76 GB   976773168     WD-WCANU1240889
 p7     OK               u0     465.76 GB   976773168     WD-WCANU1234675
 p8     OK               u0     465.76 GB   976773168     9QG0DRJH
 p9     DRIVE-REMOVED    -      -           -             -
 p10    OK               u0     465.76 GB   976773168     WD-WCANU1255530
 p11    OK               u0     465.76 GB   976773168     WD-WCANU1059605

This is as expected and now the hot spare unit (u1) goes away.

3

Put a new drive in. Much easier than step #2 - don’t forget to bring a Phillips screwdriver to the machine room, by the way.

 :-> [hb][~]$ /sbin/tw_cli /c0 show

 Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
 ------------------------------------------------------------------------------
 u0    RAID-5    OK             -       -       64K     4656.51   ON     OFF

 Port   Status           Unit   Size        Blocks        Serial
 ---------------------------------------------------------------
 p0     OK               u0     465.76 GB   976773168     WD-WCANU1306978
 p1     OK               u0     465.76 GB   976773168     WD-WCANU1241597
 p2     OK               u0     465.76 GB   976773168     WD-WCANU1230955
 p3     OK               u0     465.76 GB   976773168     WD-WCANU1222179
 p4     OK               u0     465.76 GB   976773168     WD-WCANU1318737
 p5     OK               u0     465.76 GB   976773168     WD-WCANU1230683
 p6     OK               u0     465.76 GB   976773168     WD-WCANU1240889
 p7     OK               u0     465.76 GB   976773168     WD-WCANU1234675
 p8     OK               u0     465.76 GB   976773168     9QG0DRJH
 p9     OK               -      465.76 GB   976773168     WD-WCAS81281763
 p10    OK               u0     465.76 GB   976773168     WD-WCANU1255530
 p11    OK               u0     465.76 GB   976773168     WD-WCANU1059605

So now drive 9 is back in and it seem ok, but it’s not part of any unit. It’s basically doing nothing useful.

4

Assign the new drive to a hot spare group. This is actually the important and non-obvious part of the operation.

 :-> [hb][~]$ /sbin/tw_cli /c0 add type=spare disk=9
 Creating new unit on controller /c0 ...  Done. The new unit is /c0/u1.

 :-> [hb][~]$ /sbin/tw_cli /c0 show

 Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
 ------------------------------------------------------------------------------
 u0    RAID-5    OK             -       -       64K     4656.51   ON     OFF
 u1    SPARE     OK             -       -       -       465.753   -      OFF

 Port   Status           Unit   Size        Blocks        Serial
 ---------------------------------------------------------------
 p0     OK               u0     465.76 GB   976773168     WD-WCANU1306978
 p1     OK               u0     465.76 GB   976773168     WD-WCANU1241597
 p2     OK               u0     465.76 GB   976773168     WD-WCANU1230955
 p3     OK               u0     465.76 GB   976773168     WD-WCANU1222179
 p4     OK               u0     465.76 GB   976773168     WD-WCANU1318737
 p5     OK               u0     465.76 GB   976773168     WD-WCANU1230683
 p6     OK               u0     465.76 GB   976773168     WD-WCANU1240889
 p7     OK               u0     465.76 GB   976773168     WD-WCANU1234675
 p8     OK               u0     465.76 GB   976773168     9QG0DRJH
 p9     OK               u1     465.76 GB   976773168     WD-WCAS81281763
 p10    OK               u0     465.76 GB   976773168     WD-WCANU1255530
 p11    OK               u0     465.76 GB   976773168     WD-WCANU1059605

Now the controller knows that this is a hot spare. Everything looks good again.

5

Immediately go buy a replacement! For hb, there’s one in my office that should not be used for anything but hb and there’s one in the machine room locker which the full sys admin team can get at in my absence. They should both be there!

6

Bonus step: send the dead drive in to the manufacturer if it’s still under warranty. Can’t have too many of these things lying around.

Areca Cards

cli64

I usually put this utility at /root/cli64.

Show status of current raid environment. This is good for diagnostic checks.

`disk info`

If there is a failure you probably need to pull that failed disk out. It is important to get the correct number. The proper number for finding the physical disk is in the column labeled "Slot#".

There is a password protection. I generally have no need for it and will set it to "0000". This command must be issued in the session that requires password clearance. Just do this before proceeding.

`set password=0000`

To set auto activation of an incomplete RAID. 1 is on, 0 is off. Unfortunately, this doesn’t always work.

`sys autoact p=1`

To see what is going on with a "raid set":

`rsf info raid=3`

If you pull the bad drive and replace it, you might get a "Free" status. This doesn’t help anything. Either your previous Hot Spares are hard at work now becoming primary working drives or you will be activating them to do so. Either way, you want that "Free" drive to be a Hot Spare (assuming you weren’t dumb enough to set up a system without a hot spare). To do this you need the following command with the number of the drive. The number is important to get right and confusing. The disk info command has a column labeled "CLI> #", the first column. Use this number to specify which drive to turn into a hot spare. This is different (probably) from which bay to pull a physical drive out of for a failure. For example, this is to turn the drive in physical by 17 into a hot spare.

`rsf createhs drv=25`

Sometimes there is a failure and the raid set just sits there. I think this tends to happen when the drive fails during a reboot (which is a time slightly more prone to failure). Here is a sequence where I check the raid set which contained a "Failed" that was replaced and turned to "Free". That is not ok since you actually want the drive’s status to be "rs2" or whatever raid set is correct and the rsf info’s "Raid Set State" to be "Rebuilding".

CLI> rsf info raid=3
Raid Set Information
===========================================
Raid Set Name        : rs2
Member Disks         : 7
Total Raw Capacity   : 14000.0GB
Free Raw Capacity    : 14000.0GB
Min Member Disk Size : 2000.0GB
Raid Set State       : Incompleted
===========================================
GuiErrMsg<0x00>: Success.

CLI> rsf activate raid=3
GuiErrMsg<0x00>: Success.

CLI> rsf info raid=3
Raid Set Information
===========================================
Raid Set Name        : rs2
Member Disks         : 7
Total Raw Capacity   : 14000.0GB
Free Raw Capacity    : 0.0GB
Min Member Disk Size : 2000.0GB
Raid Set State       : Rebuilding
===========================================
GuiErrMsg<0x00>: Success.

How’s that rebuild going? Check with:

`cli64 vsf info`

If there are problems, check the log:

`cli64 event info`

Areca support likes to see what you’re running:

`cli64 sys info`
`cli64 sys showcfg`
Note
that if a drive fails during a reboot, the raid card doesn’t know what to make of the situation. So instead of automatically rebuilding from a hot spare, it will just do nothing. You have to activate it as shown above. This means that whenever the file server is booted, a drive status report should be generated to make sure everything is starting out properly.