Working with software RAID in Linux.

05 October 2007

This post assumes that you've worked enough with Linux to know about the existence of software RAID in the Linux v2.6 kernel series, though not necessarily much about it.

If you're not familiar with it, RAID (Redundant Array of Inexpensive Disks) is a set of techniques that replicate data across multiple hard drives on the assumption that, at some point, a drive is going to fail. If the data can be found in some form on another drive, the data is still available. Otherwise you're out of luck unless you made backups, and if you're really unfortunate, your machine probably crashed as well, which means a full rebuild.

So, in a nutshell, this is how you'd get useful information out of those subsystems as well as how to fix things if they blow up. The easiest way to check on the status of your RAID array is by querying the kernel's /proc/mdstat file:

drwho@leandra:~$ cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdb2[1] sda2[0]
      1003968 blocks [2/2] [UU]

md2 : active raid1 sdb4[1] sda4[0]
      242428800 blocks [2/2] [UU]

md0 : active raid1 sdb1[1] sda1[0]
      256896 blocks [2/2] [UU]

unused devices: <none>

As you can see from the output, there are three RAID-1 instances running on Leandra, named md0 through md2. Each has two disk partitions associated with it (sda? and sdb?), and both are healthy and happy - you can tell from the [UU] block. Each 'U' stands for a drive, and the 'U' means that it's up, i.e., functioning normally.

This is what a degraded RAID looks like in /proc/mdstat:

drwho@akara ~ $ cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 hda2[0]
      2000256 blocks [2/1] [U_]

md2 : active raid1 hda4[0]
      55364288 blocks [2/1] [U_]

md0 : active raid1 hda1[0]
      250368 blocks [2/1] [U_]

unused devices: <none>

As you can see, even though there are three RAID-1 instances, each one only has one active drive (hda?) where there should be two or more in the output. Also, the status block of each look like [U_], which means that one drive (hda) is up and one is missing, i.e., dead.

Another way of getting information out of the RAID subsystem is to use the mdadm command to directly query a RAID device:

leandra ~ # mdadm --query --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Fri Aug 25 16:42:22 2006
     Raid Level : raid1
     Array Size : 256896 (250.92 MiB 263.06 MB)
  Used Dev Size : 256896 (250.92 MiB 263.06 MB)
   <b>Raid Devices : 2</b>
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Oct  4 16:17:43 2007
          <b>State : clean</b>
 Active Devices : 2
<b>Working Devices : 2</b>
 Failed Devices : 0
  Spare Devices : 0

           UUID : 51b1e89c:44a34a14:de91038f:e41f5312
         Events : 0.1728

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1

From the output, you can see that device md0 has two devices (drives) associated with it, and there are two devices working at this time. The state of the RAID device is 'clean'. Either one of these means that the array as a whole and the individual drives are functioning normally, or, you have nothing to worry about.

This is what a degraded RAID looks like in the output of the mdadm command:

akara ~ # mdadm --query --detail /dev/md0
/dev/md0:
        Version : 00.90.03
  Creation Time : Tue Jun 26 16:40:18 2007
     Raid Level : raid1
     Array Size : 250368 (244.54 MiB 256.38 MB)
  Used Dev Size : 250368 (244.54 MiB 256.38 MB)
   <b>Raid Devices : 2</b>
  Total Devices : 1
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Oct  4 16:27:26 2007
          State : clean, degraded
 Active Devices : 1
<b>Working Devices : 1</b>
 Failed Devices : 0
  Spare Devices : 0

           UUID : 923b9460:b7c3cf49:84637909:be5f2af2
         Events : 0.14

    Number   Major   Minor   RaidDevice State
       0       3        1        0      active sync   /dev/hda1
       1       0        0        1      <b>removed</b>

Here, you can see that only one drive of two in the array is operational: Even though two RAID devices are listed, only one is actually working. At the very bottom of the output you can see that /dev/hda1 is active but its twin (/dev/hdb1) isn't present, there's just a message of 'removed'. In other words, Akara's /dev/hdb blew out, so the kernel automatically turned it off and removed it from the array.

Now, the fix for this is relatively simple: Power down the affected machine, replace the dead drive, and boot back up into single user mode. Your replacement drive should ideally be identical in size to the dead one, but adding a larger drive won't hurt anything, and in fact might make future upgrades easier (note: I haven't actually tried this yet). Once in single user mode, you'll have to partition the new drive such that each partition has the same type ('fd') as the other in the good drive, and is the same size or larger than the corresponding partition in the good drive. When this is done, you'd use the mdadm command to add the partition to the RAID. Assuming that the second drive in device md0 went bad, you'd add the first partition to the array thusly: mdadm /dev/md0 --add /dev/hdb1

Substitute the names of the disk and RAID device as appropriate. And double-check your work before hitting the enter key, so you don't accidentally add the wrong partition to the wrong array!

If you look at the contents of the /proc/mdstat file, you'll see that /dev/md0 now has two devices associated with it and that they are resynchronizing. The re-synch process can take a considerable amount of time, so be prepared to go out for pizza while this is happening. On Leandra (dual processor AMD Athlon-64 4800+) with an NVidia SATA-2 controller, rebuilding a mirror of 256 megabytes took less than 60 wallclock seconds. Rebuilding a 200 gigabyte mirror, on the other hand, took nearly 70 wallclock minutes. You can check the progress by periodically looking at the contents of the /proc/mdstat file. I did it this way:

while true; do
cat /proc/mdstat
sleep 5
done

Use control+C to break out of this endless loop.

It should be noted that both rebuilds were done from single user mode, meaning that only a minimal set of systemware was running at the time, which meant that the hard drives weren't doing much except for rebuilding the array.

When everything was finished, I rebooted Leandra one more time and let her come up normally.