Experimenting with btrfs in production.

04 November 2019

EDIT - 20230422 - Fixed the command to increase the amount of space used on a new and bigger drive. Also updated some of the links because the official btrfs page has changed.

EDIT - 20230129 - Changed the btrfs replacement command a bit. Added a command block to force the SATA controller to rescan the devices available to it.

EDIT - 20211120 - Edited the page so that it makes more sense. The last couple of edits were out of sequence. Cleaned up a few things, too.

EDIT - 20211107 @ 1324 UTC-7 - Added how to monitor the drive replacement process.

EDIT - 20201206 @ 2216 UTC-7 - Added how to remove a hard drive and replace it with a bigger one to upgrade.

EDIT - 20200311 @ 1859 UTC-7 - Added how to replace a dead hard drive in a btrfs pool.

EDIT - 20191104 @ 2057 UTC-7 - Figured out how long it takes to scrub 40TB of disk space.  Also did a couple of experiments with rebalancing btrfs and monitored how long it took.

A couple of weeks ago while working on Leandra I started feeling more and more dissatisfied with how I had her storage array set up.  I had a bunch of 4TB hard drives inside her chassis glued together with Linux's [mdadm]i(https://wiki.archlinux.org/index.php/RAID) subsystem into what amounts to a mother-huge hard drive (a RAID-5 array with a hotspare in case one blew out), and LVMon top of that which let me pretend that I was partitioning that mother-huge hard drive so I could mount large-ish pieces of it in different places.  The thing is, while you can technically resize those virtual partitions (logical volumes) to reallocate space, it's not exactly easy.  There's a lot of fiddly stuff that you have to do (resize the file system, resize the logical volume to match, grow the logical volume that needs space, grow the filesystem that needs space, make sure that you actually have enough space) and it gets annoying in a crisis.  There was a second concern, which was figuring out which drive was the one that blew out when none of them were labelled or even had indicators of any kind that showed which drive was doing something (like throwing errors because it had crashed).  This was a problem that required fairly major surgery to fix, on both hardware and software.

By the bye, the purpose of this post isn't to show off how clever I am or brag about Leandra.  This is one part the kind of tutorial I wish I'd had when I was first starting out, and I hope that it helps somebody wrap their mind around some of the more obscure aspects of system administration.  This post is also something of a cheatsheet, both for me and for anyone out there in a similar situation who needs to get something fixed in a hurry without a whole lot of trial and error.  If deep geek porn isn't your thing, feel free to close the tab; I don't mind (but keep it in mind if you know anyone who might need it later).

Ultimately, what I was looking for was something that'd let me treat Leandra's filesystem as a single huge hard drive with as few layers of software on top as possible, and as little mucking around with storage allocation as I could get away with.  I wanted something along the lines of what I had when I first started running Linux: A formatted hard drive with everything on it, and I didn't have to worry about what partition had how much space and where it was mounted.  The first thing I did was acquire a set of larger hard drives, because if I was going to rebuild Leandra's drive array I may as well do it from scratch and get an upgrade in the bargain.  So, I spent a few months picking up 8 TB SATA hard drives here and there and a really nice external drive array so I could run some experiments without having to crack open Leandra's case before it was time.  At the same time I was researching btrfs so I knew what I was getting into.  For starters, btrfs's RAID-5 and RAID-6 support isn't really stable.  It still has a couple of deal-breaking bugs but RAID-1 has been stable for a long time and I'm okay with that.

If you're not familiar with RAID I've already mentioned two different variants, RAID-1 and RAID-5.  To explain them in a non-technical manner, RAID-1 is sometimes referred to as drive mirroring.  Basically, the Linux kernel says "Here are two hard drives of the same size.  I'm going to make them exact duplicates of one another down to the very last bit.  Every time something is written to the file system, it'll be written to both drives at once, so if one of them dies there will still be one copy."  RAID-5 is block level striping with parity; what this means is there is a group of hard drives stuck together end-to-end into an array.  Every time something is written to the file system, that file is written in a row across the array, so that every drive has a little piece of it.  If one of the drives blows out the file's parity values in the array can be used to recalculate the missing pieces when the drive is replaced.

btrfs' implementation of RAID-1 is a little different from how it's usually done, and in some ways is a little more efficient.  Essentially, you give btrfs a bunch of hard drives, called a pool, and you tell it how you want to use that pool.  In Leandra's case I specified RAID-1 for both data and filesystem metadata.  Rather than doing a lot of work making two drives into perfect bit-level copies of each other, btrfs basically says, "Okay, I'll just keep two copies of your data in the pool on different devices, so if one drive goes there will be a very good chance that the other copies will be fine."

Before anyone out there asks, the reason I didn't go with ZFS is because it's not a first-class citizen in the Linux kernel due to licensing conflicts.  I do not feel comfortable putting lots of important data on a file system that doesn't get the same kind of care and feeding as the rest of the Linux kernel.  btrfs has been incorporated into the Linux kernel since 2009 and is under constant development.

Installing the btrfs software package was about as straightforward as it gets:

[root@leandra drwho]# pacman -S btrfs-progs

Building the initial btrfs array was pretty easy - I installed the four 8 TB drives in the external array, plugged it into Leandra, and powered it up.  Let's call them /dev/sda, /dev/sdb, /dev/sde, and /dev/sdf (because that's what they look like now; I didn't make notes or screencaps during the process in case it blew up in my face and I had to abort).  They were brand-new drives, right out of the box, so I partitioned each one identically:

[root@leandra drwho]# fdisk -l /dev/sdf
Disk /dev/sdf: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: WDC WD80EMAZ-00W
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 7E9388E2-4428-4DEC-A2BF-D18C7F18804E

Device     Start         End     Sectors  Size Type
/dev/sdf1   2048 15628053134 15628051087  7.3T Linux filesystem

Because this was a learning experience (I've never worked with btrfs before), I first made sure that Leandra's backups were in good condition and restorable should it prove necessary.  I started off by making a basic RAID-1 with two of the drives:

# -L btrfs - give the pool the friendly name "btrfs"
# -d raid1 - file data in the array is in a RAID-1
# -m raid1 - file metadata in the array is in a RAID-1, also
[root@leandra drwho]# mkfs.btrfs -L btrfs -d raid1 -m raid1 /dev/sda1 /dev/sdb1

# mount the pool with the friendly name "btrfs" at /btrfs
[root@leandra drwho]# mkdir /btrfs
[root@leandra drwho]# mount LABEL=btrfs /btrfs

Boom.  8 terabytes of disk space.  Now, let's talk about that whole "treat the array the way I used to treat a single hard drive back in the day" thing.  btrfs implements the idea of subvolumes which, without getting into namespaces or anything like that, can be thought of like this: "Make a subdirectory called /btrfs/home.  Give it the friendly name 'home'.  Flip a couple of bits on 'home' so that Linux thinks that subdirectory is a hard drive named 'home'."  Now I can refer to it with that friendly name and not have to type the GUID for that device (which is this if you're curious: b7be05b1-63ef-4ae9-878c-2be8aa219d62).  Nice for keeping things straight but terrible for typing.  I did this for the four directory trees that aren't on the boot device (/home, /opt, /srv, and /var):

[root@leandra drwho]# btrfs subvolume create home
[root@leandra drwho]# btrfs subvolume create opt
[root@leandra drwho]# btrfs subvolume create srv
[root@leandra drwho]# btrfs subvolume create var

# a lot of btrfs commands won't work unless the pool is mounted
[root@leandra drwho]# btrfs subvolume list /btrfs
ID 256 gen 90747 top level 5 path home
ID 259 gen 73133 top level 5 path opt
ID 260 gen 74883 top level 5 path srv
ID 261 gen 90748 top level 5 path var

Just to have something to work with, I copied some data into the array from directories that don't really change very often.

[root@leandra drwho]# cd /
[root@leandra drwho]# rsync -a opt/ /btrfs/opt/
[root@leandra drwho]# rsync -a srv/ /btrfs/srv/

Nice.  Then I added the other two hard drives to the btrfs pool to get a feel for how to grow it:

[root@leandra drwho]# btrfs device add /dev/sde1 /btrfs
[root@leandra drwho]# btrfs device add /dev/sdf1 /btrfs

I'm not sure if I really need to do this, but just in case I decided to manually grow the filesystem so it filled the entire pool before starting a rebalance to distribute the data more evenly:

[root@leandra drwho]# btrfs filesystem resize max /btrfs
[root@leandra drwho]# btrfs filesystem balance /btrfs
[root@leandra drwho]# btrfs filesystem show
Label: 'btrfs'  uuid: b7be05b1-63ef-4ae9-878c-2be8aa219d62
        Total devices 4 FS bytes used 4.14TiB
        devid    1 size 7.28TiB used 2.07TiB path /dev/sdf1
        devid    2 size 7.28TiB used 2.07TiB path /dev/sde1
        devid    3 size 7.28TiB used 2.07TiB path /dev/sdb1
        devid    4 size 7.28TiB used 2.07TiB path /dev/sda1

Rebalancing can take a very long time so it's not the sort of thing you'd want to do very often.  Best practice is every month or so.  Given that it takes two or three days to finish on Leandra, I do it every six to twelve months; most of the time I use filters to restrict the data blocks that get shuffled around.

To make sure that the subvolumes would mount on boot, I edited the /etc/fstab file to reference them correctly:

# btrfs volume
#UUID=b7be05b1-63ef-4ae9-878c-2be8aa219d62
LABEL=btrfs     /btrfs          btrfs   autodefrag,noatime      0 0

# switch out the old mounts (commented out) for the new ones
#LABEL=home     /home           ext4     rw,noatime,stripe=512,data=ordered     0 2
LABEL=btrfs     /home   btrfs   subvol=home,autodefrag,noatime  0 0

#LABEL=opt      /opt            ext4     rw,noatime,stripe=512,data=ordered     0 4
LABEL=btrfs     /opt    btrfs   subvol=opt,autodefrag,noatime   0 0

#LABEL=srv      /srv            ext4     rw,noatime,stripe=512,data=ordered     0 5
LABEL=btrfs     /srv    btrfs   subvol=srv,autodefrag,noatime   0 0

#LABEL=var      /var            ext4     rw,noatime,stripe=512,data=ordered     0 3
LABEL=btrfs     /var    btrfs   subvol=var,autodefrag,noatime   0 0

Now to replicate the rest of the data into the btrfs pool so Leandra could hit the ground running when I booted her back up.  Time to boot down into single-user mode, log in as the root user, and kick off the fairly lengthy task.

# i hate systemd.
[root@leandra drwho]# systemctl isolate rescue.target

[root@leandra drwho]# cd /
[root@leandra drwho]# rsync -a home/ /btrfs/home/
[root@leandra drwho]# rsync -a var/ /btrfs/var/
# this takes a while...

[root@leandra drwho]# shutdown -h now

That was a pretty solid proof-of-concept if I ever saw one.  Now to worry about the hardware.  I mentioned earlier that not having any way to tell what drive was doing what was a problem.  While I was traveling last month I picked up some hot-swap drive bays so I could pop drives out and back in without needing to power Leandra down or break out the toolkit.  I picked up a four-drive unit and a three-drive unit which filled Leandra's external-facing bays all the way.  I was going to put all of the hard drives into those bays and leave the last one on the bottom standing empty.  In the event that a drive dies I can look for the error light, figure out which drive it was, slap a replacement into the empty bay, and then trigger reconstruction.  This part of the process was pretty much what you'd expect: Power Leandra down, rip out the old hard drives, install the hotswap bays, transfer the drives from the external chassis into those bays, and re-do the cabling.

Much to my surprise, when I booted Leandra back up she automatically detected the btrfs pool, fired it up, mounted the subvolumes in /etc/fstab, and we were back in business.  I let this configuration soak for a week or two and encountered no problems, crashes, or incompatibilities.  Just to see what would happen, I grabbed two of the older 4 TB hard drives I'd removed earlier, repartitioned them to clean them out and popped them into two of the empty hot-swap bays.  The Linux kernel detected the newly installed drives as /dev/sdg and /dev/sdh and all I had to do was add them to the pool and trigger a rebalance:

[root@leandra drwho]# btrfs device add /dev/sdg1 /btrfs
[root@leandra drwho]# btrfs device add /dev/sdh1 /btrfs

# Note: This command is for resizing the array but doesn't actually change
#   how much space on each individual drive is or is not used.  It's a subtle
#   and somewhat irritating distinction.
[root@leandra drwho]# btrfs filesystem resize max /btrfs
[root@leandra drwho]# btrfs filesystem balance /btrfs
[root@leandra drwho]# btrfs filesystem df /btrfs 
Data, RAID1: total=4.13TiB, used=4.13TiB
System, RAID1: total=32.00MiB, used=608.00KiB
Metadata, RAID1: total=9.00GiB, used=7.61GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

# hmm... nice, but that doesn't really tell me what I want to know
# let's look at the devices in the pool
[root@leandra drwho]# btrfs filesystem show
Label: 'btrfs'  uuid: b7be05b1-63ef-4ae9-878c-2be8aa219d62
        Total devices 6 FS bytes used 4.14TiB
        devid    1 size 7.28TiB used 2.07TiB path /dev/sdf1
        devid    2 size 7.28TiB used 2.07TiB path /dev/sde1
        devid    3 size 7.28TiB used 2.07TiB path /dev/sdb1
        devid    4 size 7.28TiB used 2.07TiB path /dev/sda1
        devid    5 size 3.64TiB used 0.00B path /dev/sdg1
        devid    6 size 3.64TiB used 0.00B path /dev/sdh1
# yay!

Incidentally, when I was figuring out how to physically label the hotswap bays, one of the methods from this discussion was what I used to trigger the drives' activity lights so I could tell them apart:

# which drive is devid 1?  it maps to /dev/sdf1, so...
[root@leandra drwho]# dd if=/dev/sdf1 of=/dev/null

# the first drive in the array is glowing solid purple, so I guess that's it

I already talked about btrfs rebalancing earlier as part of the "grow the pool" process.  It is recommended that you do this periodically as regular maintenance.  However, it is also recommended that you use filters to clean up data and metadata blocks that are only partially full, so that you don't bog your system down for days at a time.  I didn't time how long a rebalance of blocks that were 20% full took, but certainly less than five seconds.  Other runs were timed like this:

# XX is the percentage at which data blocks will be rebalanced in the pool.
# YY is the percentage at which metadata blocks will be rebalanced in the pool.
# Note that there should be no space in between -[dm] and "usage"
[root@leandra drwho]# time btrfs balance start -dusage=XX -musage=YY /btrfs
Done, had to relocate 2 out of 4286 chunks

real    0m1.109s
user    0m0.000s
sys     0m0.113s
  • 25% full blocks - about 1.11 seconds.
  • 30% full blocks - about 1.02 seconds.
  • 35% full blocks - 10.51 seconds.
  • 40% full blocks - 2.05 seconds.
  • 45% full blocks - 2.08 seconds.
  • 50% full blocks - 1 minutes, 4.52 seconds.

These figures will probably change the longer I use Leandra's btrfs pool, but if I keep on top of it it shouldn't get too bad. To keep tabs on the rebalancing process, use the following command:

[root@leandra drwho]# btrfs balance status /btrfs
Balance on '/btrfs' is running
2 out of about 3 chunks balanced (396 considered),  33% left

There are circumstances in which you may want to cancel a rebalancing run, say, if your UPS has just kicked on and you know you don't have a lot of time. This is how you'd do it:

[root@leandra drwho]# btrfs filesystem balance cancel /btrfs

UPDATED: Scrubbing is the process of verifying the data in the btrfs pool to make sure it's not corrupted.  This can be a lengthy process so it's probably not a good idea to run it very often.  Best practice is supposed to be once a month or so.  I ran a regular scrub on Leandra, and the process took 4:42:19 (4 hours, 42 minutes, 19 seconds).

[root@leandra drwho]# btrfs scrub start /btrfs

You won't get any output while it's running, but you can query the status of the scrub:

[root@leandra drwho]# btrfs scrub status /btrfs
UUID:             b7be05b1-63ef-4ae9-878c-2be8aa219d62
Scrub started:    Sat Nov 20 12:13:45 2021
Status:           running
Duration:         3:01:38
Time left:        0:09:44
ETA:              Sat Nov 20 15:25:08 2021
Total to scrub:   8.65TiB
Bytes scrubbed:   8.21TiB  (94.91%)
Rate:             789.97MiB/s
Error summary:    no errors found

You can, of course, run this command as part of a loop to keep tabs on the process:

[root@leandra drwho]# while true; do
    btrfs filesystem scrub status /btrfs
    echo
    sleep 10
done

If you really need to you can cancel a running scrub.  It won't mess the system up because it'll finish the part it's on and then cleanly terminate:

[root@leandra drwho]# btrfs scrub cancel /btrfs

Speaking of I/O intensive stuff, one of the things I did was modify my backup scripts to detect when a btrfs maintenance operation was happening and gracefully abort if that was the case.  A handy thing about the btrfs scrub/balance status command is that it'll exit with a 0 if a job is not running, and a not-zero if one is.  The exit value is captured by the $? variable.

#...
# Test for a running btrfs scrub job.
echo "Testing for a running btrfs scrub..."
sudo btrfs scrub status /btrfs > /dev/null
if [ $? -gt 0 ]; then
    echo "A btrfs scrub is running.  Terminating offsite backup."
    exit 1
else
    echo "btrfs scrub not running.  Proceeding."
    echo
fi

# Test for a running btrfs balance job.
echo "Testing for a running btrfs balance..."
sudo btrfs balance status /btrfs > /dev/null
if [ $? -gt 0 ]; then
    echo "A btrfs rebalance is running.  Terminating offsite backup."
    exit 1
else
    echo "btrfs rebalance not running.  Proceeding."
    echo
fi
#...

Let's get some more in-depth btrfs usage stats:

[root@leandra drwho]# btrfs filesystem usage /btrfs
Overall:
    Device size:                  36.39TiB
    Device allocated:              8.29TiB
    Device unallocated:           28.10TiB
    Device missing:                  0.00B
    Used:                          8.28TiB
    Free (estimated):             14.05TiB      (min: 14.05TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID1: Size:4.13TiB, Used:4.13TiB
   /dev/sda1       2.07TiB
   /dev/sdb1       2.06TiB
   /dev/sde1       2.07TiB
   /dev/sdf1       2.07TiB

Metadata,RAID1: Size:9.00GiB, Used:7.61GiB
   /dev/sda1       2.00GiB
   /dev/sdb1       7.00GiB
   /dev/sde1       2.00GiB
   /dev/sdf1       7.00GiB

System,RAID1: Size:32.00MiB, Used:608.00KiB
   /dev/sdb1      32.00MiB
   /dev/sde1      32.00MiB

Unallocated:
   /dev/sda1       5.21TiB
   /dev/sdb1       5.21TiB
   /dev/sde1       5.21TiB
   /dev/sdf1       5.21TiB
   /dev/sdg1       3.64TiB
   /dev/sdh1       3.64TiB

Let's see if any of the drives are throwing errors:

[root@leandra drwho]# btrfs device stats --check /btrfs | grep -v ' 0$'
[root@leandra drwho]# 
# nope.

So, what happens when a hard drive dies?  Assuming that you have a replacement handy (and if you're building out an array like Leandra's, you probably do) that's as big as the drive that tanked or larger, the process is really simple.  First, though, here's what was flooding Leandra's kernel message buffer:

...
[Wed Mar 11 17:05:58 2020] sd 1:0:0:0: [sdb] tag#21 FAILED Result:
    hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Wed Mar 11 17:05:58 2020] sd 1:0:0:0: [sdb] tag#21 CDB: Write(16)
    8a 00 00 00 00 00 00 00 08 80 00 00 00 08 00 00
[Wed Mar 11 17:05:58 2020] BTRFS warning (device sdg1): lost page write due to
    IO error on /dev/sdb1
[Wed Mar 11 17:05:58 2020] BTRFS error (device sdg1): bdev /dev/sdb1 errs:
    wr 6423932, rd 301858, flush 14645, corrupt 0, gen 0
[Wed Mar 11 17:05:59 2020] BTRFS error (device sdg1): error writing primary super block to device 3
...

Over.  And over.  And over again.  What this means in essence is, Leandra was trying to write some data to /dev/sdg1, and it was supposed to be mirrored to /dev/sdb1, but /dev/sdb1 had flatlined so it wasn't writing the copy.  I plugged the replacement drive into the hot-swap bay I keep empty for just such an occasion and partitioned it like the others (regular old fdisk has supported GPT partition tables for a while now, so I didn't need any other software.)  Once it was in I used btrfs filesystem show to determine which device number corresponded to /dev/sdb1 (scroll back up for that); for the record, it was device 3.  Watching the kernel message buffer when I plugged the new drive in (dmesg -Tw) I saw that it showed up as /dev/sdh1.  So, I followed the official documentation to start the replacement process:

[root@leandra drwho]# btrfs replace start -r 3  /dev/sdh1 /btrfs

The command line arguments, broken out:

  • replace - replace a failed drive
  • start - start the process
  • -r - don't try to read from the failed drive unless there is no other copy of a data block
  • 3 - the failed device is #3
  • /dev/sdh1 - the replacement drive
  • /btrfs - where the btrfs pool is mounted

You can monitor the status of the drive replacement process. While the replacement process is running run this command:

{13:22:59 @ Sun Nov 07}
[drwho @ leandra:(4) ~]$ sudo btrfs replace status /btrfs 
0.3% done, 0 write errs, 0 uncorr. read errs

The output won't update rapidly because the replacement process can take a while, but it's something to help you keep an eye on things. I didn't time how long the replacement process took; if I had to guess, not more than four hours from start to finish.  The end of the replacement process was anticlimactic at best (which is how maintenance should always be):

[root@leandra drwho]# btrfs replace start -r 3  /dev/sdh1 /btrfs
[root@leandra drwho]# btrfs filesystem show
Label: 'btrfs'  uuid: b7be05b1-63ef-4ae9-878c-2be8aa219d62
        Total devices 6 FS bytes used 4.39TiB
        devid    1 size 7.28TiB used 2.20TiB path /dev/sdg1
        devid    2 size 7.28TiB used 2.20TiB path /dev/sdf1
        devid    3 size 7.28TiB used 2.20TiB path /dev/sdh1
        devid    4 size 7.28TiB used 2.20TiB path /dev/sda1
        devid    5 size 3.64TiB used 0.00B path /dev/sdd1
        devid    6 size 3.64TiB used 0.00B path /dev/sdc1
# Hey - /dev/sdh1 is now devid 3!
# And where's /dev/sdb1?  ¯\_(ツ)_/¯

Finally, I yanked out the failed hard drive and got ready to take a sledgehammer to it.  Here's what it looked like in the kernel message buffer:

[Wed Mar 11 18:26:24 2020] ata2: SATA link down (SStatus 0 SControl 300)
[Wed Mar 11 18:26:24 2020] ata2.00: detaching (SCSI 1:0:0:0)
[Wed Mar 11 18:26:24 2020] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[Wed Mar 11 18:26:24 2020] sd 1:0:0:0: [sdb] Synchronize Cache(10) failed:
    Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Wed Mar 11 18:26:24 2020] sd 1:0:0:0: [sdb] Stopping disk
[Wed Mar 11 18:26:24 2020] sd 1:0:0:0: [sdb] Start/Stop Unit failed: Result:
    hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK

What if you wanted to just upgrade a drive in your array? As you can see in the above examples I had two 4 TB hard drives in the pool of six. In the almost a year since I migrated Leandra to btrfs, I discovered that, when you build a pool of drives, you really do need to match the sizes. The two (older) 4 TB drives never got used once. Even after load testing and multiple rebalances, they never accumulated any data at all. So, when it came time to retire some older drives I picked one of those lame ducks. While I wasn't taking notes while I did this I don't have much in the way of captured output but I do have the commands in my shell history.

I picked one of the old 4 TB drives to get rid of (to stay consistent with the rest of this document, it's /dev/sdc) and pulled it out of the array with a surprisingly obvious command.

[root@leandra drwho]# btrfs device delete /dev/sdc1 /btrfs

After a couple of seconds, it was.. well.. gone. There weren't any data blocks that had to be relocated because it was an odd drive out, so it was a quick procedure. I yanked the old drive out and tossed it into the shredder, and popped the new drive into the now vacant hot-swap bay. A quick peek at the output of dmesg showed that the replacement was device /dev/sdi.

As previously demonstrated, I created a new GPT partition table and disk partition to lay out the drive - /dev/sdi1. The command to splice the new drive into the array was:

[root@leandra drwho]# btrfs device add /dev/sdi1 /btrfs
[root@leandra drwho]# btrfs filesystem show /btrfs
Label: 'btrfs'  uuid: b7be05b1-63ef-4ae9-878c-2be8aa219d62
        Total devices 6 FS bytes used 2.01TiB
        devid    1 size 7.28TiB used 1.03TiB path /dev/sdd1
        devid    2 size 7.28TiB used 1.03TiB path /dev/sdc1
        devid    3 size 7.28TiB used 1.03TiB path /dev/sdf1
        devid    4 size 7.28TiB used 1.03TiB path /dev/sdg1
        devid    6 size 3.64TiB used 0.00B path /dev/sdb1
        devid    7 size 7.28TiB used 0.00B path /dev/sdi1

The add procedure didn't take very long, either. However, I then had to rebalance the btrfs pool to redistribute the data blocks. This process took just over 24 hours. What I did (and this is generally good practice) is do all of my work in a GNU Screen session so that I could disconnect from Leandra, do other things, and reconnect to pick up where I left off. The drive add was taking place in one shell while I monitored the rebalancing process in another shell:

{23:35:29 @ Sun Dec 06}
[drwho @ leandra:(4) ~]$ while true; do
> sudo btrfs balance status /btrfs
> echo
> sleep 30
> done

That command gave me progress reports every thirty seconds. Eventually, just before dinner a day later, I was greeted with the following output:

[root@leandra drwho]# btrfs filesystem show /btrfs
Label: 'btrfs'  uuid: b7be05b1-63ef-4ae9-878c-2be8aa219d62
        Total devices 6 FS bytes used 2.05TiB
        devid    1 size 7.28TiB used 839.03GiB path /dev/sdd1
        devid    2 size 7.28TiB used 841.00GiB path /dev/sdc1
        devid    3 size 7.28TiB used 840.00GiB path /dev/sdf1
        devid    4 size 7.28TiB used 841.00GiB path /dev/sdg1
        devid    6 size 3.64TiB used 0.00B path /dev/sdb1
        devid    7 size 7.28TiB used 841.03GiB path /dev/sdi1

The new drive was integrated into the pool, and the volume of data was scattered more evenly across more of the drives. I hadn't expected that to be quite so painless.

Note: Every time I update this post Leandra's hardware configuration has drifted a bit, so the device names and GUIDs are a little different. I could go back and make everything lined up perfectly, but I really don't think there's much of a point. If you do this stuff, you're going to find that your system layout (device names, in particular) changes also.

Sometimes when you plug the replacement drive in, it won't show up as a /dev/sd? device file. It's a minor annoyance but is easily fixable by telling the SATA drivers to rescan every SATA interface and rebuild the list of devices they see. Here's how I do it (which I should put into a shell script but haven't gotten around to yet):

[root@leandra ~]# for i in /sys/class/scsi_host/*/scan; do
>    echo "0 0 0" > "$i"
>done

This process is fairly quick. If it takes longer than a minute you probably have bigger problems with your system.

Something that I just had to do again, and never really documented well was how to replace a failed drive in an array with a larger one to both upgrade the amount of space and ensure the drive was still consistent. The above procedures work but there's an odd-one-out command that didn't really make sense in the context of the other stuff I wrote.

As background, Leandra's drives are getting up there again, with between 20,000 and 30,000 hours of runtime and they're starting to fail. Due to the fact that I both built the drive array incrementally (with one new drive per paycheck for a couple of months in a row) they're wearing out in more or less the same order they were installed. Upgrading to more space per drive for the same price point just makes sense so I've been swapping out the bad 8TB drives with 14TB ones. The replacement procedure works the way it's supposed to but it doesn't take into account the bigger drive size. The command sudo btrfs filesystem resize max /btrfs acts like it works and it doesn't throw errors, but it doesn't actually do what one would think:

{12:24:42 @ Fri Apr 21}
[drwho @ leandra:(7) user]$ sudo btrfs filesystem show
Label: 'btrfs'  uuid: b7be05b1-63ef-4ae9-878c-2be8aa219d62
        Total devices 6 FS bytes used 4.34TiB
        devid    1 size 7.28TiB used 1.45TiB path /dev/sde1
        devid    2 size 7.28TiB used 1.45TiB path /dev/sdd1
        devid    3 size 7.28TiB used 1.45TiB path /dev/sda1
        devid    4 size 7.28TiB used 1.45TiB path /dev/sdg1
        devid    7 size 7.28TiB used 1.45TiB path /dev/sdc1
        devid    8 size 7.28TiB used 1.45TiB path /dev/sdh1

Can you guess which drives are bigger than 8TB? Neither could I. To see what happened I resized each drive individually to see what would happen:

{12:25:22 @ Fri Apr 21}
[drwho @ leandra:(7) user]$ for i in $(sudo btrfs filesystem show | grep devid | awk '{print $2}'); do
>    sudo btrfs filesystem resize $i:max /btrfs
> done

{12:29:11 @ Fri Apr 21}
[drwho @ leandra:(7) user]$ sudo btrfs filesystem show
Label: 'btrfs'  uuid: b7be05b1-63ef-4ae9-878c-2be8aa219d62
        Total devices 6 FS bytes used 4.34TiB
        devid    1 size 7.28TiB used 1.45TiB path /dev/sde1
        devid    2 size 7.28TiB used 1.45TiB path /dev/sdd1
        devid    3 size 12.73TiB used 1.45TiB path /dev/sda1
        devid    4 size 12.73TiB used 1.45TiB path /dev/sdg1
        devid    7 size 12.73TiB used 1.45TiB path /dev/sdc1
        devid    8 size 12.73TiB used 1.45TiB path /dev/sdh1
# Huh.