Experimenting with btrfs in production.

Oct 19 2019

EDIT - 20200311 @ 1859 UTC-7 - Added how to replace a dead hard drive in a btrfs pool.

EDIT - 20191104 @ 2057 UTC-7 - Figured out how long it takes to scrub 40TB of disk space.  Also did a couple of experiments with rebalancing btrfs and monitored how long it took.

A couple of weeks ago while working on Leandra I started feeling more and more dissatisfied with how I had her storage array set up.  I had a bunch of 4TB hard drives inside her chassis glued together with Linux's mdadm subsystem into what amounts to a mother-huge hard drive (a RAID-5 array with a hotspare in case one blew out), and LVM on top of that which let me pretend that I was partitioning that mother-huge hard drive so I could mount large-ish pieces of it in different places.  The thing is, while you can technically resize those virtual partitions (logical volumes) to reallocate space, it's not exactly easy.  There's a lot of fiddly stuff that you have to do (resize the file system, resize the logical volume to match, grow the logical volume that needs space, grow the filesystem that needs space, make sure that you actually have enough space) and it gets annoying in a crisis.  There was a second concern, which was figuring out which drive was the one that blew out when none of them were labelled or even had indicators of any kind that showed which drive was doing something (like throwing errors because it had crashed).  This was a problem that required fairly major surgery to fix, on both hardware and software.

By the bye, the purpose of this post isn't to show off how clever I am or brag about Leandra.  This is one part the kind of tutorial I wish I'd had when I was first starting out, and I hope that it helps somebody wrap their mind around some of the more obscure aspects of system administration.  This post is also one part cheatsheet, both for me and for anyone out there in a similar situation who needs to get something fixed in a hurry, without a whole lot of trial and error.  If deep geek porn isn't your thing, feel free to close the tab; I don't mind (but keep it in mind if you know anyone who might need it later).

Ultimately, what I was looking for was something that'd let me treat Leandra's filesystem as a single huge hard drive with as few layers of software on top as possible, and as little mucking around with storage allocation as I could get away with.  I wanted something along the lines of what I had when I first started running Linux on her: A formatted hard drive with everything on it, and I didn't have to worry about what partition had how much space and where it was mounted.  The first thing I did was acquire a set of larger hard drives, because if I was going to rebuild Leandra's drive array I may as well do it from scratch and get an upgrade in the bargain.  So, I spent a few months picking up 8 TB SATA hard drives here and there and a really nice external drive array so I could run some experiments without having to crack open Leandra's case before it was time.  At the same time I was researching btrfs so I knew what I was getting into.  For starters, btrfs's RAID-5 and RAID-6 support isn't really stable.  It still has a couple of deal-breaking bugs but RAID-1 has been stable for a long time, and I'm okay with that.

If you're not familiar with RAID I've already mentioned two different variants, RAID-1 and RAID-5.  To explain them in a non-technical manner, RAID-1 is sometimes referred to as drive mirroring.  Basically, the Linux kernel says "Here are two hard drives of the same size.  I'm going to make them exact duplicates of one another down to the very last bit.  Every time something is written to the file system, it'll be written to both drives at once, so if one of them dies there will still be one copy."  RAID-5 is block level striping with parity; what this means is there is a group of hard drives stuck together end-to-end into an array.  Every time something is written to the file system, that file is written in a row across the array, so that every drive has a little piece of it.  If one of the drives blows out the file's parity values in the array can be used to recalculate the missing pieces when the drive is replaced.

btrfs' implementation of RAID-1 is a little different from how it's usually done, and in some ways is a little more efficient.  Essentially, you give btrfs a bunch of hard drives, called a pool, and you tell it how you want to use that pool.  In Leandra's case I specified RAID-1 for both data and filesystem metadata.  Rather than doing a lot of work making two drives into perfect bit-level copies of each other, btrfs basically says, "Okay, I'll just keep two copies of your data in the pool on different devices, so if one drive goes there will be a very good chance that the other copies of the data will be fine."

Before anyone out there asks, the reason I didn't go with ZFS is because it's not a first-class citizen in the Linux kernel due to licensing conflicts.  I do not feel comfortable putting lots of important data on a file system that doesn't get the same kind of care and feeding as the rest of the Linux kernel.  btrfs has been incorporated into the Linux kernel since 2009 and is under constant development.

Installing the btrfs software package was about as straightforward as it gets:

[root@leandra drwho]# pacman -S btrfs-progs

Building the initial btrfs array was pretty easy - I installed the four 8 TB drives in the external array, plugged it into Leandra, and powered it up.  Let's call them /dev/sda, /dev/sdb, /dev/sde, and /dev/sdf (because that's what they look like now; I didn't make notes or screencaps during the process in case it blew up in my face and I had to abort).  They were brand-new drives, right out of the box, so I partitioned each one identically:

[root@leandra drwho]# fdisk -l /dev/sdf
Disk /dev/sdf: 7.28 TiB, 8001563222016 bytes, 15628053168 sectors
Disk model: WDC WD80EMAZ-00W
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 7E9388E2-4428-4DEC-A2BF-D18C7F18804E

Device     Start         End     Sectors  Size Type
/dev/sdf1   2048 15628053134 15628051087  7.3T Linux filesystem

Because this was a learning experience (I've never worked with btrfs before), I first made sure that Leandra's backups were in good condition and restorable should it be necessary.  I started off by making a basic RAID-1 with two of the drives:

# -L btrfs - give the pool the friendly name "btrfs"
# -d raid1 - file data in the array is in a RAID-1
# -m raid1 - file metadata in the array is in a RAID-1, also
[root@leandra drwho]# mkfs.btrfs -L btrfs -d raid1 -m raid1 /dev/sda1 /dev/sdb1

# mount the thing with the friendly name "btrfs" at /btrfs
[root@leandra drwho]# mkdir /btrfs
[root@leandra drwho]# mount LABEL=btrfs /btrfs

Boom.  8 terabytes of disk space.  Now, let's talk about that whole "treat the array the way I used to treat a single hard drive back in the day" thing.  btrfs implements the idea of subvolumes which, without getting into namespaces or anything like that, can be thought of like this: "Make a subdirectory called /btrfs/home.  Give it the friendly name 'home'.  Flip a couple of bits on 'home' so that Linux thinks of that subdirectory as a hard drive named 'home'."  Now I can refer to it with that friendly name and not have to type the GUID for that device (which is this if you're curious: b7be05b1-63ef-4ae9-878c-2be8aa219d62).  Nice for keeping things straight but terrible for typing.  I did this for the four directory trees that aren't on the boot device (/home, /opt, /srv, and /var):

[root@leandra drwho]# btrfs subvolume create home
[root@leandra drwho]# btrfs subvolume create opt
[root@leandra drwho]# btrfs subvolume create srv
[root@leandra drwho]# btrfs subvolume create var

# a lot btrfs commands won't work unless the pool itself is mounted
[root@leandra drwho]# btrfs subvolume list /btrfs
ID 256 gen 90747 top level 5 path home
ID 259 gen 73133 top level 5 path opt
ID 260 gen 74883 top level 5 path srv
ID 261 gen 90748 top level 5 path var

Just to have something to work with, I copied some data into the array from directories that don't really change very often.

[root@leandra drwho]# cd /
[root@leandra drwho]# rsync -a opt/ /btrfs/opt/
[root@leandra drwho]# rsync -a srv/ /btrfs/srv/

Nice.  Then I added the other two hard drives to the btrfs pool to get a feel for how to grow it:

[root@leandra drwho]# btrfs device add /dev/sde1 /btrfs
[root@leandra drwho]# btrfs device add /dev/sdf1 /btrfs

I'm not sure if I really need to do this, but just in case I decided to manually grow the filesystem so it filled the entire pool before starting a rebalance to distribute the data more evenly:

[root@leandra drwho]# btrfs filesystem resize max /btrfs
[root@leandra drwho]# btrfs filesystem balance /btrfs
[root@leandra drwho]# btrfs filesystem show
Label: 'btrfs'  uuid: b7be05b1-63ef-4ae9-878c-2be8aa219d62
        Total devices 4 FS bytes used 4.14TiB
        devid    1 size 7.28TiB used 2.07TiB path /dev/sdf1
        devid    2 size 7.28TiB used 2.07TiB path /dev/sde1
        devid    3 size 7.28TiB used 2.07TiB path /dev/sdb1
        devid    4 size 7.28TiB used 2.07TiB path /dev/sda1

Rebalancing can take a very long time, so it's not the sort of thing you'd want to do very often.  Best practice is every month or so.  Given that it takes two or three days to finish on Leandra, I might do it every quarter or therabouts.

To make sure that the subvolumes would mount on boot, I edited the /etc/fstab file to reference them correctly:

# btrfs volume
LABEL=btrfs     /btrfs          btrfs   autodefrag,noatime      0 0

# switch out the old mounts (commented out) for the new ones
#LABEL=home     /home           ext4     rw,noatime,stripe=512,data=ordered     0 2
LABEL=btrfs     /home   btrfs   subvol=home,autodefrag,noatime  0 0

#LABEL=opt      /opt            ext4     rw,noatime,stripe=512,data=ordered     0 4
LABEL=btrfs     /opt    btrfs   subvol=opt,autodefrag,noatime   0 0

#LABEL=srv      /srv            ext4     rw,noatime,stripe=512,data=ordered     0 5
LABEL=btrfs     /srv    btrfs   subvol=srv,autodefrag,noatime   0 0

#LABEL=var      /var            ext4     rw,noatime,stripe=512,data=ordered     0 3
LABEL=btrfs     /var    btrfs   subvol=var,autodefrag,noatime   0 0

Now to synch the rest of the data into the btrfs pool so Leandra could hit the ground running when I booted her back up.  Time to boot down into single-user mode, log in as the root user, and kick off the fairly lengthy task.

# i hate systemd.
[root@leandra drwho]# systemctl isolate rescue.target

[root@leandra drwho]# cd /
[root@leandra drwho]# rsync -a home/ /btrfs/home/
[root@leandra drwho]# rsync -a var/ /btrfs/var/
# this takes a while...

[root@leandra drwho]# shutdown -h now

That was a pretty solid proof-of-concept if I ever saw one.  Now to worry about the hardware.  I mentioned earlier that not having any way to tell what drive was doing what was a problem.  While I was traveling last month I picked up some hot-swap drive bays so I could pop drives out and back in without needing to power Leandra down or break out the toolkit.  I picked up a four-drive unit and a three-drive unit which just filled Leandra's external-facing bays all the way up.  I was going to put all of the hard drives into those bays and leave the last one on the bottom standing empty.  In the event that a drive dies, I can look for the error light, figure out which drive it was, and slap a replacement into the empty bay, and then trigger reconstruction.  This part of the process was pretty much what you'd expect: Power Leandra down, rip out the old hard drives, install the hotswap bays, transfer the drives from the external chassis into the bays, and re-do the cabling.

Much to my surprise, when I booted Leandra back up she automatically detected the btrfs pool, fired it up, mounted the subvolumes in /etc/fstab, and we were back in business.  I let this configuration soak for a week or two and encountered no problems, crashes, or incompatibilities.  Just to see what would happen, I grabbed two of the older 4 TB hard drives I'd removed earlier, repartitioned them to clean them out and popped them into two of the empty hot-swap bays.  The Linux kernel detected the newly installed drives as /dev/sdg and /dev/sdh and all I had to do was add them to the pool, grow the filesystem (which I probably don't need to do) and trigger a rebalance:

[root@leandra drwho]# btrfs device add /dev/sdg1 /btrfs
[root@leandra drwho]# btrfs device add /dev/sdh1 /btrfs
[root@leandra drwho]# btrfs filesystem resize max /btrfs
[root@leandra drwho]# btrfs filesystem balance /btrfs
[root@leandra drwho]# btrfs filesystem df /btrfs 
Data, RAID1: total=4.13TiB, used=4.13TiB
System, RAID1: total=32.00MiB, used=608.00KiB
Metadata, RAID1: total=9.00GiB, used=7.61GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

# hmm... nice, but that doesn't really tell me what I want to know
[root@leandra drwho]# btrfs filesystem show
Label: 'btrfs'  uuid: b7be05b1-63ef-4ae9-878c-2be8aa219d62
        Total devices 6 FS bytes used 4.14TiB
        devid    1 size 7.28TiB used 2.07TiB path /dev/sdf1
        devid    2 size 7.28TiB used 2.07TiB path /dev/sde1
        devid    3 size 7.28TiB used 2.07TiB path /dev/sdb1
        devid    4 size 7.28TiB used 2.07TiB path /dev/sda1
        devid    5 size 3.64TiB used 0.00B path /dev/sdg1
        devid    6 size 3.64TiB used 0.00B path /dev/sdh1
# yay!

Incidentally, when I was figuring out how to physically label the hotswap bays, one of the methods from this discussion thread was what I used to trigger the drive's activity light so I could tell them apart:

# let's look at the devices in the pool
[root@leandra drwho]# btrfs filesystem show
Label: 'btrfs'  uuid: b7be05b1-63ef-4ae9-878c-2be8aa219d62
        Total devices 6 FS bytes used 4.14TiB
        devid    1 size 7.28TiB used 2.07TiB path /dev/sdf1
        devid    2 size 7.28TiB used 2.07TiB path /dev/sde1
        devid    3 size 7.28TiB used 2.07TiB path /dev/sdb1
        devid    4 size 7.28TiB used 2.07TiB path /dev/sda1
        devid    5 size 3.64TiB used 0.00B path /dev/sdg1
        devid    6 size 3.64TiB used 0.00B path /dev/sdh1

# which drive is devid 1?  it maps to /dev/sdf1, so...
[root@leandra drwho]# cat /dev/sdf1 > /dev/null

# the first drive in the array is glowing solid purple, so I guess that's it

UPDATED: Scrubbing is the process of verifying the data in the btrfs pool to make sure it's not corrupted.  This can be a lengthy process so it's probably not a good idea to run it very often.  Best practice is supposed to be once a month or so.  I ran a regular scrub on Leandra, and the process took 4:42:19 (4 hours, 42 minutes, 19 seconds).

[root@leandra drwho]# btrfs scrub start /btrfs

I already talked about btrfs rebalancing earlier, as part of the "grow the pool" process.  It is recommended, however, that you do this periodically as regular maintenance.  However, it is also recommended that you use filters to clean up data and metadata blocks that are only partially full, so that you don't bog your system down for days at a time.  I didn't time how long a rebalance of blocks that were 20% full took, but certainly less than five seconds.  Other runs were timed like this:

# XX is the percentage at which data blocks will be rebalanced in the pool.
# YY is the percentage at which metadata blocks will be rebalanced in the pool.
# Note that there should be no space in between -[dm] and "usage"
[root@leandra drwho]# time btrfs balance start -dusage=XX -musage=YY /btrfs
Done, had to relocate 2 out of 4286 chunks

real    0m1.109s
user    0m0.000s
sys     0m0.113s

25% full blocks - about 1.11 seconds.  30% full blocks - about 1.02 seconds.  35% full blocks - 10.51 seconds.  40% full blocks - 2.05 seconds.  45% full blocks - 2.08 seconds.  50% full blocks - 1 minutes, 4.52 seconds.  These figures will probably change wildly the longer I use Leandra's btrfs pool, but if I keep on top of it it shouldn't get too bad.

Rebalancing and scrubbing can be done while the system is online and operating normally but it is an I/O intensive operation.  If you want to keep tabs on either process (and for pete's sake, you don't want to do both at the same time, you'll kill the system) these are the command you'll use:

[root@leandra drwho]# btrfs filesystem balance status /btrfs
Balance on '/btrfs' is running
2 out of about 3 chunks balanced (396 considered),  33% left

[root@leandra drwho]# btrfs scrub status /btrfs
UUID:             b7be05b1-63ef-4ae9-878c-2be8aa219d62
Scrub started:    Sun Oct  6 22:31:10 2019
Status:           aborted
Duration:         0:13:08
Total to scrub:   8.28TiB
Rate:             0.00B/s
Error summary:    no errors found
# the last time I ran a scrub I cancelled it after 13 hours

If you really need to, you can cancel a running scrub or rebalance.  It won't mess the system up, it'll finish the part it's on and then cleanly terminate.  Here's how you do that:

[root@leandra drwho]# btrfs scrub cancel /btrfs
ERROR: scrub cancel failed on /btrfs: not running
# there's no scrub running, but you get the point

[root@leandra drwho]# btrfs filesystem balance cancel /btrfs
ERROR: balance cancel on '/btrfs' failed: Not in progress
# there's no balance running, either

Speaking of I/O intensive stuff, one of the things I did was modify my backup scripts to detect when a btrfs maintenance operation was happening and gracefully abort if that was the case.  A handy thing about the btrfs scrub/balance status command is that it'll exit with a 0 if a job is not running, and a not-zero if one is.  The exit value is captured by the $? variable.

# Test for a running btrfs scrub job.
echo "Testing for a running btrfs scrub..."
sudo btrfs scrub status /btrfs > /dev/null
if [ $? -gt 0 ]; then
    echo "A btrfs scrub is running.  Terminating offsite backup."
    exit 1
    echo "btrfs scrub not running.  Proceeding."

# Test for a running btrfs balance job.
echo "Testing for a running btrfs balance..."
sudo btrfs balance status /btrfs > /dev/null
if [ $? -gt 0 ]; then
    echo "A btrfs rebalance is running.  Terminating offsite backup."
    exit 1
    echo "btrfs rebalance not running.  Proceeding."

Let's get some more in-depth btrfs usage stats:

[root@leandra drwho]# btrfs filesystem usage /btrfs
    Device size:                  36.39TiB
    Device allocated:              8.29TiB
    Device unallocated:           28.10TiB
    Device missing:                  0.00B
    Used:                          8.28TiB
    Free (estimated):             14.05TiB      (min: 14.05TiB)
    Data ratio:                       2.00
    Metadata ratio:                   2.00
    Global reserve:              512.00MiB      (used: 0.00B)

Data,RAID1: Size:4.13TiB, Used:4.13TiB
   /dev/sda1       2.07TiB
   /dev/sdb1       2.06TiB
   /dev/sde1       2.07TiB
   /dev/sdf1       2.07TiB

Metadata,RAID1: Size:9.00GiB, Used:7.61GiB
   /dev/sda1       2.00GiB
   /dev/sdb1       7.00GiB
   /dev/sde1       2.00GiB
   /dev/sdf1       7.00GiB

System,RAID1: Size:32.00MiB, Used:608.00KiB
   /dev/sdb1      32.00MiB
   /dev/sde1      32.00MiB

   /dev/sda1       5.21TiB
   /dev/sdb1       5.21TiB
   /dev/sde1       5.21TiB
   /dev/sdf1       5.21TiB
   /dev/sdg1       3.64TiB
   /dev/sdh1       3.64TiB

Let's see if any of the drives are throwing errors:

[root@leandra drwho]# btrfs device stats --check /btrfs | grep -v ' 0$'
[root@leandra drwho]# 
# nope.

So, what happens when a hard drive dies?  In a situation like this it's a pretty straightforward fix.  Assuming that you have a replacement handy (and if you're building out an array like Leandra's, you probably do) that's as big as the drive that tanked or larger, the process is really simple.  I keep Leandra's drive in hot-swap bays so I don't have to shut her down to switch them out.  First, though, here's what was flooding Leandra's kernel message buffer:

[Wed Mar 11 17:05:58 2020] sd 1:0:0:0: [sdb] tag#21 FAILED Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Wed Mar 11 17:05:58 2020] sd 1:0:0:0: [sdb] tag#21 CDB: Write(16) 8a 00 00 00 00 00 00 00 08 80 00 00 00 08 00 00
[Wed Mar 11 17:05:58 2020] BTRFS warning (device sdg1): lost page write due to IO error on /dev/sdb1
[Wed Mar 11 17:05:58 2020] BTRFS error (device sdg1): bdev /dev/sdb1 errs: wr 6423932, rd 301858, flush 14645, corrupt 0, gen 0
[Wed Mar 11 17:05:59 2020] BTRFS error (device sdg1): error writing primary super block to device 3

Over.  And over.  And over again.  What this means in essence is, Leandra was trying to write some data to /dev/sdg1, and it was supposed to be mirrored to /dev/sdb1, but /dev/sdb1 had flatlined so it wasn't writing the copy.  So, I plugged the replacement drive into the hot-swap bay I keep empty for just such an occasion and partitioned it like the others (regular old fdisk has supported GPT partition tables for huge hard drives for a while now, so I didn't need any other software.)  Once it was in I used btrfs filesystem show to determine which device number corresponded to /dev/sdb1 (scroll back up for that); for the record, it was device 3.  Watching the kernel message buffer when I plugged the new drive in (dmesg -Tw) I saw that it showed up as /dev/sdh1.  So, I followed the official documentation to start the replacement process:

[root@leandra drwho]# btrfs replace start -B -r 3  /dev/sdh1 /btrfs

The command line arguments, broken out:

  • replace - replace a failed drive
  • start - start the process
  • -B - do not do it in the background (i.e., don't drop back to the command line)
  • -r - don't try to read from the failed drive unless there is no other copy of a data block
  • 3 - the failed device is #3
  • /dev/sdh1 - the replacement drive
  • /btrfs - where the btrfs pool is mounted

I didn't time how long the replacement process took; if I had to guess, not more than four hours from start to finish.  The end of the replacement process was anticlimactic at best (which is how maintenance should always be):

[root@leandra drwho]# btrfs replace start -B -r 3  /dev/sdh1 /btrfs
[root@leandra drwho]# btrfs filesystem show
Label: 'btrfs'  uuid: b7be05b1-63ef-4ae9-878c-2be8aa219d62
        Total devices 6 FS bytes used 4.39TiB
        devid    1 size 7.28TiB used 2.20TiB path /dev/sdg1
        devid    2 size 7.28TiB used 2.20TiB path /dev/sdf1
        devid    3 size 7.28TiB used 2.20TiB path /dev/sdh1
        devid    4 size 7.28TiB used 2.20TiB path /dev/sda1
        devid    5 size 3.64TiB used 0.00B path /dev/sdd1
        devid    6 size 3.64TiB used 0.00B path /dev/sdc1
# Hey - /dev/sdh1 is now devid 3!
# And where's /dev/sdb1?  ¯\_(ツ)_/¯

Finally, I yanked out the failed hard drive and got ready to take a sledgehammer to it.  Here's what it looked like in the kernel message buffer:

[Wed Mar 11 18:26:24 2020] ata2: SATA link down (SStatus 0 SControl 300)
[Wed Mar 11 18:26:24 2020] ata2.00: detaching (SCSI 1:0:0:0)
[Wed Mar 11 18:26:24 2020] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[Wed Mar 11 18:26:24 2020] sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[Wed Mar 11 18:26:24 2020] sd 1:0:0:0: [sdb] Stopping disk
[Wed Mar 11 18:26:24 2020] sd 1:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK