[gentoo-amd64] btrfs Was: Soliciting new RAID ideas

public inbox for gentoo-amd64@lists.gentoo.org
 help / color / mirror / Atom feed

From: Duncan <1i5t5.duncan@cox.net>
To: gentoo-amd64@lists.gentoo.org
Subject: [gentoo-amd64] btrfs  Was: Soliciting new RAID ideas
Date: Wed, 28 May 2014 03:12:05 +0000 (UTC)	[thread overview]
Message-ID: <pan$64532$35068efb$291c3acc$915b8eda@cox.net> (raw)
In-Reply-To: dfd718f8f9a0570aa19880f354b5dbef@thegeezer.net

thegeezer posted on Wed, 28 May 2014 00:38:03 +0100 as excerpted:

> depending on your budget a pair of large sata drives + mdadm will be
> ideal, if you had lvm already you could simply 'move' then 'enlarge'
> your existing stuff (tm) : i'd like to know how btrfs would do the same
> for anyone who can let me know.
> you have raid6 because you probably know that raid5 is just waiting for
> trouble, so i'd probably start looking at btrfs for your finanical data
> to be checksummed.

Given that I'm a regular on the btrfs list as well as running it myself, 
I'm likely to know more about it than most.  Here's a whirlwind rundown 
with a strong emphasis on practical points a lot of people miss (IOW, I'm 
skipping a lot of the commonly covered and obvious stuff).  Point 6 below 
directly answers your move/enlarge question.  Meanwhile, points 1, 7 and 
8 are critically important, as we see a lot of people on the btrfs list 
getting them wrong.

1) Since there's raid5/6 discussion on the thread... Don't use btrfs 
raid56 modes at this time, except purely for playing around with trashable 
or fully backed up data.  The implementation as introduced isn't code-
complete, and while the operational runtime side works, recovery from 
dropped devices, not so much.  Thus, in terms of data safety you're 
effectively running a slow raid0 with lots of extra overhead that can be 
considered trash if a device drops, with the sole benefit being that when 
the raid56 mode recovery implementation code gets merged (and is tested 
for a kernel cycle or two to work out the initial bugs), you'll then get 
what amounts to a "free" upgrade to the raid5 or raid6 mode you had 
originally configured, since it was doing the operational parity 
calculation and writes to track it all along, it just couldn't yet be 
used for actual recovery as the code simply wasn't there to do so.

2) Btrfs raid0, raid1 and raid10 modes, along with single mode (on a 
single or multiple-devices) and dup mode (on a single device, metadata is 
by default duplicated -- two copies, except on ssd where the default is 
only a single copy since some ssds dedup anyway) are reasonably mature 
and stable, to the same point as btrfs in general, anyway, which is to 
say it's "mostly stable, keep your backups fresh but you're not /too/ 
likely to have to use them."  There are still enough bugs being fixed in 
each kernel release, however, that running latest stable series is 
/strongly/ recommended, as your data is at risk to known-fixed bugs (even 
if at this point they only tend to hit the corner-cases) if you're not 
doing so.

3) It's worth noting that btrfs treats data and metadata separately -- 
when you do a mkfs.btrfs, you can configure redundancy modes separately 
for each, the single-device default being (as above) dup metadata (except 
for ssd), single data, the multi-device default being raid1 metadata, 
single data

4) FWIW, most of my btrfs formatted partitions are dual-device raid1 mode 
for both data and metadata, on ssd.  (Second backup is reiserfs on 
spinning-rust, just in case some Armageddon bug eats all the btrfs at the 
same time, working copy and first backup, tho btrfs is stable enough now 
that's extremely unlikely, but I didn't consider it so back when I set 
things up nearly a year ago now.)

The reason for my raid1 mode choice isn't that of ordinary raid1, it's 
specifically due to btrfs' checksumming and data integrity features -- if 
one copy fails its checksum, btrfs will, IF IT HAS ANOTHER COPY TO TRY, 
check the second copy and if it's good, will use it and rewrite the bad 
copy.  Btrfs scrub allows checking the entire filesystem for checksum 
errors and restoring any errors it finds from good copies where possible.

Obviously, the default single data mode (or raid0) won't have a second 
copy to check and rewrite from, while raid1 (and raid10) modes will (as 
will dup-mode metadata on a single device, but with one exception, dup 
mode isn't allowed for data, only metadata, the exception being the mixed-
blockgroup mode that mixes data and metadata together, that's the default 
on filesystems under 1 GiB but isn't recommended on large filesystems for 
performance reasons).

So I wanted a second copy of both data and metadata to take advantage of 
btrfs' data integrity and scrub features, and with btrfs raid1 mode, I 
get both that and the traditional raid1 device-loss protection as well. 
=:^)

5) It's worth noting that as of now, btrfs raid1 mode is only two-way-
mirrored, no matter how many devices are configured into the filesystem.  
N-way-mirrored is the next feature on the roadmap after the raid56 work 
is completed, but given how nearly every btrfs feature has taken much 
longer to complete than originally planned, I'm not expecting it until 
sometime next year, now.

Which is unfortunate, as my risk vs. cost sweet spot would be 3-way-
mirroring, covering in case *TWO* copies of a block failed checksum.  Oh, 
well, it's coming, even if it seems at this point like the proverbial 
carrot dangling off a stick held in front of the donkey.

6) Btrfs handles moving then enlarging (parallel to LVM) using btrfs
add/delete, to add or delete a device to/from a filesystem (moving the 
content from a to-be-deleted device in the process), plus btrfs balance, 
to restripe/convert/rebalance between devices as well as to free 
allocated but empty data and metadata chunks back to unallocated.  
There's also btrfs resize, but that's more like the conventional 
filesystem resize command, resizing the part of the filesystem on an 
individual device (partitioned/virtual or whole physical device).

So to add a device, you'd btrfs device add, then btrfs balance, with an 
optional conversion to a different redundancy mode if desired, to 
rebalance the existing data and metadata onto that device.  (Without the 
rebalance it would be used for new chunks, but existing data and metadata 
chunks would stay where they were.  I'll omit the "chunk definition" 
discussion in the interest of brevity.)

To delete a device, you'd btrfs device delete, which would move all the 
data on that device onto other existing devices in the filesystem, after 
which it could be removed.

7) Given the thread, I'd be remiss to omit this one.  VM images and other 
large "internal-rewrite-pattern" files (large database files, etc) need 
special treatment on btrfs, at least currently.  As such, btrfs may not 
be the greatest solution for Mark (tho it would work fine with special 
procedures), given the several VMs he runs.  This one unfortunately hits 
a lot of people. =:^(  But here's a heads-up, so it doesn't have to hit 
anyone reading this! =:^)

As a property of the technology, any copy-on-write-based filesystem is 
going to find files where various bits of existing data within the file 
are repeatedly rewritten (as opposed to new data simply being appended, 
think a log file or live-stored audio/video stream) extremely challenging 
to deal with.  The problem is that unlike ordinary filesystems that 
rewrite the data in place such that a file continues to occupy the same 
extents as it did before, copy-on-write filesystems will write a changed 
block to a different location.  While COW does mean atomic updates and 
thus more reliability since either the new data or the old data should 
exist, never an unpredictable mixture of the two, as a result of the 
above rewrite pattern, this type of internally-rewritten file gets 
**HEAVILY** fragmented over time.

We've had filefrag reports of several gig files with over 100K extents!  
Obviously, this isn't going to be the most efficient file in the world to 
access!

For smaller files, up to a couple hundred MiB or perhaps a bit more, 
btrfs has the autodefrag mount option, which can help a lot.  With this 
option enabled, whenever a block of a file is changed and rewritten, thus 
written elsewhere, btrfs queues up a rewrite of the entire file to happen 
in the background.  The rewrite will be done sequentially, thus defragging 
the file.  This works quite well for firefox's sqlite database files, for 
instance, as they're internal-rewrite-pattern, but they're small enough 
that autodefrag handles them reasonably nicely.

But this solution doesn't scale so well as the file size increases toward 
and past a GiB, particularly for files with a continuous stream of 
internal rewrites such as can happen with an operating VM writing to its 
virtual storage device.  At some point, the stream of writes comes in 
faster than the file can be rewritten, and things start to back up!

To deal with this case, there's the NOCOW file attribute, set with chattr 
+C.  However, to be effective, this attribute must be set when the file 
is empty, before it has existing content.  The easiest way to do that is 
to set the attribute on the directory which will contain the files.  
While it doesn't affect the directory itself any, newly created files 
within that directory inherit the NOCOW attribute before they have data, 
thus allowing it to work without having to worry about it that much.  For 
existing files, create a new directory, set its NOCOW attribute, and COPY 
(don't move, and don't use cp --reflink) the existing files into it.

Once you have your large internal-rewrite-pattern files set NOCOW, btrfs 
will rewrite them in-place as an ordinary filesystem would, thus avoiding 
the problem.

Except for one thing.  I haven't mentioned btrfs snapshots yet as that 
feature, but for this caveat, is covered well enough elsewhere.  But 
here's the problem.  A snapshot locks the existing file data in place.  
As a result, the first write to a block within a file after a snapshot 
MUST be COW, even if the file is otherwise set NOCOW.  

If only the occasional one-off snapshot is done it's not /too/ bad, as 
all the internal file writes between snapshots are NOCOW, it's only the 
first one to each file block after a snapshot that must be COW.  But many 
people and distros are script-automating their snapshots in ordered to 
have rollback capacities, and on btrfs, snapshots are (ordinarily) light 
enough that people are sometimes configuring a snapshot a minute!  If 
only a minute's changes can be written to a the existing location, then 
there's a snapshot and changes must be written to a new location, then 
another snapshot and yet another location...  Basically the NOCOW we set 
on that file isn't doing us any good!

8) So making this a separate point as it's important and a lot of people 
get it wrong.  NOCOW and snapshots don't mix!

There is, however, a (partial) workaround.  Because snapshots stop at 
btrfs subvolume boundaries, if you put your large VM images and similar 
large internal-rewrite-pattern files (databases, etc) in subvolumes, 
making that directory I suggested above a full subvolume not just a NOCOW 
directory, snapshots of the parent subvolume will not include the VM 
images subvolume, thus leaving the VM images alone.  This solves the 
snapshot-broken-NOCOW and thus the fragmentation issue, but it DOES mean 
that those VM images must be backed up using more conventional methods 
since snapshotting won't work for them.

9) Some other still partially broken bits of btrfs include:

9a) Quotas:  Just don't use them on btrfs at this point.  Performance 
doesn't scale (altho there's a rewrite in progress), and they are buggy.  
Additionally, the scaling interaction with snapshots is geometrically 
negative, sometimes requiring 64 GiB of RAM or more and coming to a near 
standstill at that, for users with enough quota-groups and enough 
snapshots.  If you need quotas, use a more traditional filesystem with 
stable quota support.  Hopefully by this time next year...

9b) Snapshot-aware-defrag:  This was enabled at one point but simply 
didn't scale, when it turned out people were doing things like per-minute 
snapshots and thus had thousands and thousands of snapshots.  So this has 
been disabled for the time being.  Btrfs defrag will defrag the working 
copy it is run on, but currently doesn't account for snapshots, so data 
that was fragmented at snapshot time gets duplicated as it is 
defragmented.  However, they plan to re-enable the feature ones they have 
rewritten various bits to scale far better than they do at present.

9c) Send and receive.  Btrfs send and receive are a very nice feature 
that can make backups far faster, with far less data transferred.  
They're great when they work.  Unfortunately, there are still various 
corner-cases where they don't.  (As an example, a recent fix was for the 
case where subdir B was nested inside subdir A for the first, full send/
receive, but later, the relationship was reversed, with subdir B made the 
parent of subdir A.  Until the recent fix, send/receive couldn't handle 
that sort of corner-case.)  You can go ahead and use it if it's working 
for you, as if it finishes without error, the copy should be 100% 
reliable.  However, have an alternate plan for backups if you suddenly 
hit one of those corner-cases and send/receive quits working.

Of course it's worth mentioning that b and c deal with features that most 
filesystems don't have at all, so with the exception of quotas, it's not 
like something's broken on btrfs that works on other filesystems.  
Instead, these features are (nearly) unique to btrfs, so even if they 
come with certain limitations, that's still better than not having the 
option of using the feature at all, because it simply doesn't exist on 
the other filesystem!

10) Btrfs in general is headed toward stable now, and a lot of people, 
including me, have used it for a significant amount of time without 
problems, but it's still new enough that you're strongly urged to make 
and test your backups, because by not doing so, you're stating by your 
actions if not your words, that you simply don't care if some as yet 
undiscovered and unfixed bug in the filesystem eats your data.

For similar reasons altho already mentioned above, run the latest stable 
kernel from the latest stable kernel series, at the oldest, and consider 
running rc kernels from at least rc2 or so (by which time any real data 
eating bugs, in btrfs or elsewhere, should be found and fixed, or at 
least published).  Because anything older and you are literally risking 
your data to known and fixed bugs.

As is said, take reasonable care and you're much less likely to be the 
statistic!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

next prev parent reply	other threads:[~2014-05-28  3:12 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-27 22:13 [gentoo-amd64] Soliciting new RAID ideas Mark Knecht
2014-05-27 22:39 ` Bob Sanders
2014-05-27 22:58   ` Harry Holt
2014-05-27 23:38     ` thegeezer
2014-05-28  0:26       ` Rich Freeman
2014-05-28  3:12       ` Duncan [this message]
2014-05-28  7:29         ` [gentoo-amd64] btrfs Was: " thegeezer
2014-05-28 20:32           ` Marc Joliet
2014-05-29  6:41             ` [gentoo-amd64] " Duncan
2014-05-29 17:57               ` Marc Joliet
2014-05-29 17:59                 ` Rich Freeman
2014-05-29 18:25                   ` Mark Knecht
2014-05-29 21:05                   ` Frank Peters
2014-05-30  2:04                     ` [gentoo-amd64] amd64 list, still useful? Was: btrfs Duncan
2014-05-30  2:44                       ` Frank Peters
2014-05-30  6:25                         ` [gentoo-amd64] " Duncan
2014-06-04 16:41                       ` [gentoo-amd64] " Mark Knecht
2014-06-05  2:00                         ` [gentoo-amd64] " Duncan
2014-06-05 18:59                           ` Mark Knecht
2014-06-06 12:11                             ` Duncan
     [not found]                           ` <Alo71o01J1aVA4001lo9xP>
2014-06-06 17:07                             ` Duncan
2014-05-27 23:32   ` [gentoo-amd64] Soliciting new RAID ideas Mark Knecht
2014-05-27 23:51   ` Marc Joliet
2014-05-28 15:26     ` Bob Sanders
2014-05-28 15:28       ` Bob Sanders
2014-05-28 16:10       ` Rich Freeman
2014-05-28 19:20       ` Marc Joliet
2014-05-28 19:56         ` Bob Sanders
2014-05-29  7:08         ` [gentoo-amd64] " Duncan
2014-05-27 23:05 ` [gentoo-amd64] " Alex Alexander

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$64532$35068efb$291c3acc$915b8eda@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=gentoo-amd64@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox