From: Duncan <1i5t5.duncan@cox.net>
To: gentoo-amd64@lists.gentoo.org
Subject: [gentoo-amd64] btrfs Was: Soliciting new RAID ideas
Date: Wed, 28 May 2014 03:12:05 +0000 (UTC) [thread overview]
Message-ID: <pan$64532$35068efb$291c3acc$915b8eda@cox.net> (raw)
In-Reply-To: dfd718f8f9a0570aa19880f354b5dbef@thegeezer.net
thegeezer posted on Wed, 28 May 2014 00:38:03 +0100 as excerpted:
> depending on your budget a pair of large sata drives + mdadm will be
> ideal, if you had lvm already you could simply 'move' then 'enlarge'
> your existing stuff (tm) : i'd like to know how btrfs would do the same
> for anyone who can let me know.
> you have raid6 because you probably know that raid5 is just waiting for
> trouble, so i'd probably start looking at btrfs for your finanical data
> to be checksummed.
Given that I'm a regular on the btrfs list as well as running it myself,
I'm likely to know more about it than most. Here's a whirlwind rundown
with a strong emphasis on practical points a lot of people miss (IOW, I'm
skipping a lot of the commonly covered and obvious stuff). Point 6 below
directly answers your move/enlarge question. Meanwhile, points 1, 7 and
8 are critically important, as we see a lot of people on the btrfs list
getting them wrong.
1) Since there's raid5/6 discussion on the thread... Don't use btrfs
raid56 modes at this time, except purely for playing around with trashable
or fully backed up data. The implementation as introduced isn't code-
complete, and while the operational runtime side works, recovery from
dropped devices, not so much. Thus, in terms of data safety you're
effectively running a slow raid0 with lots of extra overhead that can be
considered trash if a device drops, with the sole benefit being that when
the raid56 mode recovery implementation code gets merged (and is tested
for a kernel cycle or two to work out the initial bugs), you'll then get
what amounts to a "free" upgrade to the raid5 or raid6 mode you had
originally configured, since it was doing the operational parity
calculation and writes to track it all along, it just couldn't yet be
used for actual recovery as the code simply wasn't there to do so.
2) Btrfs raid0, raid1 and raid10 modes, along with single mode (on a
single or multiple-devices) and dup mode (on a single device, metadata is
by default duplicated -- two copies, except on ssd where the default is
only a single copy since some ssds dedup anyway) are reasonably mature
and stable, to the same point as btrfs in general, anyway, which is to
say it's "mostly stable, keep your backups fresh but you're not /too/
likely to have to use them." There are still enough bugs being fixed in
each kernel release, however, that running latest stable series is
/strongly/ recommended, as your data is at risk to known-fixed bugs (even
if at this point they only tend to hit the corner-cases) if you're not
doing so.
3) It's worth noting that btrfs treats data and metadata separately --
when you do a mkfs.btrfs, you can configure redundancy modes separately
for each, the single-device default being (as above) dup metadata (except
for ssd), single data, the multi-device default being raid1 metadata,
single data
4) FWIW, most of my btrfs formatted partitions are dual-device raid1 mode
for both data and metadata, on ssd. (Second backup is reiserfs on
spinning-rust, just in case some Armageddon bug eats all the btrfs at the
same time, working copy and first backup, tho btrfs is stable enough now
that's extremely unlikely, but I didn't consider it so back when I set
things up nearly a year ago now.)
The reason for my raid1 mode choice isn't that of ordinary raid1, it's
specifically due to btrfs' checksumming and data integrity features -- if
one copy fails its checksum, btrfs will, IF IT HAS ANOTHER COPY TO TRY,
check the second copy and if it's good, will use it and rewrite the bad
copy. Btrfs scrub allows checking the entire filesystem for checksum
errors and restoring any errors it finds from good copies where possible.
Obviously, the default single data mode (or raid0) won't have a second
copy to check and rewrite from, while raid1 (and raid10) modes will (as
will dup-mode metadata on a single device, but with one exception, dup
mode isn't allowed for data, only metadata, the exception being the mixed-
blockgroup mode that mixes data and metadata together, that's the default
on filesystems under 1 GiB but isn't recommended on large filesystems for
performance reasons).
So I wanted a second copy of both data and metadata to take advantage of
btrfs' data integrity and scrub features, and with btrfs raid1 mode, I
get both that and the traditional raid1 device-loss protection as well.
=:^)
5) It's worth noting that as of now, btrfs raid1 mode is only two-way-
mirrored, no matter how many devices are configured into the filesystem.
N-way-mirrored is the next feature on the roadmap after the raid56 work
is completed, but given how nearly every btrfs feature has taken much
longer to complete than originally planned, I'm not expecting it until
sometime next year, now.
Which is unfortunate, as my risk vs. cost sweet spot would be 3-way-
mirroring, covering in case *TWO* copies of a block failed checksum. Oh,
well, it's coming, even if it seems at this point like the proverbial
carrot dangling off a stick held in front of the donkey.
6) Btrfs handles moving then enlarging (parallel to LVM) using btrfs
add/delete, to add or delete a device to/from a filesystem (moving the
content from a to-be-deleted device in the process), plus btrfs balance,
to restripe/convert/rebalance between devices as well as to free
allocated but empty data and metadata chunks back to unallocated.
There's also btrfs resize, but that's more like the conventional
filesystem resize command, resizing the part of the filesystem on an
individual device (partitioned/virtual or whole physical device).
So to add a device, you'd btrfs device add, then btrfs balance, with an
optional conversion to a different redundancy mode if desired, to
rebalance the existing data and metadata onto that device. (Without the
rebalance it would be used for new chunks, but existing data and metadata
chunks would stay where they were. I'll omit the "chunk definition"
discussion in the interest of brevity.)
To delete a device, you'd btrfs device delete, which would move all the
data on that device onto other existing devices in the filesystem, after
which it could be removed.
7) Given the thread, I'd be remiss to omit this one. VM images and other
large "internal-rewrite-pattern" files (large database files, etc) need
special treatment on btrfs, at least currently. As such, btrfs may not
be the greatest solution for Mark (tho it would work fine with special
procedures), given the several VMs he runs. This one unfortunately hits
a lot of people. =:^( But here's a heads-up, so it doesn't have to hit
anyone reading this! =:^)
As a property of the technology, any copy-on-write-based filesystem is
going to find files where various bits of existing data within the file
are repeatedly rewritten (as opposed to new data simply being appended,
think a log file or live-stored audio/video stream) extremely challenging
to deal with. The problem is that unlike ordinary filesystems that
rewrite the data in place such that a file continues to occupy the same
extents as it did before, copy-on-write filesystems will write a changed
block to a different location. While COW does mean atomic updates and
thus more reliability since either the new data or the old data should
exist, never an unpredictable mixture of the two, as a result of the
above rewrite pattern, this type of internally-rewritten file gets
**HEAVILY** fragmented over time.
We've had filefrag reports of several gig files with over 100K extents!
Obviously, this isn't going to be the most efficient file in the world to
access!
For smaller files, up to a couple hundred MiB or perhaps a bit more,
btrfs has the autodefrag mount option, which can help a lot. With this
option enabled, whenever a block of a file is changed and rewritten, thus
written elsewhere, btrfs queues up a rewrite of the entire file to happen
in the background. The rewrite will be done sequentially, thus defragging
the file. This works quite well for firefox's sqlite database files, for
instance, as they're internal-rewrite-pattern, but they're small enough
that autodefrag handles them reasonably nicely.
But this solution doesn't scale so well as the file size increases toward
and past a GiB, particularly for files with a continuous stream of
internal rewrites such as can happen with an operating VM writing to its
virtual storage device. At some point, the stream of writes comes in
faster than the file can be rewritten, and things start to back up!
To deal with this case, there's the NOCOW file attribute, set with chattr
+C. However, to be effective, this attribute must be set when the file
is empty, before it has existing content. The easiest way to do that is
to set the attribute on the directory which will contain the files.
While it doesn't affect the directory itself any, newly created files
within that directory inherit the NOCOW attribute before they have data,
thus allowing it to work without having to worry about it that much. For
existing files, create a new directory, set its NOCOW attribute, and COPY
(don't move, and don't use cp --reflink) the existing files into it.
Once you have your large internal-rewrite-pattern files set NOCOW, btrfs
will rewrite them in-place as an ordinary filesystem would, thus avoiding
the problem.
Except for one thing. I haven't mentioned btrfs snapshots yet as that
feature, but for this caveat, is covered well enough elsewhere. But
here's the problem. A snapshot locks the existing file data in place.
As a result, the first write to a block within a file after a snapshot
MUST be COW, even if the file is otherwise set NOCOW.
If only the occasional one-off snapshot is done it's not /too/ bad, as
all the internal file writes between snapshots are NOCOW, it's only the
first one to each file block after a snapshot that must be COW. But many
people and distros are script-automating their snapshots in ordered to
have rollback capacities, and on btrfs, snapshots are (ordinarily) light
enough that people are sometimes configuring a snapshot a minute! If
only a minute's changes can be written to a the existing location, then
there's a snapshot and changes must be written to a new location, then
another snapshot and yet another location... Basically the NOCOW we set
on that file isn't doing us any good!
8) So making this a separate point as it's important and a lot of people
get it wrong. NOCOW and snapshots don't mix!
There is, however, a (partial) workaround. Because snapshots stop at
btrfs subvolume boundaries, if you put your large VM images and similar
large internal-rewrite-pattern files (databases, etc) in subvolumes,
making that directory I suggested above a full subvolume not just a NOCOW
directory, snapshots of the parent subvolume will not include the VM
images subvolume, thus leaving the VM images alone. This solves the
snapshot-broken-NOCOW and thus the fragmentation issue, but it DOES mean
that those VM images must be backed up using more conventional methods
since snapshotting won't work for them.
9) Some other still partially broken bits of btrfs include:
9a) Quotas: Just don't use them on btrfs at this point. Performance
doesn't scale (altho there's a rewrite in progress), and they are buggy.
Additionally, the scaling interaction with snapshots is geometrically
negative, sometimes requiring 64 GiB of RAM or more and coming to a near
standstill at that, for users with enough quota-groups and enough
snapshots. If you need quotas, use a more traditional filesystem with
stable quota support. Hopefully by this time next year...
9b) Snapshot-aware-defrag: This was enabled at one point but simply
didn't scale, when it turned out people were doing things like per-minute
snapshots and thus had thousands and thousands of snapshots. So this has
been disabled for the time being. Btrfs defrag will defrag the working
copy it is run on, but currently doesn't account for snapshots, so data
that was fragmented at snapshot time gets duplicated as it is
defragmented. However, they plan to re-enable the feature ones they have
rewritten various bits to scale far better than they do at present.
9c) Send and receive. Btrfs send and receive are a very nice feature
that can make backups far faster, with far less data transferred.
They're great when they work. Unfortunately, there are still various
corner-cases where they don't. (As an example, a recent fix was for the
case where subdir B was nested inside subdir A for the first, full send/
receive, but later, the relationship was reversed, with subdir B made the
parent of subdir A. Until the recent fix, send/receive couldn't handle
that sort of corner-case.) You can go ahead and use it if it's working
for you, as if it finishes without error, the copy should be 100%
reliable. However, have an alternate plan for backups if you suddenly
hit one of those corner-cases and send/receive quits working.
Of course it's worth mentioning that b and c deal with features that most
filesystems don't have at all, so with the exception of quotas, it's not
like something's broken on btrfs that works on other filesystems.
Instead, these features are (nearly) unique to btrfs, so even if they
come with certain limitations, that's still better than not having the
option of using the feature at all, because it simply doesn't exist on
the other filesystem!
10) Btrfs in general is headed toward stable now, and a lot of people,
including me, have used it for a significant amount of time without
problems, but it's still new enough that you're strongly urged to make
and test your backups, because by not doing so, you're stating by your
actions if not your words, that you simply don't care if some as yet
undiscovered and unfixed bug in the filesystem eats your data.
For similar reasons altho already mentioned above, run the latest stable
kernel from the latest stable kernel series, at the oldest, and consider
running rc kernels from at least rc2 or so (by which time any real data
eating bugs, in btrfs or elsewhere, should be found and fixed, or at
least published). Because anything older and you are literally risking
your data to known and fixed bugs.
As is said, take reasonable care and you're much less likely to be the
statistic!
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2014-05-28 3:12 UTC|newest]
Thread overview: 30+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-27 22:13 [gentoo-amd64] Soliciting new RAID ideas Mark Knecht
2014-05-27 22:39 ` Bob Sanders
2014-05-27 22:58 ` Harry Holt
2014-05-27 23:38 ` thegeezer
2014-05-28 0:26 ` Rich Freeman
2014-05-28 3:12 ` Duncan [this message]
2014-05-28 7:29 ` [gentoo-amd64] btrfs Was: " thegeezer
2014-05-28 20:32 ` Marc Joliet
2014-05-29 6:41 ` [gentoo-amd64] " Duncan
2014-05-29 17:57 ` Marc Joliet
2014-05-29 17:59 ` Rich Freeman
2014-05-29 18:25 ` Mark Knecht
2014-05-29 21:05 ` Frank Peters
2014-05-30 2:04 ` [gentoo-amd64] amd64 list, still useful? Was: btrfs Duncan
2014-05-30 2:44 ` Frank Peters
2014-05-30 6:25 ` [gentoo-amd64] " Duncan
2014-06-04 16:41 ` [gentoo-amd64] " Mark Knecht
2014-06-05 2:00 ` [gentoo-amd64] " Duncan
2014-06-05 18:59 ` Mark Knecht
2014-06-06 12:11 ` Duncan
[not found] ` <Alo71o01J1aVA4001lo9xP>
2014-06-06 17:07 ` Duncan
2014-05-27 23:32 ` [gentoo-amd64] Soliciting new RAID ideas Mark Knecht
2014-05-27 23:51 ` Marc Joliet
2014-05-28 15:26 ` Bob Sanders
2014-05-28 15:28 ` Bob Sanders
2014-05-28 16:10 ` Rich Freeman
2014-05-28 19:20 ` Marc Joliet
2014-05-28 19:56 ` Bob Sanders
2014-05-29 7:08 ` [gentoo-amd64] " Duncan
2014-05-27 23:05 ` [gentoo-amd64] " Alex Alexander
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$64532$35068efb$291c3acc$915b8eda@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=gentoo-amd64@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox