From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (unknown [208.92.234.80]) by finch.gentoo.org (Postfix) with ESMTP id 25A2E1393E9 for ; Wed, 28 May 2014 03:12:28 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 5909CE0922; Wed, 28 May 2014 03:12:25 +0000 (UTC) Received: from plane.gmane.org (plane.gmane.org [80.91.229.3]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 1A687E0904 for ; Wed, 28 May 2014 03:12:24 +0000 (UTC) Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1WpUHt-00064h-Tk for gentoo-amd64@lists.gentoo.org; Wed, 28 May 2014 05:12:21 +0200 Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 28 May 2014 05:12:21 +0200 Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Wed, 28 May 2014 05:12:21 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: gentoo-amd64@lists.gentoo.org From: Duncan <1i5t5.duncan@cox.net> Subject: [gentoo-amd64] btrfs Was: Soliciting new RAID ideas Date: Wed, 28 May 2014 03:12:05 +0000 (UTC) Message-ID: References: <20140527223938.GA3701@sgi.com> Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-amd64@lists.gentoo.org Reply-to: gentoo-amd64@lists.gentoo.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: ip68-231-22-224.ph.ph.cox.net User-Agent: Pan/0.140 (Chocolate Salty Balls; GIT 2ae6aff /usr/src/portage/src/egit-src/pan2) X-Archives-Salt: 73fa4b20-bb16-4d2a-ba66-3a5e575dd15d X-Archives-Hash: e2877b246478213bb9f2257c3feb7468 thegeezer posted on Wed, 28 May 2014 00:38:03 +0100 as excerpted: > depending on your budget a pair of large sata drives + mdadm will be > ideal, if you had lvm already you could simply 'move' then 'enlarge' > your existing stuff (tm) : i'd like to know how btrfs would do the same > for anyone who can let me know. > you have raid6 because you probably know that raid5 is just waiting for > trouble, so i'd probably start looking at btrfs for your finanical data > to be checksummed. Given that I'm a regular on the btrfs list as well as running it myself, I'm likely to know more about it than most. Here's a whirlwind rundown with a strong emphasis on practical points a lot of people miss (IOW, I'm skipping a lot of the commonly covered and obvious stuff). Point 6 below directly answers your move/enlarge question. Meanwhile, points 1, 7 and 8 are critically important, as we see a lot of people on the btrfs list getting them wrong. 1) Since there's raid5/6 discussion on the thread... Don't use btrfs raid56 modes at this time, except purely for playing around with trashable or fully backed up data. The implementation as introduced isn't code- complete, and while the operational runtime side works, recovery from dropped devices, not so much. Thus, in terms of data safety you're effectively running a slow raid0 with lots of extra overhead that can be considered trash if a device drops, with the sole benefit being that when the raid56 mode recovery implementation code gets merged (and is tested for a kernel cycle or two to work out the initial bugs), you'll then get what amounts to a "free" upgrade to the raid5 or raid6 mode you had originally configured, since it was doing the operational parity calculation and writes to track it all along, it just couldn't yet be used for actual recovery as the code simply wasn't there to do so. 2) Btrfs raid0, raid1 and raid10 modes, along with single mode (on a single or multiple-devices) and dup mode (on a single device, metadata is by default duplicated -- two copies, except on ssd where the default is only a single copy since some ssds dedup anyway) are reasonably mature and stable, to the same point as btrfs in general, anyway, which is to say it's "mostly stable, keep your backups fresh but you're not /too/ likely to have to use them." There are still enough bugs being fixed in each kernel release, however, that running latest stable series is /strongly/ recommended, as your data is at risk to known-fixed bugs (even if at this point they only tend to hit the corner-cases) if you're not doing so. 3) It's worth noting that btrfs treats data and metadata separately -- when you do a mkfs.btrfs, you can configure redundancy modes separately for each, the single-device default being (as above) dup metadata (except for ssd), single data, the multi-device default being raid1 metadata, single data 4) FWIW, most of my btrfs formatted partitions are dual-device raid1 mode for both data and metadata, on ssd. (Second backup is reiserfs on spinning-rust, just in case some Armageddon bug eats all the btrfs at the same time, working copy and first backup, tho btrfs is stable enough now that's extremely unlikely, but I didn't consider it so back when I set things up nearly a year ago now.) The reason for my raid1 mode choice isn't that of ordinary raid1, it's specifically due to btrfs' checksumming and data integrity features -- if one copy fails its checksum, btrfs will, IF IT HAS ANOTHER COPY TO TRY, check the second copy and if it's good, will use it and rewrite the bad copy. Btrfs scrub allows checking the entire filesystem for checksum errors and restoring any errors it finds from good copies where possible. Obviously, the default single data mode (or raid0) won't have a second copy to check and rewrite from, while raid1 (and raid10) modes will (as will dup-mode metadata on a single device, but with one exception, dup mode isn't allowed for data, only metadata, the exception being the mixed- blockgroup mode that mixes data and metadata together, that's the default on filesystems under 1 GiB but isn't recommended on large filesystems for performance reasons). So I wanted a second copy of both data and metadata to take advantage of btrfs' data integrity and scrub features, and with btrfs raid1 mode, I get both that and the traditional raid1 device-loss protection as well. =:^) 5) It's worth noting that as of now, btrfs raid1 mode is only two-way- mirrored, no matter how many devices are configured into the filesystem. N-way-mirrored is the next feature on the roadmap after the raid56 work is completed, but given how nearly every btrfs feature has taken much longer to complete than originally planned, I'm not expecting it until sometime next year, now. Which is unfortunate, as my risk vs. cost sweet spot would be 3-way- mirroring, covering in case *TWO* copies of a block failed checksum. Oh, well, it's coming, even if it seems at this point like the proverbial carrot dangling off a stick held in front of the donkey. 6) Btrfs handles moving then enlarging (parallel to LVM) using btrfs add/delete, to add or delete a device to/from a filesystem (moving the content from a to-be-deleted device in the process), plus btrfs balance, to restripe/convert/rebalance between devices as well as to free allocated but empty data and metadata chunks back to unallocated. There's also btrfs resize, but that's more like the conventional filesystem resize command, resizing the part of the filesystem on an individual device (partitioned/virtual or whole physical device). So to add a device, you'd btrfs device add, then btrfs balance, with an optional conversion to a different redundancy mode if desired, to rebalance the existing data and metadata onto that device. (Without the rebalance it would be used for new chunks, but existing data and metadata chunks would stay where they were. I'll omit the "chunk definition" discussion in the interest of brevity.) To delete a device, you'd btrfs device delete, which would move all the data on that device onto other existing devices in the filesystem, after which it could be removed. 7) Given the thread, I'd be remiss to omit this one. VM images and other large "internal-rewrite-pattern" files (large database files, etc) need special treatment on btrfs, at least currently. As such, btrfs may not be the greatest solution for Mark (tho it would work fine with special procedures), given the several VMs he runs. This one unfortunately hits a lot of people. =:^( But here's a heads-up, so it doesn't have to hit anyone reading this! =:^) As a property of the technology, any copy-on-write-based filesystem is going to find files where various bits of existing data within the file are repeatedly rewritten (as opposed to new data simply being appended, think a log file or live-stored audio/video stream) extremely challenging to deal with. The problem is that unlike ordinary filesystems that rewrite the data in place such that a file continues to occupy the same extents as it did before, copy-on-write filesystems will write a changed block to a different location. While COW does mean atomic updates and thus more reliability since either the new data or the old data should exist, never an unpredictable mixture of the two, as a result of the above rewrite pattern, this type of internally-rewritten file gets **HEAVILY** fragmented over time. We've had filefrag reports of several gig files with over 100K extents! Obviously, this isn't going to be the most efficient file in the world to access! For smaller files, up to a couple hundred MiB or perhaps a bit more, btrfs has the autodefrag mount option, which can help a lot. With this option enabled, whenever a block of a file is changed and rewritten, thus written elsewhere, btrfs queues up a rewrite of the entire file to happen in the background. The rewrite will be done sequentially, thus defragging the file. This works quite well for firefox's sqlite database files, for instance, as they're internal-rewrite-pattern, but they're small enough that autodefrag handles them reasonably nicely. But this solution doesn't scale so well as the file size increases toward and past a GiB, particularly for files with a continuous stream of internal rewrites such as can happen with an operating VM writing to its virtual storage device. At some point, the stream of writes comes in faster than the file can be rewritten, and things start to back up! To deal with this case, there's the NOCOW file attribute, set with chattr +C. However, to be effective, this attribute must be set when the file is empty, before it has existing content. The easiest way to do that is to set the attribute on the directory which will contain the files. While it doesn't affect the directory itself any, newly created files within that directory inherit the NOCOW attribute before they have data, thus allowing it to work without having to worry about it that much. For existing files, create a new directory, set its NOCOW attribute, and COPY (don't move, and don't use cp --reflink) the existing files into it. Once you have your large internal-rewrite-pattern files set NOCOW, btrfs will rewrite them in-place as an ordinary filesystem would, thus avoiding the problem. Except for one thing. I haven't mentioned btrfs snapshots yet as that feature, but for this caveat, is covered well enough elsewhere. But here's the problem. A snapshot locks the existing file data in place. As a result, the first write to a block within a file after a snapshot MUST be COW, even if the file is otherwise set NOCOW. If only the occasional one-off snapshot is done it's not /too/ bad, as all the internal file writes between snapshots are NOCOW, it's only the first one to each file block after a snapshot that must be COW. But many people and distros are script-automating their snapshots in ordered to have rollback capacities, and on btrfs, snapshots are (ordinarily) light enough that people are sometimes configuring a snapshot a minute! If only a minute's changes can be written to a the existing location, then there's a snapshot and changes must be written to a new location, then another snapshot and yet another location... Basically the NOCOW we set on that file isn't doing us any good! 8) So making this a separate point as it's important and a lot of people get it wrong. NOCOW and snapshots don't mix! There is, however, a (partial) workaround. Because snapshots stop at btrfs subvolume boundaries, if you put your large VM images and similar large internal-rewrite-pattern files (databases, etc) in subvolumes, making that directory I suggested above a full subvolume not just a NOCOW directory, snapshots of the parent subvolume will not include the VM images subvolume, thus leaving the VM images alone. This solves the snapshot-broken-NOCOW and thus the fragmentation issue, but it DOES mean that those VM images must be backed up using more conventional methods since snapshotting won't work for them. 9) Some other still partially broken bits of btrfs include: 9a) Quotas: Just don't use them on btrfs at this point. Performance doesn't scale (altho there's a rewrite in progress), and they are buggy. Additionally, the scaling interaction with snapshots is geometrically negative, sometimes requiring 64 GiB of RAM or more and coming to a near standstill at that, for users with enough quota-groups and enough snapshots. If you need quotas, use a more traditional filesystem with stable quota support. Hopefully by this time next year... 9b) Snapshot-aware-defrag: This was enabled at one point but simply didn't scale, when it turned out people were doing things like per-minute snapshots and thus had thousands and thousands of snapshots. So this has been disabled for the time being. Btrfs defrag will defrag the working copy it is run on, but currently doesn't account for snapshots, so data that was fragmented at snapshot time gets duplicated as it is defragmented. However, they plan to re-enable the feature ones they have rewritten various bits to scale far better than they do at present. 9c) Send and receive. Btrfs send and receive are a very nice feature that can make backups far faster, with far less data transferred. They're great when they work. Unfortunately, there are still various corner-cases where they don't. (As an example, a recent fix was for the case where subdir B was nested inside subdir A for the first, full send/ receive, but later, the relationship was reversed, with subdir B made the parent of subdir A. Until the recent fix, send/receive couldn't handle that sort of corner-case.) You can go ahead and use it if it's working for you, as if it finishes without error, the copy should be 100% reliable. However, have an alternate plan for backups if you suddenly hit one of those corner-cases and send/receive quits working. Of course it's worth mentioning that b and c deal with features that most filesystems don't have at all, so with the exception of quotas, it's not like something's broken on btrfs that works on other filesystems. Instead, these features are (nearly) unique to btrfs, so even if they come with certain limitations, that's still better than not having the option of using the feature at all, because it simply doesn't exist on the other filesystem! 10) Btrfs in general is headed toward stable now, and a lot of people, including me, have used it for a significant amount of time without problems, but it's still new enough that you're strongly urged to make and test your backups, because by not doing so, you're stating by your actions if not your words, that you simply don't care if some as yet undiscovered and unfixed bug in the filesystem eats your data. For similar reasons altho already mentioned above, run the latest stable kernel from the latest stable kernel series, at the oldest, and consider running rc kernels from at least rc2 or so (by which time any real data eating bugs, in btrfs or elsewhere, should be found and fixed, or at least published). Because anything older and you are literally risking your data to known and fixed bugs. As is said, take reasonable care and you're much less likely to be the statistic! -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman