From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-amd64+bounces-13308-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	by finch.gentoo.org (Postfix) with ESMTP id A4D2C1381F3
	for <garchives@archives.gentoo.org>; Sat, 22 Jun 2013 10:30:31 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 1E16EE0AE8;
	Sat, 22 Jun 2013 10:30:22 +0000 (UTC)
Received: from plane.gmane.org (plane.gmane.org [80.91.229.3])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id 26B04E0AA2
	for <gentoo-amd64@lists.gentoo.org>; Sat, 22 Jun 2013 10:30:21 +0000 (UTC)
Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <glga-gentoo-amd64@m.gmane.org>)
	id 1UqL5F-0004CD-JA
	for gentoo-amd64@lists.gentoo.org; Sat, 22 Jun 2013 12:30:17 +0200
Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <gentoo-amd64@lists.gentoo.org>; Sat, 22 Jun 2013 12:30:17 +0200
Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <gentoo-amd64@lists.gentoo.org>; Sat, 22 Jun 2013 12:30:17 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: gentoo-amd64@lists.gentoo.org
From: Duncan <1i5t5.duncan@cox.net>
Subject: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector
 value?
Date: Sat, 22 Jun 2013 10:29:59 +0000 (UTC)
Message-ID: <pan$2d013$83032870$fb7caf22$c0cf28d0@cox.net>
References: 
	<CAK2H+ecth45ADi=k=1b4y8eowYNqoABTo3iMgEzV6pAmthusVA@mail.gmail.com>
	<pan$ecf3f$9af69a78$1508667e$d81347b7@cox.net>
	<CAGfcS_mmTfUSPXRB557n3=ieg_-xqQMMuRdOUJ5H5vbqhdcXQA@mail.gmail.com>
	<pan$b657d$d0d9cd1a$724ca571$a36894cc@cox.net>
	<CAGfcS_nCzMWWJdLc_3jPOWZdwgmV_YTLqks2+WT6yqhD1hTW5g@mail.gmail.com>
Precedence: bulk
List-Post: <mailto:gentoo-amd64@lists.gentoo.org>
List-Help: <mailto:gentoo-amd64+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-amd64+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-amd64+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-amd64.gentoo.org>
X-BeenThere: gentoo-amd64@lists.gentoo.org
Reply-to: gentoo-amd64@lists.gentoo.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: ip68-231-22-224.ph.ph.cox.net
User-Agent: Pan/0.140 (Chocolate Salty Balls; GIT 459f52e
	/usr/src/portage/src/egit-src/pan2)
X-Archives-Salt: 59b2e017-9dda-4c2f-a3da-73bb9efbea51
X-Archives-Hash: 0f83d4749396f82e23dcee4d70780fb4

Rich Freeman posted on Fri, 21 Jun 2013 11:13:51 -0400 as excerpted:

> On Fri, Jun 21, 2013 at 10:27 AM, Duncan <1i5t5.duncan@cox.net> wrote:

>> Question:  Would you use [btrfs] for raid1 yet, as I'm doing?
>> What about as a single-device filesystem?

> If I wanted to use raid1 I might consider using btrfs now.  I think it
> is still a bit risky, but the established use cases have gotten a fair
> bit of testing now.  I'd be more confident in using it with a single
> device.

OK, so we agree on the basic confidence level of various btrfs features.  
I trust my own judgement a bit more now. =:^)

> To migrate today would require finding someplace to dump all
> the data offline and migrate the drives, as there is no in-place way to
> migrate multiple ext3/4 logical volumes on top of mdadm to a single
> btrfs on bare metal.

... Unless you have enough unpartitioned space available still.

What I did a few years ago is buy a 1 TB USB drive I found at a good 
deal. (It was very near the price of half-TB drives at the time, I 
figured out later they must have gotten shipped a pallet of the wrong 
ones for a sale on the half-TB version of the same thing, so it was a 
single-store, get-it-while-they're-there-to-get, deal.)

That's how I was able to migrate from the raid6 I had back to raid1.  I 
had to squeeze the data/partitions a bit to get everything to fit, but it 
did, and that's how I ended up with 4-way raid1, since it /had/ been a 4-
way raid6.  All 300-gig drives at the time, so the TB USB had /plenty/ of 
room. =:^)

> Without replying to anything in particular both you and Bob have
> mentioned the importance of multiple redundancy.
> 
> Obviously risk goes down as redundancy goes up.  If you protect 25
> drives of data with 1 drive of parity then you need 2/26 drives to fail
> to hose 25 drives of data.

Ouch!

> If you protect 1 drive of data with 25 drives of parity (call them
> mirrors or parity or whatever - they're functionally equivalent) then
> you need 25/26 drives to fail to lose 1 drive of data.

Almost correct.

Except that with 25/26 failed, you'd still have 1 working, which with 
raid1/mirroring would be enough.  (AFAIK that's the difference with 
parity.  Parity is generally done on a minimum of two devices with the 
third as parity, and going down to just one isn't enough, you can lose 
only one, or two if you have two-way-parity as with raid6.  With 
mirroring/raid1, they're all essentially identical, so one is enough to 
keep going, you'd have to loose 26/26 to be dead in the water.  But 25/26 
dead or 26/26 dead, you better HOPE it never comes down to where that 
matters!)

> RAID 1 is actually less effective - if you protect 13
> drives of data with 13 mirrors you need 2/26 drives to fail to lose 1
> drive of data (they just have to be the wrong 2).  However, you do need
> to consider that RAID is not the only way to protect data, and I'm not
> sure that multiple-redundancy raid-1 is the most cost-effective
> strategy.

The first time I read that thru I read it wrong, and was about to 
disagree.  Then I realized what you meant... and that it was an equally 
valid read of what you wrote, except...

AFAIK 13 drives of data with 13 mirrors wouldn't (normally) be called 
raid1 (unless it's 13 individual raid1s).  Normally, an arrangement of 
that nature if configured together would be configured as raid10, 2-way-
mirrored, 13-way-striped (or possibly raid0+1, but that's not recommended 
for technical reasons having to do with rebuild thruput), tho it could 
also be configured as what mdraid calls linear mode (which isn't really 
raid, but it happens to be handled by the same md/raid driver in Linux) 
across the 13, plus raid1, or if they're configured as separate volumes, 
13 individual two-disk raid1s, any of which might be what you meant (and 
the wording appears to favor 13 individual raid1s).

What I interpreted it as initially was a 13-way raid1, mirrored again at 
a second level to 13 additional drives, which would be called raid11, 
except that there's no benefit of that over a simple single-layer 26-way 
raid1 so the raid11 term is seldom seen, and that's clearly not what you 
meant.

Anyway, you're correct if it's just two-way-mirrored.  However, at that 
level, if one was to do only two-way-mirroring, one would usually do 
either raid10 for the 13-way striping, or 13 separate raid1s, which would 
give one the opportunity to make some of them 3-way-mirrored (or more) 
raid1s for the really vital data, leaving the less vital data as simple
2-way-mirror-raid1s.

Or raid6 and get loss-of-two tolerance, but as this whole subthread is 
discussing, that can be problematic for thruput.  (I've occasionally seen 
reference to raid7, which is said to be 3-way-parity, loss-of-three-
tolerance, but AFAIK there's no support for it in the kernel, and I 
wouldn't be surprised if all implementations are proprietary.  AFAIK, in 
practice, raid10 with N-way mirroring on the raid1 portion is implemented 
once that many devices get involved, or other multi-level raid schemes.)

> If I had 2 drives of data to protect and had 4 spare drives to do it
> with, I doubt I'd set up a 3x raid-1/5/10 setup (or whatever you want to
> call it - imho raid "levels" are poorly named as there really is just
> striping and mirroring and adding RS parity and everything else is just
> combinations).  Instead I'd probably set up a RAID1/5/10/whatever with
> single redundancy for faster storage and recovery, and an offline backup
> (compressed and with incrementals/etc).  The backup gets you more
> security and you only need it in a very unlikely double-failure.  I'd
> only invest in multiple redundancy in the event that the risk-weighted
> cost of having the node go down exceeds the cost of the extra drives. 
> Frankly in that case RAID still isn't the right solution - you need a
> backup node someplace else entirely as hard drives aren't the only thing
> that can break in your server.

So we're talking six drives, two of data and four "spares" to play with.

Often that's setup as raid10, either two-way-striped and 3-way-mirrored, 
or 3-way-striped and 2-way-mirrored, depending on whether the loss-of-two 
tolerance of 3-way-mirroring or thruput of three-way-striping, is 
considered of higher value.

You're right that at that level, you DO need a real backup, and it should 
take priority over raid-whatever.  HOWEVER, in addition to creating a 
SINGLE raid across all those drives, it's possible to partition them up, 
and create multiple raids out of the partitions, with one set being a 
backup of the other.  And since you've already stated that there's only 
two drives worth of data, there's certainly room enough amongst the six 
drives total to do just that.

This is in fact how I ran my raids, both my raid6 config, and my raid1 
config, for a number of years, and is in fact how I have my (raid1-mode) 
btrfs filesystems setup now on the SSDs.

Effectively I had/have each drive partitioned up into two sets of 
partitions, my "working" set, and my "backup" set.  Then I md-raided at 
my chosen level each partition across all devices.  So on each physical 
device partition 5 might be the working rootfs partition, partition 6 the 
woriing home partition... partition 9 the backup rootfs partition, and 
partition 10 the backup home partition.  They might end up being md3 
(rootwork), md4 (homework), md7 (rootbak) and md8 (homebak).

That way, you're protected against physical device death by the 
redundancy of the raids, and from fat-fingering or an update gone wrong 
by the redundancy of the backup partitions across the same physical 
devices.

What's nice about an arrangement such as this is that it gives you quite 
a bit more flexibility than you'd have with a single raid, since it's now 
possible to decide "Hmm, I don't think I actually need a backup of /var/
log, so I think I'll only run with one log partition/raid, instead of the 
usual working/backup arrangement."  Similarly, "You know, I ultimately 
don't need backups of the gentoo tree and overlays, or of the kernel git 
tree, at all, since as Linus says, 'Real men upload it to the net and let 
others be their backup', and I can always redownload that from the net, 
so I think I'll raid0 this partition and not keep any copies at all, 
since re-downloading's less trouble than dealing with the backups 
anyway."  Finally, and possibly critically, it's possible to say, "You 
know, what happens if I've just wiped rootbak in ordered to make a new 
root backup, and I have a crash and working-root refuses to boot.  I 
think I need a rootbak2, and with the space I saved by doing only one log 
partition and by making the sources trees raid0, I have room for it now, 
without using any more space than I would had I had everything on the 
same raid."

Another nice thing about it, and this is what I would have ended up doing 
if I hadn't conveniently found that 1 TB USB drive at such a good price, 
is that while the whole thing is partitioned up and in use, it's very 
possible to wipe out the backup partitions temporarily, and recreate them 
as a different raid level or a different filesystem, or otherwise 
reorganize that area, then reboot into the new version, and do the same 
to what was the working copies.  (For the area that was raid0, well, it 
was raid0 because it's easy to recreate, so just blow it away and 
recreate it on the new layout.  And for the single-raid log without a 
backup copy, it's simple enough to simply point the log elsewhere or keep 
it on rootfs for long enough to redo that set of partitions across all 
physical devices.)

Again, this isn't just theory, it really works, as I've done it to 
various degrees at various times, even if I found copying to the external 
1 TB USB drive and booting from it more convenient to do when I 
transferred from raid6 to raid1.

And being I do run ~arch, there's been a number of times I've needed to 
boot to rootbak instead of rootworking, including once when a ~arch 
portage was hosing symlinks just as a glibc update came along, thus 
breaking glibc (!!), once when a bash update broke, and another time when 
a glibc update mostly worked but I needed to downgrade and the protection 
built into the glibc ebuild wasn't letting me do it from my working root.

What's nice about this setup in regard to booting to rootbak instead of 
the usual working root, is that unlike booting to a liveCD/DVD rescue 
disk, you have the full working system installed, configured and running 
just as it was when the backup was made.  That makes it much easier to 
pickup and run from where you left off, with all the tools you're used to 
having and modes of working you're used to using, instead of being 
limited to some artificial rescue environment often with limited tools, 
and in any case setup and configured differently than you have your own 
system, because rootbak IS your own system, just from a few days/weeks/
months ago, whenever it was that you last did the backup.

Anyway, with the parameters you specified, two drives full of data and 
four spare drives (presumably of a similar size), there's a LOT of 
flexibility.  There's raid10 across four drives (two-mirror, two-stripe) 
with the other two as backup (this would probably be my choice given the 
2-disks of data, 6 disk total, constraints, but see below, and it appears 
this might be your choice as well), or raid6 across four drives (two 
mirror, two parity) with two as backups (not a choice I'd likely make, 
but a choice), or a working pair of drives plus two sets of backups (not 
a choice I'd likely make), or raid10 across all six drives in either 3-
mirror/2-stripe or 3-stripe/2-mirror mode (I'd probably elect for this 
with 3-stripe/2-mirror for the 3X speed and space, and prioritize a 
separate backup, see the discussion below), or two independent 3-disk 
raid5s (IMO there's better options for most cases, with the possible 
exception of primarily slow media usage, just which options are better 
depends on usage and priorities tho), or some hybrid combination of these.

> This sort of rationale is why I don't like arguments like "RAM is cheap"
> or "HDs are cheap" or whatever.  The fact is that wasting money on any
> component means investing less in some other component that could give
> you more space/performance/whatever-makes-you-happy.  If you have $1000
> that you can afford to blow on extra drives then you have $1000 you
> could blow on RAM, CPU, an extra server, or a trip to Disney.  Why not
> blow it on something useful?

[ This gets philosophical. OK to quit here if uninterested. ]

You're right.  "RAM and HDs are cheap"... relative to WHAT, the big-
screen TV/monitor I WOULD have been replacing my much smaller monitor 
with, if I hadn't been spending the money on the "cheap" RAM and HDs?

Of course, "time is cheap" comes with the same caveats, and can actually 
end up being far more dear.  Stress and hassle of administration 
similarly.  And sometimes, just a bit of investment in another 
"expensive" HD, saves you quite a bit of "cheap" time and stress, that's 
actually more expensive.

"It's all relative"... to one's individual priorities.  Because one 
thing's for sure, both money and time are fungible, and if they aren't 
spent on one thing, they WILL be on another (even if that "spent" is 
savings, for money), and ultimately, it's one's individual priorities 
that should rank where that spending goes.  And I can't set your 
priorities and you can't set mine, so...   But from my observation, a LOT 
of folks don't realize that and/or don't take the time necessary to 
reevaluate their own priorities from time to time, so end up spending out 
of line with their real priorities, and end up rather unhappy people as a 
result!  That's one reason why I have a personal policy to deliberately 
reevaluate personal priorities from time to time (as well as being aware 
of them constantly), and rearrange spending, money time and otherwise, in 
accordance with those reranked priorities.  I'm absolutely positive I'm a 
happier man for doing so! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman