From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-amd64+bounces-13297-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	by finch.gentoo.org (Postfix) with ESMTP id 0DC4D1381F3
	for <garchives@archives.gentoo.org>; Fri, 21 Jun 2013 14:28:08 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 75643E0A00;
	Fri, 21 Jun 2013 14:28:02 +0000 (UTC)
Received: from plane.gmane.org (plane.gmane.org [80.91.229.3])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id 7AF15E09E8
	for <gentoo-amd64@lists.gentoo.org>; Fri, 21 Jun 2013 14:28:01 +0000 (UTC)
Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <glga-gentoo-amd64@m.gmane.org>)
	id 1Uq2Jh-00042L-IU
	for gentoo-amd64@lists.gentoo.org; Fri, 21 Jun 2013 16:27:57 +0200
Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <gentoo-amd64@lists.gentoo.org>; Fri, 21 Jun 2013 16:27:57 +0200
Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <gentoo-amd64@lists.gentoo.org>; Fri, 21 Jun 2013 16:27:57 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: gentoo-amd64@lists.gentoo.org
From: Duncan <1i5t5.duncan@cox.net>
Subject: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector
 value?
Date: Fri, 21 Jun 2013 14:27:38 +0000 (UTC)
Message-ID: <pan$b657d$d0d9cd1a$724ca571$a36894cc@cox.net>
References: 
	<CAK2H+ecth45ADi=k=1b4y8eowYNqoABTo3iMgEzV6pAmthusVA@mail.gmail.com>
	<pan$ecf3f$9af69a78$1508667e$d81347b7@cox.net>
	<CAGfcS_mmTfUSPXRB557n3=ieg_-xqQMMuRdOUJ5H5vbqhdcXQA@mail.gmail.com>
Precedence: bulk
List-Post: <mailto:gentoo-amd64@lists.gentoo.org>
List-Help: <mailto:gentoo-amd64+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-amd64+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-amd64+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-amd64.gentoo.org>
X-BeenThere: gentoo-amd64@lists.gentoo.org
Reply-to: gentoo-amd64@lists.gentoo.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: ip68-231-22-224.ph.ph.cox.net
User-Agent: Pan/0.140 (Chocolate Salty Balls; GIT 459f52e
	/usr/src/portage/src/egit-src/pan2)
X-Archives-Salt: 3e359288-676b-45e9-b72f-5f367ffc2529
X-Archives-Hash: 12b7598d58536c14306ec50121cba543

Rich Freeman posted on Fri, 21 Jun 2013 06:28:35 -0400 as excerpted:

> On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
>> in data across three devices, and 8k of parity across the other two
>> devices.
> 
> With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a
> stripe, not 20k.  If you modify one block it needs to read all 1.5M, or
> it needs to read at least the old chunk on the single drive to be
> modified and both old parity chunks (which on such a small array is 3
> disks either way).

I'll admit to not fully understanding chunks/stripes/strides in terms of 
actual size tho I believe you're correct, it's well over the filesystem 
block size and a half-meg is probably right.  However, the original post 
went with a 4k blocksize, which is pretty standard as that's the usual 
memory page size as well so it makes for a convenient filesystem blocksize 
too, so that's what I was using as a base for my numbers.  If it's 4k 
blocksize, then 5-device raid6 stripe would be 3*4k=12k of data, plus 
2*4k=8k of parity.

>> Forth, back to the parity.  Remember, raid5/6 has all that parity that
>> it writes out (but basically never reads in normal mode, only when
>> degraded,
>> in ordered to reconstruct the data from the missing device(s)), but
>> doesn't actually use it for integrity checking.
> 
> I wasn't aware of this - I can't believe it isn't even an option either.
>  Note to self - start doing weekly scrubs...

Indeed.  That's one of the things that frustrated me with mdraid -- all 
that data integrity metadata there, but just going to waste in normal 
operation, only used for device recovery.

Which itself can be a problem as well, because if there *IS* an 
undetected cosmic-ray-error or whatever and a device goes out, that means 
you'll lose integrity on a second device in the rebuild as well (if it 
was a data device that dropped out and not parity anyway), because the 
parity's screwed against the undetected error and will thus rebuild a bad 
copy of the data on the replacement device.

And it's one of the things which so attracted me to btrfs, too, and why I 
was so frustrated to see it could only be a single redundancy (two-way-
mirrored), no way to do more.  The btrfs sales pitch talks about how 
great data integrity and the ability to go find a good copy when the 
data's bad, but what if the only allowed second copy is bad as well?  
OOPS!

But as I said, N-way mirroring is on the btrfs roadmap, it's simply not 
there yet.

>> The single down side to raid1 as opposed to raid5/6 is the loss of the
>> extra space made available by the data striping, 3*single-device-space
>> in the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space
>> in the case of raid1.  Otherwise, no contest, hands down, raid1 over
>> raid6.
> 
> This is a HUGE downside.  The only downside to raid1 over not having
> raid at all is that your disk space cost doubles.  raid5/6 is
> considerably cheaper in that regard.  In a 5-disk raid5 the cost of
> redundancy is only 25% more, vs a 100% additional cost for raid1.  To
> accomplish the same space as a 5-disk raid5 you'd need 8 disks.  Sure,
> read performance would be vastly superior, but if you're going to spend
> $300 more on hard drives and whatever it takes to get so many SATA ports
> on your system you could instead add an extra 32GB of RAM or put your OS
> on a mirrored SSD.  I suspect that both of those options on a typical
> workload are going to make a far bigger improvement in performance.

I'd suggest that with the exception of large database servers where the 
object is to be able to cache the entire db in RAM, the SSDs are likely a 
better option.

FWIW, my general "gentoo/amd64 user" rule of thumb is 1-2 gig base, plus 
1-2 gig per core.  Certainly that scale can slide either way and it'd 
probably slide down for folks not doing system rebuilds in tmpfs, as 
gentooers often do, but up or down, unless you put that ram in a battery-
backed ramdisk, 32-gig is a LOT of ram, even for an 8-core.

FWIW, with my old dual-dual-core (so four cores), 8 gig RAM was nicely 
roomy, tho I /did/ sometimes bump the top and thus end up either dumping 
either cache or swapping.  When I upgraded to the 6-core, I used that 
rule of thumb and figured ~12 gig, but due to the power-of-twos 
efficiency rule, I ended up with 16 gig, figuring that was more than I'd 
use in practice, but better than limiting it to 8 gig.

I was right.  The 16 gig is certainly nice, but in reality, I'm typically 
entirely wasting several gigs of it, not even cache filling it up.  I 
typically run ~1 gig in application memory and several gigs in cache, 
with only a few tens of MB in buffer.  But while I'll often exceed my old 
capacity of 8 gig, it's seldom by much, and 12 gig would handle 
everything including cache without dumping at well over the 90th 
percentile, probably 97% or there abouts.  Even with parallel make at 
both the ebuild and global portage level and with PORTAGE_TMPDIR in 
tmpfs, I hit 100% on the cores well before I run out of RAM and start 
dumping cache or swapping.  The only time that has NOT been the case is 
when I deliberately saturate, say a kernel build with an open-ended -j so 
it stacks up several-hundred jobs at once.

Meanwhile, the paired SSDs in btrfs raid1 make a HUGE practical 
difference, especially in things like the (cold-cache) portage tree (and 
overlays) sync, kernel git pull, etc.  (In my case actual booting didn't 
get a huge boost as I run ntp-client and ntpd at boot, and the ntp-client 
time sync takes ~12 seconds, more than the rest of the boot put 
together.  But cold-cache loading kde happens faster now -- I actually 
uninstalled the ksplash and just go text-console login to x-black-screen 
to kde/plasma desktop, now.  But the tree sync and kernel pull are still 
the places I appreciate the SSDs most.)

And notably, because the cold-cache system is so much faster with the 
SSDs, I tend to actually shut-down instead of suspending, now, so I tend 
to cache even less and thus use less memory with the SSDs than before.  
I /could/ probably do 8-gig RAM now instead of 16, and not miss it.  Even 
a gig per core, 6-gig, wouldn't be terrible, tho below that would start 
to bottleneck and pinch a bit again I suspect.

> Which is better really depends on your workload.  In my case much of my
> raid space is used my mythtv, or for storage of stuff I only
> occasionally use.  In these use cases the performance of the raid5 is
> more than adequate, and I'd rather be able to keep shows around for an
> extra 6 months in HD than have the DVR respond a millisecond faster when
> I hit play.  If you really have sustained random access of the bulk of
> your data than a raid1 would make much more sense.

Definitely.  For mythTV or similar massive media needs, raid5 will be  
fast enough.  And I suspect just the single device-loss tolerance is a 
reasonable risk tradeoff for you too, since after all it /is/ just media, 
so tolerating loss of a single device is good, but the risk of losing two 
before a full rebuild with a replacement if one fails is acceptable, 
given the cost vs. size tradeoff with the massive size requirements of 
video.

But again, the OP seemed to find his speed benchmarks disappointing, to 
say the least, and I believe pointing out raid6 as the culprit is 
accurate.  Which, given his production-rating reliability stock trading 
VMs usage, I'm guessing raid5/6 really isn't the ideal match.  Massive 
media, yes, definitely.  Massive VMs, not so much.

>> So several points on btrfs:
>>
>> 1) It's still in heavy development.
> 
> That is what is keeping me away.  I won't touch it until I can use it
> with raid5, and the first common containing that hit the kernel weeks
> ago I think (and it has known gaps).  Until it is stable I'm sticking
> with my current setup.

Question:  Would you use it for raid1 yet, as I'm doing?  What about as a 
single-device filesystem?  Do you believe my estimates of reliability in 
those cases (almost but not quite stable for single-device, kind of in 
the middle for raid1/raid0/raid10, say a year behind single-device and 
raid5/6/50/60 about a year behind that) reasonably accurate?

Because if you're waiting until btrfs raid5 is fully stable, that's 
likely to be some wait yet -- I'd say a year, likely more given that 
everything btrfs has seemed to take longer than people expected.  But if 
you're simply waiting until it matures to the point that say btrfs raid1 
is at now, or maybe even a bit less, but certainly to where it's complete 
plus say a kernel release to work out a few more wrinkles, then that's 
quite possible by year-end.

>> 2) RAID levels work QUITE a bit differently on btrfs.  In particular,
>> what btrfs calls raid1 mode (with the same applying to raid10) is
>> simply two-way-mirroring, NO MATTER THE NUMBER OF DEVICES.  There's no
>> multi-way mirroring yet available
> 
> Odd, for some reason I thought it let you specify arbitrary numbers of
> copies, but looking around I think you're right.  It does store two
> copies of metadata regardless of the number of drives unless you
> override this.

Default is single-copy data, dual-copy metadata, regardless of number of 
devices (single device does DUP metadata, two copies on the same device, 
by default), with the exception of SSDs, where the metadata default is 
single since many of the SSD firmwares (sandforce firmware, with its 
compression features, is known to do this, tho mine, I forgot the 
firmware brand ATM but it's Corsair Neutron SSDs aimed at the server/
workstation market where unpredictability isn't considered a feature, 
doesn't as one of its features is stable performance and usage regardless 
of the data its fed) do dedup on identical copy data anyway.  At least 
that's the explanation given for the SSD exception.

But the real gotcha is that there's no way to setup N-way (N>2) 
redundancy on btrfs raid1/10, and I know for a fact that catches some 
admins by nasty surprise, as I've seen it come up on the btrfs list as 
well as had my own personal disappointment with it, tho luckily I did my 
research and figured that out before I actually installed on btrfs.

I just wish they'd called it 2-way-mirroring instead of raid1, as that 
wouldn't be the deception in labeling that I consider the btrfs raid1 
moniker at this point, and admins would be far less likely to be caught 
unaware when a second device goes haywire that they /thought/ they'd be 
covered for.  Of course at this point it's all still development anyway, 
so no sane admin is going to be lacking backups in any case, but there's 
a lot of people flying by the seat of their pants out there, who have NOT 
done the research, and I they show up frequently on the btrfs list, after 
it's too late.   (Tho certainly there's less of them showing up now than 
a year ago, when I first investigated btrfs, I think both due to btrfs 
maturing quite a bit since then and to a lot of the original btrfs hype 
dying down, which is a good thing considering the number of folks that 
were installing it, only to find out once they lost data that it was 
still development.)

> However, if one considered raid1 expensive, having multiple layers of
> redundancy is REALLY expensive if you aren't using Reed Solomon and many
> data disks.

Well, depending on the use case.  In your media case, certainly.  
However, that's one of the few cases that still gobbles storage space as 
fast as the manufacturers up their capacities, and that is likely to 
continue to do so for at least a few more years, given that HD is still 
coming in, so a lot of the media is still SD, and with quad-HD in the 
wings as well, now.  But once we hit half-petabyte, I suppose even quad-HD 
won't be gobbling the space as fast as they can upgrade it, any more.  So 
a half-decade or so, maybe?

Plus of course the shear bandwidth requirements for quad-HD are 
astounding, so at that point either some serious raid0/x0 raid or ssds 
for the speed will be pretty mandatory anyway, remaining SSD size limits 
or no SSD size limits.

> From my standpoint I don't think raid1 is the best use of money in most
> cases, either for performance OR for data security.  If you want
> performance the money is probably better spent on other components. If
> you want data security the money is probably better spent on offline
> backups.  However, this very-much depends on how the disks will be used
> - there are certainly cases where raid1 is your best option.

I agree when the use is primarily video media.  Other than that, a pair 
of 2 TB spinning rust drives tends to still go quite a long way, and 
tends to be a pretty good cost/risk tradeoff IMO.  Throwing in a third 2-
TB drive for three-way raid1 mirroring is often a good idea as well, 
where the additional data security is needed, but beyond that, the cost/
benefit balance probably doesn't make a whole lot of sense, agreed.

And offline backups are important too, but with dual 2TB drives, many 
people can live with a TB of data and do multiple raid1s, giving 
themselves both logically offline backup and physical device redundancy.  
And if that means they do backups to the second raid set on the same 
physical devices more reliably than they would with an external that they 
have to physically look for and/or attach each time (as turned out to be 
the case for me), then the pair of 2TB drives is quite a reasonable 
investment indeed.

But if you're going for performance, spinning rust raid simply doesn't 
cut it at the consumer level any longer.  SSD at least the commonly used 
data, leaving say the media data on spinning rust for the time being if 
the budget doesn't work otherwise, as I've actually done here with my 
(much smaller than yours) media collection, figuring it not worth the 
cost to put /it/ on SSD just yet.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman