From: Duncan <1i5t5.duncan@cox.net>
To: gentoo-amd64@lists.gentoo.org
Subject: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
Date: Fri, 21 Jun 2013 14:27:38 +0000 (UTC) [thread overview]
Message-ID: <pan$b657d$d0d9cd1a$724ca571$a36894cc@cox.net> (raw)
In-Reply-To: CAGfcS_mmTfUSPXRB557n3=ieg_-xqQMMuRdOUJ5H5vbqhdcXQA@mail.gmail.com
Rich Freeman posted on Fri, 21 Jun 2013 06:28:35 -0400 as excerpted:
> On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
>> in data across three devices, and 8k of parity across the other two
>> devices.
>
> With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a
> stripe, not 20k. If you modify one block it needs to read all 1.5M, or
> it needs to read at least the old chunk on the single drive to be
> modified and both old parity chunks (which on such a small array is 3
> disks either way).
I'll admit to not fully understanding chunks/stripes/strides in terms of
actual size tho I believe you're correct, it's well over the filesystem
block size and a half-meg is probably right. However, the original post
went with a 4k blocksize, which is pretty standard as that's the usual
memory page size as well so it makes for a convenient filesystem blocksize
too, so that's what I was using as a base for my numbers. If it's 4k
blocksize, then 5-device raid6 stripe would be 3*4k=12k of data, plus
2*4k=8k of parity.
>> Forth, back to the parity. Remember, raid5/6 has all that parity that
>> it writes out (but basically never reads in normal mode, only when
>> degraded,
>> in ordered to reconstruct the data from the missing device(s)), but
>> doesn't actually use it for integrity checking.
>
> I wasn't aware of this - I can't believe it isn't even an option either.
> Note to self - start doing weekly scrubs...
Indeed. That's one of the things that frustrated me with mdraid -- all
that data integrity metadata there, but just going to waste in normal
operation, only used for device recovery.
Which itself can be a problem as well, because if there *IS* an
undetected cosmic-ray-error or whatever and a device goes out, that means
you'll lose integrity on a second device in the rebuild as well (if it
was a data device that dropped out and not parity anyway), because the
parity's screwed against the undetected error and will thus rebuild a bad
copy of the data on the replacement device.
And it's one of the things which so attracted me to btrfs, too, and why I
was so frustrated to see it could only be a single redundancy (two-way-
mirrored), no way to do more. The btrfs sales pitch talks about how
great data integrity and the ability to go find a good copy when the
data's bad, but what if the only allowed second copy is bad as well?
OOPS!
But as I said, N-way mirroring is on the btrfs roadmap, it's simply not
there yet.
>> The single down side to raid1 as opposed to raid5/6 is the loss of the
>> extra space made available by the data striping, 3*single-device-space
>> in the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space
>> in the case of raid1. Otherwise, no contest, hands down, raid1 over
>> raid6.
>
> This is a HUGE downside. The only downside to raid1 over not having
> raid at all is that your disk space cost doubles. raid5/6 is
> considerably cheaper in that regard. In a 5-disk raid5 the cost of
> redundancy is only 25% more, vs a 100% additional cost for raid1. To
> accomplish the same space as a 5-disk raid5 you'd need 8 disks. Sure,
> read performance would be vastly superior, but if you're going to spend
> $300 more on hard drives and whatever it takes to get so many SATA ports
> on your system you could instead add an extra 32GB of RAM or put your OS
> on a mirrored SSD. I suspect that both of those options on a typical
> workload are going to make a far bigger improvement in performance.
I'd suggest that with the exception of large database servers where the
object is to be able to cache the entire db in RAM, the SSDs are likely a
better option.
FWIW, my general "gentoo/amd64 user" rule of thumb is 1-2 gig base, plus
1-2 gig per core. Certainly that scale can slide either way and it'd
probably slide down for folks not doing system rebuilds in tmpfs, as
gentooers often do, but up or down, unless you put that ram in a battery-
backed ramdisk, 32-gig is a LOT of ram, even for an 8-core.
FWIW, with my old dual-dual-core (so four cores), 8 gig RAM was nicely
roomy, tho I /did/ sometimes bump the top and thus end up either dumping
either cache or swapping. When I upgraded to the 6-core, I used that
rule of thumb and figured ~12 gig, but due to the power-of-twos
efficiency rule, I ended up with 16 gig, figuring that was more than I'd
use in practice, but better than limiting it to 8 gig.
I was right. The 16 gig is certainly nice, but in reality, I'm typically
entirely wasting several gigs of it, not even cache filling it up. I
typically run ~1 gig in application memory and several gigs in cache,
with only a few tens of MB in buffer. But while I'll often exceed my old
capacity of 8 gig, it's seldom by much, and 12 gig would handle
everything including cache without dumping at well over the 90th
percentile, probably 97% or there abouts. Even with parallel make at
both the ebuild and global portage level and with PORTAGE_TMPDIR in
tmpfs, I hit 100% on the cores well before I run out of RAM and start
dumping cache or swapping. The only time that has NOT been the case is
when I deliberately saturate, say a kernel build with an open-ended -j so
it stacks up several-hundred jobs at once.
Meanwhile, the paired SSDs in btrfs raid1 make a HUGE practical
difference, especially in things like the (cold-cache) portage tree (and
overlays) sync, kernel git pull, etc. (In my case actual booting didn't
get a huge boost as I run ntp-client and ntpd at boot, and the ntp-client
time sync takes ~12 seconds, more than the rest of the boot put
together. But cold-cache loading kde happens faster now -- I actually
uninstalled the ksplash and just go text-console login to x-black-screen
to kde/plasma desktop, now. But the tree sync and kernel pull are still
the places I appreciate the SSDs most.)
And notably, because the cold-cache system is so much faster with the
SSDs, I tend to actually shut-down instead of suspending, now, so I tend
to cache even less and thus use less memory with the SSDs than before.
I /could/ probably do 8-gig RAM now instead of 16, and not miss it. Even
a gig per core, 6-gig, wouldn't be terrible, tho below that would start
to bottleneck and pinch a bit again I suspect.
> Which is better really depends on your workload. In my case much of my
> raid space is used my mythtv, or for storage of stuff I only
> occasionally use. In these use cases the performance of the raid5 is
> more than adequate, and I'd rather be able to keep shows around for an
> extra 6 months in HD than have the DVR respond a millisecond faster when
> I hit play. If you really have sustained random access of the bulk of
> your data than a raid1 would make much more sense.
Definitely. For mythTV or similar massive media needs, raid5 will be
fast enough. And I suspect just the single device-loss tolerance is a
reasonable risk tradeoff for you too, since after all it /is/ just media,
so tolerating loss of a single device is good, but the risk of losing two
before a full rebuild with a replacement if one fails is acceptable,
given the cost vs. size tradeoff with the massive size requirements of
video.
But again, the OP seemed to find his speed benchmarks disappointing, to
say the least, and I believe pointing out raid6 as the culprit is
accurate. Which, given his production-rating reliability stock trading
VMs usage, I'm guessing raid5/6 really isn't the ideal match. Massive
media, yes, definitely. Massive VMs, not so much.
>> So several points on btrfs:
>>
>> 1) It's still in heavy development.
>
> That is what is keeping me away. I won't touch it until I can use it
> with raid5, and the first common containing that hit the kernel weeks
> ago I think (and it has known gaps). Until it is stable I'm sticking
> with my current setup.
Question: Would you use it for raid1 yet, as I'm doing? What about as a
single-device filesystem? Do you believe my estimates of reliability in
those cases (almost but not quite stable for single-device, kind of in
the middle for raid1/raid0/raid10, say a year behind single-device and
raid5/6/50/60 about a year behind that) reasonably accurate?
Because if you're waiting until btrfs raid5 is fully stable, that's
likely to be some wait yet -- I'd say a year, likely more given that
everything btrfs has seemed to take longer than people expected. But if
you're simply waiting until it matures to the point that say btrfs raid1
is at now, or maybe even a bit less, but certainly to where it's complete
plus say a kernel release to work out a few more wrinkles, then that's
quite possible by year-end.
>> 2) RAID levels work QUITE a bit differently on btrfs. In particular,
>> what btrfs calls raid1 mode (with the same applying to raid10) is
>> simply two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no
>> multi-way mirroring yet available
>
> Odd, for some reason I thought it let you specify arbitrary numbers of
> copies, but looking around I think you're right. It does store two
> copies of metadata regardless of the number of drives unless you
> override this.
Default is single-copy data, dual-copy metadata, regardless of number of
devices (single device does DUP metadata, two copies on the same device,
by default), with the exception of SSDs, where the metadata default is
single since many of the SSD firmwares (sandforce firmware, with its
compression features, is known to do this, tho mine, I forgot the
firmware brand ATM but it's Corsair Neutron SSDs aimed at the server/
workstation market where unpredictability isn't considered a feature,
doesn't as one of its features is stable performance and usage regardless
of the data its fed) do dedup on identical copy data anyway. At least
that's the explanation given for the SSD exception.
But the real gotcha is that there's no way to setup N-way (N>2)
redundancy on btrfs raid1/10, and I know for a fact that catches some
admins by nasty surprise, as I've seen it come up on the btrfs list as
well as had my own personal disappointment with it, tho luckily I did my
research and figured that out before I actually installed on btrfs.
I just wish they'd called it 2-way-mirroring instead of raid1, as that
wouldn't be the deception in labeling that I consider the btrfs raid1
moniker at this point, and admins would be far less likely to be caught
unaware when a second device goes haywire that they /thought/ they'd be
covered for. Of course at this point it's all still development anyway,
so no sane admin is going to be lacking backups in any case, but there's
a lot of people flying by the seat of their pants out there, who have NOT
done the research, and I they show up frequently on the btrfs list, after
it's too late. (Tho certainly there's less of them showing up now than
a year ago, when I first investigated btrfs, I think both due to btrfs
maturing quite a bit since then and to a lot of the original btrfs hype
dying down, which is a good thing considering the number of folks that
were installing it, only to find out once they lost data that it was
still development.)
> However, if one considered raid1 expensive, having multiple layers of
> redundancy is REALLY expensive if you aren't using Reed Solomon and many
> data disks.
Well, depending on the use case. In your media case, certainly.
However, that's one of the few cases that still gobbles storage space as
fast as the manufacturers up their capacities, and that is likely to
continue to do so for at least a few more years, given that HD is still
coming in, so a lot of the media is still SD, and with quad-HD in the
wings as well, now. But once we hit half-petabyte, I suppose even quad-HD
won't be gobbling the space as fast as they can upgrade it, any more. So
a half-decade or so, maybe?
Plus of course the shear bandwidth requirements for quad-HD are
astounding, so at that point either some serious raid0/x0 raid or ssds
for the speed will be pretty mandatory anyway, remaining SSD size limits
or no SSD size limits.
> From my standpoint I don't think raid1 is the best use of money in most
> cases, either for performance OR for data security. If you want
> performance the money is probably better spent on other components. If
> you want data security the money is probably better spent on offline
> backups. However, this very-much depends on how the disks will be used
> - there are certainly cases where raid1 is your best option.
I agree when the use is primarily video media. Other than that, a pair
of 2 TB spinning rust drives tends to still go quite a long way, and
tends to be a pretty good cost/risk tradeoff IMO. Throwing in a third 2-
TB drive for three-way raid1 mirroring is often a good idea as well,
where the additional data security is needed, but beyond that, the cost/
benefit balance probably doesn't make a whole lot of sense, agreed.
And offline backups are important too, but with dual 2TB drives, many
people can live with a TB of data and do multiple raid1s, giving
themselves both logically offline backup and physical device redundancy.
And if that means they do backups to the second raid set on the same
physical devices more reliably than they would with an external that they
have to physically look for and/or attach each time (as turned out to be
the case for me), then the pair of 2TB drives is quite a reasonable
investment indeed.
But if you're going for performance, spinning rust raid simply doesn't
cut it at the consumer level any longer. SSD at least the commonly used
data, leaving say the media data on spinning rust for the time being if
the budget doesn't work otherwise, as I've actually done here with my
(much smaller than yours) media collection, figuring it not worth the
cost to put /it/ on SSD just yet.
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2013-06-21 14:28 UTC|newest]
Thread overview: 46+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht
2013-06-20 19:16 ` Volker Armin Hemmann
2013-06-20 19:28 ` Mark Knecht
2013-06-20 20:45 ` Mark Knecht
2013-06-24 18:47 ` Volker Armin Hemmann
2013-06-24 19:11 ` Mark Knecht
2013-06-20 19:27 ` Rich Freeman
2013-06-20 19:31 ` Mark Knecht
2013-06-21 7:31 ` [gentoo-amd64] " Duncan
2013-06-21 10:28 ` Rich Freeman
2013-06-21 14:23 ` Bob Sanders
2013-06-21 14:27 ` Duncan [this message]
2013-06-21 15:13 ` Rich Freeman
2013-06-22 10:29 ` Duncan
2013-06-22 11:12 ` Rich Freeman
2013-06-22 15:45 ` Duncan
2013-06-22 23:04 ` Mark Knecht
2013-06-22 23:17 ` Matthew Marlowe
2013-06-23 11:43 ` Rich Freeman
2013-06-23 15:23 ` Mark Knecht
2013-06-28 0:51 ` Duncan
2013-06-28 3:18 ` Matthew Marlowe
2013-06-21 17:40 ` Mark Knecht
2013-06-21 17:56 ` Bob Sanders
2013-06-21 18:12 ` Mark Knecht
2013-06-21 17:57 ` Rich Freeman
2013-06-21 18:10 ` Gary E. Miller
2013-06-21 18:38 ` Mark Knecht
2013-06-21 18:50 ` Gary E. Miller
2013-06-21 18:57 ` Rich Freeman
2013-06-22 14:34 ` Duncan
2013-06-22 22:15 ` Gary E. Miller
2013-06-28 0:20 ` Duncan
2013-06-28 0:41 ` Gary E. Miller
2013-06-21 18:53 ` Bob Sanders
2013-06-22 14:23 ` Duncan
2013-06-23 1:02 ` Mark Knecht
2013-06-23 1:48 ` Mark Knecht
2013-06-28 3:36 ` Duncan
2013-06-28 9:12 ` Duncan
2013-06-28 17:50 ` Gary E. Miller
2013-06-29 5:40 ` Duncan
2013-06-30 1:04 ` Rich Freeman
2013-06-22 12:49 ` [gentoo-amd64] " B Vance
2013-06-22 13:12 ` Rich Freeman
2013-06-23 11:31 ` thegeezer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$b657d$d0d9cd1a$724ca571$a36894cc@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=gentoo-amd64@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox