From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-amd64+bounces-13294-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	by finch.gentoo.org (Postfix) with ESMTP id 68E431381F3
	for <garchives@archives.gentoo.org>; Fri, 21 Jun 2013 07:31:58 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 413A3E09C9;
	Fri, 21 Jun 2013 07:31:55 +0000 (UTC)
Received: from plane.gmane.org (plane.gmane.org [80.91.229.3])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id 3BC13E09A2
	for <gentoo-amd64@lists.gentoo.org>; Fri, 21 Jun 2013 07:31:54 +0000 (UTC)
Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <glga-gentoo-amd64@m.gmane.org>)
	id 1Upvp1-0001mj-2H
	for gentoo-amd64@lists.gentoo.org; Fri, 21 Jun 2013 09:31:51 +0200
Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <gentoo-amd64@lists.gentoo.org>; Fri, 21 Jun 2013 09:31:51 +0200
Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <gentoo-amd64@lists.gentoo.org>; Fri, 21 Jun 2013 09:31:51 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: gentoo-amd64@lists.gentoo.org
From: Duncan <1i5t5.duncan@cox.net>
Subject: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector
 value?
Date: Fri, 21 Jun 2013 07:31:35 +0000 (UTC)
Message-ID: <pan$ecf3f$9af69a78$1508667e$d81347b7@cox.net>
References: 
	<CAK2H+ecth45ADi=k=1b4y8eowYNqoABTo3iMgEzV6pAmthusVA@mail.gmail.com>
Precedence: bulk
List-Post: <mailto:gentoo-amd64@lists.gentoo.org>
List-Help: <mailto:gentoo-amd64+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-amd64+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-amd64+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-amd64.gentoo.org>
X-BeenThere: gentoo-amd64@lists.gentoo.org
Reply-to: gentoo-amd64@lists.gentoo.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: ip68-231-22-224.ph.ph.cox.net
User-Agent: Pan/0.140 (Chocolate Salty Balls; GIT 459f52e
	/usr/src/portage/src/egit-src/pan2)
X-Archives-Salt: f5912bae-bf59-4503-8a68-6556849c92a9
X-Archives-Hash: 42c258ab6271fbccb675d2e1d89dbf25

Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:

>    Does anyone know of info on how the starting sector number might
> impact RAID performance under Gentoo? The drives are WD-500G RE3 drives
> shown here:
> 
> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/
B001EMZPD0/ref=cm_cr_pr_product_top
> 
>    These are NOT 4k sector sized drives.
> 
>    Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
> benchmarking seems abysmal at around 40MB/S using dd copying large
> files.
> It's higher, around 80MB/S if the file being transferred is coming from
> an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in
> top.
> And my 'large file' copies might not be large enough as the machine has
> 24GB of DRAM and I've only been copying 21GB so it's possible some of
> that is cached.

I /suspect/ that the problem isn't striping, tho that can be a factor, 
but rather, your choice of raid6.  Note that I personally ran md/raid-6 
here for awhile, so I know a bit of what I'm talking about.  I didn't 
realize the full implications of what I was setting up originally, or I'd 
have not chosen raid6 in the first place, but live and learn as they say, 
and that I did.

General rule, raid6 is abysmal for writing and gets dramatically worse as 
fragmentation sets in, tho reading is reasonable.  The reason is that in 
ordered to properly parity-check and write out less-than-full-stripe 
writes, the system must effectively read-in the existing data and merge 
it with the new data, then recalculate the parity, before writing the new 
data AND 100% of the (two-way in raid-6) parity.  Further, because raid 
sits below the filesystem level, it knows nothing about what parts of the 
filesystem are actually used, and must read and write the FULL data 
stripe (perhaps minus the new data bit, I'm not sure), including parts 
that will be empty on a freshly formatted filesystem.

So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k 
in data across three devices, and 8k of parity across the other two 
devices.  Now you go to write a 1k file, but in ordered to do so the full 
12k of existing data must be read in, even on an empty filesystem, 
because the RAID doesn't know it's empty!  Then the new data must be 
merged in and new checksums created, then the full 20k must be written 
back out, certainly the 8k of parity, but also likely the full 12k of 
data even if most of it is simply rewrite, but almost certainly at least 
the 4k strip on the device the new data is written to.

As I said that gets much worse as a filesystem ages, due to fragmentation 
meaning writes are more often writes to say 3 stripe fragments instead of 
a single whole stripe.  That's what proper stride size, etc, can help 
with, if the filesystem's reasonably fragmentation resistant, but even 
then filesystem aging certainly won't /help/.

Reads, meanwhile, are reasonable speed (in normal non-degraded mode), 
because on a raid6 the data is at least two-way striped (on a 4-device 
raid, your 5-device would be three-way striped data, the other two being 
parity of course), so you do get moderate striping read bonuses.

Then there's all that parity information available and written out at 
every write, but it's not actually used to check the reliability of the 
data in normal operation, only to reconstruct if a device or two goes 
missing.

On a well laid out system, I/O to the separate drives at least shouldn't 
interfere with each other, assuming SATA and a chipset and bus layout 
that can handle them in parallel, not /that/ big a feat on today's 
hardware at least as long as you're still doing "spinning rust", as the 
mechanical drive latency is almost certainly the bottleneck there, and at 
least that can be parallelized to a reasonable degree across the 
individual drives.

What I ultimately came to realize here is that unless the job at hand is 
nearly 100% read on the raid, with the caveat that you have enough space, 
raid1 is almost certainly at least as good if not a better choice.  If 
you have the devices to support it, you can go for raid10/50/60, and a 
raid10 across 5 devices is certainly possible with mdraid, but a straight 
raid-6... you're generally better off with an N-way raid-1, for a couple 
reasons.

First, md/raid1 is surprisingly, even astoundingly, good at multi-task 
scheduling reads.  So any time there's multiple I/O read tasks going on 
(like during boot), raid1 works really well, with the scheduler 
distributing tasks among the available devices, this minimizing seek-
latency.  So take a 5-device raid-1, you can very likely accomplish at 
least 5 and possibly 6 or even 7 read jobs in say 110 or 120% of the time 
it would take to do just the longest one on a single device, almost 
certainly well before a single device could have done the two longest 
read jobs.  This also works if there's a single task alternating reads of 
N different files/directories, since the scheduler will again distribute 
jobs among the devices, so say one device head stays over the directory 
information, while another goes to read the first file, the second reads 
another file, etc, and the heads stay where they are until they're needed 
elsewhere so the more devices in raid1 you have the more likely it is 
that more data read from the same location still has a head located right 
over it and can just read it as the correct portion of the disk spins 
underneath, instead of first seeking to the correct spot on the disk.

It's worth pointing out that in the case of parallel job read access, due 
to this parallel read-scheduling md/raid1 can often best raid0 
performance, despite raid0's technically better single-job thruput 
numbers.  This was something I learned by experience as well, that makes 
sense but that I had TOTALLY not realized or calculated for in my 
original setup, as I was running raid0 for things like the gentoo ebuild 
tree and the kernel sources, since I didn't need redundancy for them.  My 
raid0 performance there was rather disappointing, because both portage 
tree updates and dep calculation and the kernel build process don't 
optimize well for thruput, which is what raid0 does, but optimize rather 
better for parallel I/O, where raid1 shines especially for read.

Second, md/raid1 writes, because they happen in parallel with the 
bottleneck being the spinning rust, basically occur at the speed of the 
slowest disk.  So you don't get N-way parallel job write speed, just 
single disk speed, but it's still *WAY* *WAY* better than raid6, which 
has to read in the existing data and do the merge before it can write 
back out.  **THAT'S THE RAID6 PERFORMANCE KILLER**, or at least it was 
for me, effectively giving you half-device speed writes because the data 
in too many cases must be read in first before it can be written.  Raid1 
doesn't have that problem -- it doesn't get a write performance 
multiplier from the N devices, but at least it doesn't get device 
performance cut in half like raid5/6 does.

Third, the read-scheduling benefits of #1 help to a lessor extent with 
large same-raid1 copies as well.  Consider, the first block must be read 
by one device, then written to all at the new location.  The second 
similarly, then the third, etc.  But, with proper scheduling an N-way  
raid1 doing an N-block copy has done N+1 operations on all devices at the 
end of that N-block copy.  IOW, given the memory to use as a buffer, the 
read can be done in parallel, reading N blocks in at once, one from each 
device, then the writes, one block at a time to all devices.  So a 5-way 
raid1 will have done 6 jobs on each of the 5 devices at the end, 1 read 
and 5 writes, to write out 5 blocks.  (In actuality due to read-ahead I 
think it's optimally 64-k blocks per device, 16 4-k blocks on each, 320k 
total, but that's well within the usual minimal 2MB drive buffer size, 
and the drive will probably do that on its own if both read and write-
caching are on, given scheduling that forces a cache-flush only at the 
end, not multiple times in the middle.  So all the kernel has to do is be 
sure it's not interfering by forcing untimely flushes, and the drives 
should optimize on their own.)

Forth, back to the parity.  Remember, raid5/6 has all that parity that it 
writes out (but basically never reads in normal mode, only when degraded, 
in ordered to reconstruct the data from the missing device(s)), but 
doesn't actually use it for integrity checking.  So while raid1 doesn't 
have the benefit of that parity data, it's not like raid5/6 used it 
anyway, and an N-way raid1 means even MORE missing-device protection 
since you can lose all but one device and keep right on going as if 
nothing happened.  So a 5-way raid1 can lose 4 devices, not just the two 
devices of a 5-way raid6 or the single device of a raid5.  Yes, there's 
the loss of parity/integrity data with raid1, BUT RAID5/6 DOESN'T USE 
THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE 
CASE OF DEVICE LOSS!  So the N-way raid1 is far more redundant since you 
have N copies of the data, not one copy plus two-way-parity-that's-never-
used-except-for-reconstruction.

Fifth, in the event of device loss, a raid1 continues to function at 
normal speed, because it's simply an N-way copy with a bit of extra 
metadata to keep track of the number of N-ways.  Of course you'll lose 
the benefit of read-parallelization that the missing device provided, and 
you'll lose the redundancy of the missing device, but in general, 
performance remains pretty much the same no matter how many ways it's 
raid-1 mirrored.  Contrast that with raid5/6 which is SEVERELY read 
performance impacted by device loss, since it then must reconstruct the 
data using the parity data, not simply read it from somewhere else, which 
is what raid1 does.

The single down side to raid1 as opposed to raid5/6 is the loss of the 
extra space made available by the data striping, 3*single-device-space in 
the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the 
case of raid1.  Otherwise, no contest, hands down, raid1 over raid6.


IOW, you're seeing now exactly why raid6 and to a lessor extent raid5 
have such terrible performance (as opposed to reliability) reputations.  
Really, unless you simply don't have the space to make it raid1, I 
**STRONGLY** urge you to try that instead.  I know I was very happily 
surprised by the results I got, and only then realized what all the 
negativity I'd seen around raid5/6 had been about, as I really hadn't 
understood it at all when I was doing my original research.


Meanwhile, Rich0 already brought up btrfs, which really does promise a 
better solution to many of these issues than md/raid, in part due to that 
is arguably "layering violation", but really DOES allow for some serious 
optimizations in the multiple-drive case, because as a filesystem, it 
DOES know what's real data and what's empty space that isn't worth 
worrying about, and because unlike raid5/6 parity, it really DOES care 
about data integrity, not just rebuilding in case of device failure.

So several points on btrfs:

1) It's still in heavy development.  The base single-device filesystem 
case works reasonably well now and is /almost/ stable, tho I'd still urge 
people to keep good backups as it's simply not time tested and settled, 
and won't be for at least a few more kernels as they're still busy with 
the other features.  Second-level raid0/raid1/raid10 is at an 
intermediate level.  Primary development and initial bug testing and 
fixing is done but they're still working on bugs that people doing only 
traditional single-device filesystems simply don't have to worry about.  
Third-round raid5/6 is still very new, introduced as VERY experimental 
only with 3.9 IIRC, and is currently EXPECTED to eat data in power-loss 
or crash events, so it's ONLY good for preliminary testing at this point.

Thus, if you're using btrfs at all, keep good backups, and keep current, 
even -rc (if not live-git) on the kernel, because there really are fixes 
in every single kernel for very real corner-case problems they are still 
coming across.  But single-device is /relatively/ stable now, so provided 
you keep good *TESTED* backups and are willing and able to use them if it 
comes to it, and keep current on the kernel, go for that.  And I'm 
personally running dual-device raid-1 mode across two SSDs, at the second 
stage deployment level.  I tried that (but still on spinning rust) a year 
ago and decided btrfs simply wasn't ready for me yet, so it has come 
quite a way in the last year.  But raid5/6 mode is still fresh third-tier 
development, which I'd not consider usable until at LEAST 3.11 and 
probably 3.12 or later (maybe a year from now, since it's less mature 
than raid1 was at this point last year, but should mature a bit faster).

Takeaway: If you don't have a backup you're prepared to use, you 
shouldn't be even THINKING about btrfs at this point, no matter WHAT type 
of deployment you're considering.  If you do, you're probably reasonably 
safe with traditional single-device btrfs, intermediately risky/safe with 
raid0/1/10, don't even think about raid5/6 for real deployment yet, 
period.

2) RAID levels work QUITE a bit differently on btrfs.  In particular, 
what btrfs calls raid1 mode (with the same applying to raid10) is simply 
two-way-mirroring, NO MATTER THE NUMBER OF DEVICES.  There's no multi-way 
mirroring yet available, unless you're willing to apply not-yet-
mainstreamed patches.  It's planned, but not yet applied.  The roadmap 
says it'll happen after raid5/6 are introduced (they have been, but 
aren't yet really finished including power-loss-recovery, etc), so I'm 
guessing 3.12 at the earliest as I think 3.11 is still focused on raid5/6 
completion.

3) Btrfs raid1 mode is used to provide second-source for its data 
integrity feature as well, such that if one copy's checksum doesn't 
verify, it'll try the other one.  Unfortunately #2 means there's only the 
single fallback to try, but that's better than most filesystems, without 
data integrity at all, or if they have it, no fallback if it fails.

The combination of #2 and 3 was a bitter pill for me a year ago, when I 
was still running on aging spinning rust, and thus didn't trust two-copy-
only redundancy.  I really like the data integrity feature, but just a 
single backup copy was a great disappointment since I didn't trust my old 
hardware, and unfortunately two-copy-max remains the case for so-called 
raid1.  (Raid5/6 mode apparently introduces N-way copies or some such, 
but as I said, it's not complete yet and is EXPECTED to eat data.  N-way-
mirroring will build on that and is on the horizon, but it has been on 
the horizon and not seeming to get much closer for over a year now...)  
Fortunately for me, my budget is in far better shape this year, and with 
the dual new SSDs I purchased and with spinning rust for backup still, I 
trust my hardware enough now to run the 2-way-only mirroring that btrfs 
calls raid1 mode.

4) As mentioned above in the btrfs intro paragraph, btrfs, being a 
filesystem, actually knows what data is actual data, and what is safely 
left untracked and thus unsynced.  Thus, the read-data-in-before-writing-
it problem will be rather less, certainly on freshly formated disks where 
most existing data WILL be garbage/zeros (trimmed if on SSD, as mkfs.btrfs 
issues a trim command for the entire filesystem range before it creates 
the superblocks, etc, so empty space really /is/ zeroed).  Similarly with 
"slack space" that's not currently used but was used previously, as the 
filesystem ages -- btrfs knows that it can ignore that data too, and thus 
won't have to read it in to update the checksum when writing to a raid5/6 
mode btrfs.

5) There's various other nice btrfs features and a few caveats as well, 
but with the exception of anything btrfs-raid pertaining I totally forgot 
about, they're out of scope for this thread, which is after all, on raid, 
so I'll skip discussing them here.


So bottom line, I really recommend md/raid1 for now.  Unless you want to 
go md/raid10, with three-way-mirroring on the raid1 side.  AFAIK that's 
doable with 5 devices, but it's simpler, certainly conceptually simpler 
which can make a different to an admin trying to work with it, with 6.

If the data simply won't fit on the 5-way raid1 and you want to keep at 
least 2-device-loss protection, consider splitting it up, raid1 with 
three devices for the first half, then either get a sixth device to do 
the same with the second half, or go raid1 with two devices and put your 
less critical data on the second set.  Or, do the raid10 with 5 devices 
thing, but I'll admit that while I've read that it's possible, I don't 
really conceptually understand it myself, and haven't tried it, so I have 
no personal opinion or experience to offer on that.  But in that case I 
really would try to scrap up the money for a sixth device if possible, 
and do raid10 with 3-way redundancy 2-way-striping across the six, simply 
because it's easier to conceptualize and thus to properly administer.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman