[gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?

public inbox for gentoo-amd64@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?
@ 2013-06-20 19:10 Mark Knecht
  2013-06-20 19:16 ` Volker Armin Hemmann
                   ` (4 more replies)
  0 siblings, 5 replies; 46+ messages in thread
From: Mark Knecht @ 2013-06-20 19:10 UTC (permalink / raw
  To: Gentoo AMD64

Hi,
   Does anyone know of info on how the starting sector number might
impact RAID performance under Gentoo? The drives are WD-500G RE3
drives shown here:

http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/B001EMZPD0/ref=cm_cr_pr_product_top

   These are NOT 4k sector sized drives.

   Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
benchmarking seems abysmal at around 40MB/S using dd copying large
files. It's higher, around 80MB/S if the file being transferred is
coming from an SSD, but even 80MB/S seems slow to me. I see a LOT of
wait time in top. And my 'large file' copies might not be large enough
as the machine has 24GB of DRAM and I've only been copying 21GB so
it's possible some of that is cached.

   Then I looked again at how I partitioned the drives originally and
see the starting sector of sector 3 as 8594775. I started wondering if
something like 4K block sizes at the file system level might be
getting munged across 16k chunk sizes in the RAID. Maybe the blocks
are being torn apart in bad ways for performance? That led me down a
bunch of rabbit holes and I haven't found any light yet.

   Looking for some thoughtful ideas from those more experienced in this area.

Cheers,
Mark

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?
  2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht
@ 2013-06-20 19:16 ` Volker Armin Hemmann
  2013-06-20 19:28   ` Mark Knecht
  2013-06-20 20:45   ` Mark Knecht
  2013-06-20 19:27 ` Rich Freeman
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 46+ messages in thread
From: Volker Armin Hemmann @ 2013-06-20 19:16 UTC (permalink / raw
  To: gentoo-amd64

Am 20.06.2013 21:10, schrieb Mark Knecht:
> Hi,
>    Does anyone know of info on how the starting sector number might
> impact RAID performance under Gentoo? The drives are WD-500G RE3
> drives shown here:
>
> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/B001EMZPD0/ref=cm_cr_pr_product_top
>
>    These are NOT 4k sector sized drives.
>
>    Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
> benchmarking seems abysmal at around 40MB/S using dd copying large
> files. It's higher, around 80MB/S if the file being transferred is
> coming from an SSD, but even 80MB/S seems slow to me. I see a LOT of
> wait time in top. And my 'large file' copies might not be large enough
> as the machine has 24GB of DRAM and I've only been copying 21GB so
> it's possible some of that is cached.
>
>    Then I looked again at how I partitioned the drives originally and
> see the starting sector of sector 3 as 8594775. I started wondering if
> something like 4K block sizes at the file system level might be
> getting munged across 16k chunk sizes in the RAID. Maybe the blocks
> are being torn apart in bad ways for performance? That led me down a
> bunch of rabbit holes and I haven't found any light yet.
>
>    Looking for some thoughtful ideas from those more experienced in this area.
>
> Cheers,
> Mark
>
>

man mkfs.xfs

man mkfs.ext4

look for stripe size etc.

Have fun.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?
  2013-06-20 19:16 ` Volker Armin Hemmann
@ 2013-06-20 19:28   ` Mark Knecht
  2013-06-20 20:45   ` Mark Knecht
  1 sibling, 0 replies; 46+ messages in thread
From: Mark Knecht @ 2013-06-20 19:28 UTC (permalink / raw
  To: Gentoo AMD64

On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann
<volkerarmin@googlemail.com> wrote:
<SNIP>
>>    Looking for some thoughtful ideas from those more experienced in this area.
>>
>> Cheers,
>> Mark
>>
>>
>
> man mkfs.xfs
>
> man mkfs.ext4
>
> look for stripe size etc.
>
> Have fun.
>

I am probably mistaken but I thought that stuff was for hardware RAID
and that for mdadm type software RAID it was handled by mdadm?

I certainly don't remember any of the Linux software RAID pages I've
read about setting up RAID suggesting that these options are
important, but I'll go look around.

Thanks!

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?
  2013-06-20 19:16 ` Volker Armin Hemmann
  2013-06-20 19:28   ` Mark Knecht
@ 2013-06-20 20:45   ` Mark Knecht
  2013-06-24 18:47     ` Volker Armin Hemmann
  1 sibling, 1 reply; 46+ messages in thread
From: Mark Knecht @ 2013-06-20 20:45 UTC (permalink / raw
  To: Gentoo AMD64

On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann
<volkerarmin@googlemail.com> wrote:
<SNIP>
> man mkfs.xfs
>
> man mkfs.ext4
>
> look for stripe size etc.
>
> Have fun.
>
Volker,
   I find way down at the bottom of the RAID setup page that they do
say stride & stripe are important for RAID4 & RAID5, but remain
non-committal for RAID6. None the less thanks for the idea. Now I
guess I have to figure out how to test it in less than 10 weeks. I
think I'm in trouble at this point having only 1 file system. Possibly
it would be better to have a second just to be able to change settings
with tune2fs and being able to do it quickly.

   None the less thanks.

- Mark


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?
  2013-06-20 20:45   ` Mark Knecht
@ 2013-06-24 18:47     ` Volker Armin Hemmann
  2013-06-24 19:11       ` Mark Knecht
  0 siblings, 1 reply; 46+ messages in thread
From: Volker Armin Hemmann @ 2013-06-24 18:47 UTC (permalink / raw
  To: gentoo-amd64

Am 20.06.2013 22:45, schrieb Mark Knecht:
> On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann
> <volkerarmin@googlemail.com> wrote:
> <SNIP>
>> man mkfs.xfs
>>
>> man mkfs.ext4
>>
>> look for stripe size etc.
>>
>> Have fun.
>>
> Volker,
>    I find way down at the bottom of the RAID setup page that they do
> say stride & stripe are important for RAID4 & RAID5, but remain
> non-committal for RAID6.

raid 6 is just raid5 with additional parity. So stripe size is not less
important.


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?
  2013-06-24 18:47     ` Volker Armin Hemmann
@ 2013-06-24 19:11       ` Mark Knecht
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Knecht @ 2013-06-24 19:11 UTC (permalink / raw
  To: Gentoo AMD64

On Mon, Jun 24, 2013 at 11:47 AM, Volker Armin Hemmann
<volkerarmin@googlemail.com> wrote:
> Am 20.06.2013 22:45, schrieb Mark Knecht:
>> On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann
>> <volkerarmin@googlemail.com> wrote:
>> <SNIP>
>>> man mkfs.xfs
>>>
>>> man mkfs.ext4
>>>
>>> look for stripe size etc.
>>>
>>> Have fun.
>>>
>> Volker,
>>    I find way down at the bottom of the RAID setup page that they do
>> say stride & stripe are important for RAID4 & RAID5, but remain
>> non-committal for RAID6.
>
> raid 6 is just raid5 with additional parity. So stripe size is not less
> important.
>

Yeah, as I continued to study that became more apparent. The Linux
RAID wiki not saying about it was (apparently) just an oversight on
their part.

At this point I'm basically getting set up to tear my whole machine
apart and rebuild it from scratch. When I do I'm benchmark whatever
RAID options I think will meet my long term needs and then report back
anything I find.

Personally, I think that RAID6 should be just slightly slower than
RAID5, and use slightly more CPU power doing it. How RAID5/6 really
compare with RAID1 isn't really that much of an issue for me as using
only RAID1 won't give me enough storage using any combinations of my 5
500GB drives.

I think if I was into spending some money I'd look at buying a second
SSD and do RAID1 for my / and then just use the disks for the VMs &
video, but don't see that as an option right now.

Cheers,
Mark

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?
  2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht
  2013-06-20 19:16 ` Volker Armin Hemmann
@ 2013-06-20 19:27 ` Rich Freeman
  2013-06-20 19:31   ` Mark Knecht
  2013-06-21  7:31 ` [gentoo-amd64] " Duncan
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 46+ messages in thread
From: Rich Freeman @ 2013-06-20 19:27 UTC (permalink / raw
  To: gentoo-amd64

On Thu, Jun 20, 2013 at 3:10 PM, Mark Knecht <markknecht@gmail.com> wrote:
>    Looking for some thoughtful ideas from those more experienced in this area.

Please do share your findings.  I suspect my own RAID+LVM+EXT3/4
system is not optimized - especially with LVM I have no idea how
blocks in ext3/4 end up mapping to stripes and physical blocks.  Oh,
and this is on 4k disks.

Honestly, this is one of the reasons I REALLY want to move to btrfs
when it fully supports raid5.  Right now the various layers don't talk
to each other and that means a lot of micro-management if you don't
want a lot of read-write-read cycles (to say nothing of what you can
buy with a filesystem that can aim to overwrite entire stripes at a
time).

Rich

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?
  2013-06-20 19:27 ` Rich Freeman
@ 2013-06-20 19:31   ` Mark Knecht
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Knecht @ 2013-06-20 19:31 UTC (permalink / raw
  To: Gentoo AMD64

On Thu, Jun 20, 2013 at 12:27 PM, Rich Freeman <rich0@gentoo.org> wrote:
> On Thu, Jun 20, 2013 at 3:10 PM, Mark Knecht <markknecht@gmail.com> wrote:
>>    Looking for some thoughtful ideas from those more experienced in this area.
>
> Please do share your findings.  I suspect my own RAID+LVM+EXT3/4
> system is not optimized - especially with LVM I have no idea how
> blocks in ext3/4 end up mapping to stripes and physical blocks.  Oh,
> and this is on 4k disks.
>
> Honestly, this is one of the reasons I REALLY want to move to btrfs
> when it fully supports raid5.  Right now the various layers don't talk
> to each other and that means a lot of micro-management if you don't
> want a lot of read-write-read cycles (to say nothing of what you can
> buy with a filesystem that can aim to overwrite entire stripes at a
> time).
>
> Rich
>

I'll share everything I find, true or false, and maybe as a group we
can figure out what's right.

In the meantime, please be careful with your RAID5 and do good backups
:-) I ran RAID5 for awhile but moved to RAID6 due to the number of
reports I read where one drive went bad on a RAID5 and then the RAID
lost a second drive before the original bad drive was replaced and
everything was gone.

Cheers,
Mark


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht
  2013-06-20 19:16 ` Volker Armin Hemmann
  2013-06-20 19:27 ` Rich Freeman
@ 2013-06-21  7:31 ` Duncan
  2013-06-21 10:28   ` Rich Freeman
                     ` (2 more replies)
  2013-06-22 12:49 ` [gentoo-amd64] " B Vance
  2013-06-23 11:31 ` thegeezer
  4 siblings, 3 replies; 46+ messages in thread
From: Duncan @ 2013-06-21  7:31 UTC (permalink / raw
  To: gentoo-amd64

Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:

>    Does anyone know of info on how the starting sector number might
> impact RAID performance under Gentoo? The drives are WD-500G RE3 drives
> shown here:
> 
> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/
B001EMZPD0/ref=cm_cr_pr_product_top
> 
>    These are NOT 4k sector sized drives.
> 
>    Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
> benchmarking seems abysmal at around 40MB/S using dd copying large
> files.
> It's higher, around 80MB/S if the file being transferred is coming from
> an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in
> top.
> And my 'large file' copies might not be large enough as the machine has
> 24GB of DRAM and I've only been copying 21GB so it's possible some of
> that is cached.

I /suspect/ that the problem isn't striping, tho that can be a factor, 
but rather, your choice of raid6.  Note that I personally ran md/raid-6 
here for awhile, so I know a bit of what I'm talking about.  I didn't 
realize the full implications of what I was setting up originally, or I'd 
have not chosen raid6 in the first place, but live and learn as they say, 
and that I did.

General rule, raid6 is abysmal for writing and gets dramatically worse as 
fragmentation sets in, tho reading is reasonable.  The reason is that in 
ordered to properly parity-check and write out less-than-full-stripe 
writes, the system must effectively read-in the existing data and merge 
it with the new data, then recalculate the parity, before writing the new 
data AND 100% of the (two-way in raid-6) parity.  Further, because raid 
sits below the filesystem level, it knows nothing about what parts of the 
filesystem are actually used, and must read and write the FULL data 
stripe (perhaps minus the new data bit, I'm not sure), including parts 
that will be empty on a freshly formatted filesystem.

So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k 
in data across three devices, and 8k of parity across the other two 
devices.  Now you go to write a 1k file, but in ordered to do so the full 
12k of existing data must be read in, even on an empty filesystem, 
because the RAID doesn't know it's empty!  Then the new data must be 
merged in and new checksums created, then the full 20k must be written 
back out, certainly the 8k of parity, but also likely the full 12k of 
data even if most of it is simply rewrite, but almost certainly at least 
the 4k strip on the device the new data is written to.

As I said that gets much worse as a filesystem ages, due to fragmentation 
meaning writes are more often writes to say 3 stripe fragments instead of 
a single whole stripe.  That's what proper stride size, etc, can help 
with, if the filesystem's reasonably fragmentation resistant, but even 
then filesystem aging certainly won't /help/.

Reads, meanwhile, are reasonable speed (in normal non-degraded mode), 
because on a raid6 the data is at least two-way striped (on a 4-device 
raid, your 5-device would be three-way striped data, the other two being 
parity of course), so you do get moderate striping read bonuses.

Then there's all that parity information available and written out at 
every write, but it's not actually used to check the reliability of the 
data in normal operation, only to reconstruct if a device or two goes 
missing.

On a well laid out system, I/O to the separate drives at least shouldn't 
interfere with each other, assuming SATA and a chipset and bus layout 
that can handle them in parallel, not /that/ big a feat on today's 
hardware at least as long as you're still doing "spinning rust", as the 
mechanical drive latency is almost certainly the bottleneck there, and at 
least that can be parallelized to a reasonable degree across the 
individual drives.

What I ultimately came to realize here is that unless the job at hand is 
nearly 100% read on the raid, with the caveat that you have enough space, 
raid1 is almost certainly at least as good if not a better choice.  If 
you have the devices to support it, you can go for raid10/50/60, and a 
raid10 across 5 devices is certainly possible with mdraid, but a straight 
raid-6... you're generally better off with an N-way raid-1, for a couple 
reasons.

First, md/raid1 is surprisingly, even astoundingly, good at multi-task 
scheduling reads.  So any time there's multiple I/O read tasks going on 
(like during boot), raid1 works really well, with the scheduler 
distributing tasks among the available devices, this minimizing seek-
latency.  So take a 5-device raid-1, you can very likely accomplish at 
least 5 and possibly 6 or even 7 read jobs in say 110 or 120% of the time 
it would take to do just the longest one on a single device, almost 
certainly well before a single device could have done the two longest 
read jobs.  This also works if there's a single task alternating reads of 
N different files/directories, since the scheduler will again distribute 
jobs among the devices, so say one device head stays over the directory 
information, while another goes to read the first file, the second reads 
another file, etc, and the heads stay where they are until they're needed 
elsewhere so the more devices in raid1 you have the more likely it is 
that more data read from the same location still has a head located right 
over it and can just read it as the correct portion of the disk spins 
underneath, instead of first seeking to the correct spot on the disk.

It's worth pointing out that in the case of parallel job read access, due 
to this parallel read-scheduling md/raid1 can often best raid0 
performance, despite raid0's technically better single-job thruput 
numbers.  This was something I learned by experience as well, that makes 
sense but that I had TOTALLY not realized or calculated for in my 
original setup, as I was running raid0 for things like the gentoo ebuild 
tree and the kernel sources, since I didn't need redundancy for them.  My 
raid0 performance there was rather disappointing, because both portage 
tree updates and dep calculation and the kernel build process don't 
optimize well for thruput, which is what raid0 does, but optimize rather 
better for parallel I/O, where raid1 shines especially for read.

Second, md/raid1 writes, because they happen in parallel with the 
bottleneck being the spinning rust, basically occur at the speed of the 
slowest disk.  So you don't get N-way parallel job write speed, just 
single disk speed, but it's still *WAY* *WAY* better than raid6, which 
has to read in the existing data and do the merge before it can write 
back out.  **THAT'S THE RAID6 PERFORMANCE KILLER**, or at least it was 
for me, effectively giving you half-device speed writes because the data 
in too many cases must be read in first before it can be written.  Raid1 
doesn't have that problem -- it doesn't get a write performance 
multiplier from the N devices, but at least it doesn't get device 
performance cut in half like raid5/6 does.

Third, the read-scheduling benefits of #1 help to a lessor extent with 
large same-raid1 copies as well.  Consider, the first block must be read 
by one device, then written to all at the new location.  The second 
similarly, then the third, etc.  But, with proper scheduling an N-way  
raid1 doing an N-block copy has done N+1 operations on all devices at the 
end of that N-block copy.  IOW, given the memory to use as a buffer, the 
read can be done in parallel, reading N blocks in at once, one from each 
device, then the writes, one block at a time to all devices.  So a 5-way 
raid1 will have done 6 jobs on each of the 5 devices at the end, 1 read 
and 5 writes, to write out 5 blocks.  (In actuality due to read-ahead I 
think it's optimally 64-k blocks per device, 16 4-k blocks on each, 320k 
total, but that's well within the usual minimal 2MB drive buffer size, 
and the drive will probably do that on its own if both read and write-
caching are on, given scheduling that forces a cache-flush only at the 
end, not multiple times in the middle.  So all the kernel has to do is be 
sure it's not interfering by forcing untimely flushes, and the drives 
should optimize on their own.)

Forth, back to the parity.  Remember, raid5/6 has all that parity that it 
writes out (but basically never reads in normal mode, only when degraded, 
in ordered to reconstruct the data from the missing device(s)), but 
doesn't actually use it for integrity checking.  So while raid1 doesn't 
have the benefit of that parity data, it's not like raid5/6 used it 
anyway, and an N-way raid1 means even MORE missing-device protection 
since you can lose all but one device and keep right on going as if 
nothing happened.  So a 5-way raid1 can lose 4 devices, not just the two 
devices of a 5-way raid6 or the single device of a raid5.  Yes, there's 
the loss of parity/integrity data with raid1, BUT RAID5/6 DOESN'T USE 
THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE 
CASE OF DEVICE LOSS!  So the N-way raid1 is far more redundant since you 
have N copies of the data, not one copy plus two-way-parity-that's-never-
used-except-for-reconstruction.

Fifth, in the event of device loss, a raid1 continues to function at 
normal speed, because it's simply an N-way copy with a bit of extra 
metadata to keep track of the number of N-ways.  Of course you'll lose 
the benefit of read-parallelization that the missing device provided, and 
you'll lose the redundancy of the missing device, but in general, 
performance remains pretty much the same no matter how many ways it's 
raid-1 mirrored.  Contrast that with raid5/6 which is SEVERELY read 
performance impacted by device loss, since it then must reconstruct the 
data using the parity data, not simply read it from somewhere else, which 
is what raid1 does.

The single down side to raid1 as opposed to raid5/6 is the loss of the 
extra space made available by the data striping, 3*single-device-space in 
the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the 
case of raid1.  Otherwise, no contest, hands down, raid1 over raid6.


IOW, you're seeing now exactly why raid6 and to a lessor extent raid5 
have such terrible performance (as opposed to reliability) reputations.  
Really, unless you simply don't have the space to make it raid1, I 
**STRONGLY** urge you to try that instead.  I know I was very happily 
surprised by the results I got, and only then realized what all the 
negativity I'd seen around raid5/6 had been about, as I really hadn't 
understood it at all when I was doing my original research.


Meanwhile, Rich0 already brought up btrfs, which really does promise a 
better solution to many of these issues than md/raid, in part due to that 
is arguably "layering violation", but really DOES allow for some serious 
optimizations in the multiple-drive case, because as a filesystem, it 
DOES know what's real data and what's empty space that isn't worth 
worrying about, and because unlike raid5/6 parity, it really DOES care 
about data integrity, not just rebuilding in case of device failure.

So several points on btrfs:

1) It's still in heavy development.  The base single-device filesystem 
case works reasonably well now and is /almost/ stable, tho I'd still urge 
people to keep good backups as it's simply not time tested and settled, 
and won't be for at least a few more kernels as they're still busy with 
the other features.  Second-level raid0/raid1/raid10 is at an 
intermediate level.  Primary development and initial bug testing and 
fixing is done but they're still working on bugs that people doing only 
traditional single-device filesystems simply don't have to worry about.  
Third-round raid5/6 is still very new, introduced as VERY experimental 
only with 3.9 IIRC, and is currently EXPECTED to eat data in power-loss 
or crash events, so it's ONLY good for preliminary testing at this point.

Thus, if you're using btrfs at all, keep good backups, and keep current, 
even -rc (if not live-git) on the kernel, because there really are fixes 
in every single kernel for very real corner-case problems they are still 
coming across.  But single-device is /relatively/ stable now, so provided 
you keep good *TESTED* backups and are willing and able to use them if it 
comes to it, and keep current on the kernel, go for that.  And I'm 
personally running dual-device raid-1 mode across two SSDs, at the second 
stage deployment level.  I tried that (but still on spinning rust) a year 
ago and decided btrfs simply wasn't ready for me yet, so it has come 
quite a way in the last year.  But raid5/6 mode is still fresh third-tier 
development, which I'd not consider usable until at LEAST 3.11 and 
probably 3.12 or later (maybe a year from now, since it's less mature 
than raid1 was at this point last year, but should mature a bit faster).

Takeaway: If you don't have a backup you're prepared to use, you 
shouldn't be even THINKING about btrfs at this point, no matter WHAT type 
of deployment you're considering.  If you do, you're probably reasonably 
safe with traditional single-device btrfs, intermediately risky/safe with 
raid0/1/10, don't even think about raid5/6 for real deployment yet, 
period.

2) RAID levels work QUITE a bit differently on btrfs.  In particular, 
what btrfs calls raid1 mode (with the same applying to raid10) is simply 
two-way-mirroring, NO MATTER THE NUMBER OF DEVICES.  There's no multi-way 
mirroring yet available, unless you're willing to apply not-yet-
mainstreamed patches.  It's planned, but not yet applied.  The roadmap 
says it'll happen after raid5/6 are introduced (they have been, but 
aren't yet really finished including power-loss-recovery, etc), so I'm 
guessing 3.12 at the earliest as I think 3.11 is still focused on raid5/6 
completion.

3) Btrfs raid1 mode is used to provide second-source for its data 
integrity feature as well, such that if one copy's checksum doesn't 
verify, it'll try the other one.  Unfortunately #2 means there's only the 
single fallback to try, but that's better than most filesystems, without 
data integrity at all, or if they have it, no fallback if it fails.

The combination of #2 and 3 was a bitter pill for me a year ago, when I 
was still running on aging spinning rust, and thus didn't trust two-copy-
only redundancy.  I really like the data integrity feature, but just a 
single backup copy was a great disappointment since I didn't trust my old 
hardware, and unfortunately two-copy-max remains the case for so-called 
raid1.  (Raid5/6 mode apparently introduces N-way copies or some such, 
but as I said, it's not complete yet and is EXPECTED to eat data.  N-way-
mirroring will build on that and is on the horizon, but it has been on 
the horizon and not seeming to get much closer for over a year now...)  
Fortunately for me, my budget is in far better shape this year, and with 
the dual new SSDs I purchased and with spinning rust for backup still, I 
trust my hardware enough now to run the 2-way-only mirroring that btrfs 
calls raid1 mode.

4) As mentioned above in the btrfs intro paragraph, btrfs, being a 
filesystem, actually knows what data is actual data, and what is safely 
left untracked and thus unsynced.  Thus, the read-data-in-before-writing-
it problem will be rather less, certainly on freshly formated disks where 
most existing data WILL be garbage/zeros (trimmed if on SSD, as mkfs.btrfs 
issues a trim command for the entire filesystem range before it creates 
the superblocks, etc, so empty space really /is/ zeroed).  Similarly with 
"slack space" that's not currently used but was used previously, as the 
filesystem ages -- btrfs knows that it can ignore that data too, and thus 
won't have to read it in to update the checksum when writing to a raid5/6 
mode btrfs.

5) There's various other nice btrfs features and a few caveats as well, 
but with the exception of anything btrfs-raid pertaining I totally forgot 
about, they're out of scope for this thread, which is after all, on raid, 
so I'll skip discussing them here.


So bottom line, I really recommend md/raid1 for now.  Unless you want to 
go md/raid10, with three-way-mirroring on the raid1 side.  AFAIK that's 
doable with 5 devices, but it's simpler, certainly conceptually simpler 
which can make a different to an admin trying to work with it, with 6.

If the data simply won't fit on the 5-way raid1 and you want to keep at 
least 2-device-loss protection, consider splitting it up, raid1 with 
three devices for the first half, then either get a sixth device to do 
the same with the second half, or go raid1 with two devices and put your 
less critical data on the second set.  Or, do the raid10 with 5 devices 
thing, but I'll admit that while I've read that it's possible, I don't 
really conceptually understand it myself, and haven't tried it, so I have 
no personal opinion or experience to offer on that.  But in that case I 
really would try to scrap up the money for a sixth device if possible, 
and do raid10 with 3-way redundancy 2-way-striping across the six, simply 
because it's easier to conceptualize and thus to properly administer.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21  7:31 ` [gentoo-amd64] " Duncan
@ 2013-06-21 10:28   ` Rich Freeman
  2013-06-21 14:23     ` Bob Sanders
                       ` (2 more replies)
  2013-06-21 17:40   ` Mark Knecht
  2013-06-30  1:04   ` Rich Freeman
  2 siblings, 3 replies; 46+ messages in thread
From: Rich Freeman @ 2013-06-21 10:28 UTC (permalink / raw
  To: gentoo-amd64

On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
> in data across three devices, and 8k of parity across the other two
> devices.

With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a
stripe, not 20k.  If you modify one block it needs to read all 1.5M,
or it needs to read at least the old chunk on the single drive to be
modified and both old parity chunks (which on such a small array is 3
disks either way).

> Forth, back to the parity.  Remember, raid5/6 has all that parity that it
> writes out (but basically never reads in normal mode, only when degraded,
> in ordered to reconstruct the data from the missing device(s)), but
> doesn't actually use it for integrity checking.

I wasn't aware of this - I can't believe it isn't even an option
either.  Note to self - start doing weekly scrubs...

> The single down side to raid1 as opposed to raid5/6 is the loss of the
> extra space made available by the data striping, 3*single-device-space in
> the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the
> case of raid1.  Otherwise, no contest, hands down, raid1 over raid6.

This is a HUGE downside.  The only downside to raid1 over not having
raid at all is that your disk space cost doubles.  raid5/6 is
considerably cheaper in that regard.  In a 5-disk raid5 the cost of
redundancy is only 25% more, vs a 100% additional cost for raid1.  To
accomplish the same space as a 5-disk raid5 you'd need 8 disks.  Sure,
read performance would be vastly superior, but if you're going to
spend $300 more on hard drives and whatever it takes to get so many
SATA ports on your system you could instead add an extra 32GB of RAM
or put your OS on a mirrored SSD.  I suspect that both of those
options on a typical workload are going to make a far bigger
improvement in performance.

Which is better really depends on your workload.  In my case much of
my raid space is used my mythtv, or for storage of stuff I only
occasionally use.  In these use cases the performance of the raid5 is
more than adequate, and I'd rather be able to keep shows around for an
extra 6 months in HD than have the DVR respond a millisecond faster
when I hit play.  If you really have sustained random access of the
bulk of your data than a raid1 would make much more sense.

> So several points on btrfs:
>
> 1) It's still in heavy development.

That is what is keeping me away.  I won't touch it until I can use it
with raid5, and the first common containing that hit the kernel weeks
ago I think (and it has known gaps).  Until it is stable I'm sticking
with my current setup.

> 2) RAID levels work QUITE a bit differently on btrfs.  In particular,
> what btrfs calls raid1 mode (with the same applying to raid10) is simply
> two-way-mirroring, NO MATTER THE NUMBER OF DEVICES.  There's no multi-way
> mirroring yet available

Odd, for some reason I thought it let you specify arbitrary numbers of
copies, but looking around I think you're right.  It does store two
copies of metadata regardless of the number of drives unless you
override this.

However, if one considered raid1 expensive, having multiple layers of
redundancy is REALLY expensive if you aren't using Reed Solomon and
many data disks.

From my standpoint I don't think raid1 is the best use of money in
most cases, either for performance OR for data security.  If you want
performance the money is probably better spent on other components.
If you want data security the money is probably better spent on
offline backups.  However, this very-much depends on how the disks
will be used - there are certainly cases where raid1 is your best
option.

Rich

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 10:28   ` Rich Freeman
@ 2013-06-21 14:23     ` Bob Sanders
  2013-06-21 14:27     ` Duncan
  2013-06-22 23:04     ` Mark Knecht
  2 siblings, 0 replies; 46+ messages in thread
From: Bob Sanders @ 2013-06-21 14:23 UTC (permalink / raw
  To: gentoo-amd64

Rich Freeman, mused, then expounded:
> On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> 
> > The single down side to raid1 as opposed to raid5/6 is the loss of the
> > extra space made available by the data striping, 3*single-device-space in
> > the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the
> > case of raid1.  Otherwise, no contest, hands down, raid1 over raid6.
> 
> This is a HUGE downside.  The only downside to raid1 over not having
> raid at all is that your disk space cost doubles.  raid5/6 is
> considerably cheaper in that regard.  In a 5-disk raid5 the cost of
> redundancy is only 25% more, vs a 100% additional cost for raid1.  To
> accomplish the same space as a 5-disk raid5 you'd need 8 disks.  Sure,
> read performance would be vastly superior, but if you're going to
> spend $300 more on hard drives and whatever it takes to get so many
> SATA ports on your system you could instead add an extra 32GB of RAM
> or put your OS on a mirrored SSD.  I suspect that both of those
> options on a typical workload are going to make a far bigger
> improvement in performance.
>

However, the incidence of failure is less with RAID1 than RAID5/6.  As
the number of devices increases, the failure rate increases.  Indeed,
the performance and total space can outweigh the increase in device
failure.  However, more devices - especially more devices that have
motrs and bearings, takes more power, generates more heat, and increases
the need for more backups to avert an increase in failures.

Bob
-- 
-  



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 10:28   ` Rich Freeman
  2013-06-21 14:23     ` Bob Sanders
@ 2013-06-21 14:27     ` Duncan
  2013-06-21 15:13       ` Rich Freeman
  2013-06-22 23:04     ` Mark Knecht
  2 siblings, 1 reply; 46+ messages in thread
From: Duncan @ 2013-06-21 14:27 UTC (permalink / raw
  To: gentoo-amd64

Rich Freeman posted on Fri, 21 Jun 2013 06:28:35 -0400 as excerpted:

> On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
>> in data across three devices, and 8k of parity across the other two
>> devices.
> 
> With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a
> stripe, not 20k.  If you modify one block it needs to read all 1.5M, or
> it needs to read at least the old chunk on the single drive to be
> modified and both old parity chunks (which on such a small array is 3
> disks either way).

I'll admit to not fully understanding chunks/stripes/strides in terms of 
actual size tho I believe you're correct, it's well over the filesystem 
block size and a half-meg is probably right.  However, the original post 
went with a 4k blocksize, which is pretty standard as that's the usual 
memory page size as well so it makes for a convenient filesystem blocksize 
too, so that's what I was using as a base for my numbers.  If it's 4k 
blocksize, then 5-device raid6 stripe would be 3*4k=12k of data, plus 
2*4k=8k of parity.

>> Forth, back to the parity.  Remember, raid5/6 has all that parity that
>> it writes out (but basically never reads in normal mode, only when
>> degraded,
>> in ordered to reconstruct the data from the missing device(s)), but
>> doesn't actually use it for integrity checking.
> 
> I wasn't aware of this - I can't believe it isn't even an option either.
>  Note to self - start doing weekly scrubs...

Indeed.  That's one of the things that frustrated me with mdraid -- all 
that data integrity metadata there, but just going to waste in normal 
operation, only used for device recovery.

Which itself can be a problem as well, because if there *IS* an 
undetected cosmic-ray-error or whatever and a device goes out, that means 
you'll lose integrity on a second device in the rebuild as well (if it 
was a data device that dropped out and not parity anyway), because the 
parity's screwed against the undetected error and will thus rebuild a bad 
copy of the data on the replacement device.

And it's one of the things which so attracted me to btrfs, too, and why I 
was so frustrated to see it could only be a single redundancy (two-way-
mirrored), no way to do more.  The btrfs sales pitch talks about how 
great data integrity and the ability to go find a good copy when the 
data's bad, but what if the only allowed second copy is bad as well?  
OOPS!

But as I said, N-way mirroring is on the btrfs roadmap, it's simply not 
there yet.

>> The single down side to raid1 as opposed to raid5/6 is the loss of the
>> extra space made available by the data striping, 3*single-device-space
>> in the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space
>> in the case of raid1.  Otherwise, no contest, hands down, raid1 over
>> raid6.
> 
> This is a HUGE downside.  The only downside to raid1 over not having
> raid at all is that your disk space cost doubles.  raid5/6 is
> considerably cheaper in that regard.  In a 5-disk raid5 the cost of
> redundancy is only 25% more, vs a 100% additional cost for raid1.  To
> accomplish the same space as a 5-disk raid5 you'd need 8 disks.  Sure,
> read performance would be vastly superior, but if you're going to spend
> $300 more on hard drives and whatever it takes to get so many SATA ports
> on your system you could instead add an extra 32GB of RAM or put your OS
> on a mirrored SSD.  I suspect that both of those options on a typical
> workload are going to make a far bigger improvement in performance.

I'd suggest that with the exception of large database servers where the 
object is to be able to cache the entire db in RAM, the SSDs are likely a 
better option.

FWIW, my general "gentoo/amd64 user" rule of thumb is 1-2 gig base, plus 
1-2 gig per core.  Certainly that scale can slide either way and it'd 
probably slide down for folks not doing system rebuilds in tmpfs, as 
gentooers often do, but up or down, unless you put that ram in a battery-
backed ramdisk, 32-gig is a LOT of ram, even for an 8-core.

FWIW, with my old dual-dual-core (so four cores), 8 gig RAM was nicely 
roomy, tho I /did/ sometimes bump the top and thus end up either dumping 
either cache or swapping.  When I upgraded to the 6-core, I used that 
rule of thumb and figured ~12 gig, but due to the power-of-twos 
efficiency rule, I ended up with 16 gig, figuring that was more than I'd 
use in practice, but better than limiting it to 8 gig.

I was right.  The 16 gig is certainly nice, but in reality, I'm typically 
entirely wasting several gigs of it, not even cache filling it up.  I 
typically run ~1 gig in application memory and several gigs in cache, 
with only a few tens of MB in buffer.  But while I'll often exceed my old 
capacity of 8 gig, it's seldom by much, and 12 gig would handle 
everything including cache without dumping at well over the 90th 
percentile, probably 97% or there abouts.  Even with parallel make at 
both the ebuild and global portage level and with PORTAGE_TMPDIR in 
tmpfs, I hit 100% on the cores well before I run out of RAM and start 
dumping cache or swapping.  The only time that has NOT been the case is 
when I deliberately saturate, say a kernel build with an open-ended -j so 
it stacks up several-hundred jobs at once.

Meanwhile, the paired SSDs in btrfs raid1 make a HUGE practical 
difference, especially in things like the (cold-cache) portage tree (and 
overlays) sync, kernel git pull, etc.  (In my case actual booting didn't 
get a huge boost as I run ntp-client and ntpd at boot, and the ntp-client 
time sync takes ~12 seconds, more than the rest of the boot put 
together.  But cold-cache loading kde happens faster now -- I actually 
uninstalled the ksplash and just go text-console login to x-black-screen 
to kde/plasma desktop, now.  But the tree sync and kernel pull are still 
the places I appreciate the SSDs most.)

And notably, because the cold-cache system is so much faster with the 
SSDs, I tend to actually shut-down instead of suspending, now, so I tend 
to cache even less and thus use less memory with the SSDs than before.  
I /could/ probably do 8-gig RAM now instead of 16, and not miss it.  Even 
a gig per core, 6-gig, wouldn't be terrible, tho below that would start 
to bottleneck and pinch a bit again I suspect.

> Which is better really depends on your workload.  In my case much of my
> raid space is used my mythtv, or for storage of stuff I only
> occasionally use.  In these use cases the performance of the raid5 is
> more than adequate, and I'd rather be able to keep shows around for an
> extra 6 months in HD than have the DVR respond a millisecond faster when
> I hit play.  If you really have sustained random access of the bulk of
> your data than a raid1 would make much more sense.

Definitely.  For mythTV or similar massive media needs, raid5 will be  
fast enough.  And I suspect just the single device-loss tolerance is a 
reasonable risk tradeoff for you too, since after all it /is/ just media, 
so tolerating loss of a single device is good, but the risk of losing two 
before a full rebuild with a replacement if one fails is acceptable, 
given the cost vs. size tradeoff with the massive size requirements of 
video.

But again, the OP seemed to find his speed benchmarks disappointing, to 
say the least, and I believe pointing out raid6 as the culprit is 
accurate.  Which, given his production-rating reliability stock trading 
VMs usage, I'm guessing raid5/6 really isn't the ideal match.  Massive 
media, yes, definitely.  Massive VMs, not so much.

>> So several points on btrfs:
>>
>> 1) It's still in heavy development.
> 
> That is what is keeping me away.  I won't touch it until I can use it
> with raid5, and the first common containing that hit the kernel weeks
> ago I think (and it has known gaps).  Until it is stable I'm sticking
> with my current setup.

Question:  Would you use it for raid1 yet, as I'm doing?  What about as a 
single-device filesystem?  Do you believe my estimates of reliability in 
those cases (almost but not quite stable for single-device, kind of in 
the middle for raid1/raid0/raid10, say a year behind single-device and 
raid5/6/50/60 about a year behind that) reasonably accurate?

Because if you're waiting until btrfs raid5 is fully stable, that's 
likely to be some wait yet -- I'd say a year, likely more given that 
everything btrfs has seemed to take longer than people expected.  But if 
you're simply waiting until it matures to the point that say btrfs raid1 
is at now, or maybe even a bit less, but certainly to where it's complete 
plus say a kernel release to work out a few more wrinkles, then that's 
quite possible by year-end.

>> 2) RAID levels work QUITE a bit differently on btrfs.  In particular,
>> what btrfs calls raid1 mode (with the same applying to raid10) is
>> simply two-way-mirroring, NO MATTER THE NUMBER OF DEVICES.  There's no
>> multi-way mirroring yet available
> 
> Odd, for some reason I thought it let you specify arbitrary numbers of
> copies, but looking around I think you're right.  It does store two
> copies of metadata regardless of the number of drives unless you
> override this.

Default is single-copy data, dual-copy metadata, regardless of number of 
devices (single device does DUP metadata, two copies on the same device, 
by default), with the exception of SSDs, where the metadata default is 
single since many of the SSD firmwares (sandforce firmware, with its 
compression features, is known to do this, tho mine, I forgot the 
firmware brand ATM but it's Corsair Neutron SSDs aimed at the server/
workstation market where unpredictability isn't considered a feature, 
doesn't as one of its features is stable performance and usage regardless 
of the data its fed) do dedup on identical copy data anyway.  At least 
that's the explanation given for the SSD exception.

But the real gotcha is that there's no way to setup N-way (N>2) 
redundancy on btrfs raid1/10, and I know for a fact that catches some 
admins by nasty surprise, as I've seen it come up on the btrfs list as 
well as had my own personal disappointment with it, tho luckily I did my 
research and figured that out before I actually installed on btrfs.

I just wish they'd called it 2-way-mirroring instead of raid1, as that 
wouldn't be the deception in labeling that I consider the btrfs raid1 
moniker at this point, and admins would be far less likely to be caught 
unaware when a second device goes haywire that they /thought/ they'd be 
covered for.  Of course at this point it's all still development anyway, 
so no sane admin is going to be lacking backups in any case, but there's 
a lot of people flying by the seat of their pants out there, who have NOT 
done the research, and I they show up frequently on the btrfs list, after 
it's too late.   (Tho certainly there's less of them showing up now than 
a year ago, when I first investigated btrfs, I think both due to btrfs 
maturing quite a bit since then and to a lot of the original btrfs hype 
dying down, which is a good thing considering the number of folks that 
were installing it, only to find out once they lost data that it was 
still development.)

> However, if one considered raid1 expensive, having multiple layers of
> redundancy is REALLY expensive if you aren't using Reed Solomon and many
> data disks.

Well, depending on the use case.  In your media case, certainly.  
However, that's one of the few cases that still gobbles storage space as 
fast as the manufacturers up their capacities, and that is likely to 
continue to do so for at least a few more years, given that HD is still 
coming in, so a lot of the media is still SD, and with quad-HD in the 
wings as well, now.  But once we hit half-petabyte, I suppose even quad-HD 
won't be gobbling the space as fast as they can upgrade it, any more.  So 
a half-decade or so, maybe?

Plus of course the shear bandwidth requirements for quad-HD are 
astounding, so at that point either some serious raid0/x0 raid or ssds 
for the speed will be pretty mandatory anyway, remaining SSD size limits 
or no SSD size limits.

> From my standpoint I don't think raid1 is the best use of money in most
> cases, either for performance OR for data security.  If you want
> performance the money is probably better spent on other components. If
> you want data security the money is probably better spent on offline
> backups.  However, this very-much depends on how the disks will be used
> - there are certainly cases where raid1 is your best option.

I agree when the use is primarily video media.  Other than that, a pair 
of 2 TB spinning rust drives tends to still go quite a long way, and 
tends to be a pretty good cost/risk tradeoff IMO.  Throwing in a third 2-
TB drive for three-way raid1 mirroring is often a good idea as well, 
where the additional data security is needed, but beyond that, the cost/
benefit balance probably doesn't make a whole lot of sense, agreed.

And offline backups are important too, but with dual 2TB drives, many 
people can live with a TB of data and do multiple raid1s, giving 
themselves both logically offline backup and physical device redundancy.  
And if that means they do backups to the second raid set on the same 
physical devices more reliably than they would with an external that they 
have to physically look for and/or attach each time (as turned out to be 
the case for me), then the pair of 2TB drives is quite a reasonable 
investment indeed.

But if you're going for performance, spinning rust raid simply doesn't 
cut it at the consumer level any longer.  SSD at least the commonly used 
data, leaving say the media data on spinning rust for the time being if 
the budget doesn't work otherwise, as I've actually done here with my 
(much smaller than yours) media collection, figuring it not worth the 
cost to put /it/ on SSD just yet.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 14:27     ` Duncan
@ 2013-06-21 15:13       ` Rich Freeman
  2013-06-22 10:29         ` Duncan
  0 siblings, 1 reply; 46+ messages in thread
From: Rich Freeman @ 2013-06-21 15:13 UTC (permalink / raw
  To: gentoo-amd64

On Fri, Jun 21, 2013 at 10:27 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Rich Freeman posted on Fri, 21 Jun 2013 06:28:35 -0400 as excerpted:
>
>> That is what is keeping me away.  I won't touch it until I can use it
>> with raid5, and the first common containing that hit the kernel weeks
>> ago I think (and it has known gaps).  Until it is stable I'm sticking
>> with my current setup.
>
> Question:  Would you use it for raid1 yet, as I'm doing?  What about as a
> single-device filesystem?  Do you believe my estimates of reliability in
> those cases (almost but not quite stable for single-device, kind of in
> the middle for raid1/raid0/raid10, say a year behind single-device and
> raid5/6/50/60 about a year behind that) reasonably accurate?

If I wanted to use raid1 I might consider using btrfs now.  I think it
is still a bit risky, but the established use cases have gotten a fair
bit of testing now.  I'd be more confident in using it with a single
device.

>
> Because if you're waiting until btrfs raid5 is fully stable, that's
> likely to be some wait yet -- I'd say a year, likely more given that
> everything btrfs has seemed to take longer than people expected.

That's my thought as well.  Right now I'm not running out of space, so
I'm hoping that I can wait until the next time I need to migrate my
data (from 1TB to 5+TB drives, for example).  With such a scenario I
don't need to have 10 drives mounted at once to migrate the data - I
can migrate existing data to 1-2 drives, remove the old ones, and
expand the new array. To migrate today would require finding someplace
to dump all the data offline and migrate the drives, as there is no
in-place way to migrate multiple ext3/4 logical volumes on top of
mdadm to a single btrfs on bare metal.

Without replying to anything in particular both you and Bob have
mentioned the importance of multiple redundancy.

Obviously risk goes down as redundancy goes up.  If you protect 25
drives of data with 1 drive of parity then you need 2/26 drives to
fail to hose 25 drives of data.  If you protect 1 drive of data with
25 drives of parity (call them mirrors or parity or whatever - they're
functionally equivalent) then you need 25/26 drives to fail to lose 1
drive of data.  RAID 1 is actually less effective - if you protect 13
drives of data with 13 mirrors you need 2/26 drives to fail to lose 1
drive of data (they just have to be the wrong 2).  However, you do
need to consider that RAID is not the only way to protect data, and
I'm not sure that multiple-redundancy raid-1 is the most
cost-effective strategy.

If I had 2 drives of data to protect and had 4 spare drives to do it
with, I doubt I'd set up a 3x raid-1/5/10 setup (or whatever you want
to call it - imho raid "levels" are poorly named as there really is
just striping and mirroring and adding RS parity and everything else
is just combinations).  Instead I'd probably set up a
RAID1/5/10/whatever with single redundancy for faster storage and
recovery, and an offline backup (compressed and with
incrementals/etc).  The backup gets you more security and you only
need it in a very unlikely double-failure.  I'd only invest in
multiple redundancy in the event that the risk-weighted cost of having
the node go down exceeds the cost of the extra drives.  Frankly in
that case RAID still isn't the right solution - you need a backup node
someplace else entirely as hard drives aren't the only thing that can
break in your server.

This sort of rationale is why I don't like arguments like "RAM is
cheap" or "HDs are cheap" or whatever.  The fact is that wasting money
on any component means investing less in some other component that
could give you more space/performance/whatever-makes-you-happy.  If
you have $1000 that you can afford to blow on extra drives then you
have $1000 you could blow on RAM, CPU, an extra server, or a trip to
Disney.  Why not blow it on something useful?

Rich

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 15:13       ` Rich Freeman
@ 2013-06-22 10:29         ` Duncan
  2013-06-22 11:12           ` Rich Freeman
  0 siblings, 1 reply; 46+ messages in thread
From: Duncan @ 2013-06-22 10:29 UTC (permalink / raw
  To: gentoo-amd64

Rich Freeman posted on Fri, 21 Jun 2013 11:13:51 -0400 as excerpted:

> On Fri, Jun 21, 2013 at 10:27 AM, Duncan <1i5t5.duncan@cox.net> wrote:

>> Question:  Would you use [btrfs] for raid1 yet, as I'm doing?
>> What about as a single-device filesystem?

> If I wanted to use raid1 I might consider using btrfs now.  I think it
> is still a bit risky, but the established use cases have gotten a fair
> bit of testing now.  I'd be more confident in using it with a single
> device.

OK, so we agree on the basic confidence level of various btrfs features.  
I trust my own judgement a bit more now. =:^)

> To migrate today would require finding someplace to dump all
> the data offline and migrate the drives, as there is no in-place way to
> migrate multiple ext3/4 logical volumes on top of mdadm to a single
> btrfs on bare metal.

... Unless you have enough unpartitioned space available still.

What I did a few years ago is buy a 1 TB USB drive I found at a good 
deal. (It was very near the price of half-TB drives at the time, I 
figured out later they must have gotten shipped a pallet of the wrong 
ones for a sale on the half-TB version of the same thing, so it was a 
single-store, get-it-while-they're-there-to-get, deal.)

That's how I was able to migrate from the raid6 I had back to raid1.  I 
had to squeeze the data/partitions a bit to get everything to fit, but it 
did, and that's how I ended up with 4-way raid1, since it /had/ been a 4-
way raid6.  All 300-gig drives at the time, so the TB USB had /plenty/ of 
room. =:^)

> Without replying to anything in particular both you and Bob have
> mentioned the importance of multiple redundancy.
> 
> Obviously risk goes down as redundancy goes up.  If you protect 25
> drives of data with 1 drive of parity then you need 2/26 drives to fail
> to hose 25 drives of data.

Ouch!

> If you protect 1 drive of data with 25 drives of parity (call them
> mirrors or parity or whatever - they're functionally equivalent) then
> you need 25/26 drives to fail to lose 1 drive of data.

Almost correct.

Except that with 25/26 failed, you'd still have 1 working, which with 
raid1/mirroring would be enough.  (AFAIK that's the difference with 
parity.  Parity is generally done on a minimum of two devices with the 
third as parity, and going down to just one isn't enough, you can lose 
only one, or two if you have two-way-parity as with raid6.  With 
mirroring/raid1, they're all essentially identical, so one is enough to 
keep going, you'd have to loose 26/26 to be dead in the water.  But 25/26 
dead or 26/26 dead, you better HOPE it never comes down to where that 
matters!)

> RAID 1 is actually less effective - if you protect 13
> drives of data with 13 mirrors you need 2/26 drives to fail to lose 1
> drive of data (they just have to be the wrong 2).  However, you do need
> to consider that RAID is not the only way to protect data, and I'm not
> sure that multiple-redundancy raid-1 is the most cost-effective
> strategy.

The first time I read that thru I read it wrong, and was about to 
disagree.  Then I realized what you meant... and that it was an equally 
valid read of what you wrote, except...

AFAIK 13 drives of data with 13 mirrors wouldn't (normally) be called 
raid1 (unless it's 13 individual raid1s).  Normally, an arrangement of 
that nature if configured together would be configured as raid10, 2-way-
mirrored, 13-way-striped (or possibly raid0+1, but that's not recommended 
for technical reasons having to do with rebuild thruput), tho it could 
also be configured as what mdraid calls linear mode (which isn't really 
raid, but it happens to be handled by the same md/raid driver in Linux) 
across the 13, plus raid1, or if they're configured as separate volumes, 
13 individual two-disk raid1s, any of which might be what you meant (and 
the wording appears to favor 13 individual raid1s).

What I interpreted it as initially was a 13-way raid1, mirrored again at 
a second level to 13 additional drives, which would be called raid11, 
except that there's no benefit of that over a simple single-layer 26-way 
raid1 so the raid11 term is seldom seen, and that's clearly not what you 
meant.

Anyway, you're correct if it's just two-way-mirrored.  However, at that 
level, if one was to do only two-way-mirroring, one would usually do 
either raid10 for the 13-way striping, or 13 separate raid1s, which would 
give one the opportunity to make some of them 3-way-mirrored (or more) 
raid1s for the really vital data, leaving the less vital data as simple
2-way-mirror-raid1s.

Or raid6 and get loss-of-two tolerance, but as this whole subthread is 
discussing, that can be problematic for thruput.  (I've occasionally seen 
reference to raid7, which is said to be 3-way-parity, loss-of-three-
tolerance, but AFAIK there's no support for it in the kernel, and I 
wouldn't be surprised if all implementations are proprietary.  AFAIK, in 
practice, raid10 with N-way mirroring on the raid1 portion is implemented 
once that many devices get involved, or other multi-level raid schemes.)

> If I had 2 drives of data to protect and had 4 spare drives to do it
> with, I doubt I'd set up a 3x raid-1/5/10 setup (or whatever you want to
> call it - imho raid "levels" are poorly named as there really is just
> striping and mirroring and adding RS parity and everything else is just
> combinations).  Instead I'd probably set up a RAID1/5/10/whatever with
> single redundancy for faster storage and recovery, and an offline backup
> (compressed and with incrementals/etc).  The backup gets you more
> security and you only need it in a very unlikely double-failure.  I'd
> only invest in multiple redundancy in the event that the risk-weighted
> cost of having the node go down exceeds the cost of the extra drives. 
> Frankly in that case RAID still isn't the right solution - you need a
> backup node someplace else entirely as hard drives aren't the only thing
> that can break in your server.

So we're talking six drives, two of data and four "spares" to play with.

Often that's setup as raid10, either two-way-striped and 3-way-mirrored, 
or 3-way-striped and 2-way-mirrored, depending on whether the loss-of-two 
tolerance of 3-way-mirroring or thruput of three-way-striping, is 
considered of higher value.

You're right that at that level, you DO need a real backup, and it should 
take priority over raid-whatever.  HOWEVER, in addition to creating a 
SINGLE raid across all those drives, it's possible to partition them up, 
and create multiple raids out of the partitions, with one set being a 
backup of the other.  And since you've already stated that there's only 
two drives worth of data, there's certainly room enough amongst the six 
drives total to do just that.

This is in fact how I ran my raids, both my raid6 config, and my raid1 
config, for a number of years, and is in fact how I have my (raid1-mode) 
btrfs filesystems setup now on the SSDs.

Effectively I had/have each drive partitioned up into two sets of 
partitions, my "working" set, and my "backup" set.  Then I md-raided at 
my chosen level each partition across all devices.  So on each physical 
device partition 5 might be the working rootfs partition, partition 6 the 
woriing home partition... partition 9 the backup rootfs partition, and 
partition 10 the backup home partition.  They might end up being md3 
(rootwork), md4 (homework), md7 (rootbak) and md8 (homebak).

That way, you're protected against physical device death by the 
redundancy of the raids, and from fat-fingering or an update gone wrong 
by the redundancy of the backup partitions across the same physical 
devices.

What's nice about an arrangement such as this is that it gives you quite 
a bit more flexibility than you'd have with a single raid, since it's now 
possible to decide "Hmm, I don't think I actually need a backup of /var/
log, so I think I'll only run with one log partition/raid, instead of the 
usual working/backup arrangement."  Similarly, "You know, I ultimately 
don't need backups of the gentoo tree and overlays, or of the kernel git 
tree, at all, since as Linus says, 'Real men upload it to the net and let 
others be their backup', and I can always redownload that from the net, 
so I think I'll raid0 this partition and not keep any copies at all, 
since re-downloading's less trouble than dealing with the backups 
anyway."  Finally, and possibly critically, it's possible to say, "You 
know, what happens if I've just wiped rootbak in ordered to make a new 
root backup, and I have a crash and working-root refuses to boot.  I 
think I need a rootbak2, and with the space I saved by doing only one log 
partition and by making the sources trees raid0, I have room for it now, 
without using any more space than I would had I had everything on the 
same raid."

Another nice thing about it, and this is what I would have ended up doing 
if I hadn't conveniently found that 1 TB USB drive at such a good price, 
is that while the whole thing is partitioned up and in use, it's very 
possible to wipe out the backup partitions temporarily, and recreate them 
as a different raid level or a different filesystem, or otherwise 
reorganize that area, then reboot into the new version, and do the same 
to what was the working copies.  (For the area that was raid0, well, it 
was raid0 because it's easy to recreate, so just blow it away and 
recreate it on the new layout.  And for the single-raid log without a 
backup copy, it's simple enough to simply point the log elsewhere or keep 
it on rootfs for long enough to redo that set of partitions across all 
physical devices.)

Again, this isn't just theory, it really works, as I've done it to 
various degrees at various times, even if I found copying to the external 
1 TB USB drive and booting from it more convenient to do when I 
transferred from raid6 to raid1.

And being I do run ~arch, there's been a number of times I've needed to 
boot to rootbak instead of rootworking, including once when a ~arch 
portage was hosing symlinks just as a glibc update came along, thus 
breaking glibc (!!), once when a bash update broke, and another time when 
a glibc update mostly worked but I needed to downgrade and the protection 
built into the glibc ebuild wasn't letting me do it from my working root.

What's nice about this setup in regard to booting to rootbak instead of 
the usual working root, is that unlike booting to a liveCD/DVD rescue 
disk, you have the full working system installed, configured and running 
just as it was when the backup was made.  That makes it much easier to 
pickup and run from where you left off, with all the tools you're used to 
having and modes of working you're used to using, instead of being 
limited to some artificial rescue environment often with limited tools, 
and in any case setup and configured differently than you have your own 
system, because rootbak IS your own system, just from a few days/weeks/
months ago, whenever it was that you last did the backup.

Anyway, with the parameters you specified, two drives full of data and 
four spare drives (presumably of a similar size), there's a LOT of 
flexibility.  There's raid10 across four drives (two-mirror, two-stripe) 
with the other two as backup (this would probably be my choice given the 
2-disks of data, 6 disk total, constraints, but see below, and it appears 
this might be your choice as well), or raid6 across four drives (two 
mirror, two parity) with two as backups (not a choice I'd likely make, 
but a choice), or a working pair of drives plus two sets of backups (not 
a choice I'd likely make), or raid10 across all six drives in either 3-
mirror/2-stripe or 3-stripe/2-mirror mode (I'd probably elect for this 
with 3-stripe/2-mirror for the 3X speed and space, and prioritize a 
separate backup, see the discussion below), or two independent 3-disk 
raid5s (IMO there's better options for most cases, with the possible 
exception of primarily slow media usage, just which options are better 
depends on usage and priorities tho), or some hybrid combination of these.

> This sort of rationale is why I don't like arguments like "RAM is cheap"
> or "HDs are cheap" or whatever.  The fact is that wasting money on any
> component means investing less in some other component that could give
> you more space/performance/whatever-makes-you-happy.  If you have $1000
> that you can afford to blow on extra drives then you have $1000 you
> could blow on RAM, CPU, an extra server, or a trip to Disney.  Why not
> blow it on something useful?

[ This gets philosophical. OK to quit here if uninterested. ]

You're right.  "RAM and HDs are cheap"... relative to WHAT, the big-
screen TV/monitor I WOULD have been replacing my much smaller monitor 
with, if I hadn't been spending the money on the "cheap" RAM and HDs?

Of course, "time is cheap" comes with the same caveats, and can actually 
end up being far more dear.  Stress and hassle of administration 
similarly.  And sometimes, just a bit of investment in another 
"expensive" HD, saves you quite a bit of "cheap" time and stress, that's 
actually more expensive.

"It's all relative"... to one's individual priorities.  Because one 
thing's for sure, both money and time are fungible, and if they aren't 
spent on one thing, they WILL be on another (even if that "spent" is 
savings, for money), and ultimately, it's one's individual priorities 
that should rank where that spending goes.  And I can't set your 
priorities and you can't set mine, so...   But from my observation, a LOT 
of folks don't realize that and/or don't take the time necessary to 
reevaluate their own priorities from time to time, so end up spending out 
of line with their real priorities, and end up rather unhappy people as a 
result!  That's one reason why I have a personal policy to deliberately 
reevaluate personal priorities from time to time (as well as being aware 
of them constantly), and rearrange spending, money time and otherwise, in 
accordance with those reranked priorities.  I'm absolutely positive I'm a 
happier man for doing so! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-22 10:29         ` Duncan
@ 2013-06-22 11:12           ` Rich Freeman
  2013-06-22 15:45             ` Duncan
  0 siblings, 1 reply; 46+ messages in thread
From: Rich Freeman @ 2013-06-22 11:12 UTC (permalink / raw
  To: gentoo-amd64

On Sat, Jun 22, 2013 at 6:29 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Rich Freeman posted on Fri, 21 Jun 2013 11:13:51 -0400 as excerpted:
>> If you protect 1 drive of data with 25 drives of parity (call them
>> mirrors or parity or whatever - they're functionally equivalent) then
>> you need 25/26 drives to fail to lose 1 drive of data.
>
> Almost correct.

DOH - good catch.  Would need 26 fails.

> AFAIK 13 drives of data with 13 mirrors wouldn't (normally) be called
> raid1 (unless it's 13 individual raid1s)...

That's why I commented that I find RAID "levels" extremely unhelpful.
There is striping, mirroring, and RS parity, and every possible
combination of the above.  We have a special name raid5 for striping
with one RS parity drive.  We have another special name raid6 for
striping with two RS parity drives.  We don't have a special name for
striping with 37 RS parity drives.  Yet, all three of these are the
same thing.

I was referring to 13 data drives with one mirror each . If you lose
two drives you could potential lose one drive of data.  If you made
that one big raid10 then if you lose two drives you could lose 13
drives of data.  Both scenarios involve bad luck in terms of what pair
goes.

> You're right that at that level, you DO need a real backup, and it should
> take priority over raid-whatever.  HOWEVER, in addition to creating a
> SINGLE raid across all those drives, it's possible to partition them up,
> and create multiple raids out of the partitions, with one set being a
> backup of the other.

I wouldn't consider that a great strategy.  Sure, it is convenient,
but it does you no good at all if your computer burns up in a fire.

Multiple-level redundancy just seems to be past the point of
diminishing returns to me.  If I wanted to spend that kind of money
I'd probably spend it differently.

However, I do agree that mdadm should support more flexible arrays.
For example, my boot partition is raid1 (since grub doesn't support
anything else), and I have it set up across all 5 of my drives.
However, the reality is that only two get used and the others are
treated only as spares.  So, that is just a waste of space, and it is
actually more annoying from a config perspective because it would be
really nice if my system could boot from an arbitrary drive.

Oh, as far as raid on partitions goes - I do use this for a different
purpose.  If you have a collection of drives of different sizes it can
reduce space waste.  Suppose you have 3 500GB drives and 2 1TB drives.
 If you put them all directly in a raid5 you get 2TB of space.  If you
chop the 1TB drives into 2 500GB partitions then you can get two
raid5s - one 2TB in space, and the other 500GB in space.  That is
500GB more data for the same space.  Oh, and I realize I wrote raid5.
With mdadm you can set up a 2-drive raid5.  It is functionally
equivalent to a raid1 I think, and I believe you can convert between
them, but since I generally intend to expand arrays I prefer to just
set them up as raid5 from the start.    Since I stick lvm on top I
don't care if the space is chopped up.

Rich

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-22 11:12           ` Rich Freeman
@ 2013-06-22 15:45             ` Duncan
  0 siblings, 0 replies; 46+ messages in thread
From: Duncan @ 2013-06-22 15:45 UTC (permalink / raw
  To: gentoo-amd64

Rich Freeman posted on Sat, 22 Jun 2013 07:12:25 -0400 as excerpted:

> Multiple-level redundancy just seems to be past the point of diminishing
> returns to me.  If I wanted to spend that kind of money I'd probably
> spend it differently.

My point was that for me, it wasn't multiple level redundancy.  It was 
simply device redundancy (raid), and fat-finger redundancy (backups), on 
the same set of drives so I was protected from either scenario.

The fire/flood scenario would certainly get me if I didn't have offsite 
backups, but just as you call multiple redundancy past your point of 
diminishing returns, I call the fire/flood scenario past mine.  If that 
happens, I figure I'll have far more important things to worry about than 
rebuilding my computer for awhile.  And chances are, when I do get around 
to it, things will be progressed enough that much of the data won't be 
worth so much any more anyway.  Besides, the real /important/ data is in 
my head.  What's worth rebuilding, will be nearly as easy to rebuild due 
to what's in my head, as it would be to go thru what's now historical 
data and try to pick up the pieces, sorting thru what's still worth 
keeping around and what's not.

Tho as I said, I do/did keep an additional level of backup on that 1 TB 
drive, but it's on-site too, and while not in the computer, it's 
generally nearby enough that it'd be lost too in case of flood/fire.  
It's more a convenience than a real backup, and I don't really keep it 
upto date, but if it survived and what's in the computer itself didn't, I 
do have old copies of much of my data, simply because it's still there 
from the last time I used that drive as convenient temporary storage 
while I switched things around.

> However, I do agree that mdadm should support more flexible arrays. For
> example, my boot partition is raid1 (since grub doesn't support anything
> else), and I have it set up across all 5 of my drives. However, the
> reality is that only two get used and the others are treated only as
> spares.  So, that is just a waste of space, and it is actually more
> annoying from a config perspective because it would be really nice if my
> system could boot from an arbitrary drive.

Three points on that.  First, obviously you're not on grub2 yet.  It 
handles all sorts of raid, lvm, newer filesystems like btrfs (and zfs for 
those so inclined), various filesystems, etc, natively, thru its modules.

Second, /boot is an interesting case.  Here, originally (with grub1 and 
the raid6s across 4 drives) I setup a 4-drive raid1.  But, I actually 
installed grub to the boot sector of all four drives, and tested each one 
booting just to grub by itself (the other drives off), so I knew it was 
using its own grub, not pointed somewhere else.

But I was still worried about it as while I could boot from any of the 
drives, they were a single raid1, which meant no fat-finger redundancy, 
and doing a usable backup of /boot isn't so easy.

So I think it was when I switched from raid6 to raid1 for almost the 
entire system, that I switched to dual dual-drive raid1s for /boot as 
well, and of course tested booting to each one alone again, just to be 
sure.  That gave me fat-finger redundancy, as well as added convenience 
since I run git kernels, and I was able to update just the one dual-drive 
raid1 /boot with the git kernels, then update the backup with the 
releases once they came out, which made for a nice division of stable 
kernel vs pre-release there.

That dual dual-drive raid-1 setup proved very helpful when I upgraded to 
grub2 as well, since I was able to play around with it on the one dual-
drive raid1 /boot while the other one stayed safely bootable grub1 until 
I had grub2 working the way I wanted on the working /boot, and had again 
installed and tested it on both component hard drives to boot to grub and 
to the full raid1 system just from the one drive by itself, with the 
others entirely shut off.

Only when I had both drives of the working /boot up and running grub2, 
did I mounth the backup /boot as well, and copy over the now working 
config to it, before running grub2-install on those two drives.

Of course somewhere along the way, IIRC at the same time as the raid6 to 
raid1 conversion as well, I had also upgraded to gpt partitions from 
traditional mbr.  When I did I had the foresight to create BOTH dedicated 
BIOS boot partitions and EFI partitions on each of the four drives.  
grub1 wasn't using them, but that was fine; they were small (tiny).  That 
made the upgrade to grub2 even easier, since grub2 could install its core 
into the dedicated BIOS partitions.  The EFI partitions remain unused to 
this day, but as I said, they're tiny, and with gpt they're specifically 
typed and labeled so they can't mix me up, either.

(BTW, talking about data integrity, if you're not on GPT yet, do consider 
it.  It keeps a second partition table at the end of the drive as well as 
the one at the beginning, and unlike mbr they're checksummed, so 
corruption is detected.  It also kills the primary/secondary/extended 
difference so no more worrying about that, and allows partition labels, 
much like filesystem labels, which makes tracking and managing what's 
what **FAR** easier.  I GPT partition everything now, including my USB 
thumbdrives if I partition them at all!)

When that machine slowly died and I transferred to a new half-TB drive 
thinking it was the aging 300-gigs (it wasn't, caps were dying on the by 
then 8 year old mobo), and then transferred that into my new machine 
without raid, I did the usual working/backup partition arrangement, but 
got frustrated without the ability to have a backup /boot, because with 
just one device, the boot sector could point just one place, at the core 
grub2 in the dedicated BIOS boot partition, which in turn pointed at the 
usual /boot.  Now grub2's better in this regard than grub2, since that 
core grub2 has an emergency mode that would give me limited ability to 
load a backup /boot, that's an entirely manual process with a 
comparatively limited grub2 emergency shell without additional modules 
available, and I didn't actually take advantage of that to configure a 
backup /boot that it could reach.

But when I switched to the SSDs, I again had multiple devices, the pair 
of SSDs, which I setup with individual /boots, and the original one still 
on the spinning rust.  Again I installed grub2 to each one, pointed at 
its own separately configured /boot, so now I actually have three 
separately configured and bootable /boots, one on each of the SSDs and a 
third on the spinning rust half-TB.

(FWIW the four old 300-gigs are sitting on the shelf.  I need to badblocks 
or dd them to wipe, and I have a friend that'll buy them off me.)

Third point.  /boot partition raid1 across all five drives and three are 
wasted?  How?  I believe if you check, all five will have a mirror of the 
data (not just two unless it's btrfs raid1 not mdadm raid1, but btrfs is /
entirely/ different in that regard).  Either they're all wasted but one, 
or none are wasted, depending on how you look at it.

Meanwhile, do look into installing grub on each drive, so you can boot 
from any of them.  I definitely know it's possible as that's what I've 
been doing, tested, for quite some time.

> Oh, as far as raid on partitions goes - I do use this for a different
> purpose.  If you have a collection of drives of different sizes it can
> reduce space waste.  Suppose you have 3 500GB drives and 2 1TB drives.
>  If you put them all directly in a raid5 you get 2TB of space.  If you
> chop the 1TB drives into 2 500GB partitions then you can get two raid5s
> - one 2TB in space, and the other 500GB in space.  That is 500GB more
> data for the same space.  Oh, and I realize I wrote raid5. With mdadm
> you can set up a 2-drive raid5.  It is functionally equivalent to a
> raid1 I think,

You better check.  Unless I'm misinformed, which I could be as I've not 
looked at this in awhile and both mdadm and the kernel have changed quite 
a bit since then, that'll be setup as a degraded raid5, which means if 
you lose one...

But I do know raid10 can be setup like that, on fewer drives than it'd 
normally take, with the mirrors in "far" mode I believe, and it just 
arranges the stripes as it needs to.  It's quite possible that they fixed 
it so raid5 works similarly and can do the same thing now, in which case 
that degraded thing I knew about is obsolete.  But unless you know for 
sure, please do check.

> and I believe you can convert between them, but since I generally intend
> to expand arrays I prefer to just set them up as raid5 from the start.
> Since I stick lvm on top I don't care if the space is chopped up.

There's a lot of raid conversion ability in modern mdadm.  I think most 
levels can be converted between, given sufficient devices.  Again, a lot 
has changed in that regard since I set my originals up, I'd guess 
somewhere around 2008.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 10:28   ` Rich Freeman
  2013-06-21 14:23     ` Bob Sanders
  2013-06-21 14:27     ` Duncan
@ 2013-06-22 23:04     ` Mark Knecht
  2013-06-22 23:17       ` Matthew Marlowe
                         ` (2 more replies)
  2 siblings, 3 replies; 46+ messages in thread
From: Mark Knecht @ 2013-06-22 23:04 UTC (permalink / raw
  To: Gentoo AMD64

On Fri, Jun 21, 2013 at 3:28 AM, Rich Freeman <rich0@gentoo.org> wrote:
> On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
>> in data across three devices, and 8k of parity across the other two
>> devices.
>
> With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a
> stripe, not 20k.  If you modify one block it needs to read all 1.5M,
> or it needs to read at least the old chunk on the single drive to be
> modified and both old parity chunks (which on such a small array is 3
> disks either way).
>

Hi Rich,
   I've been rereading everyone's posts as well as trying to collect
my own thoughts. One question I have at this point, being that you and
I seem to be the two non-RAID1 users (but not necessarily devotees) at
this time, is what chunk size, stride & stripe width with you are
using? Are you currently using 512K chunks on your RAID5? If so that's
potentially quite different than my 16K chunk RAID6. The more I read
through this thread and other things on the web the more I am
concerned that 16K chunks has possibly forced far more IO operations
that really makes sense for performance. Unfortunately there's no easy
way to me to really test this right now as the RAID6 uses the whole
drive. However for every 512K I want to get off the drive you might
need 1 chuck whereas I'm going to need what, 32 chunks? That's got to
be a lot more IO operations on my machine isn't it?

   For clarity, I'm a 16K chunk, stride of 4K, stripe of 12K:

c2RAID6 ~ # tune2fs -l /dev/md3 | grep RAID
Filesystem volume name:   RAID6root
RAID stride:              4
RAID stripe width:        12
c2RAID6 ~ #

c2RAID6 ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md3 : active raid6 sdb3[9] sdf3[5] sde3[6] sdd3[7] sdc3[8]
      1452264480 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

unused devices: <none>
c2RAID6 ~ #

   As I understand one of your earlier responses I think you are using
4K sector drives, which again has that extra level of complexity in
terms of creating the partitions initially, but after that should be
fairly straight forward to use. (I think) That said there are
trade-offs between RAID5 & RAID6 but have you measured speeds using
anything like the dd method I posted yesterday, or any other way that
we could compare?

   As I think Duncan asked about storage usage requirements in another
part of this thread I'll just document it here. The machine serves
main 3 purposes for me:

1) It's my day in, day out desktop. I run almostly totally Gentoo
64-bit stable unless I need to keyword a package to get what I need.
Over time I tend to let my keyworded packages go stable if they are
working for me. The overall storage requirements for this, including
my home directory, typically don't run over 50GB.

2) The machine runs 3 Windows VMs every day - 2 Win 7 & 1 Win XP.
Total storage for the basic VMs is about 150GB. XP is just for things
like NetFlix. These 3 VMs typically have allocated 9 cores allocated
to them (6+2+1) leaving 3 for Gentoo to run the hardware, etc. The 6
core VM is often using 80-100% of its CPUs sustained for times. (hours
to days.) It's doing a lot of stock market math...

3) More recently, and really the reason to consolidate into a single
RAID of any type, I have about 900GB of mp4s which has been on an
external USB drive, and backed up to a second USB drive. However this
is mostly storage. We watch most of this video on the TV using the
second copy drive hooked directly to the TV or copied onto Kindles.
I've been having to keep multiple backups of this outside the machine
(poor man's RAID1 - two separate USB drives hooked up one at a time!)
;-) I'd rather just keep it safe on the RAID 6, That said, I've not
yet put it on the RAID6 as I have these performance issues I'd like to
solve first. (If possible. Duncan is making me worry that they cannot
be solved...)

   Lastly, even if I completely buy into Duncan's well formed reasons
about why RAID1 might be faster, using 500GB drives I see no single
RAID solution for me other than RAID5/6. The real RAID1/RAID6
comparison from a storage standpoint would be a (conceptual) 3-drive
RAID6 vs 3 drive RAID1. Both create 500GB of storage and can
(conceptually) lose 2 drives and still recover data. However adding
another drive to the RAID1 gains you more speed but no storage (buying
into Duncan's points) vs adding storage to the RAID6 and probably
reducing speed. As I need storage what other choices do I have?

   Answering myself, take the 5 drives, create two RAIDS - a 500GB
2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video
data maybe? I don't know...

   Or buy more hardware and do a 2 drive SSD RAID1 for the system, or
a hardware RAID controller, etc. The options explode if I start buying
more hardware.

   Also, THANKS TO EVERYONE for the continued conversation.

Cheers,
Mark

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-22 23:04     ` Mark Knecht
@ 2013-06-22 23:17       ` Matthew Marlowe
  2013-06-23 11:43       ` Rich Freeman
  2013-06-28  0:51       ` Duncan
  2 siblings, 0 replies; 46+ messages in thread
From: Matthew Marlowe @ 2013-06-22 23:17 UTC (permalink / raw
  To: gentoo-amd64

[-- Attachment #1: Type: text/plain, Size: 5732 bytes --]

I would recommend that anyone concerned about mdadm software raid
performance on gentoo test via tools like bonnie++ before putting any data
on the drives and separate from data into different sets/volumes.

I did testing two years ago watching read, write burst and sustained rates,
file ops per second, etc.... Ended up getting 7 2tb enterprise data drives
Disk 1 is os, no raid
Disk 2-5 are data, raid 10
Disk 6-7 are backups and to test/scratch space, raid 0
On Jun 22, 2013 4:04 PM, "Mark Knecht" <markknecht@gmail.com> wrote:

> On Fri, Jun 21, 2013 at 3:28 AM, Rich Freeman <rich0@gentoo.org> wrote:
> > On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> >> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
> >> in data across three devices, and 8k of parity across the other two
> >> devices.
> >
> > With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a
> > stripe, not 20k.  If you modify one block it needs to read all 1.5M,
> > or it needs to read at least the old chunk on the single drive to be
> > modified and both old parity chunks (which on such a small array is 3
> > disks either way).
> >
>
> Hi Rich,
>    I've been rereading everyone's posts as well as trying to collect
> my own thoughts. One question I have at this point, being that you and
> I seem to be the two non-RAID1 users (but not necessarily devotees) at
> this time, is what chunk size, stride & stripe width with you are
> using? Are you currently using 512K chunks on your RAID5? If so that's
> potentially quite different than my 16K chunk RAID6. The more I read
> through this thread and other things on the web the more I am
> concerned that 16K chunks has possibly forced far more IO operations
> that really makes sense for performance. Unfortunately there's no easy
> way to me to really test this right now as the RAID6 uses the whole
> drive. However for every 512K I want to get off the drive you might
> need 1 chuck whereas I'm going to need what, 32 chunks? That's got to
> be a lot more IO operations on my machine isn't it?
>
>    For clarity, I'm a 16K chunk, stride of 4K, stripe of 12K:
>
> c2RAID6 ~ # tune2fs -l /dev/md3 | grep RAID
> Filesystem volume name:   RAID6root
> RAID stride:              4
> RAID stripe width:        12
> c2RAID6 ~ #
>
> c2RAID6 ~ # cat /proc/mdstat
> Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
> md3 : active raid6 sdb3[9] sdf3[5] sde3[6] sdd3[7] sdc3[8]
>       1452264480 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5]
> [UUUUU]
>
> unused devices: <none>
> c2RAID6 ~ #
>
>    As I understand one of your earlier responses I think you are using
> 4K sector drives, which again has that extra level of complexity in
> terms of creating the partitions initially, but after that should be
> fairly straight forward to use. (I think) That said there are
> trade-offs between RAID5 & RAID6 but have you measured speeds using
> anything like the dd method I posted yesterday, or any other way that
> we could compare?
>
>    As I think Duncan asked about storage usage requirements in another
> part of this thread I'll just document it here. The machine serves
> main 3 purposes for me:
>
> 1) It's my day in, day out desktop. I run almostly totally Gentoo
> 64-bit stable unless I need to keyword a package to get what I need.
> Over time I tend to let my keyworded packages go stable if they are
> working for me. The overall storage requirements for this, including
> my home directory, typically don't run over 50GB.
>
> 2) The machine runs 3 Windows VMs every day - 2 Win 7 & 1 Win XP.
> Total storage for the basic VMs is about 150GB. XP is just for things
> like NetFlix. These 3 VMs typically have allocated 9 cores allocated
> to them (6+2+1) leaving 3 for Gentoo to run the hardware, etc. The 6
> core VM is often using 80-100% of its CPUs sustained for times. (hours
> to days.) It's doing a lot of stock market math...
>
> 3) More recently, and really the reason to consolidate into a single
> RAID of any type, I have about 900GB of mp4s which has been on an
> external USB drive, and backed up to a second USB drive. However this
> is mostly storage. We watch most of this video on the TV using the
> second copy drive hooked directly to the TV or copied onto Kindles.
> I've been having to keep multiple backups of this outside the machine
> (poor man's RAID1 - two separate USB drives hooked up one at a time!)
> ;-) I'd rather just keep it safe on the RAID 6, That said, I've not
> yet put it on the RAID6 as I have these performance issues I'd like to
> solve first. (If possible. Duncan is making me worry that they cannot
> be solved...)
>
>    Lastly, even if I completely buy into Duncan's well formed reasons
> about why RAID1 might be faster, using 500GB drives I see no single
> RAID solution for me other than RAID5/6. The real RAID1/RAID6
> comparison from a storage standpoint would be a (conceptual) 3-drive
> RAID6 vs 3 drive RAID1. Both create 500GB of storage and can
> (conceptually) lose 2 drives and still recover data. However adding
> another drive to the RAID1 gains you more speed but no storage (buying
> into Duncan's points) vs adding storage to the RAID6 and probably
> reducing speed. As I need storage what other choices do I have?
>
>    Answering myself, take the 5 drives, create two RAIDS - a 500GB
> 2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video
> data maybe? I don't know...
>
>    Or buy more hardware and do a 2 drive SSD RAID1 for the system, or
> a hardware RAID controller, etc. The options explode if I start buying
> more hardware.
>
>    Also, THANKS TO EVERYONE for the continued conversation.
>
> Cheers,
> Mark
>
>

[-- Attachment #2: Type: text/html, Size: 6574 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-22 23:04     ` Mark Knecht
  2013-06-22 23:17       ` Matthew Marlowe
@ 2013-06-23 11:43       ` Rich Freeman
  2013-06-23 15:23         ` Mark Knecht
  2013-06-28  0:51       ` Duncan
  2 siblings, 1 reply; 46+ messages in thread
From: Rich Freeman @ 2013-06-23 11:43 UTC (permalink / raw
  To: gentoo-amd64

On Sat, Jun 22, 2013 at 7:04 PM, Mark Knecht <markknecht@gmail.com> wrote:
>    I've been rereading everyone's posts as well as trying to collect
> my own thoughts. One question I have at this point, being that you and
> I seem to be the two non-RAID1 users (but not necessarily devotees) at
> this time, is what chunk size, stride & stripe width with you are
> using?

I'm using 512K chunks on the two RAID5s which are my LVM PVs:
md7 : active raid5 sdc3[0] sdd3[6] sde3[7] sda4[2] sdb4[5]
      971765760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 1/2 pages [4KB], 65536KB chunk

md6 : active raid5 sda3[0] sdd2[4] sdb3[3] sde2[5]
      2197687296 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 2/6 pages [8KB], 65536KB chunk

On top of this I have a few LVs with ext4 filesystems:
tune2fs -l /dev/vg1/root  | grep RAID
RAID stride:              128
RAID stripe width:        384
(this is root, bin, sbin, lib)

tune2fs -l /dev/vg1/data  | grep RAID
RAID stride:              19204
(this is just about everything else)

tune2fs -l /dev/vg1/video  | grep RAID
RAID stride:              11047
(this is mythtv video)

Those were all the defaults picked, and with the exception of root I
believe the array was quite different when the others were created.
I'm pretty confident that none of these are optimizes, and I'd be
shocked if any of them are aligned unless this is automated (including
across pvmoves, reshaping, and such).

That is part of why I'd like to move to btrfs - optimizing raid with
mdadm+lvm+mkfs.ext4 involves a lot of micromanagement as far as I'm
aware.  Docs are very spotty at best, and it isn't at all clear that
things get adjusted as needed when you actually take advantage of
things like pvmove or reshaping arrays.  I suspect that having btrfs
on bare metal will be more likely to result in something that keeps
itself in-tune.

Rich

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-23 11:43       ` Rich Freeman
@ 2013-06-23 15:23         ` Mark Knecht
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Knecht @ 2013-06-23 15:23 UTC (permalink / raw
  To: Gentoo AMD64

On Sun, Jun 23, 2013 at 4:43 AM, Rich Freeman <rich0@gentoo.org> wrote:
> On Sat, Jun 22, 2013 at 7:04 PM, Mark Knecht <markknecht@gmail.com> wrote:
>>    I've been rereading everyone's posts as well as trying to collect
>> my own thoughts. One question I have at this point, being that you and
>> I seem to be the two non-RAID1 users (but not necessarily devotees) at
>> this time, is what chunk size, stride & stripe width with you are
>> using?
>
> I'm using 512K chunks on the two RAID5s which are my LVM PVs:
> md7 : active raid5 sdc3[0] sdd3[6] sde3[7] sda4[2] sdb4[5]
>       971765760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU]
>       bitmap: 1/2 pages [4KB], 65536KB chunk
>
> md6 : active raid5 sda3[0] sdd2[4] sdb3[3] sde2[5]
>       2197687296 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
>       bitmap: 2/6 pages [8KB], 65536KB chunk
>
> On top of this I have a few LVs with ext4 filesystems:
> tune2fs -l /dev/vg1/root  | grep RAID
> RAID stride:              128
> RAID stripe width:        384
> (this is root, bin, sbin, lib)
>
> tune2fs -l /dev/vg1/data  | grep RAID
> RAID stride:              19204
> (this is just about everything else)
>
> tune2fs -l /dev/vg1/video  | grep RAID
> RAID stride:              11047
> (this is mythtv video)
>
> Those were all the defaults picked, and with the exception of root I
> believe the array was quite different when the others were created.
> I'm pretty confident that none of these are optimizes, and I'd be
> shocked if any of them are aligned unless this is automated (including
> across pvmoves, reshaping, and such).
>
> That is part of why I'd like to move to btrfs - optimizing raid with
> mdadm+lvm+mkfs.ext4 involves a lot of micromanagement as far as I'm
> aware.  Docs are very spotty at best, and it isn't at all clear that
> things get adjusted as needed when you actually take advantage of
> things like pvmove or reshaping arrays.  I suspect that having btrfs
> on bare metal will be more likely to result in something that keeps
> itself in-tune.
>
> Rich
>

Thanks Rich. I'm finding that helpful.

I completely agree on the micromanagement comment. At one level or
another that's sort of what this whole thread is about!

On your root partition I sort of wonder about the stripe width.
Assuming I did it right (5, 5, 512, 4) his little page calculates 128
for the stride and 512 stripe width. (4 data disks * 128 I think) Just
a piece of info.

http://busybox.net/~aldot/mkfs_stride.html

Returning to the title of the thread, asking about partition location
essentially, I woke up this morning and had sort of decided to just
try changing the chunk size to something large like your 512K. It
seems I'm out of luck as my partition size is not (apparently)
divisible by 512K:

c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=512
--backup-file=/backups/ChunkSizeBackup
mdadm: component size 484088160K is not a multiple of chunksize 512K
c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=256
--backup-file=/backups/ChunkSizeBackup
mdadm: component size 484088160K is not a multiple of chunksize 256K
c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=128
--backup-file=/backups/ChunkSizeBackup
mdadm: component size 484088160K is not a multiple of chunksize 128K
c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=64
--backup-file=/backups/ChunkSizeBackup
mdadm: component size 484088160K is not a multiple of chunksize 64K
c2RAID6 ~ #
c2RAID6 ~ # cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md3 : active raid6 sdb3[9] sdf3[5] sde3[6] sdd3[7] sdc3[8]
      1452264480 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU]

unused devices: <none>
c2RAID6 ~ # fdisk -l /dev/sdb

Disk /dev/sdb: 500.1 GB, 500107862016 bytes, 976773168 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x8b45be24

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *          63      112454       56196   83  Linux
/dev/sdb2          112455     8514449     4200997+  82  Linux swap / Solaris
/dev/sdb3         8594775   976773167   484089196+  fd  Linux raid autodetect
c2RAID6 ~ #

I suspect I might be much better off if all the partition sizes were
divisible by 2048 and started on 2048 multiple, like the newer fdisk
tools enforce.

I am thinking I won't make much headway unless I completely rebuild
the system from bare metal up. If I'm going to do that then I need to
get a good copy of the whole RAID onto some other drive which is a big
scary job, then start over with an install disk I guess.

Not sure I'm up for that just yet on a Sunday morning...

Take care,
Mark


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-22 23:04     ` Mark Knecht
  2013-06-22 23:17       ` Matthew Marlowe
  2013-06-23 11:43       ` Rich Freeman
@ 2013-06-28  0:51       ` Duncan
  2013-06-28  3:18         ` Matthew Marlowe
  2 siblings, 1 reply; 46+ messages in thread
From: Duncan @ 2013-06-28  0:51 UTC (permalink / raw
  To: gentoo-amd64

Mark Knecht posted on Sat, 22 Jun 2013 16:04:06 -0700 as excerpted:

> Lastly, even if I completely buy into Duncan's well formed reasons about
> why RAID1 might be faster, using 500GB drives I see no single RAID
> solution for me other than RAID5/6. The real RAID1/RAID6 comparison from
> a storage standpoint would be a (conceptual) 3-drive RAID6 vs 3 drive
> RAID1. Both create 500GB of storage and can (conceptually) lose 2 drives
> and still recover data. However adding another drive to the RAID1 gains
> you more speed but no storage (buying into Duncan's points) vs adding
> storage to the RAID6 and probably reducing speed. As I need storage what
> other choices do I have?
> 
>    Answering myself, take the 5 drives, create two RAIDS - a 500GB
> 2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video
> data maybe? I don't know...
> 
>    Or buy more hardware and do a 2 drive SSD RAID1 for the system, or
> a hardware RAID controller, etc. The options explode if I start buying
> more hardware.

Finally getting back to this on what's my "weekend"...

Unfortunately, given 900 gigs media data and 150 gigs of VMs, with 5 500 
gig drives to work with, you're right, simply making a raid1 out of 
everything isn't possible.

You could do a 4-drive raid10, two-way striped and two-way mirrored, for 
a TB of storage for the media files and possibly squeeze the VMs between 
the SSD and the raid, with the 5th half-TB as a backup, but it'd be quite 
tight and non-optimal, plus losing the wrong two drives on the raid10 
would put it out of commission so you'd have only one-drive-loss-
tolerance there.

You could buy a sixth half-TB and try either three-way-striping and two-
way mirroring for the same one-drive-loss tolerance but a good 1.5 TB (3-
way half-TB stripe) space, giving you plenty of space and thruput speed 
but at the cost of only single-drive-loss-tolerance.

You could use the same six in a raid10 with the reverse configuration, 
two-way-stripe three-way-mirror, for better loss-of-two-tolerance but at 
only a TB of space and have the same squeeze as the 4-way raid10 (but now 
without the extra drive for backup), or...

Personally, I'd probably be intensely motivated enough to try the 2-way-
stripe 3-way-mirror 6-drive raid10, squeezing the media space as 
necessary to do it (maybe by using external drives for what wouldn't 
fit), but that's still a compromise... and includes buying that sixth 
drive.

So the raid6 might well be the best alternative you have, given the data 
size AND physical device size constraints.

But some time testing the performance of different configs and 
familiarizing yourself with the options and operation, as you've decided 
to do now, certainly won't hurt.  I DID say I wasn't real strong on the 
chunk options, etc, myself, and you're using ext4, not the reiserfs I was 
using, and I believe ext4 has at least some potential performance upside 
compared to reiserfs, so it's quite possible that with some chunk/stride/
etc tweaking, you can get something better, performance-wise.  Tho I 
expect raid6 will never be a speed demon, and may well never perform as 
you had originally expected/hoped.  But better than the initial results 
should be possible, hopefully, and familiarizing yourself with things 
while experimenting has benefits of its own, so that's an idea I can 
agree with 100%. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-28  0:51       ` Duncan
@ 2013-06-28  3:18         ` Matthew Marlowe
  0 siblings, 0 replies; 46+ messages in thread
From: Matthew Marlowe @ 2013-06-28  3:18 UTC (permalink / raw
  To: gentoo-amd64

I supported about 250 gentoo vm's using about 30 SAS 15K rpm 144GB
drives awhile back.  Drives were split into 14 disk RAID10 sets.  Then
each RAID10 set was split it into 200-500GB virtual drives, and the
virtual machines were grouped into sets of 3-5 and matched with a
virtual drive.  Virtual machines on the same virtual drive were setup
to use thin provisioning, so that only used up as much storage space
as their data differed from the canonical gentoo os image which was
usually less than 20%.  The virtual drives were usually only 30-50%
full and we could virtually provision 2TB+ of virtual machines on a
single 500GB virtual drive.

Don't underestimate what you can do with small drives, especially if
they are fast and you have a lot of them....

On Thu, Jun 27, 2013 at 5:51 PM, Duncan <1i5t5.duncan@cox.net> wrote:
> Mark Knecht posted on Sat, 22 Jun 2013 16:04:06 -0700 as excerpted:
>
>> Lastly, even if I completely buy into Duncan's well formed reasons about
>> why RAID1 might be faster, using 500GB drives I see no single RAID
>> solution for me other than RAID5/6. The real RAID1/RAID6 comparison from
>> a storage standpoint would be a (conceptual) 3-drive RAID6 vs 3 drive
>> RAID1. Both create 500GB of storage and can (conceptually) lose 2 drives
>> and still recover data. However adding another drive to the RAID1 gains
>> you more speed but no storage (buying into Duncan's points) vs adding
>> storage to the RAID6 and probably reducing speed. As I need storage what
>> other choices do I have?
>>
>>    Answering myself, take the 5 drives, create two RAIDS - a 500GB
>> 2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video
>> data maybe? I don't know...
>>
>>    Or buy more hardware and do a 2 drive SSD RAID1 for the system, or
>> a hardware RAID controller, etc. The options explode if I start buying
>> more hardware.
>
> Finally getting back to this on what's my "weekend"...
>
> Unfortunately, given 900 gigs media data and 150 gigs of VMs, with 5 500
> gig drives to work with, you're right, simply making a raid1 out of
> everything isn't possible.
>
> You could do a 4-drive raid10, two-way striped and two-way mirrored, for
> a TB of storage for the media files and possibly squeeze the VMs between
> the SSD and the raid, with the 5th half-TB as a backup, but it'd be quite
> tight and non-optimal, plus losing the wrong two drives on the raid10
> would put it out of commission so you'd have only one-drive-loss-
> tolerance there.
>
> You could buy a sixth half-TB and try either three-way-striping and two-
> way mirroring for the same one-drive-loss tolerance but a good 1.5 TB (3-
> way half-TB stripe) space, giving you plenty of space and thruput speed
> but at the cost of only single-drive-loss-tolerance.
>
> You could use the same six in a raid10 with the reverse configuration,
> two-way-stripe three-way-mirror, for better loss-of-two-tolerance but at
> only a TB of space and have the same squeeze as the 4-way raid10 (but now
> without the extra drive for backup), or...
>
> Personally, I'd probably be intensely motivated enough to try the 2-way-
> stripe 3-way-mirror 6-drive raid10, squeezing the media space as
> necessary to do it (maybe by using external drives for what wouldn't
> fit), but that's still a compromise... and includes buying that sixth
> drive.
>
> So the raid6 might well be the best alternative you have, given the data
> size AND physical device size constraints.
>
> But some time testing the performance of different configs and
> familiarizing yourself with the options and operation, as you've decided
> to do now, certainly won't hurt.  I DID say I wasn't real strong on the
> chunk options, etc, myself, and you're using ext4, not the reiserfs I was
> using, and I believe ext4 has at least some potential performance upside
> compared to reiserfs, so it's quite possible that with some chunk/stride/
> etc tweaking, you can get something better, performance-wise.  Tho I
> expect raid6 will never be a speed demon, and may well never perform as
> you had originally expected/hoped.  But better than the initial results
> should be possible, hopefully, and familiarizing yourself with things
> while experimenting has benefits of its own, so that's an idea I can
> agree with 100%. =:^)
>
> --
> Duncan - List replies preferred.   No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master."  Richard Stallman
>
>


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21  7:31 ` [gentoo-amd64] " Duncan
  2013-06-21 10:28   ` Rich Freeman
@ 2013-06-21 17:40   ` Mark Knecht
  2013-06-21 17:56     ` Bob Sanders
                       ` (2 more replies)
  2013-06-30  1:04   ` Rich Freeman
  2 siblings, 3 replies; 46+ messages in thread
From: Mark Knecht @ 2013-06-21 17:40 UTC (permalink / raw
  To: Gentoo AMD64

On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:
>
>>    Does anyone know of info on how the starting sector number might
>> impact RAID performance under Gentoo? The drives are WD-500G RE3 drives
>> shown here:
>>
>> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/
> B001EMZPD0/ref=cm_cr_pr_product_top
>>
>>    These are NOT 4k sector sized drives.
>>
>>    Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
>> benchmarking seems abysmal at around 40MB/S using dd copying large
>> files.
>> It's higher, around 80MB/S if the file being transferred is coming from
>> an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in
>> top.
>> And my 'large file' copies might not be large enough as the machine has
>> 24GB of DRAM and I've only been copying 21GB so it's possible some of
>> that is cached.
>
> I /suspect/ that the problem isn't striping, tho that can be a factor,
> but rather, your choice of raid6.  Note that I personally ran md/raid-6
> here for awhile, so I know a bit of what I'm talking about.  I didn't
> realize the full implications of what I was setting up originally, or I'd
> have not chosen raid6 in the first place, but live and learn as they say,
> and that I did.
>
> General rule, raid6 is abysmal for writing and gets dramatically worse as
> fragmentation sets in, tho reading is reasonable.  The reason is that in
> ordered to properly parity-check and write out less-than-full-stripe
> writes, the system must effectively read-in the existing data and merge
> it with the new data, then recalculate the parity, before writing the new
> data AND 100% of the (two-way in raid-6) parity.  Further, because raid
> sits below the filesystem level, it knows nothing about what parts of the
> filesystem are actually used, and must read and write the FULL data
> stripe (perhaps minus the new data bit, I'm not sure), including parts
> that will be empty on a freshly formatted filesystem.
>
> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k
> in data across three devices, and 8k of parity across the other two
> devices.  Now you go to write a 1k file, but in ordered to do so the full
> 12k of existing data must be read in, even on an empty filesystem,
> because the RAID doesn't know it's empty!  Then the new data must be
> merged in and new checksums created, then the full 20k must be written
> back out, certainly the 8k of parity, but also likely the full 12k of
> data even if most of it is simply rewrite, but almost certainly at least
> the 4k strip on the device the new data is written to.
>
<SNIP>

Hi Duncan,
   Wonderful post but much too long to carry on a conversation
in-line. As you sound pretty sure of your understanding/history I'll
assume you're right 100% of the time, but only maybe 80% of the post
feels right to me at this time so let's assume I have much to learn
and go from there. I expect that others here are in a similar
situation to me - they use RAID but are laboring with little hard data
on what different portions of the system are doing and how to optimize
it. I certainly feel that's true in my case. I hope this thread over
the near or far term future might help a bit for me and potentially
others.

   In thinking about this issue this morning I think it's important to
me to get down to basics and verify as much as possible, step-by-step,
so that I don't layer good work on top of bad assumptions. To that
end, and before I move too much farther forward, let me document a few
things about my system and the hardware available to work with and see
if you, Rich, Bob, Volker or anyone else wants to chime in about what
is correct, not correct or a better way to use it.

Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB
DDR3 + Core i7-980x Extreme 12 core processor
1 SDD - 120GB SATA3 on it's own controller
5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives using Intel integrated
controllers

(NOTE: I can possibly go to a 6-drive RAID if I made some changes in
the box but that's for later)

According to the WD spec
(http://www.wdc.com/en/library/spec/2879-701281.pdf) the 500GB drives
sustain 113MB/S to the drive. Using hdparm I measure 107MB/S or higher
for all 5 drives:

c2RAID6 ~ # hdparm -tT /dev/sdb

/dev/sdb:
 Timing cached reads:   17374 MB in  2.00 seconds = 8696.12 MB/sec
 Timing buffered disk reads: 322 MB in  3.00 seconds = 107.20 MB/sec
c2RAID6 ~ #

The SDD on it's own PCI Express controller clocks in at about 250MB/S for reads.

c2RAID6 ~ # hdparm -tT /dev/sda

/dev/sda:
 Timing cached reads:   17492 MB in  2.00 seconds = 8754.42 MB/sec
 Timing buffered disk reads: 760 MB in  3.00 seconds = 253.28 MB/sec
c2RAID6 ~ #

TESTING: I'm using dd to test. It gives an easy to read anyway result
and seems to be used a lot. I can use bonnie++ or IOzone later but I
don't think that's necessary quite yet. Being that I have 24GB and
don't want cached data to effect the test speeds I do the following:

1) Using dd I created a 50GB file for copying using the following commands:

cd /mnt/fastVM
dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50]

mark@c2RAID6 /VirtualMachines/bonnie $ ls -alh /mnt/fastVM/ran*
-rw-r--r-- 1 mark mark 47G Jun 21 07:10 /mnt/fastVM/random1
mark@c2RAID6 /VirtualMachines/bonnie $

2) To ensure that nothing is cached and the copies are (hopefully)
completely fair as root I do the following between each test:

sync
free -h
echo 3 > /proc/sys/vm/drop_caches
free -h

An example:

c2RAID6 ~ # sync
c2RAID6 ~ # free -h
             total       used       free     shared    buffers     cached
Mem:           23G        23G       129M         0B       8.5M        21G
-/+ buffers/cache:       1.6G        21G
Swap:          12G         0B        12G
c2RAID6 ~ # echo 3 > /proc/sys/vm/drop_caches
c2RAID6 ~ # free -h
             total       used       free     shared    buffers     cached
Mem:           23G       2.6G        20G         0B       884K       1.3G
-/+ buffers/cache:       1.3G        22G
Swap:          12G         0B        12G
c2RAID6 ~ #

3) As a first test I copy using dd the 50GB file from the SDD to the
RAID6. As long as reading the SDD is much faster than writing the
RAID6 then it should be a test of primarily the RAID6 write speed:

mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/mnt/fastVM/random1 of=SDDCopy
97656250+0 records in
97656250+0 records out
50000000000 bytes (50 GB) copied, 339.173 s, 147 MB/s
mark@c2RAID6 /VirtualMachines/bonnie $

If I clear cache as above and rerun the test it's always 145-155MB/S

4) As a second test I read from the RAID6 and write back to the RAID6.
I see MUCH lower speeds, again repeatable:

mark@c2RAID6 /VirtualMachines/bonnie $ dd if=SDDCopy of=HDDWrite
97656250+0 records in
97656250+0 records out
50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s
mark@c2RAID6 /VirtualMachines/bonnie $

5) As a final test, and just looking for problems if any, I do an SDD
to SDD copy which clocked in at close to 200MB/S

mark@c2RAID6 /mnt/fastVM $ dd if=random1 of=SDDCopy
97656250+0 records in
97656250+0 records out
50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s
mark@c2RAID6 /mnt/fastVM $

   So, being that this RAID6 was grown yesterday from something that
has existed for a year or two I'm not sure of it's fragmentation, or
even how to determine that at this time. However it seems my problem
are RAID6 reads, not RAID6 writes, at least to new an probably never
used disk space.

   I will also report more later but I can state that just using top
there's never much CPU usage doing this but a LOT of WAIT time when
reading the RAID6. It really appears the system is spinning it's
wheels waiting for the RAID to get data from the disk.

   One place where I wanted to double check your thinking. My thought
is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as
it has to read from three drives and make sure they are all good
before returning data to the user. I don't see how that could ever be
faster than what a single drive file system could do which for these
drives would be the 113MB/S WD spec number, correct? As I'm currently
getting 145MB/S it appears on the surface that the RAID6 is providing
some value, at least in these early days of use. Maybe it will degrade
over time though.

   Comments?

Cheers,
Mark

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 17:40   ` Mark Knecht
@ 2013-06-21 17:56     ` Bob Sanders
  2013-06-21 18:12       ` Mark Knecht
  2013-06-21 17:57     ` Rich Freeman
  2013-06-22 14:23     ` Duncan
  2 siblings, 1 reply; 46+ messages in thread
From: Bob Sanders @ 2013-06-21 17:56 UTC (permalink / raw
  To: gentoo-amd64

Mark Knecht, mused, then expounded:
> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> > Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:
> >
> 
> Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB
> DDR3 + Core i7-980x Extreme 12 core processor

The hit to iop performance is mainly due to the large number of cores in
the the high end Intel cpu.  I suggest you find a nice 4-core Intel
processor, something non-extreme.  you'll find all your IO will improve.

> 1 SDD - 120GB SATA3 on it's own controller
> 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives using Intel integrated
> controllers
>

Again, if you're serious about RAID, get an LSI MegaRAID card.  While I
have my dislikes about the LSI controller, it's a lot faster than using
MD and much faster (and more reliable) than any bios software RAID.

Oh, and don't believe all the published numbers on drives,
etc...benchmarking is an art.

Bob
-- 
-  

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 17:56     ` Bob Sanders
@ 2013-06-21 18:12       ` Mark Knecht
  0 siblings, 0 replies; 46+ messages in thread
From: Mark Knecht @ 2013-06-21 18:12 UTC (permalink / raw
  To: Gentoo AMD64

On Fri, Jun 21, 2013 at 10:56 AM, Bob Sanders <rsanders@sgi.com> wrote:
> Mark Knecht, mused, then expounded:
>> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> > Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted:
>> >
>>
>> Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB
>> DDR3 + Core i7-980x Extreme 12 core processor
>
> The hit to iop performance is mainly due to the large number of cores in
> the the high end Intel cpu.  I suggest you find a nice 4-core Intel
> processor, something non-extreme.  you'll find all your IO will improve.
>

Interesting point but not likely to happen. I run 3 Windows VMs all
day, most of which are doing numerical calculations and not a huge
amount of IO in the Windows environment itself. In my usage model the
12 cores get a workout nearly every day.

>
>> 1 SDD - 120GB SATA3 on it's own controller
>> 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives using Intel integrated
>> controllers
>>
>
> Again, if you're serious about RAID, get an LSI MegaRAID card.  While I
> have my dislikes about the LSI controller, it's a lot faster than using
> MD and much faster (and more reliable) than any bios software RAID.
>

I suppose if I accept your assertion above then an LSI MegaRAID might
be a better solution specifically because I _am_ using the 12 core
Extreme processor.

Will consider, at least in the long run, if this thread & work doesn't
yield significantly improved results over the next few weeks.

> Oh, and don't believe all the published numbers on drives,
> etc...benchmarking is an art.
>
> Bob

Absolutely! :-)

Thanks,
Mark


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 17:40   ` Mark Knecht
  2013-06-21 17:56     ` Bob Sanders
@ 2013-06-21 17:57     ` Rich Freeman
  2013-06-21 18:10       ` Gary E. Miller
  2013-06-21 18:38       ` Mark Knecht
  2013-06-22 14:23     ` Duncan
  2 siblings, 2 replies; 46+ messages in thread
From: Rich Freeman @ 2013-06-21 17:57 UTC (permalink / raw
  To: gentoo-amd64

On Fri, Jun 21, 2013 at 1:40 PM, Mark Knecht <markknecht@gmail.com> wrote:
>    One place where I wanted to double check your thinking. My thought
> is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as
> it has to read from three drives and make sure they are all good
> before returning data to the user.

That isn't correct.   In theory it could be done that way, but every
raid1 implementation I've heard of makes writes to all drives
(obviously), but reads from only a single drive (assuming it is
correct).  That means that read latency is greatly reduced since they
can be split across two drives which effectively means two heads per
"platter."  Also, raid1 typically does not include checksumming, so if
there is a discrepancy between the drives there is no way to know
which one is right.  With raid5 at least you can always correct
discrepancies if you have all the disks (though as Duncan pointed out
in practice this only happens if you do an explicit scrub on mdadm).
With btrfs every block is checksummed and so as long as there is one
good (err, consistent) copy somewhere it will be used.

Rich

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 17:57     ` Rich Freeman
@ 2013-06-21 18:10       ` Gary E. Miller
  2013-06-21 18:38       ` Mark Knecht
  1 sibling, 0 replies; 46+ messages in thread
From: Gary E. Miller @ 2013-06-21 18:10 UTC (permalink / raw
  To: gentoo-amd64; +Cc: rich0

[-- Attachment #1: Type: text/plain, Size: 1722 bytes --]

Yo Rich!

On Fri, 21 Jun 2013 13:57:20 -0400
Rich Freeman <rich0@gentoo.org> wrote:

> In theory it could be done that way, but every
> raid1 implementation I've heard of makes writes to all drives
> (obviously), but reads from only a single drive (assuming it is
> correct).  That means that read latency is greatly reduced since they
> can be split across two drives which effectively means two heads per
> "platter."

Yes, that is what I see in practice.  A much reduced average read time.
And if you are really pressed for speed, add more stripes and get even
more speed.

> Also, raid1 typically does not include checksumming, so if
> there is a discrepancy between the drives there is no way to know
> which one is right.

Uh, not exactly correct.  Remember each HDD has ECC for each sector.  If
there is a read error the HDD will detecct the bad ECC and report the
error to the RAID1 hardware/software.  Then RAID1 is smart enough to try
to read from the 2nd drive.

>  With raid5 at least you can always correct
> discrepancies if you have all the disks

Not really.  If 2 disks fail in an n+1 RAID5 you are out of luck.
Not as uncommon occurance as one might think.

> (though as Duncan pointed out
> in practice this only happens if you do an explicit scrub on mdadm).

Which you should be doing at least weekly.  Otherwise you only find out
your disks have failed when you try to do a full copy or backup, and then
you likely have multiple failures and you are out of luck.

RGDS
GARY
---------------------------------------------------------------------------
Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701
	gem@rellim.com  Tel:+1(541)382-8588

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 17:57     ` Rich Freeman
  2013-06-21 18:10       ` Gary E. Miller
@ 2013-06-21 18:38       ` Mark Knecht
  2013-06-21 18:50         ` Gary E. Miller
  2013-06-21 18:53         ` Bob Sanders
  1 sibling, 2 replies; 46+ messages in thread
From: Mark Knecht @ 2013-06-21 18:38 UTC (permalink / raw
  To: Gentoo AMD64

On Fri, Jun 21, 2013 at 10:57 AM, Rich Freeman <rich0@gentoo.org> wrote:
> On Fri, Jun 21, 2013 at 1:40 PM, Mark Knecht <markknecht@gmail.com> wrote:
>>    One place where I wanted to double check your thinking. My thought
>> is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as
>> it has to read from three drives and make sure they are all good
>> before returning data to the user.
>
> That isn't correct.   In theory it could be done that way, but every
> raid1 implementation I've heard of makes writes to all drives
> (obviously), but reads from only a single drive (assuming it is
> correct).  That means that read latency is greatly reduced since they
> can be split across two drives which effectively means two heads per
> "platter."  Also, raid1 typically does not include checksumming, so if
> there is a discrepancy between the drives there is no way to know
> which one is right.  With raid5 at least you can always correct
> discrepancies if you have all the disks (though as Duncan pointed out
> in practice this only happens if you do an explicit scrub on mdadm).
> With btrfs every block is checksummed and so as long as there is one
> good (err, consistent) copy somewhere it will be used.
>
> Rich
>

Humm...

OK, we agree on RAID1 writes. All data must be written to all drives
so there's no way to implement any real speed up in that area. If I
simplistically assume that write speeds are similar to hdparm -tT read
speeds then that's that.

On the read side I'm not sure if I'm understanding your point. I agree
that a so-designed RAID1 system could/might read smaller portions of a
larger read from RAID1 drives in parallel, taking some data from one
drive and some from another drive, and then only take action
corrective if one of the drives had troubles. However I don't know
that mdadm-based RAID1 does anything like that. Does it?

It seems to me that unless I at least _request_ all data from all
drives and minimally compare at least some error flag from the
controller telling me one drive had trouble reading a sector then how
do I know if anything bad is happening?

Or maybe you're saying it's RAID1 and I don't know if anything bad is
happening _unless_ I do a scrub and specifically check all the drives
for consistency?

Just trying to get clear what you're saying.

I do mdadm scrubs at least once a week. I still do them by hand. They
have never appeared terribly expensive watching top or iotop but
sometimes when I'm watching NetFlix or Hulu in a VM I get more pauses
when the scrub is taking place, but it's not huge.

I agree that RAID5 gives you an opportunity to get things fixed, but
there are folks who lose a disk in a RAID5, start the rebuild, and
then lose a second disk during the rebuild. That was my main reason to
go to RAID6. Not that I would ever run the array degraded but that I
could still tolerate a second loss while the rebuild was happening and
hopefully get by. That was similar to my old 3-disk RAID1 where I'd
have to lose all 3 disks to be out of business.

Thanks,
Mark

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 18:38       ` Mark Knecht
@ 2013-06-21 18:50         ` Gary E. Miller
  2013-06-21 18:57           ` Rich Freeman
  2013-06-22 14:34           ` Duncan
  2013-06-21 18:53         ` Bob Sanders
  1 sibling, 2 replies; 46+ messages in thread
From: Gary E. Miller @ 2013-06-21 18:50 UTC (permalink / raw
  To: gentoo-amd64; +Cc: markknecht

[-- Attachment #1: Type: text/plain, Size: 2699 bytes --]

Yo Mark!

On Fri, 21 Jun 2013 11:38:00 -0700
Mark Knecht <markknecht@gmail.com> wrote:

> On the read side I'm not sure if I'm understanding your point. I agree
> that a so-designed RAID1 system could/might read smaller portions of a
> larger read from RAID1 drives in parallel, taking some data from one
> drive and some from another drive, and then only take action
> corrective if one of the drives had troubles. However I don't know
> that mdadm-based RAID1 does anything like that. Does it?

It surely does.  I have confirmed that at least monthly since md has
existed in the kernel.

> It seems to me that unless I at least _request_ all data from all
> drives and minimally compare at least some error flag from the
> controller telling me one drive had trouble reading a sector then how
> do I know if anything bad is happening?

Correct.  You cant' tell if you can read something without trying to
read it.  Which is why you should do a full raid rebuild every week.
> 
> Or maybe you're saying it's RAID1 and I don't know if anything bad is
> happening _unless_ I do a scrub and specifically check all the drives
> for consistency?

No.  A simple read will find the problem.  But given it is RAID1 the only
way to be sure to read from both dirves is a raid rebuild.

> I do mdadm scrubs at least once a week. I still do them by hand. They
> have never appeared terribly expensive watching top or iotop but
> sometimes when I'm watching NetFlix or Hulu in a VM I get more pauses
> when the scrub is taking place, but it's not huge.

Which is why you should cron jothem at oh-dark-thirty.
>
> I agree that RAID5 gives you an opportunity to get things fixed, but
> there are folks who lose a disk in a RAID5, start the rebuild, and
> then lose a second disk during the rebuild.

Because they failed to do weekly rebuilds.

> Not that I would ever run the array degraded but that I
> could still tolerate a second loss while the rebuild was happening and
> hopefully get by.

Sadly most people make their RAID5 or RAID6 out of brand new,
consecutively serial numbered drives.  They then get the exactly the
same temp, voltage, humidity, seek stress until they all fail within
days of each other.  I have personally seen 4 of 5 drives in a RAID5
fail within 3 days many times.  Usually on a Friday where the tech
decides the drive replacement can wait until Monday.

Your only protection against a full RAIDx failure is an offsite backup.

RGDS
GARY
---------------------------------------------------------------------------
Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701
	gem@rellim.com  Tel:+1(541)382-8588

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 18:50         ` Gary E. Miller
@ 2013-06-21 18:57           ` Rich Freeman
  2013-06-22 14:34           ` Duncan
  1 sibling, 0 replies; 46+ messages in thread
From: Rich Freeman @ 2013-06-21 18:57 UTC (permalink / raw
  To: gentoo-amd64

On Fri, Jun 21, 2013 at 2:50 PM, Gary E. Miller <gem@rellim.com> wrote:
> On Fri, 21 Jun 2013 11:38:00 -0700
> Mark Knecht <markknecht@gmail.com> wrote:
>> Or maybe you're saying it's RAID1 and I don't know if anything bad is
>> happening _unless_ I do a scrub and specifically check all the drives
>> for consistency?
>
> No.  A simple read will find the problem.  But given it is RAID1 the only
> way to be sure to read from both dirves is a raid rebuild.

Keep in mind that a read will only find the problem if it is visible
to the hard drive's ECC.  A silent error would not be detected.  It
could be detected by a rebuild, though it could not be reliably fixed
in this way.  With raid5 a silent error in a single drive per stripe
could be fixed in a rebuild.

>
> Your only protection against a full RAIDx failure is an offsite backup.

++

That's why I'm not big on crazy levels of redundancy.  RAID is first
and foremost a restoration avoidance tool, not a backup solution.  It
reduces the risk of needing restoration, but it does not cover as many
failure modes as an offline backup.  If btrfs eats your data it really
won't matter how many platters it had to chew on in the process.  So,
by all means use RAID, but if you're going to spend a lot of money on
redundant disks, spend it on a backup solution instead (which might
very well involve disks, though you should move them offsite).

Rich

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 18:50         ` Gary E. Miller
  2013-06-21 18:57           ` Rich Freeman
@ 2013-06-22 14:34           ` Duncan
  2013-06-22 22:15             ` Gary E. Miller
  1 sibling, 1 reply; 46+ messages in thread
From: Duncan @ 2013-06-22 14:34 UTC (permalink / raw
  To: gentoo-amd64

Gary E. Miller posted on Fri, 21 Jun 2013 11:50:43 -0700 as excerpted:

> On Fri, 21 Jun 2013 11:38:00 -0700 Mark Knecht <markknecht@gmail.com>
> wrote:
> 
>> On the read side I'm not sure if I'm understanding your point. I agree
>> that a so-designed RAID1 system could/might read smaller portions of a
>> larger read from RAID1 drives in parallel, taking some data from one
>> drive and some from another drive, and then only take action corrective
>> if one of the drives had troubles. However I don't know that
>> mdadm-based RAID1 does anything like that. Does it?
> 
> It surely does.  I have confirmed that at least monthly since md has
> existed in the kernel.

Out of curiosity, /how/ do you confirm that?  I agree based on real usage 
experience, but with a claim that you're confirming it at least monthly, 
it sounds like you have a standardized/scripted test, and I'm interested 
in what/how you do it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-22 14:34           ` Duncan
@ 2013-06-22 22:15             ` Gary E. Miller
  2013-06-28  0:20               ` Duncan
  0 siblings, 1 reply; 46+ messages in thread
From: Gary E. Miller @ 2013-06-22 22:15 UTC (permalink / raw
  To: gentoo-amd64; +Cc: 1i5t5.duncan

[-- Attachment #1: Type: text/plain, Size: 1401 bytes --]

Yo Duncan!

On Sat, 22 Jun 2013 14:34:36 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> >> On the read side I'm not sure if I'm understanding your point. I
> >> agree that a so-designed RAID1 system could/might read smaller
> >> portions of a larger read from RAID1 drives in parallel, taking
> >> some data from one drive and some from another drive, and then
> >> only take action corrective if one of the drives had troubles.
> >> However I don't know that mdadm-based RAID1 does anything like
> >> that. Does it?
> > 
> > It surely does.  I have confirmed that at least monthly since md has
> > existed in the kernel.
> 
> Out of curiosity, /how/ do you confirm that?  I agree based on real
> usage experience, but with a claim that you're confirming it at least
> monthly, it sounds like you have a standardized/scripted test, and
> I'm interested in what/how you do it.

I have around 30 RAID1 sets in production right now.  Some of them
doing mostly reads and some mostly writes.  Some are HDD and some SSD.
The RAID sets are pushed pretty hard 24x7 and we watch the performance
pretty closely to plan updates.  I have collectd performance graphs
going way back.

RGDS
GARY
---------------------------------------------------------------------------
Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701
	gem@rellim.com  Tel:+1(541)382-8588

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-22 22:15             ` Gary E. Miller
@ 2013-06-28  0:20               ` Duncan
  2013-06-28  0:41                 ` Gary E. Miller
  0 siblings, 1 reply; 46+ messages in thread
From: Duncan @ 2013-06-28  0:20 UTC (permalink / raw
  To: gentoo-amd64

Gary E. Miller posted on Sat, 22 Jun 2013 15:15:16 -0700 as excerpted:

>> >> [Does md/raid1 do parallel reads of multiple files at once?]
>> > 
>> > It surely does.  I have confirmed that at least monthly since md has
>> > existed in the kernel.
>> 
>> Out of curiosity, /how/ do you confirm that?  I agree based on real
>> usage experience, but with a claim that you're confirming it at least
>> monthly, it sounds like you have a standardized/scripted test, and I'm
>> interested in what/how you do it.
> 
> I have around 30 RAID1 sets in production right now.  Some of them doing
> mostly reads and some mostly writes.  Some are HDD and some SSD.
> The RAID sets are pushed pretty hard 24x7 and we watch the performance
> pretty closely to plan updates.  I have collectd performance graphs
> going way back.

So you're basically confirming it with normal usage as well, but have 
documented performance history going pretty well all the way back.  Not 
the simple test script I was hoping for, but pretty impressive, none-the-
less.

Thanks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-28  0:20               ` Duncan
@ 2013-06-28  0:41                 ` Gary E. Miller
  0 siblings, 0 replies; 46+ messages in thread
From: Gary E. Miller @ 2013-06-28  0:41 UTC (permalink / raw
  To: gentoo-amd64; +Cc: 1i5t5.duncan

[-- Attachment #1: Type: text/plain, Size: 1233 bytes --]

Yo Duncan!

On Fri, 28 Jun 2013 00:20:45 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> > I have around 30 RAID1 sets in production right now.  Some of them
> > doing mostly reads and some mostly writes.  Some are HDD and some
> > SSD. The RAID sets are pushed pretty hard 24x7 and we watch the
> > performance pretty closely to plan updates.  I have collectd
> > performance graphs going way back.
> 
> So you're basically confirming it with normal usage as well, but have 
> documented performance history going pretty well all the way back.
> Not the simple test script I was hoping for, but pretty impressive,
> none-the- less.

I find that 'hdparm -tT', 'dd' and 'bonnie++' will match up pretty well
with what I see in production.  Just be sure to use really large test
file sizes with bonnie++ and dd. dd also needs a pretty large block size
(bs=) and pretty large/fast source of bits when writing.

With bonnie++ you can easily see the speed differences between raw disks
and various RAID types.


RGDS
GARY
---------------------------------------------------------------------------
Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701
	gem@rellim.com  Tel:+1(541)382-8588

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 18:38       ` Mark Knecht
  2013-06-21 18:50         ` Gary E. Miller
@ 2013-06-21 18:53         ` Bob Sanders
  1 sibling, 0 replies; 46+ messages in thread
From: Bob Sanders @ 2013-06-21 18:53 UTC (permalink / raw
  To: gentoo-amd64

Mark Knecht, mused, then expounded:
> 
> 
> I agree that RAID5 gives you an opportunity to get things fixed, but
> there are folks who lose a disk in a RAID5, start the rebuild, and
> then lose a second disk during the rebuild. That was my main reason to
> go to RAID6. Not that I would ever run the array degraded but that I
> could still tolerate a second loss while the rebuild was happening and
> hopefully get by. That was similar to my old 3-disk RAID1 where I'd
> have to lose all 3 disks to be out of business.
>

If the drives in the RAID came from the same build lot, the chances of
multi-drive failure are fairly high, if one fails.

I've had 3 out of four drives, from the same lot build, fail at the same
time.  I've had others never fail.  And a few that fail over time where
others from the same lot failed within a month of the first failure.

Bob
-- 
-  



^ permalink raw reply	[flat|nested] 46+ messages in thread

* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21 17:40   ` Mark Knecht
  2013-06-21 17:56     ` Bob Sanders
  2013-06-21 17:57     ` Rich Freeman
@ 2013-06-22 14:23     ` Duncan
  2013-06-23  1:02       ` Mark Knecht
  2 siblings, 1 reply; 46+ messages in thread
From: Duncan @ 2013-06-22 14:23 UTC (permalink / raw
  To: gentoo-amd64

Mark Knecht posted on Fri, 21 Jun 2013 10:40:48 -0700 as excerpted:

> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> <SNIP>
> 
> Wonderful post but much too long to carry on a conversation
> in-line.

FWIW... I'd have a hard time doing much of anything else, these days, no 
matter the size.  Otherwise, I'd be likely to forget a point.  But I do 
try to snip or summarize when possible.  And I do understand your choice 
and agree with it for you.  It's just not one I'd find workable for me... 
which is why I'm back to inline, here.

> As you sound pretty sure of your understanding/history I'll
> assume you're right 100% of the time, but only maybe 80% of the post
> feels right to me at this time so let's assume I have much to learn and
> go from there.

That's a very nice way of saying "I'll have to verify that before I can 
fully agree, but we'll go with it for now."  I'll have to remember it!  
=:^)

> In thinking about this issue this morning I think it's important to
> me to get down to basics and verify as much as possible, step-by-step,
> so that I don't layer good work on top of bad assumptions.

Extremely reasonable approach. =:^)

> Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB
> DDR3 + Core i7-980x Extreme 12 core processor

That's a very impressive base.  But as you point out elsewhere, you use 
it.  Multiple VMs running MS should well use use both the dozen cores and 
the 24 gig RAM.

As an aside, it's interesting how well your dozen cores, 24 gig RAM, fits 
my basic two gigs a core rule of thumb.  Obviously I'd consider that 
reasonably well balanced RAM/cores-wise.

> 1 SDD - 120GB SATA3 on it's own controller
> 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives
> using Intel integrated controllers
> 
> (NOTE: I can possibly go to a 6-drive RAID if I made some changes in the
> box but that's for later)
> 
> According to the WD spec
> (http://www.wdc.com/en/library/spec/2879-701281.pdf) the 500GB drives 

OK, single 120 gig main drive (SSD), 5 half-TB drives for the raid.

> [...] sustain 113MB/S to the drive. Using hdparm I measure 107MB/S
> or higher for all 5 drives [...]
> The SDD on it's own PCI Express controller clocks in at about 250MB/S
> for reads.

OK.

But there's a caveat on the measured "spinning rust" speeds.  You're 
effectively getting "near best case".

I suppose you're familiar with absolute velocity vs rotational velocity 
vs distance from center.  Think merry-go-round as a kid or crack-the-whip 
as a teen (or insert your own experience here).  The closer to the center 
you are the slower you go at the same rotational speed (RPM).  
Conversely, the farther from the center you are, the faster you're 
actually moving at the same RPM.

Rotational disk data I/O rates have a similar effect -- data toward the 
outside edge of the platter (beginning of the disk) is faster to read/
write, while data toward the inside edge (center) is slower.

Based on my own hddparm tests on partitioned drives where I knew the 
location of the partition, vs. the results for the drive as a whole, the 
speed reported for rotational drives as a whole, is the speed near the 
outside edge (beginning of the disk).

Thus, it'd be rather interesting to partition up one of those drives with 
a small partition at the beginning and another at the end, and do an 
hdparm -t of each, as well as of the whole disk.  I bet you'd find the 
one at the end reports rather lower numbers, while the report for the 
drive as a whole is similar to that of the partition near the beginning 
of the drive, much faster.

A good SSD won't have this same sort of variance, since it's SSD and the 
latency to any of its flash, at least as presented by the firmware which 
should deal with any variance as it distributes wear, should be similar.  
(Cheap SSDs and standard USB thumbdrive flash storage works differently, 
however.  Often they assume FAT and have a small amount of fast and 
resilient but expensive SLC flash at the beginning, where the FAT would 
be, with the rest of the device much slower and less resilient to rewrite 
but far cheaper MLC.  I was just reading about this recently as I 
researched my own SSDs.)

> TESTING: I'm using dd to test. It gives an easy to read anyway result
> and seems to be used a lot. I can use bonnie++ or IOzone later but I
> don't think that's necessary quite yet.

Agreed.

> Being that I have 24GB and don't
> want cached data to effect the test speeds I do the following:
> 
> 1) Using dd I created a 50GB file for copying using the following
> commands:
> 
> cd /mnt/fastVM
> dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50]

It'd be interesting to see what the reported speed is here...  See below 
for more.

> 2) To ensure that nothing is cached and the copies are (hopefully)
> completely fair as root I do the following between each test:
> 
> sync free -h
> echo 3 > /proc/sys/vm/drop_caches
> free -h

Good job. =:^)

> 3) As a first test I copy using dd the 50GB file from the SDD to the
> RAID6.

OK, that answered the question I had about where that file you created 
actually was -- on the SSD.

> As long as reading the SDD is much faster than writing the RAID6
> then it should be a test of primarily the RAID6 write speed:
> 
> dd if=/mnt/fastVM/random1 of=SDDCopy
> 97656250+0 records in 97656250+0 records out
> 50000000000 bytes (50 GB) copied, 339.173 s, 147 MB/s

> If I clear cache as above and rerun the test it's always 145-155MB/S

... Assuming $PWD is now on the raid.  You had the path shown too, which 
I snipped, but that doesn't tell /me/ (as opposed to you, who should know 
based on your mounts) anything about whether it's on the raid or not.  
However, the above including the drop-caches demonstrates enough care 
that I'm quite confident you'd not make /that/ mistake.

> 4) As a second test I read from the RAID6 and write back to the RAID6.
> I see MUCH lower speeds, again repeatable:
> 
> dd if=SDDCopy of=HDDWrite
> 97656250+0 records in 97656250+0 records out
> 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s

> 5) As a final test, and just looking for problems if any, I do an SDD to
> SDD copy which clocked in at close to 200MB/S
> 
> dd if=random1 of=SDDCopy
> 97656250+0 records in 97656250+0 records out
> 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s

> So, being that this RAID6 was grown yesterday from something that
> has existed for a year or two I'm not sure of it's fragmentation, or
> even how to determine that at this time. However it seems my problem are
> RAID6 reads, not RAID6 writes, at least to new an probably never used
> disk space.

Reading all that, one question occurs to me.  If you want to test read 
and write separately, why the intermediate step of dd-ing from /dev/
random to ssd, then from ssd to raid or ssd?

Why not do direct dd if=/dev/random (or urandom, see note below)
of=/desired/target ... for write tests, and then (after dropping caches), 
if=/desired/target of=/dev/null ... for read tests?  That way there's 
just the one block device involved, not both.

/dev/random note:  I presume with that hardware you have one of the newer 
CPUs with the new Intel hardware random instruction, with the appropriate 
kernel config hooking it into /dev/random, and/or otherwise have 
/dev/random hooked up to a hardware random number generator.  Otherwise, 
using that much random data could block until more suitably random data 
is generated from approved kernel sources.  Thus, the following probably 
doesn't apply to you, but it may well apply to others, and is good 
practice in any case, unless you KNOW your random isn't going to block 
due to hardware generation, and even then it's worth noting that when 
you're posting examples like the above.

In general, for tests such as this where a LOT of random data is needed, 
but cryptographic-quality random isn't necessarily required, use
/dev/urandom.  In the event that real-random data gets too low,
/dev/urandom will switch to pseudo-random generation, which should be 
"good enough" for this sort of usage.  /dev/random, OTOH, will block 
until it gets more random data from sources the kernel trusts to be truly 
random.  On some machines with relatively limited sources of randomness 
the kernel considers truly random, therefore, just grabbing 50 GB of data 
from /dev/random could take QUITE some time (days maybe?  I don't know).

Obviously you don't have /too/ big a problem with it as you got the data 
from /dev/random, but it's worth noting.  If your machine has a hardware 
random generator hooked into /dev/random, then /dev/urandom will never 
switch to pseudo-random in any case, so for tests of anything above 
/kilobytes/ of random data (and even at that...), just use urandom and 
you won't have to worry about it either way.  OTOH, if you're generating 
an SSH key or something, always use /dev/random as that needs 
cryptographic security level randomness, but that'll take just a few 
bytes of randomness, not kilobytes let alone gigabytes, and if your 
hardware doesn't have good randomness and it does block, wiggling your 
mouse around a bit (obviously assumes a local command, remote could 
require something other than mouse, obviously) should give it enough 
randomness to continue.

Meanwhile, dd-ing either from /dev/urandom as source, or to /dev/null as 
sink, with only the test-target block device as a real block device, 
should give you "purer" read-only and write-only tests.  In theory it 
shouldn't matter much given your method of testing, but as we all know, 
theory and reality aren't always well aligned.

Of course the next question follows on from the above.  I see a write to 
the raid, and a copy from the raid to the raid, so read/write on the 
raid, and a copy from the ssd to the ssd, read/write on it, but no test 
of from the raid read.

So

if=/dev/urandom of=/mnt/raid/target ... should give you raid write.

drop-caches

if=/mnt/raid/target of=/dev/null ... should give you raid read.

*THEN* we have good numbers on both to compare the raid read/write to.

What I suspect you'll find, unless fragmentation IS your problem, is that 
both read (from the raid) alone and write (to the raid) alone should be 
much faster than read/write (from/to the raid).

The problem with read/write is that you're on "rotating rust" hardware 
and there's some latency as it repositions the heads from the read 
location to the write location and back.

If I'm correct and that's what you find, a workaround specific to dd 
would be to specify a much larger block size, so it reads in far more 
data at once, then writes it out at once, with far fewer switches between 
modes.  In the above you didn't specify bs (or the separate input/output 
equivilents, ibs/obs respectively) at all, so it's using 512-byte 
blocksize defaults.

From what I know of hardware, 64KB is a standard read-ahead, so in theory 
you should see improvements using larger block sizes upto at LEAST that 
size, and on a 5-disk raid6, probably 3X that, 192KB, which should in 
theory do a full 64KB buffer on each of the three data drives of the 5-
way raid6 (the other two being parity).

I'm guessing you'll see a "knee" at the 192 KB (that's 2^10 power not 
10^3 power BTW) block size, and above that you might see improvement, but 
not near as much, since the hardware should be doing full 64KB blocks 
which it's optimized to.  There's likely to be another knee at the 16MB 
point (again, power of two, not 10), or more accurately, the 48MB point 
(3*16MB), since that's the size of the device hardware buffers (again, 
three devices worth of data-stripe, since the other two are parity, 
3*16MB=48MB).  Above that, theory says you'll see even less improvement, 
since the caches will be full and any improvement still seen should be 
purely that of less switches between read/write mode and thus less seeks.

But it'd be interesting to see how closely theory matches reality, 
there's very possibly a fly in that theoretical ointment somewhere. =:^\

Of course configurable block size is specific to dd.  Real life file 
transfers may well be quite a different story.  That's where the chunk 
size, stripe size, etc, stuff comes in, setting the defaults for the 
kernel for that device, and again, I'll freely admit to not knowing as 
much as I could in that area.

> I will also report more later but I can state that just using top
> there's never much CPU usage doing this but a LOT of WAIT time when
> reading the RAID6. It really appears the system is spinning it's wheels
> waiting for the RAID to get data from the disk.

When you're dealing with spinning rust, any time you have a transfer of 
any size (certainly GB), you WILL see high wait times.  Disks are simply 
SLOW.  Even SSDs are no match for system memory, tho their enough closer 
to help a lot and can be close enough that the bottleneck is elsewhere.  
(Modern SSDs saturate the SATA-600 links with thruput above 500 MByte/
sec, making the SATA-600 bus the bottleneck, or the 1x PCI-E 2.xlink if 
that's what it's running on, since they saturate at 485MByte/sec or so, 
tho PCI-E 3.x is double that so nearly a GByte/sec and a single SATA-600 
won't saturate that.  Modern DDR3 SDRAM by comparision runs 10+ GByte/sec 
LOW end, two orders of magnitude faster.  Numbers fresh from wikipedia, 
BTW.)

> One place where I wanted to double check your thinking. My thought
> is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as it
> has to read from three drives and make sure they are all good before
> returning data to the user. I don't see how that could ever be faster
> than what a single drive file system could do which for these drives
> would be the 113MB/S WD spec number, correct? As I'm currently getting
> 145MB/S it appears on the surface that the RAID6 is providing some
> value, at least in these early days of use. Maybe it will degrade over
> time though.

As someone else already posted, that's NOT correct.  Neither raid1 nor 
raid6, at least the mdraid implementations, verify the data.  Raid1 
doesn't have parity at all, just many copies, and raid6 has parity but 
only uses it for rebuilds, NOT to check data integrity under normal usage 
-- it too simply reads the data and returns it.

What raid1 does (when it's getting short reads only one at a time) is 
send the request to every spindle.  The first one that returns the data 
wins; the others simply get their returns thrown away.

So under small-one-at-a-time reading conditions, the speed of raid1 reads 
should be the speed of the fastest disk in the bunch.

The raid1 read advantage is in the fact that there's often more than one 
read going on at once, or that the read is big enough to split up, so 
different spindles can be seeking to and reading different parts of the 
request in parallel.  (This also helps in fragmented file conditions as 
long as fragmentation isn't overwhelming, since a raid1 can then send 
different spindle heads to read the different segments in parallel, 
instead of reading one at a time serially, as it would have to do in a 
single spindle case.)

In theory, the stripes of raid6 /can/ lead to better thruput for reads.  
In fact, my experience both with raid6 and with raid0 demonstrates that 
not to be the case as often as one might expect, due either to small 
reads or due to fragmentation breaking up the big reads thus negating the 
theoretical thruput advantage of multiple stripes.

To be fair, my raid0 experience was as I mentioned earlier, with files I 
could easily redownload from the net, mostly the portage tree and 
overlays, along with the kernel git tree.  Due to the frequency of update 
and the fast rate of change as well as the small files, fragmentation was 
quite a problem, and the files were small enough I likely wouldn't have 
seen the full benefit of the 4-way raid0 stripes in any case, so that 
wasn't a best-case test scenario.  But it's what one practically puts on 
raid0, because it IS easily redownloaded from the net, so it DOESN'T 
matter that a loss of any of the raid0 component devices will kill the 
entire thing.

If I'd have been using the raid0 for much bigger media files, mp3s or 
video of megabytes in size minimum, that get saved and never changed so 
there's little fragmentation, I expect my raid0 experience would have 
been *FAR* better.  But at the same time, that's not the type of data 
that it generally makes SENSE to store on a raid0 without backups or 
redundancy of any sort, unless it's simply VDR files that if a device 
drops from the raid and you lose it you don't particularly care (which 
would make a GREAT raid0 candidate), so...

Raid6 is the stripes of raid0, plus two-way-parity.  So since the parity 
is ignored for reads, for them it's effectively raid0 with two less 
stripes then the number of devices.  Thus your 5-device raid6 is 
effectively a 3-device raid0 in terms of reads.  In theory, thruput for 
large reads done by themselves should be pretty good -- three times that 
of a single device.  In fact... due either to multiple jobs happening at 
once, or to a mix of read/write happening at once, or to fragmentation, I 
was disappointed, and far happier with raid1.

But your situation is indeed rather different than mine, and depending on 
how much writing happens in those big VM files and how the filesystem you 
choose handles fragmentation, you could be rather happier with raid6 than 
I was.

But I'd still suggest you try raid1 if the amount of data you're handling 
will let you.  Honestly, it surprised me how well raid1 did for me.  I 
wasn't prepared for that at all, and I believe that in comparison to what 
I was getting on raid6 is what colored my opinion of raid6 so badly.  I 
had NO IDEA there would be that much difference!  But your experience may 
indeed be different.  The only way to know is to try it.

However, one thing I either overlooked or that hasn't been posted yet is 
just how much data you're talking about.  You're running five 500-gig 
drives in raid6 now, which should give you 3*500=1500 gigs (10-power) 
capacity.

If it's under a third full, 500 MB (10-power), you can go raid1 with as 
many mirrors as you like of the five, and keep the rest of them for hot-
spares or whatever.

If you're running (or plan to be running) near capacity, over 2/3 full, 1 
TB (10-power), you really don't have much option but raid6.

If you're in between, 1/3 to 2/3 full, 500-1000 GB (10-power), then a 
raid10 is possible, perhaps 4-spindle with the 5th as a hot-spare.

(A spindle configured as a hot-spare is kept unused but ready for use by 
mdadm and the kernel.  If a spindle should drop out, the hot-spare is 
automatically inserted in its place and a rebuild immediately started.  
This narrows the danger zone during which you're degraded and at risk if 
further spindles drop out, because handling is automatic so you're back 
to full un-degraded as soon as possible.  However, it doesn't eliminate 
that danger zone should another one drop out during the rebuild, which is 
after all quite stressful on the remaining drives since due to all that 
reading going on, so the risk is greater during a rebuild than under 
normal operation.)

So if you're over 2/3 full, or expect to be in short order, there's 
little sense in further debate on at least /your/ raid6, as that's pretty 
much what you're stuck with.  (Unless you can categorize some data as 
more important than other, and raid it, while the other can be considered 
worth the risk of loss if the device goes, in which case we're back in 
play with other options once again.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-22 14:23     ` Duncan
@ 2013-06-23  1:02       ` Mark Knecht
  2013-06-23  1:48         ` Mark Knecht
  0 siblings, 1 reply; 46+ messages in thread
From: Mark Knecht @ 2013-06-23  1:02 UTC (permalink / raw
  To: Gentoo AMD64

On Sat, Jun 22, 2013 at 7:23 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Mark Knecht posted on Fri, 21 Jun 2013 10:40:48 -0700 as excerpted:
>
>> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
>> <SNIP>
<SNIP>
>
> ... Assuming $PWD is now on the raid.  You had the path shown too, which
> I snipped, but that doesn't tell /me/ (as opposed to you, who should know
> based on your mounts) anything about whether it's on the raid or not.
> However, the above including the drop-caches demonstrates enough care
> that I'm quite confident you'd not make /that/ mistake.
>
>> 4) As a second test I read from the RAID6 and write back to the RAID6.
>> I see MUCH lower speeds, again repeatable:
>>
>> dd if=SDDCopy of=HDDWrite
>> 97656250+0 records in 97656250+0 records out
>> 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s
>
>> 5) As a final test, and just looking for problems if any, I do an SDD to
>> SDD copy which clocked in at close to 200MB/S
>>
>> dd if=random1 of=SDDCopy
>> 97656250+0 records in 97656250+0 records out
>> 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s
>
>> So, being that this RAID6 was grown yesterday from something that
>> has existed for a year or two I'm not sure of it's fragmentation, or
>> even how to determine that at this time. However it seems my problem are
>> RAID6 reads, not RAID6 writes, at least to new an probably never used
>> disk space.
>
> Reading all that, one question occurs to me.  If you want to test read
> and write separately, why the intermediate step of dd-ing from /dev/
> random to ssd, then from ssd to raid or ssd?
>
> Why not do direct dd if=/dev/random (or urandom, see note below)
> of=/desired/target ... for write tests, and then (after dropping caches),
> if=/desired/target of=/dev/null ... for read tests?  That way there's
> just the one block device involved, not both.
>

1) I was a bit worried about using it in a way it might not have been
intended to be used.

2) I felt that if I had a specific file then results should be
repeatable, or at least not dependent on what's in the file.


<SNIP>
>
> Meanwhile, dd-ing either from /dev/urandom as source, or to /dev/null as
> sink, with only the test-target block device as a real block device,
> should give you "purer" read-only and write-only tests.  In theory it
> shouldn't matter much given your method of testing, but as we all know,
> theory and reality aren't always well aligned.
>

Will try some tests this way tomorrow morning.

>
> Of course the next question follows on from the above.  I see a write to
> the raid, and a copy from the raid to the raid, so read/write on the
> raid, and a copy from the ssd to the ssd, read/write on it, but no test
> of from the raid read.
>
> So
>
> if=/dev/urandom of=/mnt/raid/target ... should give you raid write.
>
> drop-caches
>
> if=/mnt/raid/target of=/dev/null ... should give you raid read.
>
> *THEN* we have good numbers on both to compare the raid read/write to.
>
> What I suspect you'll find, unless fragmentation IS your problem, is that
> both read (from the raid) alone and write (to the raid) alone should be
> much faster than read/write (from/to the raid).
>
> The problem with read/write is that you're on "rotating rust" hardware
> and there's some latency as it repositions the heads from the read
> location to the write location and back.
>

If this lack of performance is truly driven by the drive rotational
issues than I completely agree.

> If I'm correct and that's what you find, a workaround specific to dd
> would be to specify a much larger block size, so it reads in far more
> data at once, then writes it out at once, with far fewer switches between
> modes.  In the above you didn't specify bs (or the separate input/output
> equivilents, ibs/obs respectively) at all, so it's using 512-byte
> blocksize defaults.
>

So help me clarify this before I do the work and find out I didn't
understand. Whereas earlier I created a file using:

dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50]

if what you are suggesting is more like this very short example:

mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/urandom of=urandom1
bs=4096 count=$[1000*100]
100000+0 records in
100000+0 records out
409600000 bytes (410 MB) copied, 25.8825 s, 15.8 MB/s
mark@c2RAID6 /VirtualMachines/bonnie $

then the results for writing this 400MB file are very slow, but I'm
sure I don't understand what you're asking, or urandom is the limiting
factor here.

I'll look for a reply (you or anyone else that has Duncan's idea
better than I do) before I do much more.

Thanks!

- Mark


^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-23  1:02       ` Mark Knecht
@ 2013-06-23  1:48         ` Mark Knecht
  2013-06-28  3:36           ` Duncan
  0 siblings, 1 reply; 46+ messages in thread
From: Mark Knecht @ 2013-06-23  1:48 UTC (permalink / raw
  To: Gentoo AMD64

On Sat, Jun 22, 2013 at 6:02 PM, Mark Knecht <markknecht@gmail.com> wrote:

> if what you are suggesting is more like this very short example:
>
> mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/urandom of=urandom1
> bs=4096 count=$[1000*100]
> 100000+0 records in
> 100000+0 records out
> 409600000 bytes (410 MB) copied, 25.8825 s, 15.8 MB/s
> mark@c2RAID6 /VirtualMachines/bonnie $
>


Duncan,
   Actually, using your idea of piping things to /dev/null it appears
that the random number generator itself is only capable of 15MB/S on
my machine.

mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/urandom of=/dev/null
bs=4096 count=$[1000]
1000+0 records in
1000+0 records out
4096000 bytes (4.1 MB) copied, 0.260608 s, 15.7 MB/s
mark@c2RAID6 /VirtualMachines/bonnie $

It doesn't change much based on block size of number of bytes I pipe.

   If this speed is representative of how well that works then I think
I have to use a file. It appears this guy gets similar values:

http://www.globallinuxsecurity.pro/quickly-fill-a-disk-with-random-bits-without-dev-urandom/

   On the other hand, piping /dev/zero appears to be very fast -
basically the speed of the processor I think:

mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/zero of=/dev/null
bs=4096 count=$[1000]
1000+0 records in
1000+0 records out
4096000 bytes (4.1 MB) copied, 0.000622594 s, 6.6 GB/s
mark@c2RAID6 /VirtualMachines/bonnie $

- Mark


^ permalink raw reply	[flat|nested] 46+ messages in thread

* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-23  1:48         ` Mark Knecht
@ 2013-06-28  3:36           ` Duncan
  2013-06-28  9:12             ` Duncan
  0 siblings, 1 reply; 46+ messages in thread
From: Duncan @ 2013-06-28  3:36 UTC (permalink / raw
  To: gentoo-amd64

Mark Knecht posted on Sat, 22 Jun 2013 18:48:15 -0700 as excerpted:

> Duncan,

Again, following up now that it's my "weekend" and I have a chance...

> Actually, using your idea of piping things to /dev/null it appears
> that the random number generator itself is only capable of 15MB/S on my
> machine.  It doesn't change much based on block size of number of bytes
> I pipe.

=:^(

Well, you tried.

> If this speed is representative of how well that works then I think
> I have to use a file. It appears this guy gets similar values:
> 
> http://www.globallinuxsecurity.pro/quickly-fill-a-disk-with-random-bits-
without-dev-urandom/

Wow, that's a very nice idea he has there!  I'll have to remember that!  
The same idea should work for creating any relatively large random file, 
regardless of final use.  Just crypt-setup the thing and dd /dev/zero 
into it.

FWIW, you're doing better than my system does, however.  I seem to run 
about 13 MB/s from /dev/urandom (upto 13.7 depending on blocksize).  And 
back to the random vs urandom discussion, random totally blocked here 
after a few dozen bytes, waiting for more random data to be generated.  
So the fact that you actually got a usefully sized file out of it does 
indicate that you must have hardware random and that it's apparently 
working well.

> On the other hand, piping /dev/zero appears to be very fast -
> basically the speed of the processor I think:
> 
> $ dd if=/dev/zero of=/dev/null bs=4096 count=$[1000]
> 1000+0 records in 1000+0 records out 4096000 bytes (4.1 MB) copied,
> 0.000622594 s, 6.6 GB/s

What's most interesting to me when I tried that here is that unlike 
urandom, zero's output varies DRAMATICALLY by blocksize.  With
bs=$((1024*1024)) (aka 1MB), I get 14.3 GB/s, tho at the default bs=512, 
I get only 1.2 GB/s.  (Trying a few more values, 1024*512 gives me very 
similar 14.5 GB/s, 1024*64 is already down to 13.2 GB/s, 1024*128=13.9 
and 1024*256=14.1, while on the high side 1024*1024*2 is already down to 
10.2 GB/s.  So quarter MB to one MB seems the ideal range, on my 
hardware.)

But of course, if your device is compressible-data speed-sensitive, as 
are say the sandforce-controller-based ssds, /dev/zero isn't going to 
give you anything like the real-world benchmark random data would (tho it 
should be a great best-case compressible-data test).  Tho it's unlikely 
to matter on most spinning rust, AFAIK, and SSDs like my Corsair Neutrons 
(Link_A_Media/LAMD-based controller), which have as a bullet-point 
feature that they're data compression agnostic, unlike the sandforce-
based SSDs.

Since /dev/zero is so fast, I'd probably do a few initial tests to 
determine whether compressible data makes a difference on what you're 
testing, then use /dev/zero if it doesn't appear to, to get a reasonable 
base config, then finally double-check that against random data again.

Meanwhile, here's another idea for random data, seeing as /dev/urandom is 
speed limited.  Upto your memory constraints anyway, you should be able 
to dd if=/dev/urandom of=/some/file/on/tmpfs .  Then you can
dd if=/tmpfs/file, of=/dev/test/target, or if you want a bigger file than 
a direct tmpfs file will let you use, try something like this:

cat /tmpfs/file /tmpfs/file /tmpfs/file | dd of=/dev/test/target

... which would give you 3X the data size of /tmpfs/file.

(Man, testing that with a 10 GB tmpfs file (on a 12 GB tmpfs /tmp), I can 
see see how slow that 13 MB/s /dev/urandom actually is as I'm creating 
it! OUCH!  I waited awhile before I started typing this comment... I've 
been typing slowly and looking at the usage graph as I type, and I'm 
still only at maybe 8 gigs, depending on where my cache usage was when I 
started, right now!)

cd /tmp

dd if=/dev/urandom of=/tmp/10gig.testfile bs=$((1024*1024)) count=10240

(10240 records, 10737418240 bytes, but it says 11 GB copied, I guess dd 
uses 10^3 multipliers, anyway, ~783 s, 13.7 MB/s)

ls -l 10gig.testfile

(confirm the size, 10737418240 bytes)

cat 10gig.testfile 10gig.testfile 10gig.testfile \
10gig.testfile 10gig.testfile | dd of=/dev/null

(that's 5x, yielding 50 GB power of 2, 104857600+0 records, 53687091200 
bytes, ~140s, 385 MB/s at the default 512-byte blocksize)

Wow, what a difference block size makes there, too!  Trying the above cat/
dd with bs=$((1024*1024)) (1MB) yields ~30s, 1.8 GB/s!

1GB block size (1024*1024*1024) yields about the same, 30s, 1.8 GB/s.

LOL dd didn't like my idea to try a 10 GB buffer size!

dd: memory exhausted by input buffer of size 10737418240 bytes (10 GiB)

(No wonder, as that'd be 10GB in tmpfs/cache and a 10GB buffer, and I'm 
/only/ running 16 gigs RAM and no swap!  But it won't take 2 GB either.  
Checking, looks like as my normal user I'm running a ulimit of 1-gig 
memory size, 2-gig virtual-size, so I'm sort of surprised it took the 1GB 
buffer... maybe that counts against virtual only or something? )

Low side again, ~90s, 599 MB/s @ 1KB (1024 byte) bs, already a dramatic 
improvement from the 140s 385 MB/s of the default 512-byte block.

2KB bs yields 52s, 1 GB/s

16KB bs yields 31s, 1.7 GB/s, near optimum already.

High side again, 1024*1024*4 (4MB) bs appears to be best-case, just under 
29s, 1.9 GB/s.  Going to 8MB takes another second, 1.8 GB/s again, which 
is not a big surprise given that the memory page size is 4MB, so that's 
an unsurprising peak performance point.

FWIW, cat seems to run just over 100% single-core saturation while dd 
seems to run just under, @97% or so.

Running two instances in parallel (using the peak 4MB block size, 1.9 GB/
s with a single run) seems to cut performance some, but not nearly in 
half.  (I got 1.5 GB/s and 1.6 GB/s, but I started one then switched to a 
different terminal to start the other, so they only overlapped by maybe 
30s or so of the 35s on each.).

OK, so that's all memory/cpu since neither end is actual storage, but 
that does give me a reasonable base against which to benchmark actual 
storage (rust or ssd), if I wished.

What's interesting is that by, I guess pure coincidence, my 385 MB/s 
original 512-byte blocksize figure is reasonably close to what the SSD 
read benchmarks are with hddparm.  IIRC the hdparm/ssd numbers were some 
higher, but not so much so (470 MB/sec I just tested).  But the bus speed 
maxes out not /too/ far above that (500-600 MB/sec, theoretically 600 MB/
sec on SATA-600, but real world obviously won't /quite/ hit that, IIRC 
best numbers I've seen anywhere are 585 or so).

So now I guess I send this and do some more testing of real device, now 
that you've provoked my curiosity and I have the 50 GB (mostly) 
pseudorandom file sitting in tmpfs already.  Maybe I'll post those 
results later.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-28  3:36           ` Duncan
@ 2013-06-28  9:12             ` Duncan
  2013-06-28 17:50               ` Gary E. Miller
  0 siblings, 1 reply; 46+ messages in thread
From: Duncan @ 2013-06-28  9:12 UTC (permalink / raw
  To: gentoo-amd64

Duncan posted on Fri, 28 Jun 2013 03:36:10 +0000 as excerpted:

> So now I guess I send this and do some more testing of real device, now
> that you've provoked my curiosity and I have the 50 GB (mostly)
> pseudorandom file sitting in tmpfs already.  Maybe I'll post those
> results later.

Well, I decided to use something rather smaller, both because I wanted to 
run it against my much smaller btrfs partitions on the ssd, and because 
the big file was taking too long for the benchmarks I wanted to do in the 
time I wanted to do them.

I settled on a 4 GiB file.  Speeds are power-of-10-based since that's 
what dd reports, unless otherwise stated.  Sizes are power-of-2-based 
unless otherwise stated.  This was filesystem-layer-based, not direct to 
device, and single I/O task, plus whatever the system might have had 
going on in the background.

Also note that after reading the dd manpage, I added the conv=fsync 
parameter, hoping that gave me more accurate speed ratings due to the 
reducing the write-caching.

SSD speeds, dual Corsair Neutron n256gp3 SATA-600 ssds, running btrfs 
raid1 data and metadata:

To SSD: peak was upper 250s MB/s over a wide blocksize range of 1 MiB to 
1GiB.  I believe the btrfs checksumming might lower speeds here somewhat, 
as it's quite lower than the rated 450 MB/s sequential write speed.  

From SSD: peak was lower 480s MB/s, blocksize 32 KiB to 512 KiB (smaller  
blocksize range but much smaller block than I expected).  This is MUCH 
better, far closer to the 540 MB/s ratings.

To/from SSD: At around 220 MB/s, peak was somewhat lower than write-only 
peak, as might be expected.  Best-case blocksize range seemed to be 256 
KiB to 2 MiB.

So, best mixed-access case would seem to be a blocksize near 1 MiB.

I did a few timed cps also, then did the math to confirm the dd numbers.  
They were close enough.

Spinning rust speeds, single Seagate st9500424as, 7200rpm 2.5" 16MB 
buffer SATA-300 disk drive, reiserfs.  Tests were done on a partition 
located roughly 40% thru the drive.  I didn't test this one as closely 
and didn't do rust-to-rust tests at all, but:

To rust: upper 70s MB/s, blocksize didn't seem to matter much.

From rust: upper 90s MB/s, blocksize upto 4 MiB.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-28  9:12             ` Duncan
@ 2013-06-28 17:50               ` Gary E. Miller
  2013-06-29  5:40                 ` Duncan
  0 siblings, 1 reply; 46+ messages in thread
From: Gary E. Miller @ 2013-06-28 17:50 UTC (permalink / raw
  To: gentoo-amd64; +Cc: 1i5t5.duncan

[-- Attachment #1: Type: text/plain, Size: 1612 bytes --]

Yo Duncan!

On Fri, 28 Jun 2013 09:12:24 +0000 (UTC)
Duncan <1i5t5.duncan@cox.net> wrote:

> Duncan posted on Fri, 28 Jun 2013 03:36:10 +0000 as excerpted:
> 
> I settled on a 4 GiB file.  Speeds are power-of-10-based since that's 
> what dd reports, unless otherwise stated. 

dd is pretty good at testing linear file performance, pretty useless
for testing mysql performance.

> To SSD: peak was upper 250s MB/s over a wide blocksize range of 1 MiB
> >From SSD: peak was lower 480s MB/s, blocksize 32 KiB to 512 KiB

Sounds about right.  Your speeds are now so high that small differences
in the SATA controller chip will be bigger than that between some
SSD drives. Use a PCIe/SATA card and your performance will drop from
what you see.

> Spinning rust speeds, single Seagate st9500424as, 7200rpm 2.5" 16MB 

Those are pretty old and slow.  If you are going to test an HDD against a
newer SSD you should at least test a newer HDD.  A new 2TB drive could
get pretty close to your SSD performance in linear tests.

> To rust: upper 70s MB/s, blocksize didn't seem to matter much.
> >From rust: upper 90s MB/s, blocksize upto 4 MiB.

Seems about right, for that drive.

I think your numbers are about right, if your workload is just reading 
and writing big linear files.  For a MySQL workload there would be a lot
of random reads/writes/seeks and the SSD would really shine.

RGDS
GARY
---------------------------------------------------------------------------
Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701
	gem@rellim.com  Tel:+1(541)382-8588

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-28 17:50               ` Gary E. Miller
@ 2013-06-29  5:40                 ` Duncan
  0 siblings, 0 replies; 46+ messages in thread
From: Duncan @ 2013-06-29  5:40 UTC (permalink / raw
  To: gentoo-amd64

Gary E. Miller posted on Fri, 28 Jun 2013 10:50:08 -0700 as excerpted:

> Yo Duncan!

Nice greeting, BTW.  Good to cheer a reader up after a long day with 
things not going right, especially after seeing it several times in a row 
so there's a bit of familiarity now. =:^)

> On Fri, 28 Jun 2013 09:12:24 +0000 (UTC)
> Duncan <1i5t5.duncan@cox.net> wrote:
> 
>> Duncan posted on Fri, 28 Jun 2013 03:36:10 +0000 as excerpted:
>> 
>> I settled on a 4 GiB file.  Speeds are power-of-10-based since that's
>> what dd reports, unless otherwise stated.
> 
> dd is pretty good at testing linear file performance, pretty useless for
> testing mysql performance.

Recognized.  A single i/o job test, but it's something, it's reasonably 
repeatable, and when done on the actual filesystem, it's real-world times 
and flexible in data and block size, if single-job limited.

Plus, unlike some of the more exotic tests which need to be installed 
separately, it's commonly already installed and available for use on most 
*ix systems. =:^)

>> To SSD: peak was upper 250s MB/s over a wide blocksize range of 1 MiB
>> >From SSD: peak was lower 480s MB/s, blocksize 32 KiB to 512 KiB
> 
> Sounds about right.  Your speeds are now so high that small differences
> in the SATA controller chip will be bigger than that between some SSD
> drives. Use a PCIe/SATA card and your performance will drop from what
> you see.

Good point.

I was thinking about that the other day.  SSDs are fast enough that they 
saturate a modern PCIe 3.0 1x and a single SATA-600 channel all by 
themselves.  SATA port-multipliers are arguably still useful for for 
slower spinning rust, but not so much for SSD, where the bottleneck is 
often already the SATA and/or PCIe, so doubling up will indeed only slow 
things down.

And most add-on SATA cards have several SATA ports hanging off the same 
1x PCIe, which means they'll bottleneck if actually using more than a 
single port, too.

I believe I have seen 4x PCIe SATA cards, which would allow four or so 
SATA ports (I think 5), but they tend to be higher priced.

After pondering that for a bit, I decided I'd take a closer look next 
time I was at Fry's Electronics, to see what was actually available, as 
well as the prices.  Until last year I was still running old PCI-X boxes, 
so the whole PCI-E thing itself is still relatively new to me, and I'm 
still reorienting myself to the modern bus and its implications in terms 
of addon cards, etc.

>> Spinning rust speeds, single Seagate st9500424as, 7200rpm 2.5" 16MB
> 
> Those are pretty old and slow.  If you are going to test an HDD against
> a newer SSD you should at least test a newer HDD.  A new 2TB drive could
> get pretty close to your SSD performance in linear tests.

Well, it's not particularly old, but it *IS* a 2.5 inch, down from the 
old 3.5 inch standard, which due to the smaller diameter does mean lower 
rim/maximum speeds at the same RPM.  And of course 7200 RPM is middle of 
the pack as well.  The fast stuff (tho calling any spinning rust "fast" 
in the age of SSDs does rather jar, it's relative!) is 15000.

But 2.5 inch does seem to be on its way as the new standard for desktops 
and servers as well, helped along by the three factors of storage 
density, SSDs (which are invariably 2.5 inch, and even that's due to the 
standard form factor as often the circuit boards aren't full height and/
or are largely empty space, 3.5 inch is just /ridiculously/ huge for 
them), and the newer focus on power efficiency (plus raw spindle density!
) in the data center.

There's still a lot of inertia behind the 3.5 inch standard, just as 
there is behind spinning rust, and it's not going away overnight, but in 
the larger picture, 3.5 inch tends to look as anachronous as a full size 
desktop in an age when even the laptop is being displaced by the tablet 
and mobile phone.  Which isn't to say there's no one still using them, by 
far (my main machine is still a mid tower, easier to switch out parts on 
them, after all), but just sayin' what I'm sayin'.

Anyway...

>> To rust: upper 70s MB/s, blocksize didn't seem to matter much.
>> >From rust: upper 90s MB/s, blocksize upto 4 MiB.
> 
> Seems about right, for that drive.
> 
> I think your numbers are about right, if your workload is just reading
> and writing big linear files.  For a MySQL workload there would be a lot
> of random reads/writes/seeks and the SSD would really shine.

Absolutely.  And perhaps more to the point given the list and thus the 
readership...

As I said in a different thread on a different list, recently, I didn't 
see my boot times change much, the major factor there being the ntp-
client time-sync, at ~12 seconds usually (just long enough to trigger 
openrc's first 10-second warning in the minute timeout...), but *WOW*, 
did the SSDs drop my emerge sync, as well as kernel git pull, time!  
Those are both many smaller files that will tend to highly fragment over 
time due to constant churn, and that's EXACTLY the job type where good 
SSDs can (and do!) really shine!

Like your MySQL db example (tho that's high activity large-file rather 
than high-activity huge number of smaller files), except this one's 
likely more directly useful to a larger share of the list readership. =:^)

Meanwhile, the other thing with the boot times is that I boot to a CLI 
login, so don't tend to count the X and kde startup times as boot.  But 
kde starts up much faster too, and that would count as boot time for many 
users.

Additionally, I have one X app, pan, that in my (not exactly design 
targeted) usage, had a startup time that really suffered on spinning 
rust, so much so that for years I've had it start with the kde session, 
so that if it takes five minutes on cold-cache to startup, no big deal, I 
have other things to do and it's ready when I'm ready for it.  It's a 
newsgroups (nntp) app, which as designed (by default) ships with a 10 MB 
article cache, and expires headers in (IIRC) two weeks.  But my usage, in 
addition to following my various lists with it using gmane's list2news 
service, is as a long-time technical group and list archive.  My text-
instance pan (the one I use the most) has a cache size of several gig 
(with about a gig actually used) and is set to no-expiry on messages.  In 
fact, I have ISP newsgroups archived in pan for an ISP server that hasn't 
hasn't even existed for several years, now, as well as the archives for 
several mailing lists going back over a decade to 2002.

So this text-instance pan tends to be another prime example of best-use-
case-for-SSDs.  Thousands, actually tens of thousands of files I believe, 
all in the same cache dir, with pan accessing them all to rebuild its 
threading tree in memory at startup.  (For years there's been talk of 
switching that to a database, so it doesn't have to all be in memory at 
once, but the implementation has yet to be coded up and switched to.)  On 
spinning rust, I did note a good speed boost if I backed up everything 
and did a mkfs and restore from time to time, so it's definitely a high 
fragmentation use-case as well.  *GREAT* use case for SSD, and there too, 
I noticed a HUGE difference.  Tho I've not actually timed the startup 
since switching to SSD, I do know that the pan icon appears in the system 
tray far earlier than it did, such that I almost think it's there as soon 
as the system tray is, now, whereas on the spinning rust, it would take 
five minutes or more to appear.

... Which is something those dd results don't, and can't, show at all.  
Single i/o thread access to a single rather large (GBs) unchanged since 
original write file is one thing.  Access to thousands or tens of 
thousands of constantly changing or multi-write-thread interwoven little 
files, or for that matter, to a high activity large file thus (depending 
on the filesystem) potentially triggering COW fragmentation there, is 
something entirely different.

And the many-parallel-job seek latency of spinning rust is something that 
dd simply does not and cannot really measure, as it's simply the wrong 
tool for that sort of purpose.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value?
  2013-06-21  7:31 ` [gentoo-amd64] " Duncan
  2013-06-21 10:28   ` Rich Freeman
  2013-06-21 17:40   ` Mark Knecht
@ 2013-06-30  1:04   ` Rich Freeman
  2 siblings, 0 replies; 46+ messages in thread
From: Rich Freeman @ 2013-06-30  1:04 UTC (permalink / raw
  To: gentoo-amd64

On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> BUT RAID5/6 DOESN'T USE
> THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE
> CASE OF DEVICE LOSS!

Well, to drive this point home in the case of the thread that wouldn't
die, I had put an entry in crontab a week ago to do a weekly forced
check of all my arrays.  Last week it passed.  Today towards the end
drive performance seriously deteriorated, and eventually smartd sent
me an email about pending sectors (these are read errors).

Long story short I ended up failing the drive out of the array (at
which point my system stopped crawling), and tried wiping the bad
sectors individually, and after self tests kept failing I even tried
zeroing the drive.  With sustained read failures under those
circumstances I decided the drive had to be suitable for RMA.  The
drive was almost a year old.

So, crossing my fingers that I don't suffer another failure and I'll
be ginger with my clean shutdowns.  Since the problem was discovered
before I had dual failures the RAID should be recoverable without
further loss.

If you don't already, check your arrays weekly in crontab.  Scripts
for this can be found online or I'd be happy to post the one I dug up
somewhere...

Rich

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?
  2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht
                   ` (2 preceding siblings ...)
  2013-06-21  7:31 ` [gentoo-amd64] " Duncan
@ 2013-06-22 12:49 ` B Vance
  2013-06-22 13:12   ` Rich Freeman
  2013-06-23 11:31 ` thegeezer
  4 siblings, 1 reply; 46+ messages in thread
From: B Vance @ 2013-06-22 12:49 UTC (permalink / raw
  To: gentoo-amd64

On Thu, 2013-06-20 at 12:10 -0700, Mark Knecht wrote:
> Hi,
>    Does anyone know of info on how the starting sector number might
> impact RAID performance under Gentoo? The drives are WD-500G RE3
> drives shown here:
> 
> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/B001EMZPD0/ref=cm_cr_pr_product_top
> 
>    These are NOT 4k sector sized drives.
> 
>    Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
> benchmarking seems abysmal at around 40MB/S using dd copying large
> files. It's higher, around 80MB/S if the file being transferred is
> coming from an SSD, but even 80MB/S seems slow to me. I see a LOT of
> wait time in top. And my 'large file' copies might not be large enough
> as the machine has 24GB of DRAM and I've only been copying 21GB so
> it's possible some of that is cached.
> 
>    Then I looked again at how I partitioned the drives originally and
> see the starting sector of sector 3 as 8594775. I started wondering if
> something like 4K block sizes at the file system level might be
> getting munged across 16k chunk sizes in the RAID. Maybe the blocks
> are being torn apart in bad ways for performance? That led me down a
> bunch of rabbit holes and I haven't found any light yet.
> 
>    Looking for some thoughtful ideas from those more experienced in this area.
> 
> Cheers,
> Mark
> 

Not necessarily the kind of answer you are looking for, but a year or so
back I converted my NAS from Hardware RAID1 to  linux software RAID1 to
RAID1 on ZFS.  Before the conversion to ZFS I had issues with the NAS
being unable to keep up with requests.  Since then I have been able to
hit the SAN relatively hard with no visible effects.  Just to give an
idea, a normal load involves streaming an HD movie to the TV, streaming
music to a second system, being used as the shared storage for four
computers, two of which almost constantly hit the shared drive for data
(keep the distfile directory for all the systems on it as well as using
it as the local rsync) , and once a month, transferring data to
removable storage devices.  All of this going over cat6 Ethernet and
occasionally USB2.  

I'm unsure how I would go about measuring the throughput, mainly because
I never cared in the past as long as the files transferred at a
reasonable pace and the video/audio didn't stutter.  By no means is my
NAS a high-end system.  It's stats are:

AMD64 X2, 4200
ASUS A8V MoBo (I think)
4GB RAM
2 x Silicon Image Sil 3114 SATA RAID cards (4 port PCI cards)
3 x 1.5TB Seagate drives (on Raid Cards)
4 x 2TB Western Digital drives (On Raid Cards)
2 x Western Digital antique 80GB drives (mirrored on motherboard for OS)
Marvell GigE network cards (Have a second card to add once I figure how
to automatically load balance through two cards)
Case with 2 x 120mm fans on top, 3 x 120mm fans on the front, 1 x 240mm
fan on the side
Total storage available 6.3TB, of which 3.4TB is used.
An image of the pool is created on a daily basis via cron jobs, which
are overwritten every 3 days. (Image of Day 1, Day 2, Day 3 then Day 4
overwrites Day 1.)The pool started with 5 750GB drives and has been
grown slowly as I find deals on better drives.  

Main advantage of using ZFS on linux is the ease of growing your pools.
As long as you know the id of the drive (preferably the hardware id not
the delegated one), its so simple I can manage it.  Since I'm nowhere
near the technical level of most folk here, anyone can do it.  For what
it's worth (very little I know), I think that ZFS has too many
advantages over linux software RAID for it to be a real competition.

YMMV

B. Vance

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?
  2013-06-22 12:49 ` [gentoo-amd64] " B Vance
@ 2013-06-22 13:12   ` Rich Freeman
  0 siblings, 0 replies; 46+ messages in thread
From: Rich Freeman @ 2013-06-22 13:12 UTC (permalink / raw
  To: gentoo-amd64

On Sat, Jun 22, 2013 at 8:49 AM, B Vance
<anonymous.pseudonym.88@gmail.com> wrote:
> Main advantage of using ZFS on linux is the ease of growing your pools.
> As long as you know the id of the drive (preferably the hardware id not
> the delegated one), its so simple I can manage it.  Since I'm nowhere
> near the technical level of most folk here, anyone can do it.  For what
> it's worth (very little I know), I think that ZFS has too many
> advantages over linux software RAID for it to be a real competition.

I'm holding out for btrfs but for all the same reasons.  I really
don't want to mess with zfs on linux (fuse, etc - and the license
issues - the thing I don't get is that Oracle maintains both).

However, the last time I checked ZFS does not support reshaping of
RAID-Z.  That is a major limitation for me, as I almost always expand
arrays gradually.  You can add additional raid-z's to a zpool, but if
you have a raid-z with 5 drives you can't add 1 more drive to it as
part of the same raid-z.  That means that it get treated as a mirror
and not a stripe, and that means that if you add 10 drives in this
manner one at a time you get 5 drives of capacity and not 9.  Btrfs
targets making raids re-shapeable, just like mdadm.

But in general COW makes a LOT more sense with RAID because the
layer-breaking allows them to often avoid read-write cycles by writing
complete stripes more often, and files aren't modified in place so you
can consolidate changes for many files into a single stripe (granted,
that can cause fragmentation).  ZFS has all those advantages being
COW, as will btrfs when it is ready for prime time.

Rich

^ permalink raw reply	[flat|nested] 46+ messages in thread

* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value?
  2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht
                   ` (3 preceding siblings ...)
  2013-06-22 12:49 ` [gentoo-amd64] " B Vance
@ 2013-06-23 11:31 ` thegeezer
  4 siblings, 0 replies; 46+ messages in thread
From: thegeezer @ 2013-06-23 11:31 UTC (permalink / raw
  To: gentoo-amd64; +Cc: Mark Knecht

[-- Attachment #1: Type: text/plain, Size: 3887 bytes --]

Howdy,
My own 2c on the issue is to suggest LVM
this looks at things in a slightly different way and allows me to treat
all my disks as one large volume i can carve up.
It supports multi way mirroring, so i can choose to create a volume for
all my pictures  which is on at least 3 drives.
It supports volume striping (RAID0) so i can put swap file and scratch
files there.
It does support other RAID levels but I can't find where the scrub option is
It supports volume concatenation so i can just keep growing my MythTV
recordings volume by just adding another disk.
It supports encrypted volumes so I can put all my guarded stuff in there.
it supports (with some magic) nested volumes, so i can have an encrypted
volume sitting inside a mirrored volume so my secrets are protected.
i can partition my drives in 3 parts, so that i can create a volume
group of fast, medium and slow based on where on the disk the partition
is (start track ~150MB/sec, end track ~60MB/sec, numbers sort of
remembered sort of made up)
I can have a bunch of disks for long term storage and hdparm can spin
them down all the time. 
Live movement even of a root volume also means that i can keep moving
storage to the storage drives or decide to use a fast disk as a storage
disk and have that spin down too.

I think the crucial aspect is to also consider what you wish to put on
the drives. 
If it is just pr0n, do you really care if it gets lost?
if it is just scratch areas that need to be fast, ditto.
Where the different RAIDs are good is the use of parity so you don't
lose half of your potential storage size if it were a mirror. 
Bit rot is real, all it takes is a single misaligned charged particle
from that nuclear furnace in the sky to knock a single bit out of
magnetic alignment so it will require regular scrubbing maybe in a cron.
https://wiki.archlinux.org/index.php/Software_RAID_and_LVM#Data_scrubbing

Specifically on the bandwidth issue, I'd suggest
1. take all the drives out of RAID if you can, run a benchmark against
them individually, I like the benchmark tool in palimpset, but that's me.  
2. concurrently run dd if=/dev/zero of=/dev/sdX on all drives and see
how it compares to the individual scores this will show you the computer
mainboard/chipset effect.
3. you might find this
https://raid.wiki.kernel.org/index.php/RAID_setup#Calculation  a good
starting point for calculating strides and stripes
 and this http://forums.gentoo.org/viewtopic-t-942794-start-0.html 
shows the benefit of adjusting the numbers

hope this helps!

On 06/20/2013 08:10 PM, Mark Knecht wrote:
> Hi,
>    Does anyone know of info on how the starting sector number might
> impact RAID performance under Gentoo? The drives are WD-500G RE3
> drives shown here:
>
> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/B001EMZPD0/ref=cm_cr_pr_product_top
>
>    These are NOT 4k sector sized drives.
>
>    Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My
> benchmarking seems abysmal at around 40MB/S using dd copying large
> files. It's higher, around 80MB/S if the file being transferred is
> coming from an SSD, but even 80MB/S seems slow to me. I see a LOT of
> wait time in top. And my 'large file' copies might not be large enough
> as the machine has 24GB of DRAM and I've only been copying 21GB so
> it's possible some of that is cached.
>
>    Then I looked again at how I partitioned the drives originally and
> see the starting sector of sector 3 as 8594775. I started wondering if
> something like 4K block sizes at the file system level might be
> getting munged across 16k chunk sizes in the RAID. Maybe the blocks
> are being torn apart in bad ways for performance? That led me down a
> bunch of rabbit holes and I haven't found any light yet.
>
>    Looking for some thoughtful ideas from those more experienced in this area.
>
> Cheers,
> Mark
>

[-- Attachment #2: Type: text/html, Size: 5389 bytes --]

^ permalink raw reply	[flat|nested] 46+ messages in thread

end of thread, other threads:[~2013-06-30  1:04 UTC | newest]

Thread overview: 46+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht
2013-06-20 19:16 ` Volker Armin Hemmann
2013-06-20 19:28   ` Mark Knecht
2013-06-20 20:45   ` Mark Knecht
2013-06-24 18:47     ` Volker Armin Hemmann
2013-06-24 19:11       ` Mark Knecht
2013-06-20 19:27 ` Rich Freeman
2013-06-20 19:31   ` Mark Knecht
2013-06-21  7:31 ` [gentoo-amd64] " Duncan
2013-06-21 10:28   ` Rich Freeman
2013-06-21 14:23     ` Bob Sanders
2013-06-21 14:27     ` Duncan
2013-06-21 15:13       ` Rich Freeman
2013-06-22 10:29         ` Duncan
2013-06-22 11:12           ` Rich Freeman
2013-06-22 15:45             ` Duncan
2013-06-22 23:04     ` Mark Knecht
2013-06-22 23:17       ` Matthew Marlowe
2013-06-23 11:43       ` Rich Freeman
2013-06-23 15:23         ` Mark Knecht
2013-06-28  0:51       ` Duncan
2013-06-28  3:18         ` Matthew Marlowe
2013-06-21 17:40   ` Mark Knecht
2013-06-21 17:56     ` Bob Sanders
2013-06-21 18:12       ` Mark Knecht
2013-06-21 17:57     ` Rich Freeman
2013-06-21 18:10       ` Gary E. Miller
2013-06-21 18:38       ` Mark Knecht
2013-06-21 18:50         ` Gary E. Miller
2013-06-21 18:57           ` Rich Freeman
2013-06-22 14:34           ` Duncan
2013-06-22 22:15             ` Gary E. Miller
2013-06-28  0:20               ` Duncan
2013-06-28  0:41                 ` Gary E. Miller
2013-06-21 18:53         ` Bob Sanders
2013-06-22 14:23     ` Duncan
2013-06-23  1:02       ` Mark Knecht
2013-06-23  1:48         ` Mark Knecht
2013-06-28  3:36           ` Duncan
2013-06-28  9:12             ` Duncan
2013-06-28 17:50               ` Gary E. Miller
2013-06-29  5:40                 ` Duncan
2013-06-30  1:04   ` Rich Freeman
2013-06-22 12:49 ` [gentoo-amd64] " B Vance
2013-06-22 13:12   ` Rich Freeman
2013-06-23 11:31 ` thegeezer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox