From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-amd64+bounces-13312-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	by finch.gentoo.org (Postfix) with ESMTP id ED2831381F3
	for <garchives@archives.gentoo.org>; Sat, 22 Jun 2013 14:24:11 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id E6CC9E0AFD;
	Sat, 22 Jun 2013 14:24:00 +0000 (UTC)
Received: from plane.gmane.org (plane.gmane.org [80.91.229.3])
	(using TLSv1 with cipher AES256-SHA (256/256 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id AC38FE0AFD
	for <gentoo-amd64@lists.gentoo.org>; Sat, 22 Jun 2013 14:23:59 +0000 (UTC)
Received: from list by plane.gmane.org with local (Exim 4.69)
	(envelope-from <glga-gentoo-amd64@m.gmane.org>)
	id 1UqOjH-0004n2-9U
	for gentoo-amd64@lists.gentoo.org; Sat, 22 Jun 2013 16:23:51 +0200
Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224])
        by main.gmane.org with esmtp (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <gentoo-amd64@lists.gentoo.org>; Sat, 22 Jun 2013 16:23:51 +0200
Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian))
        id 1AlnuQ-0007hv-00
        for <gentoo-amd64@lists.gentoo.org>; Sat, 22 Jun 2013 16:23:51 +0200
X-Injected-Via-Gmane: http://gmane.org/
To: gentoo-amd64@lists.gentoo.org
From: Duncan <1i5t5.duncan@cox.net>
Subject: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector
 value?
Date: Sat, 22 Jun 2013 14:23:36 +0000 (UTC)
Message-ID: <pan$22d39$31ef3544$39d12fd7$3d33aa30@cox.net>
References: 
	<CAK2H+ecth45ADi=k=1b4y8eowYNqoABTo3iMgEzV6pAmthusVA@mail.gmail.com>
	<pan$ecf3f$9af69a78$1508667e$d81347b7@cox.net>
	<CAK2H+ecNGaQ7BfAaWtLkQhg3T-pC56tejDKL6kK+qydLa8YWyg@mail.gmail.com>
Precedence: bulk
List-Post: <mailto:gentoo-amd64@lists.gentoo.org>
List-Help: <mailto:gentoo-amd64+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-amd64+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-amd64+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-amd64.gentoo.org>
X-BeenThere: gentoo-amd64@lists.gentoo.org
Reply-to: gentoo-amd64@lists.gentoo.org
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Complaints-To: usenet@ger.gmane.org
X-Gmane-NNTP-Posting-Host: ip68-231-22-224.ph.ph.cox.net
User-Agent: Pan/0.140 (Chocolate Salty Balls; GIT 459f52e
	/usr/src/portage/src/egit-src/pan2)
X-Archives-Salt: f8dfa39f-af8e-46e0-9751-6f57889b18c3
X-Archives-Hash: a3c0706a1b3885b056c4575e63272184

Mark Knecht posted on Fri, 21 Jun 2013 10:40:48 -0700 as excerpted:

> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> <SNIP>
> 
> Wonderful post but much too long to carry on a conversation
> in-line.

FWIW... I'd have a hard time doing much of anything else, these days, no 
matter the size.  Otherwise, I'd be likely to forget a point.  But I do 
try to snip or summarize when possible.  And I do understand your choice 
and agree with it for you.  It's just not one I'd find workable for me... 
which is why I'm back to inline, here.

> As you sound pretty sure of your understanding/history I'll
> assume you're right 100% of the time, but only maybe 80% of the post
> feels right to me at this time so let's assume I have much to learn and
> go from there.

That's a very nice way of saying "I'll have to verify that before I can 
fully agree, but we'll go with it for now."  I'll have to remember it!  
=:^)

> In thinking about this issue this morning I think it's important to
> me to get down to basics and verify as much as possible, step-by-step,
> so that I don't layer good work on top of bad assumptions.

Extremely reasonable approach. =:^)

> Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB
> DDR3 + Core i7-980x Extreme 12 core processor

That's a very impressive base.  But as you point out elsewhere, you use 
it.  Multiple VMs running MS should well use use both the dozen cores and 
the 24 gig RAM.

As an aside, it's interesting how well your dozen cores, 24 gig RAM, fits 
my basic two gigs a core rule of thumb.  Obviously I'd consider that 
reasonably well balanced RAM/cores-wise.

> 1 SDD - 120GB SATA3 on it's own controller
> 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives
> using Intel integrated controllers
> 
> (NOTE: I can possibly go to a 6-drive RAID if I made some changes in the
> box but that's for later)
> 
> According to the WD spec
> (http://www.wdc.com/en/library/spec/2879-701281.pdf) the 500GB drives 

OK, single 120 gig main drive (SSD), 5 half-TB drives for the raid.

> [...] sustain 113MB/S to the drive. Using hdparm I measure 107MB/S
> or higher for all 5 drives [...]
> The SDD on it's own PCI Express controller clocks in at about 250MB/S
> for reads.

OK.

But there's a caveat on the measured "spinning rust" speeds.  You're 
effectively getting "near best case".

I suppose you're familiar with absolute velocity vs rotational velocity 
vs distance from center.  Think merry-go-round as a kid or crack-the-whip 
as a teen (or insert your own experience here).  The closer to the center 
you are the slower you go at the same rotational speed (RPM).  
Conversely, the farther from the center you are, the faster you're 
actually moving at the same RPM.

Rotational disk data I/O rates have a similar effect -- data toward the 
outside edge of the platter (beginning of the disk) is faster to read/
write, while data toward the inside edge (center) is slower.

Based on my own hddparm tests on partitioned drives where I knew the 
location of the partition, vs. the results for the drive as a whole, the 
speed reported for rotational drives as a whole, is the speed near the 
outside edge (beginning of the disk).

Thus, it'd be rather interesting to partition up one of those drives with 
a small partition at the beginning and another at the end, and do an 
hdparm -t of each, as well as of the whole disk.  I bet you'd find the 
one at the end reports rather lower numbers, while the report for the 
drive as a whole is similar to that of the partition near the beginning 
of the drive, much faster.

A good SSD won't have this same sort of variance, since it's SSD and the 
latency to any of its flash, at least as presented by the firmware which 
should deal with any variance as it distributes wear, should be similar.  
(Cheap SSDs and standard USB thumbdrive flash storage works differently, 
however.  Often they assume FAT and have a small amount of fast and 
resilient but expensive SLC flash at the beginning, where the FAT would 
be, with the rest of the device much slower and less resilient to rewrite 
but far cheaper MLC.  I was just reading about this recently as I 
researched my own SSDs.)

> TESTING: I'm using dd to test. It gives an easy to read anyway result
> and seems to be used a lot. I can use bonnie++ or IOzone later but I
> don't think that's necessary quite yet.

Agreed.

> Being that I have 24GB and don't
> want cached data to effect the test speeds I do the following:
> 
> 1) Using dd I created a 50GB file for copying using the following
> commands:
> 
> cd /mnt/fastVM
> dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50]

It'd be interesting to see what the reported speed is here...  See below 
for more.

> 2) To ensure that nothing is cached and the copies are (hopefully)
> completely fair as root I do the following between each test:
> 
> sync free -h
> echo 3 > /proc/sys/vm/drop_caches
> free -h

Good job. =:^)

> 3) As a first test I copy using dd the 50GB file from the SDD to the
> RAID6.

OK, that answered the question I had about where that file you created 
actually was -- on the SSD.

> As long as reading the SDD is much faster than writing the RAID6
> then it should be a test of primarily the RAID6 write speed:
> 
> dd if=/mnt/fastVM/random1 of=SDDCopy
> 97656250+0 records in 97656250+0 records out
> 50000000000 bytes (50 GB) copied, 339.173 s, 147 MB/s

> If I clear cache as above and rerun the test it's always 145-155MB/S

... Assuming $PWD is now on the raid.  You had the path shown too, which 
I snipped, but that doesn't tell /me/ (as opposed to you, who should know 
based on your mounts) anything about whether it's on the raid or not.  
However, the above including the drop-caches demonstrates enough care 
that I'm quite confident you'd not make /that/ mistake.

> 4) As a second test I read from the RAID6 and write back to the RAID6.
> I see MUCH lower speeds, again repeatable:
> 
> dd if=SDDCopy of=HDDWrite
> 97656250+0 records in 97656250+0 records out
> 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s

> 5) As a final test, and just looking for problems if any, I do an SDD to
> SDD copy which clocked in at close to 200MB/S
> 
> dd if=random1 of=SDDCopy
> 97656250+0 records in 97656250+0 records out
> 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s

> So, being that this RAID6 was grown yesterday from something that
> has existed for a year or two I'm not sure of it's fragmentation, or
> even how to determine that at this time. However it seems my problem are
> RAID6 reads, not RAID6 writes, at least to new an probably never used
> disk space.

Reading all that, one question occurs to me.  If you want to test read 
and write separately, why the intermediate step of dd-ing from /dev/
random to ssd, then from ssd to raid or ssd?

Why not do direct dd if=/dev/random (or urandom, see note below)
of=/desired/target ... for write tests, and then (after dropping caches), 
if=/desired/target of=/dev/null ... for read tests?  That way there's 
just the one block device involved, not both.

/dev/random note:  I presume with that hardware you have one of the newer 
CPUs with the new Intel hardware random instruction, with the appropriate 
kernel config hooking it into /dev/random, and/or otherwise have 
/dev/random hooked up to a hardware random number generator.  Otherwise, 
using that much random data could block until more suitably random data 
is generated from approved kernel sources.  Thus, the following probably 
doesn't apply to you, but it may well apply to others, and is good 
practice in any case, unless you KNOW your random isn't going to block 
due to hardware generation, and even then it's worth noting that when 
you're posting examples like the above.

In general, for tests such as this where a LOT of random data is needed, 
but cryptographic-quality random isn't necessarily required, use
/dev/urandom.  In the event that real-random data gets too low,
/dev/urandom will switch to pseudo-random generation, which should be 
"good enough" for this sort of usage.  /dev/random, OTOH, will block 
until it gets more random data from sources the kernel trusts to be truly 
random.  On some machines with relatively limited sources of randomness 
the kernel considers truly random, therefore, just grabbing 50 GB of data 
from /dev/random could take QUITE some time (days maybe?  I don't know).

Obviously you don't have /too/ big a problem with it as you got the data 
from /dev/random, but it's worth noting.  If your machine has a hardware 
random generator hooked into /dev/random, then /dev/urandom will never 
switch to pseudo-random in any case, so for tests of anything above 
/kilobytes/ of random data (and even at that...), just use urandom and 
you won't have to worry about it either way.  OTOH, if you're generating 
an SSH key or something, always use /dev/random as that needs 
cryptographic security level randomness, but that'll take just a few 
bytes of randomness, not kilobytes let alone gigabytes, and if your 
hardware doesn't have good randomness and it does block, wiggling your 
mouse around a bit (obviously assumes a local command, remote could 
require something other than mouse, obviously) should give it enough 
randomness to continue.


Meanwhile, dd-ing either from /dev/urandom as source, or to /dev/null as 
sink, with only the test-target block device as a real block device, 
should give you "purer" read-only and write-only tests.  In theory it 
shouldn't matter much given your method of testing, but as we all know, 
theory and reality aren't always well aligned.


Of course the next question follows on from the above.  I see a write to 
the raid, and a copy from the raid to the raid, so read/write on the 
raid, and a copy from the ssd to the ssd, read/write on it, but no test 
of from the raid read.

So

if=/dev/urandom of=/mnt/raid/target ... should give you raid write.

drop-caches

if=/mnt/raid/target of=/dev/null ... should give you raid read.

*THEN* we have good numbers on both to compare the raid read/write to.

What I suspect you'll find, unless fragmentation IS your problem, is that 
both read (from the raid) alone and write (to the raid) alone should be 
much faster than read/write (from/to the raid).

The problem with read/write is that you're on "rotating rust" hardware 
and there's some latency as it repositions the heads from the read 
location to the write location and back.

If I'm correct and that's what you find, a workaround specific to dd 
would be to specify a much larger block size, so it reads in far more 
data at once, then writes it out at once, with far fewer switches between 
modes.  In the above you didn't specify bs (or the separate input/output 
equivilents, ibs/obs respectively) at all, so it's using 512-byte 
blocksize defaults.

>From what I know of hardware, 64KB is a standard read-ahead, so in theory 
you should see improvements using larger block sizes upto at LEAST that 
size, and on a 5-disk raid6, probably 3X that, 192KB, which should in 
theory do a full 64KB buffer on each of the three data drives of the 5-
way raid6 (the other two being parity).

I'm guessing you'll see a "knee" at the 192 KB (that's 2^10 power not 
10^3 power BTW) block size, and above that you might see improvement, but 
not near as much, since the hardware should be doing full 64KB blocks 
which it's optimized to.  There's likely to be another knee at the 16MB 
point (again, power of two, not 10), or more accurately, the 48MB point 
(3*16MB), since that's the size of the device hardware buffers (again, 
three devices worth of data-stripe, since the other two are parity, 
3*16MB=48MB).  Above that, theory says you'll see even less improvement, 
since the caches will be full and any improvement still seen should be 
purely that of less switches between read/write mode and thus less seeks.

But it'd be interesting to see how closely theory matches reality, 
there's very possibly a fly in that theoretical ointment somewhere. =:^\

Of course configurable block size is specific to dd.  Real life file 
transfers may well be quite a different story.  That's where the chunk 
size, stripe size, etc, stuff comes in, setting the defaults for the 
kernel for that device, and again, I'll freely admit to not knowing as 
much as I could in that area.

> I will also report more later but I can state that just using top
> there's never much CPU usage doing this but a LOT of WAIT time when
> reading the RAID6. It really appears the system is spinning it's wheels
> waiting for the RAID to get data from the disk.

When you're dealing with spinning rust, any time you have a transfer of 
any size (certainly GB), you WILL see high wait times.  Disks are simply 
SLOW.  Even SSDs are no match for system memory, tho their enough closer 
to help a lot and can be close enough that the bottleneck is elsewhere.  
(Modern SSDs saturate the SATA-600 links with thruput above 500 MByte/
sec, making the SATA-600 bus the bottleneck, or the 1x PCI-E 2.xlink if 
that's what it's running on, since they saturate at 485MByte/sec or so, 
tho PCI-E 3.x is double that so nearly a GByte/sec and a single SATA-600 
won't saturate that.  Modern DDR3 SDRAM by comparision runs 10+ GByte/sec 
LOW end, two orders of magnitude faster.  Numbers fresh from wikipedia, 
BTW.)

> One place where I wanted to double check your thinking. My thought
> is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as it
> has to read from three drives and make sure they are all good before
> returning data to the user. I don't see how that could ever be faster
> than what a single drive file system could do which for these drives
> would be the 113MB/S WD spec number, correct? As I'm currently getting
> 145MB/S it appears on the surface that the RAID6 is providing some
> value, at least in these early days of use. Maybe it will degrade over
> time though.

As someone else already posted, that's NOT correct.  Neither raid1 nor 
raid6, at least the mdraid implementations, verify the data.  Raid1 
doesn't have parity at all, just many copies, and raid6 has parity but 
only uses it for rebuilds, NOT to check data integrity under normal usage 
-- it too simply reads the data and returns it.

What raid1 does (when it's getting short reads only one at a time) is 
send the request to every spindle.  The first one that returns the data 
wins; the others simply get their returns thrown away.

So under small-one-at-a-time reading conditions, the speed of raid1 reads 
should be the speed of the fastest disk in the bunch.

The raid1 read advantage is in the fact that there's often more than one 
read going on at once, or that the read is big enough to split up, so 
different spindles can be seeking to and reading different parts of the 
request in parallel.  (This also helps in fragmented file conditions as 
long as fragmentation isn't overwhelming, since a raid1 can then send 
different spindle heads to read the different segments in parallel, 
instead of reading one at a time serially, as it would have to do in a 
single spindle case.)

In theory, the stripes of raid6 /can/ lead to better thruput for reads.  
In fact, my experience both with raid6 and with raid0 demonstrates that 
not to be the case as often as one might expect, due either to small 
reads or due to fragmentation breaking up the big reads thus negating the 
theoretical thruput advantage of multiple stripes.

To be fair, my raid0 experience was as I mentioned earlier, with files I 
could easily redownload from the net, mostly the portage tree and 
overlays, along with the kernel git tree.  Due to the frequency of update 
and the fast rate of change as well as the small files, fragmentation was 
quite a problem, and the files were small enough I likely wouldn't have 
seen the full benefit of the 4-way raid0 stripes in any case, so that 
wasn't a best-case test scenario.  But it's what one practically puts on 
raid0, because it IS easily redownloaded from the net, so it DOESN'T 
matter that a loss of any of the raid0 component devices will kill the 
entire thing.

If I'd have been using the raid0 for much bigger media files, mp3s or 
video of megabytes in size minimum, that get saved and never changed so 
there's little fragmentation, I expect my raid0 experience would have 
been *FAR* better.  But at the same time, that's not the type of data 
that it generally makes SENSE to store on a raid0 without backups or 
redundancy of any sort, unless it's simply VDR files that if a device 
drops from the raid and you lose it you don't particularly care (which 
would make a GREAT raid0 candidate), so...

Raid6 is the stripes of raid0, plus two-way-parity.  So since the parity 
is ignored for reads, for them it's effectively raid0 with two less 
stripes then the number of devices.  Thus your 5-device raid6 is 
effectively a 3-device raid0 in terms of reads.  In theory, thruput for 
large reads done by themselves should be pretty good -- three times that 
of a single device.  In fact... due either to multiple jobs happening at 
once, or to a mix of read/write happening at once, or to fragmentation, I 
was disappointed, and far happier with raid1.

But your situation is indeed rather different than mine, and depending on 
how much writing happens in those big VM files and how the filesystem you 
choose handles fragmentation, you could be rather happier with raid6 than 
I was.

But I'd still suggest you try raid1 if the amount of data you're handling 
will let you.  Honestly, it surprised me how well raid1 did for me.  I 
wasn't prepared for that at all, and I believe that in comparison to what 
I was getting on raid6 is what colored my opinion of raid6 so badly.  I 
had NO IDEA there would be that much difference!  But your experience may 
indeed be different.  The only way to know is to try it.

However, one thing I either overlooked or that hasn't been posted yet is 
just how much data you're talking about.  You're running five 500-gig 
drives in raid6 now, which should give you 3*500=1500 gigs (10-power) 
capacity.

If it's under a third full, 500 MB (10-power), you can go raid1 with as 
many mirrors as you like of the five, and keep the rest of them for hot-
spares or whatever.

If you're running (or plan to be running) near capacity, over 2/3 full, 1 
TB (10-power), you really don't have much option but raid6.

If you're in between, 1/3 to 2/3 full, 500-1000 GB (10-power), then a 
raid10 is possible, perhaps 4-spindle with the 5th as a hot-spare.

(A spindle configured as a hot-spare is kept unused but ready for use by 
mdadm and the kernel.  If a spindle should drop out, the hot-spare is 
automatically inserted in its place and a rebuild immediately started.  
This narrows the danger zone during which you're degraded and at risk if 
further spindles drop out, because handling is automatic so you're back 
to full un-degraded as soon as possible.  However, it doesn't eliminate 
that danger zone should another one drop out during the rebuild, which is 
after all quite stressful on the remaining drives since due to all that 
reading going on, so the risk is greater during a rebuild than under 
normal operation.)

So if you're over 2/3 full, or expect to be in short order, there's 
little sense in further debate on at least /your/ raid6, as that's pretty 
much what you're stuck with.  (Unless you can categorize some data as 
more important than other, and raid it, while the other can be considered 
worth the risk of loss if the device goes, in which case we're back in 
play with other options once again.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman