From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) by finch.gentoo.org (Postfix) with ESMTP id ED2831381F3 for ; Sat, 22 Jun 2013 14:24:11 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id E6CC9E0AFD; Sat, 22 Jun 2013 14:24:00 +0000 (UTC) Received: from plane.gmane.org (plane.gmane.org [80.91.229.3]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id AC38FE0AFD for ; Sat, 22 Jun 2013 14:23:59 +0000 (UTC) Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1UqOjH-0004n2-9U for gentoo-amd64@lists.gentoo.org; Sat, 22 Jun 2013 16:23:51 +0200 Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 22 Jun 2013 16:23:51 +0200 Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Sat, 22 Jun 2013 16:23:51 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: gentoo-amd64@lists.gentoo.org From: Duncan <1i5t5.duncan@cox.net> Subject: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? Date: Sat, 22 Jun 2013 14:23:36 +0000 (UTC) Message-ID: References: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-amd64@lists.gentoo.org Reply-to: gentoo-amd64@lists.gentoo.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: ip68-231-22-224.ph.ph.cox.net User-Agent: Pan/0.140 (Chocolate Salty Balls; GIT 459f52e /usr/src/portage/src/egit-src/pan2) X-Archives-Salt: f8dfa39f-af8e-46e0-9751-6f57889b18c3 X-Archives-Hash: a3c0706a1b3885b056c4575e63272184 Mark Knecht posted on Fri, 21 Jun 2013 10:40:48 -0700 as excerpted: > On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: > > > Wonderful post but much too long to carry on a conversation > in-line. FWIW... I'd have a hard time doing much of anything else, these days, no matter the size. Otherwise, I'd be likely to forget a point. But I do try to snip or summarize when possible. And I do understand your choice and agree with it for you. It's just not one I'd find workable for me... which is why I'm back to inline, here. > As you sound pretty sure of your understanding/history I'll > assume you're right 100% of the time, but only maybe 80% of the post > feels right to me at this time so let's assume I have much to learn and > go from there. That's a very nice way of saying "I'll have to verify that before I can fully agree, but we'll go with it for now." I'll have to remember it! =:^) > In thinking about this issue this morning I think it's important to > me to get down to basics and verify as much as possible, step-by-step, > so that I don't layer good work on top of bad assumptions. Extremely reasonable approach. =:^) > Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB > DDR3 + Core i7-980x Extreme 12 core processor That's a very impressive base. But as you point out elsewhere, you use it. Multiple VMs running MS should well use use both the dozen cores and the 24 gig RAM. As an aside, it's interesting how well your dozen cores, 24 gig RAM, fits my basic two gigs a core rule of thumb. Obviously I'd consider that reasonably well balanced RAM/cores-wise. > 1 SDD - 120GB SATA3 on it's own controller > 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives > using Intel integrated controllers > > (NOTE: I can possibly go to a 6-drive RAID if I made some changes in the > box but that's for later) > > According to the WD spec > (http://www.wdc.com/en/library/spec/2879-701281.pdf) the 500GB drives OK, single 120 gig main drive (SSD), 5 half-TB drives for the raid. > [...] sustain 113MB/S to the drive. Using hdparm I measure 107MB/S > or higher for all 5 drives [...] > The SDD on it's own PCI Express controller clocks in at about 250MB/S > for reads. OK. But there's a caveat on the measured "spinning rust" speeds. You're effectively getting "near best case". I suppose you're familiar with absolute velocity vs rotational velocity vs distance from center. Think merry-go-round as a kid or crack-the-whip as a teen (or insert your own experience here). The closer to the center you are the slower you go at the same rotational speed (RPM). Conversely, the farther from the center you are, the faster you're actually moving at the same RPM. Rotational disk data I/O rates have a similar effect -- data toward the outside edge of the platter (beginning of the disk) is faster to read/ write, while data toward the inside edge (center) is slower. Based on my own hddparm tests on partitioned drives where I knew the location of the partition, vs. the results for the drive as a whole, the speed reported for rotational drives as a whole, is the speed near the outside edge (beginning of the disk). Thus, it'd be rather interesting to partition up one of those drives with a small partition at the beginning and another at the end, and do an hdparm -t of each, as well as of the whole disk. I bet you'd find the one at the end reports rather lower numbers, while the report for the drive as a whole is similar to that of the partition near the beginning of the drive, much faster. A good SSD won't have this same sort of variance, since it's SSD and the latency to any of its flash, at least as presented by the firmware which should deal with any variance as it distributes wear, should be similar. (Cheap SSDs and standard USB thumbdrive flash storage works differently, however. Often they assume FAT and have a small amount of fast and resilient but expensive SLC flash at the beginning, where the FAT would be, with the rest of the device much slower and less resilient to rewrite but far cheaper MLC. I was just reading about this recently as I researched my own SSDs.) > TESTING: I'm using dd to test. It gives an easy to read anyway result > and seems to be used a lot. I can use bonnie++ or IOzone later but I > don't think that's necessary quite yet. Agreed. > Being that I have 24GB and don't > want cached data to effect the test speeds I do the following: > > 1) Using dd I created a 50GB file for copying using the following > commands: > > cd /mnt/fastVM > dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50] It'd be interesting to see what the reported speed is here... See below for more. > 2) To ensure that nothing is cached and the copies are (hopefully) > completely fair as root I do the following between each test: > > sync free -h > echo 3 > /proc/sys/vm/drop_caches > free -h Good job. =:^) > 3) As a first test I copy using dd the 50GB file from the SDD to the > RAID6. OK, that answered the question I had about where that file you created actually was -- on the SSD. > As long as reading the SDD is much faster than writing the RAID6 > then it should be a test of primarily the RAID6 write speed: > > dd if=/mnt/fastVM/random1 of=SDDCopy > 97656250+0 records in 97656250+0 records out > 50000000000 bytes (50 GB) copied, 339.173 s, 147 MB/s > If I clear cache as above and rerun the test it's always 145-155MB/S ... Assuming $PWD is now on the raid. You had the path shown too, which I snipped, but that doesn't tell /me/ (as opposed to you, who should know based on your mounts) anything about whether it's on the raid or not. However, the above including the drop-caches demonstrates enough care that I'm quite confident you'd not make /that/ mistake. > 4) As a second test I read from the RAID6 and write back to the RAID6. > I see MUCH lower speeds, again repeatable: > > dd if=SDDCopy of=HDDWrite > 97656250+0 records in 97656250+0 records out > 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s > 5) As a final test, and just looking for problems if any, I do an SDD to > SDD copy which clocked in at close to 200MB/S > > dd if=random1 of=SDDCopy > 97656250+0 records in 97656250+0 records out > 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s > So, being that this RAID6 was grown yesterday from something that > has existed for a year or two I'm not sure of it's fragmentation, or > even how to determine that at this time. However it seems my problem are > RAID6 reads, not RAID6 writes, at least to new an probably never used > disk space. Reading all that, one question occurs to me. If you want to test read and write separately, why the intermediate step of dd-ing from /dev/ random to ssd, then from ssd to raid or ssd? Why not do direct dd if=/dev/random (or urandom, see note below) of=/desired/target ... for write tests, and then (after dropping caches), if=/desired/target of=/dev/null ... for read tests? That way there's just the one block device involved, not both. /dev/random note: I presume with that hardware you have one of the newer CPUs with the new Intel hardware random instruction, with the appropriate kernel config hooking it into /dev/random, and/or otherwise have /dev/random hooked up to a hardware random number generator. Otherwise, using that much random data could block until more suitably random data is generated from approved kernel sources. Thus, the following probably doesn't apply to you, but it may well apply to others, and is good practice in any case, unless you KNOW your random isn't going to block due to hardware generation, and even then it's worth noting that when you're posting examples like the above. In general, for tests such as this where a LOT of random data is needed, but cryptographic-quality random isn't necessarily required, use /dev/urandom. In the event that real-random data gets too low, /dev/urandom will switch to pseudo-random generation, which should be "good enough" for this sort of usage. /dev/random, OTOH, will block until it gets more random data from sources the kernel trusts to be truly random. On some machines with relatively limited sources of randomness the kernel considers truly random, therefore, just grabbing 50 GB of data from /dev/random could take QUITE some time (days maybe? I don't know). Obviously you don't have /too/ big a problem with it as you got the data from /dev/random, but it's worth noting. If your machine has a hardware random generator hooked into /dev/random, then /dev/urandom will never switch to pseudo-random in any case, so for tests of anything above /kilobytes/ of random data (and even at that...), just use urandom and you won't have to worry about it either way. OTOH, if you're generating an SSH key or something, always use /dev/random as that needs cryptographic security level randomness, but that'll take just a few bytes of randomness, not kilobytes let alone gigabytes, and if your hardware doesn't have good randomness and it does block, wiggling your mouse around a bit (obviously assumes a local command, remote could require something other than mouse, obviously) should give it enough randomness to continue. Meanwhile, dd-ing either from /dev/urandom as source, or to /dev/null as sink, with only the test-target block device as a real block device, should give you "purer" read-only and write-only tests. In theory it shouldn't matter much given your method of testing, but as we all know, theory and reality aren't always well aligned. Of course the next question follows on from the above. I see a write to the raid, and a copy from the raid to the raid, so read/write on the raid, and a copy from the ssd to the ssd, read/write on it, but no test of from the raid read. So if=/dev/urandom of=/mnt/raid/target ... should give you raid write. drop-caches if=/mnt/raid/target of=/dev/null ... should give you raid read. *THEN* we have good numbers on both to compare the raid read/write to. What I suspect you'll find, unless fragmentation IS your problem, is that both read (from the raid) alone and write (to the raid) alone should be much faster than read/write (from/to the raid). The problem with read/write is that you're on "rotating rust" hardware and there's some latency as it repositions the heads from the read location to the write location and back. If I'm correct and that's what you find, a workaround specific to dd would be to specify a much larger block size, so it reads in far more data at once, then writes it out at once, with far fewer switches between modes. In the above you didn't specify bs (or the separate input/output equivilents, ibs/obs respectively) at all, so it's using 512-byte blocksize defaults. >From what I know of hardware, 64KB is a standard read-ahead, so in theory you should see improvements using larger block sizes upto at LEAST that size, and on a 5-disk raid6, probably 3X that, 192KB, which should in theory do a full 64KB buffer on each of the three data drives of the 5- way raid6 (the other two being parity). I'm guessing you'll see a "knee" at the 192 KB (that's 2^10 power not 10^3 power BTW) block size, and above that you might see improvement, but not near as much, since the hardware should be doing full 64KB blocks which it's optimized to. There's likely to be another knee at the 16MB point (again, power of two, not 10), or more accurately, the 48MB point (3*16MB), since that's the size of the device hardware buffers (again, three devices worth of data-stripe, since the other two are parity, 3*16MB=48MB). Above that, theory says you'll see even less improvement, since the caches will be full and any improvement still seen should be purely that of less switches between read/write mode and thus less seeks. But it'd be interesting to see how closely theory matches reality, there's very possibly a fly in that theoretical ointment somewhere. =:^\ Of course configurable block size is specific to dd. Real life file transfers may well be quite a different story. That's where the chunk size, stripe size, etc, stuff comes in, setting the defaults for the kernel for that device, and again, I'll freely admit to not knowing as much as I could in that area. > I will also report more later but I can state that just using top > there's never much CPU usage doing this but a LOT of WAIT time when > reading the RAID6. It really appears the system is spinning it's wheels > waiting for the RAID to get data from the disk. When you're dealing with spinning rust, any time you have a transfer of any size (certainly GB), you WILL see high wait times. Disks are simply SLOW. Even SSDs are no match for system memory, tho their enough closer to help a lot and can be close enough that the bottleneck is elsewhere. (Modern SSDs saturate the SATA-600 links with thruput above 500 MByte/ sec, making the SATA-600 bus the bottleneck, or the 1x PCI-E 2.xlink if that's what it's running on, since they saturate at 485MByte/sec or so, tho PCI-E 3.x is double that so nearly a GByte/sec and a single SATA-600 won't saturate that. Modern DDR3 SDRAM by comparision runs 10+ GByte/sec LOW end, two orders of magnitude faster. Numbers fresh from wikipedia, BTW.) > One place where I wanted to double check your thinking. My thought > is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as it > has to read from three drives and make sure they are all good before > returning data to the user. I don't see how that could ever be faster > than what a single drive file system could do which for these drives > would be the 113MB/S WD spec number, correct? As I'm currently getting > 145MB/S it appears on the surface that the RAID6 is providing some > value, at least in these early days of use. Maybe it will degrade over > time though. As someone else already posted, that's NOT correct. Neither raid1 nor raid6, at least the mdraid implementations, verify the data. Raid1 doesn't have parity at all, just many copies, and raid6 has parity but only uses it for rebuilds, NOT to check data integrity under normal usage -- it too simply reads the data and returns it. What raid1 does (when it's getting short reads only one at a time) is send the request to every spindle. The first one that returns the data wins; the others simply get their returns thrown away. So under small-one-at-a-time reading conditions, the speed of raid1 reads should be the speed of the fastest disk in the bunch. The raid1 read advantage is in the fact that there's often more than one read going on at once, or that the read is big enough to split up, so different spindles can be seeking to and reading different parts of the request in parallel. (This also helps in fragmented file conditions as long as fragmentation isn't overwhelming, since a raid1 can then send different spindle heads to read the different segments in parallel, instead of reading one at a time serially, as it would have to do in a single spindle case.) In theory, the stripes of raid6 /can/ lead to better thruput for reads. In fact, my experience both with raid6 and with raid0 demonstrates that not to be the case as often as one might expect, due either to small reads or due to fragmentation breaking up the big reads thus negating the theoretical thruput advantage of multiple stripes. To be fair, my raid0 experience was as I mentioned earlier, with files I could easily redownload from the net, mostly the portage tree and overlays, along with the kernel git tree. Due to the frequency of update and the fast rate of change as well as the small files, fragmentation was quite a problem, and the files were small enough I likely wouldn't have seen the full benefit of the 4-way raid0 stripes in any case, so that wasn't a best-case test scenario. But it's what one practically puts on raid0, because it IS easily redownloaded from the net, so it DOESN'T matter that a loss of any of the raid0 component devices will kill the entire thing. If I'd have been using the raid0 for much bigger media files, mp3s or video of megabytes in size minimum, that get saved and never changed so there's little fragmentation, I expect my raid0 experience would have been *FAR* better. But at the same time, that's not the type of data that it generally makes SENSE to store on a raid0 without backups or redundancy of any sort, unless it's simply VDR files that if a device drops from the raid and you lose it you don't particularly care (which would make a GREAT raid0 candidate), so... Raid6 is the stripes of raid0, plus two-way-parity. So since the parity is ignored for reads, for them it's effectively raid0 with two less stripes then the number of devices. Thus your 5-device raid6 is effectively a 3-device raid0 in terms of reads. In theory, thruput for large reads done by themselves should be pretty good -- three times that of a single device. In fact... due either to multiple jobs happening at once, or to a mix of read/write happening at once, or to fragmentation, I was disappointed, and far happier with raid1. But your situation is indeed rather different than mine, and depending on how much writing happens in those big VM files and how the filesystem you choose handles fragmentation, you could be rather happier with raid6 than I was. But I'd still suggest you try raid1 if the amount of data you're handling will let you. Honestly, it surprised me how well raid1 did for me. I wasn't prepared for that at all, and I believe that in comparison to what I was getting on raid6 is what colored my opinion of raid6 so badly. I had NO IDEA there would be that much difference! But your experience may indeed be different. The only way to know is to try it. However, one thing I either overlooked or that hasn't been posted yet is just how much data you're talking about. You're running five 500-gig drives in raid6 now, which should give you 3*500=1500 gigs (10-power) capacity. If it's under a third full, 500 MB (10-power), you can go raid1 with as many mirrors as you like of the five, and keep the rest of them for hot- spares or whatever. If you're running (or plan to be running) near capacity, over 2/3 full, 1 TB (10-power), you really don't have much option but raid6. If you're in between, 1/3 to 2/3 full, 500-1000 GB (10-power), then a raid10 is possible, perhaps 4-spindle with the 5th as a hot-spare. (A spindle configured as a hot-spare is kept unused but ready for use by mdadm and the kernel. If a spindle should drop out, the hot-spare is automatically inserted in its place and a rebuild immediately started. This narrows the danger zone during which you're degraded and at risk if further spindles drop out, because handling is automatic so you're back to full un-degraded as soon as possible. However, it doesn't eliminate that danger zone should another one drop out during the rebuild, which is after all quite stressful on the remaining drives since due to all that reading going on, so the risk is greater during a rebuild than under normal operation.) So if you're over 2/3 full, or expect to be in short order, there's little sense in further debate on at least /your/ raid6, as that's pretty much what you're stuck with. (Unless you can categorize some data as more important than other, and raid it, while the other can be considered worth the risk of loss if the device goes, in which case we're back in play with other options once again.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman