From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) by finch.gentoo.org (Postfix) with ESMTP id 0DC4D1381F3 for ; Fri, 21 Jun 2013 14:28:08 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 75643E0A00; Fri, 21 Jun 2013 14:28:02 +0000 (UTC) Received: from plane.gmane.org (plane.gmane.org [80.91.229.3]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 7AF15E09E8 for ; Fri, 21 Jun 2013 14:28:01 +0000 (UTC) Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1Uq2Jh-00042L-IU for gentoo-amd64@lists.gentoo.org; Fri, 21 Jun 2013 16:27:57 +0200 Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 21 Jun 2013 16:27:57 +0200 Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 21 Jun 2013 16:27:57 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: gentoo-amd64@lists.gentoo.org From: Duncan <1i5t5.duncan@cox.net> Subject: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? Date: Fri, 21 Jun 2013 14:27:38 +0000 (UTC) Message-ID: References: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-amd64@lists.gentoo.org Reply-to: gentoo-amd64@lists.gentoo.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: ip68-231-22-224.ph.ph.cox.net User-Agent: Pan/0.140 (Chocolate Salty Balls; GIT 459f52e /usr/src/portage/src/egit-src/pan2) X-Archives-Salt: 3e359288-676b-45e9-b72f-5f367ffc2529 X-Archives-Hash: 12b7598d58536c14306ec50121cba543 Rich Freeman posted on Fri, 21 Jun 2013 06:28:35 -0400 as excerpted: > On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: >> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k >> in data across three devices, and 8k of parity across the other two >> devices. > > With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a > stripe, not 20k. If you modify one block it needs to read all 1.5M, or > it needs to read at least the old chunk on the single drive to be > modified and both old parity chunks (which on such a small array is 3 > disks either way). I'll admit to not fully understanding chunks/stripes/strides in terms of actual size tho I believe you're correct, it's well over the filesystem block size and a half-meg is probably right. However, the original post went with a 4k blocksize, which is pretty standard as that's the usual memory page size as well so it makes for a convenient filesystem blocksize too, so that's what I was using as a base for my numbers. If it's 4k blocksize, then 5-device raid6 stripe would be 3*4k=12k of data, plus 2*4k=8k of parity. >> Forth, back to the parity. Remember, raid5/6 has all that parity that >> it writes out (but basically never reads in normal mode, only when >> degraded, >> in ordered to reconstruct the data from the missing device(s)), but >> doesn't actually use it for integrity checking. > > I wasn't aware of this - I can't believe it isn't even an option either. > Note to self - start doing weekly scrubs... Indeed. That's one of the things that frustrated me with mdraid -- all that data integrity metadata there, but just going to waste in normal operation, only used for device recovery. Which itself can be a problem as well, because if there *IS* an undetected cosmic-ray-error or whatever and a device goes out, that means you'll lose integrity on a second device in the rebuild as well (if it was a data device that dropped out and not parity anyway), because the parity's screwed against the undetected error and will thus rebuild a bad copy of the data on the replacement device. And it's one of the things which so attracted me to btrfs, too, and why I was so frustrated to see it could only be a single redundancy (two-way- mirrored), no way to do more. The btrfs sales pitch talks about how great data integrity and the ability to go find a good copy when the data's bad, but what if the only allowed second copy is bad as well? OOPS! But as I said, N-way mirroring is on the btrfs roadmap, it's simply not there yet. >> The single down side to raid1 as opposed to raid5/6 is the loss of the >> extra space made available by the data striping, 3*single-device-space >> in the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space >> in the case of raid1. Otherwise, no contest, hands down, raid1 over >> raid6. > > This is a HUGE downside. The only downside to raid1 over not having > raid at all is that your disk space cost doubles. raid5/6 is > considerably cheaper in that regard. In a 5-disk raid5 the cost of > redundancy is only 25% more, vs a 100% additional cost for raid1. To > accomplish the same space as a 5-disk raid5 you'd need 8 disks. Sure, > read performance would be vastly superior, but if you're going to spend > $300 more on hard drives and whatever it takes to get so many SATA ports > on your system you could instead add an extra 32GB of RAM or put your OS > on a mirrored SSD. I suspect that both of those options on a typical > workload are going to make a far bigger improvement in performance. I'd suggest that with the exception of large database servers where the object is to be able to cache the entire db in RAM, the SSDs are likely a better option. FWIW, my general "gentoo/amd64 user" rule of thumb is 1-2 gig base, plus 1-2 gig per core. Certainly that scale can slide either way and it'd probably slide down for folks not doing system rebuilds in tmpfs, as gentooers often do, but up or down, unless you put that ram in a battery- backed ramdisk, 32-gig is a LOT of ram, even for an 8-core. FWIW, with my old dual-dual-core (so four cores), 8 gig RAM was nicely roomy, tho I /did/ sometimes bump the top and thus end up either dumping either cache or swapping. When I upgraded to the 6-core, I used that rule of thumb and figured ~12 gig, but due to the power-of-twos efficiency rule, I ended up with 16 gig, figuring that was more than I'd use in practice, but better than limiting it to 8 gig. I was right. The 16 gig is certainly nice, but in reality, I'm typically entirely wasting several gigs of it, not even cache filling it up. I typically run ~1 gig in application memory and several gigs in cache, with only a few tens of MB in buffer. But while I'll often exceed my old capacity of 8 gig, it's seldom by much, and 12 gig would handle everything including cache without dumping at well over the 90th percentile, probably 97% or there abouts. Even with parallel make at both the ebuild and global portage level and with PORTAGE_TMPDIR in tmpfs, I hit 100% on the cores well before I run out of RAM and start dumping cache or swapping. The only time that has NOT been the case is when I deliberately saturate, say a kernel build with an open-ended -j so it stacks up several-hundred jobs at once. Meanwhile, the paired SSDs in btrfs raid1 make a HUGE practical difference, especially in things like the (cold-cache) portage tree (and overlays) sync, kernel git pull, etc. (In my case actual booting didn't get a huge boost as I run ntp-client and ntpd at boot, and the ntp-client time sync takes ~12 seconds, more than the rest of the boot put together. But cold-cache loading kde happens faster now -- I actually uninstalled the ksplash and just go text-console login to x-black-screen to kde/plasma desktop, now. But the tree sync and kernel pull are still the places I appreciate the SSDs most.) And notably, because the cold-cache system is so much faster with the SSDs, I tend to actually shut-down instead of suspending, now, so I tend to cache even less and thus use less memory with the SSDs than before. I /could/ probably do 8-gig RAM now instead of 16, and not miss it. Even a gig per core, 6-gig, wouldn't be terrible, tho below that would start to bottleneck and pinch a bit again I suspect. > Which is better really depends on your workload. In my case much of my > raid space is used my mythtv, or for storage of stuff I only > occasionally use. In these use cases the performance of the raid5 is > more than adequate, and I'd rather be able to keep shows around for an > extra 6 months in HD than have the DVR respond a millisecond faster when > I hit play. If you really have sustained random access of the bulk of > your data than a raid1 would make much more sense. Definitely. For mythTV or similar massive media needs, raid5 will be fast enough. And I suspect just the single device-loss tolerance is a reasonable risk tradeoff for you too, since after all it /is/ just media, so tolerating loss of a single device is good, but the risk of losing two before a full rebuild with a replacement if one fails is acceptable, given the cost vs. size tradeoff with the massive size requirements of video. But again, the OP seemed to find his speed benchmarks disappointing, to say the least, and I believe pointing out raid6 as the culprit is accurate. Which, given his production-rating reliability stock trading VMs usage, I'm guessing raid5/6 really isn't the ideal match. Massive media, yes, definitely. Massive VMs, not so much. >> So several points on btrfs: >> >> 1) It's still in heavy development. > > That is what is keeping me away. I won't touch it until I can use it > with raid5, and the first common containing that hit the kernel weeks > ago I think (and it has known gaps). Until it is stable I'm sticking > with my current setup. Question: Would you use it for raid1 yet, as I'm doing? What about as a single-device filesystem? Do you believe my estimates of reliability in those cases (almost but not quite stable for single-device, kind of in the middle for raid1/raid0/raid10, say a year behind single-device and raid5/6/50/60 about a year behind that) reasonably accurate? Because if you're waiting until btrfs raid5 is fully stable, that's likely to be some wait yet -- I'd say a year, likely more given that everything btrfs has seemed to take longer than people expected. But if you're simply waiting until it matures to the point that say btrfs raid1 is at now, or maybe even a bit less, but certainly to where it's complete plus say a kernel release to work out a few more wrinkles, then that's quite possible by year-end. >> 2) RAID levels work QUITE a bit differently on btrfs. In particular, >> what btrfs calls raid1 mode (with the same applying to raid10) is >> simply two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no >> multi-way mirroring yet available > > Odd, for some reason I thought it let you specify arbitrary numbers of > copies, but looking around I think you're right. It does store two > copies of metadata regardless of the number of drives unless you > override this. Default is single-copy data, dual-copy metadata, regardless of number of devices (single device does DUP metadata, two copies on the same device, by default), with the exception of SSDs, where the metadata default is single since many of the SSD firmwares (sandforce firmware, with its compression features, is known to do this, tho mine, I forgot the firmware brand ATM but it's Corsair Neutron SSDs aimed at the server/ workstation market where unpredictability isn't considered a feature, doesn't as one of its features is stable performance and usage regardless of the data its fed) do dedup on identical copy data anyway. At least that's the explanation given for the SSD exception. But the real gotcha is that there's no way to setup N-way (N>2) redundancy on btrfs raid1/10, and I know for a fact that catches some admins by nasty surprise, as I've seen it come up on the btrfs list as well as had my own personal disappointment with it, tho luckily I did my research and figured that out before I actually installed on btrfs. I just wish they'd called it 2-way-mirroring instead of raid1, as that wouldn't be the deception in labeling that I consider the btrfs raid1 moniker at this point, and admins would be far less likely to be caught unaware when a second device goes haywire that they /thought/ they'd be covered for. Of course at this point it's all still development anyway, so no sane admin is going to be lacking backups in any case, but there's a lot of people flying by the seat of their pants out there, who have NOT done the research, and I they show up frequently on the btrfs list, after it's too late. (Tho certainly there's less of them showing up now than a year ago, when I first investigated btrfs, I think both due to btrfs maturing quite a bit since then and to a lot of the original btrfs hype dying down, which is a good thing considering the number of folks that were installing it, only to find out once they lost data that it was still development.) > However, if one considered raid1 expensive, having multiple layers of > redundancy is REALLY expensive if you aren't using Reed Solomon and many > data disks. Well, depending on the use case. In your media case, certainly. However, that's one of the few cases that still gobbles storage space as fast as the manufacturers up their capacities, and that is likely to continue to do so for at least a few more years, given that HD is still coming in, so a lot of the media is still SD, and with quad-HD in the wings as well, now. But once we hit half-petabyte, I suppose even quad-HD won't be gobbling the space as fast as they can upgrade it, any more. So a half-decade or so, maybe? Plus of course the shear bandwidth requirements for quad-HD are astounding, so at that point either some serious raid0/x0 raid or ssds for the speed will be pretty mandatory anyway, remaining SSD size limits or no SSD size limits. > From my standpoint I don't think raid1 is the best use of money in most > cases, either for performance OR for data security. If you want > performance the money is probably better spent on other components. If > you want data security the money is probably better spent on offline > backups. However, this very-much depends on how the disks will be used > - there are certainly cases where raid1 is your best option. I agree when the use is primarily video media. Other than that, a pair of 2 TB spinning rust drives tends to still go quite a long way, and tends to be a pretty good cost/risk tradeoff IMO. Throwing in a third 2- TB drive for three-way raid1 mirroring is often a good idea as well, where the additional data security is needed, but beyond that, the cost/ benefit balance probably doesn't make a whole lot of sense, agreed. And offline backups are important too, but with dual 2TB drives, many people can live with a TB of data and do multiple raid1s, giving themselves both logically offline backup and physical device redundancy. And if that means they do backups to the second raid set on the same physical devices more reliably than they would with an external that they have to physically look for and/or attach each time (as turned out to be the case for me), then the pair of 2TB drives is quite a reasonable investment indeed. But if you're going for performance, spinning rust raid simply doesn't cut it at the consumer level any longer. SSD at least the commonly used data, leaving say the media data on spinning rust for the time being if the budget doesn't work otherwise, as I've actually done here with my (much smaller than yours) media collection, figuring it not worth the cost to put /it/ on SSD just yet. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman