From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) by finch.gentoo.org (Postfix) with ESMTP id 68E431381F3 for ; Fri, 21 Jun 2013 07:31:58 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 413A3E09C9; Fri, 21 Jun 2013 07:31:55 +0000 (UTC) Received: from plane.gmane.org (plane.gmane.org [80.91.229.3]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 3BC13E09A2 for ; Fri, 21 Jun 2013 07:31:54 +0000 (UTC) Received: from list by plane.gmane.org with local (Exim 4.69) (envelope-from ) id 1Upvp1-0001mj-2H for gentoo-amd64@lists.gentoo.org; Fri, 21 Jun 2013 09:31:51 +0200 Received: from ip68-231-22-224.ph.ph.cox.net ([68.231.22.224]) by main.gmane.org with esmtp (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 21 Jun 2013 09:31:51 +0200 Received: from 1i5t5.duncan by ip68-231-22-224.ph.ph.cox.net with local (Gmexim 0.1 (Debian)) id 1AlnuQ-0007hv-00 for ; Fri, 21 Jun 2013 09:31:51 +0200 X-Injected-Via-Gmane: http://gmane.org/ To: gentoo-amd64@lists.gentoo.org From: Duncan <1i5t5.duncan@cox.net> Subject: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? Date: Fri, 21 Jun 2013 07:31:35 +0000 (UTC) Message-ID: References: Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-amd64@lists.gentoo.org Reply-to: gentoo-amd64@lists.gentoo.org Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Complaints-To: usenet@ger.gmane.org X-Gmane-NNTP-Posting-Host: ip68-231-22-224.ph.ph.cox.net User-Agent: Pan/0.140 (Chocolate Salty Balls; GIT 459f52e /usr/src/portage/src/egit-src/pan2) X-Archives-Salt: f5912bae-bf59-4503-8a68-6556849c92a9 X-Archives-Hash: 42c258ab6271fbccb675d2e1d89dbf25 Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted: > Does anyone know of info on how the starting sector number might > impact RAID performance under Gentoo? The drives are WD-500G RE3 drives > shown here: > > http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/ B001EMZPD0/ref=cm_cr_pr_product_top > > These are NOT 4k sector sized drives. > > Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My > benchmarking seems abysmal at around 40MB/S using dd copying large > files. > It's higher, around 80MB/S if the file being transferred is coming from > an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in > top. > And my 'large file' copies might not be large enough as the machine has > 24GB of DRAM and I've only been copying 21GB so it's possible some of > that is cached. I /suspect/ that the problem isn't striping, tho that can be a factor, but rather, your choice of raid6. Note that I personally ran md/raid-6 here for awhile, so I know a bit of what I'm talking about. I didn't realize the full implications of what I was setting up originally, or I'd have not chosen raid6 in the first place, but live and learn as they say, and that I did. General rule, raid6 is abysmal for writing and gets dramatically worse as fragmentation sets in, tho reading is reasonable. The reason is that in ordered to properly parity-check and write out less-than-full-stripe writes, the system must effectively read-in the existing data and merge it with the new data, then recalculate the parity, before writing the new data AND 100% of the (two-way in raid-6) parity. Further, because raid sits below the filesystem level, it knows nothing about what parts of the filesystem are actually used, and must read and write the FULL data stripe (perhaps minus the new data bit, I'm not sure), including parts that will be empty on a freshly formatted filesystem. So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k in data across three devices, and 8k of parity across the other two devices. Now you go to write a 1k file, but in ordered to do so the full 12k of existing data must be read in, even on an empty filesystem, because the RAID doesn't know it's empty! Then the new data must be merged in and new checksums created, then the full 20k must be written back out, certainly the 8k of parity, but also likely the full 12k of data even if most of it is simply rewrite, but almost certainly at least the 4k strip on the device the new data is written to. As I said that gets much worse as a filesystem ages, due to fragmentation meaning writes are more often writes to say 3 stripe fragments instead of a single whole stripe. That's what proper stride size, etc, can help with, if the filesystem's reasonably fragmentation resistant, but even then filesystem aging certainly won't /help/. Reads, meanwhile, are reasonable speed (in normal non-degraded mode), because on a raid6 the data is at least two-way striped (on a 4-device raid, your 5-device would be three-way striped data, the other two being parity of course), so you do get moderate striping read bonuses. Then there's all that parity information available and written out at every write, but it's not actually used to check the reliability of the data in normal operation, only to reconstruct if a device or two goes missing. On a well laid out system, I/O to the separate drives at least shouldn't interfere with each other, assuming SATA and a chipset and bus layout that can handle them in parallel, not /that/ big a feat on today's hardware at least as long as you're still doing "spinning rust", as the mechanical drive latency is almost certainly the bottleneck there, and at least that can be parallelized to a reasonable degree across the individual drives. What I ultimately came to realize here is that unless the job at hand is nearly 100% read on the raid, with the caveat that you have enough space, raid1 is almost certainly at least as good if not a better choice. If you have the devices to support it, you can go for raid10/50/60, and a raid10 across 5 devices is certainly possible with mdraid, but a straight raid-6... you're generally better off with an N-way raid-1, for a couple reasons. First, md/raid1 is surprisingly, even astoundingly, good at multi-task scheduling reads. So any time there's multiple I/O read tasks going on (like during boot), raid1 works really well, with the scheduler distributing tasks among the available devices, this minimizing seek- latency. So take a 5-device raid-1, you can very likely accomplish at least 5 and possibly 6 or even 7 read jobs in say 110 or 120% of the time it would take to do just the longest one on a single device, almost certainly well before a single device could have done the two longest read jobs. This also works if there's a single task alternating reads of N different files/directories, since the scheduler will again distribute jobs among the devices, so say one device head stays over the directory information, while another goes to read the first file, the second reads another file, etc, and the heads stay where they are until they're needed elsewhere so the more devices in raid1 you have the more likely it is that more data read from the same location still has a head located right over it and can just read it as the correct portion of the disk spins underneath, instead of first seeking to the correct spot on the disk. It's worth pointing out that in the case of parallel job read access, due to this parallel read-scheduling md/raid1 can often best raid0 performance, despite raid0's technically better single-job thruput numbers. This was something I learned by experience as well, that makes sense but that I had TOTALLY not realized or calculated for in my original setup, as I was running raid0 for things like the gentoo ebuild tree and the kernel sources, since I didn't need redundancy for them. My raid0 performance there was rather disappointing, because both portage tree updates and dep calculation and the kernel build process don't optimize well for thruput, which is what raid0 does, but optimize rather better for parallel I/O, where raid1 shines especially for read. Second, md/raid1 writes, because they happen in parallel with the bottleneck being the spinning rust, basically occur at the speed of the slowest disk. So you don't get N-way parallel job write speed, just single disk speed, but it's still *WAY* *WAY* better than raid6, which has to read in the existing data and do the merge before it can write back out. **THAT'S THE RAID6 PERFORMANCE KILLER**, or at least it was for me, effectively giving you half-device speed writes because the data in too many cases must be read in first before it can be written. Raid1 doesn't have that problem -- it doesn't get a write performance multiplier from the N devices, but at least it doesn't get device performance cut in half like raid5/6 does. Third, the read-scheduling benefits of #1 help to a lessor extent with large same-raid1 copies as well. Consider, the first block must be read by one device, then written to all at the new location. The second similarly, then the third, etc. But, with proper scheduling an N-way raid1 doing an N-block copy has done N+1 operations on all devices at the end of that N-block copy. IOW, given the memory to use as a buffer, the read can be done in parallel, reading N blocks in at once, one from each device, then the writes, one block at a time to all devices. So a 5-way raid1 will have done 6 jobs on each of the 5 devices at the end, 1 read and 5 writes, to write out 5 blocks. (In actuality due to read-ahead I think it's optimally 64-k blocks per device, 16 4-k blocks on each, 320k total, but that's well within the usual minimal 2MB drive buffer size, and the drive will probably do that on its own if both read and write- caching are on, given scheduling that forces a cache-flush only at the end, not multiple times in the middle. So all the kernel has to do is be sure it's not interfering by forcing untimely flushes, and the drives should optimize on their own.) Forth, back to the parity. Remember, raid5/6 has all that parity that it writes out (but basically never reads in normal mode, only when degraded, in ordered to reconstruct the data from the missing device(s)), but doesn't actually use it for integrity checking. So while raid1 doesn't have the benefit of that parity data, it's not like raid5/6 used it anyway, and an N-way raid1 means even MORE missing-device protection since you can lose all but one device and keep right on going as if nothing happened. So a 5-way raid1 can lose 4 devices, not just the two devices of a 5-way raid6 or the single device of a raid5. Yes, there's the loss of parity/integrity data with raid1, BUT RAID5/6 DOESN'T USE THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE CASE OF DEVICE LOSS! So the N-way raid1 is far more redundant since you have N copies of the data, not one copy plus two-way-parity-that's-never- used-except-for-reconstruction. Fifth, in the event of device loss, a raid1 continues to function at normal speed, because it's simply an N-way copy with a bit of extra metadata to keep track of the number of N-ways. Of course you'll lose the benefit of read-parallelization that the missing device provided, and you'll lose the redundancy of the missing device, but in general, performance remains pretty much the same no matter how many ways it's raid-1 mirrored. Contrast that with raid5/6 which is SEVERELY read performance impacted by device loss, since it then must reconstruct the data using the parity data, not simply read it from somewhere else, which is what raid1 does. The single down side to raid1 as opposed to raid5/6 is the loss of the extra space made available by the data striping, 3*single-device-space in the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the case of raid1. Otherwise, no contest, hands down, raid1 over raid6. IOW, you're seeing now exactly why raid6 and to a lessor extent raid5 have such terrible performance (as opposed to reliability) reputations. Really, unless you simply don't have the space to make it raid1, I **STRONGLY** urge you to try that instead. I know I was very happily surprised by the results I got, and only then realized what all the negativity I'd seen around raid5/6 had been about, as I really hadn't understood it at all when I was doing my original research. Meanwhile, Rich0 already brought up btrfs, which really does promise a better solution to many of these issues than md/raid, in part due to that is arguably "layering violation", but really DOES allow for some serious optimizations in the multiple-drive case, because as a filesystem, it DOES know what's real data and what's empty space that isn't worth worrying about, and because unlike raid5/6 parity, it really DOES care about data integrity, not just rebuilding in case of device failure. So several points on btrfs: 1) It's still in heavy development. The base single-device filesystem case works reasonably well now and is /almost/ stable, tho I'd still urge people to keep good backups as it's simply not time tested and settled, and won't be for at least a few more kernels as they're still busy with the other features. Second-level raid0/raid1/raid10 is at an intermediate level. Primary development and initial bug testing and fixing is done but they're still working on bugs that people doing only traditional single-device filesystems simply don't have to worry about. Third-round raid5/6 is still very new, introduced as VERY experimental only with 3.9 IIRC, and is currently EXPECTED to eat data in power-loss or crash events, so it's ONLY good for preliminary testing at this point. Thus, if you're using btrfs at all, keep good backups, and keep current, even -rc (if not live-git) on the kernel, because there really are fixes in every single kernel for very real corner-case problems they are still coming across. But single-device is /relatively/ stable now, so provided you keep good *TESTED* backups and are willing and able to use them if it comes to it, and keep current on the kernel, go for that. And I'm personally running dual-device raid-1 mode across two SSDs, at the second stage deployment level. I tried that (but still on spinning rust) a year ago and decided btrfs simply wasn't ready for me yet, so it has come quite a way in the last year. But raid5/6 mode is still fresh third-tier development, which I'd not consider usable until at LEAST 3.11 and probably 3.12 or later (maybe a year from now, since it's less mature than raid1 was at this point last year, but should mature a bit faster). Takeaway: If you don't have a backup you're prepared to use, you shouldn't be even THINKING about btrfs at this point, no matter WHAT type of deployment you're considering. If you do, you're probably reasonably safe with traditional single-device btrfs, intermediately risky/safe with raid0/1/10, don't even think about raid5/6 for real deployment yet, period. 2) RAID levels work QUITE a bit differently on btrfs. In particular, what btrfs calls raid1 mode (with the same applying to raid10) is simply two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no multi-way mirroring yet available, unless you're willing to apply not-yet- mainstreamed patches. It's planned, but not yet applied. The roadmap says it'll happen after raid5/6 are introduced (they have been, but aren't yet really finished including power-loss-recovery, etc), so I'm guessing 3.12 at the earliest as I think 3.11 is still focused on raid5/6 completion. 3) Btrfs raid1 mode is used to provide second-source for its data integrity feature as well, such that if one copy's checksum doesn't verify, it'll try the other one. Unfortunately #2 means there's only the single fallback to try, but that's better than most filesystems, without data integrity at all, or if they have it, no fallback if it fails. The combination of #2 and 3 was a bitter pill for me a year ago, when I was still running on aging spinning rust, and thus didn't trust two-copy- only redundancy. I really like the data integrity feature, but just a single backup copy was a great disappointment since I didn't trust my old hardware, and unfortunately two-copy-max remains the case for so-called raid1. (Raid5/6 mode apparently introduces N-way copies or some such, but as I said, it's not complete yet and is EXPECTED to eat data. N-way- mirroring will build on that and is on the horizon, but it has been on the horizon and not seeming to get much closer for over a year now...) Fortunately for me, my budget is in far better shape this year, and with the dual new SSDs I purchased and with spinning rust for backup still, I trust my hardware enough now to run the 2-way-only mirroring that btrfs calls raid1 mode. 4) As mentioned above in the btrfs intro paragraph, btrfs, being a filesystem, actually knows what data is actual data, and what is safely left untracked and thus unsynced. Thus, the read-data-in-before-writing- it problem will be rather less, certainly on freshly formated disks where most existing data WILL be garbage/zeros (trimmed if on SSD, as mkfs.btrfs issues a trim command for the entire filesystem range before it creates the superblocks, etc, so empty space really /is/ zeroed). Similarly with "slack space" that's not currently used but was used previously, as the filesystem ages -- btrfs knows that it can ignore that data too, and thus won't have to read it in to update the checksum when writing to a raid5/6 mode btrfs. 5) There's various other nice btrfs features and a few caveats as well, but with the exception of anything btrfs-raid pertaining I totally forgot about, they're out of scope for this thread, which is after all, on raid, so I'll skip discussing them here. So bottom line, I really recommend md/raid1 for now. Unless you want to go md/raid10, with three-way-mirroring on the raid1 side. AFAIK that's doable with 5 devices, but it's simpler, certainly conceptually simpler which can make a different to an admin trying to work with it, with 6. If the data simply won't fit on the 5-way raid1 and you want to keep at least 2-device-loss protection, consider splitting it up, raid1 with three devices for the first half, then either get a sixth device to do the same with the second half, or go raid1 with two devices and put your less critical data on the second set. Or, do the raid10 with 5 devices thing, but I'll admit that while I've read that it's possible, I don't really conceptually understand it myself, and haven't tried it, so I have no personal opinion or experience to offer on that. But in that case I really would try to scrap up the money for a sixth device if possible, and do raid10 with 3-way redundancy 2-way-striping across the six, simply because it's easier to conceptualize and thus to properly administer. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman