* [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? @ 2013-06-20 19:10 Mark Knecht 2013-06-20 19:16 ` Volker Armin Hemmann ` (4 more replies) 0 siblings, 5 replies; 46+ messages in thread From: Mark Knecht @ 2013-06-20 19:10 UTC (permalink / raw To: Gentoo AMD64 Hi, Does anyone know of info on how the starting sector number might impact RAID performance under Gentoo? The drives are WD-500G RE3 drives shown here: http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/B001EMZPD0/ref=cm_cr_pr_product_top These are NOT 4k sector sized drives. Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My benchmarking seems abysmal at around 40MB/S using dd copying large files. It's higher, around 80MB/S if the file being transferred is coming from an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in top. And my 'large file' copies might not be large enough as the machine has 24GB of DRAM and I've only been copying 21GB so it's possible some of that is cached. Then I looked again at how I partitioned the drives originally and see the starting sector of sector 3 as 8594775. I started wondering if something like 4K block sizes at the file system level might be getting munged across 16k chunk sizes in the RAID. Maybe the blocks are being torn apart in bad ways for performance? That led me down a bunch of rabbit holes and I haven't found any light yet. Looking for some thoughtful ideas from those more experienced in this area. Cheers, Mark ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? 2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht @ 2013-06-20 19:16 ` Volker Armin Hemmann 2013-06-20 19:28 ` Mark Knecht 2013-06-20 20:45 ` Mark Knecht 2013-06-20 19:27 ` Rich Freeman ` (3 subsequent siblings) 4 siblings, 2 replies; 46+ messages in thread From: Volker Armin Hemmann @ 2013-06-20 19:16 UTC (permalink / raw To: gentoo-amd64 Am 20.06.2013 21:10, schrieb Mark Knecht: > Hi, > Does anyone know of info on how the starting sector number might > impact RAID performance under Gentoo? The drives are WD-500G RE3 > drives shown here: > > http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/B001EMZPD0/ref=cm_cr_pr_product_top > > These are NOT 4k sector sized drives. > > Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My > benchmarking seems abysmal at around 40MB/S using dd copying large > files. It's higher, around 80MB/S if the file being transferred is > coming from an SSD, but even 80MB/S seems slow to me. I see a LOT of > wait time in top. And my 'large file' copies might not be large enough > as the machine has 24GB of DRAM and I've only been copying 21GB so > it's possible some of that is cached. > > Then I looked again at how I partitioned the drives originally and > see the starting sector of sector 3 as 8594775. I started wondering if > something like 4K block sizes at the file system level might be > getting munged across 16k chunk sizes in the RAID. Maybe the blocks > are being torn apart in bad ways for performance? That led me down a > bunch of rabbit holes and I haven't found any light yet. > > Looking for some thoughtful ideas from those more experienced in this area. > > Cheers, > Mark > > man mkfs.xfs man mkfs.ext4 look for stripe size etc. Have fun. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? 2013-06-20 19:16 ` Volker Armin Hemmann @ 2013-06-20 19:28 ` Mark Knecht 2013-06-20 20:45 ` Mark Knecht 1 sibling, 0 replies; 46+ messages in thread From: Mark Knecht @ 2013-06-20 19:28 UTC (permalink / raw To: Gentoo AMD64 On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann <volkerarmin@googlemail.com> wrote: <SNIP> >> Looking for some thoughtful ideas from those more experienced in this area. >> >> Cheers, >> Mark >> >> > > man mkfs.xfs > > man mkfs.ext4 > > look for stripe size etc. > > Have fun. > I am probably mistaken but I thought that stuff was for hardware RAID and that for mdadm type software RAID it was handled by mdadm? I certainly don't remember any of the Linux software RAID pages I've read about setting up RAID suggesting that these options are important, but I'll go look around. Thanks! ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? 2013-06-20 19:16 ` Volker Armin Hemmann 2013-06-20 19:28 ` Mark Knecht @ 2013-06-20 20:45 ` Mark Knecht 2013-06-24 18:47 ` Volker Armin Hemmann 1 sibling, 1 reply; 46+ messages in thread From: Mark Knecht @ 2013-06-20 20:45 UTC (permalink / raw To: Gentoo AMD64 On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann <volkerarmin@googlemail.com> wrote: <SNIP> > man mkfs.xfs > > man mkfs.ext4 > > look for stripe size etc. > > Have fun. > Volker, I find way down at the bottom of the RAID setup page that they do say stride & stripe are important for RAID4 & RAID5, but remain non-committal for RAID6. None the less thanks for the idea. Now I guess I have to figure out how to test it in less than 10 weeks. I think I'm in trouble at this point having only 1 file system. Possibly it would be better to have a second just to be able to change settings with tune2fs and being able to do it quickly. None the less thanks. - Mark ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? 2013-06-20 20:45 ` Mark Knecht @ 2013-06-24 18:47 ` Volker Armin Hemmann 2013-06-24 19:11 ` Mark Knecht 0 siblings, 1 reply; 46+ messages in thread From: Volker Armin Hemmann @ 2013-06-24 18:47 UTC (permalink / raw To: gentoo-amd64 Am 20.06.2013 22:45, schrieb Mark Knecht: > On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann > <volkerarmin@googlemail.com> wrote: > <SNIP> >> man mkfs.xfs >> >> man mkfs.ext4 >> >> look for stripe size etc. >> >> Have fun. >> > Volker, > I find way down at the bottom of the RAID setup page that they do > say stride & stripe are important for RAID4 & RAID5, but remain > non-committal for RAID6. raid 6 is just raid5 with additional parity. So stripe size is not less important. ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? 2013-06-24 18:47 ` Volker Armin Hemmann @ 2013-06-24 19:11 ` Mark Knecht 0 siblings, 0 replies; 46+ messages in thread From: Mark Knecht @ 2013-06-24 19:11 UTC (permalink / raw To: Gentoo AMD64 On Mon, Jun 24, 2013 at 11:47 AM, Volker Armin Hemmann <volkerarmin@googlemail.com> wrote: > Am 20.06.2013 22:45, schrieb Mark Knecht: >> On Thu, Jun 20, 2013 at 12:16 PM, Volker Armin Hemmann >> <volkerarmin@googlemail.com> wrote: >> <SNIP> >>> man mkfs.xfs >>> >>> man mkfs.ext4 >>> >>> look for stripe size etc. >>> >>> Have fun. >>> >> Volker, >> I find way down at the bottom of the RAID setup page that they do >> say stride & stripe are important for RAID4 & RAID5, but remain >> non-committal for RAID6. > > raid 6 is just raid5 with additional parity. So stripe size is not less > important. > Yeah, as I continued to study that became more apparent. The Linux RAID wiki not saying about it was (apparently) just an oversight on their part. At this point I'm basically getting set up to tear my whole machine apart and rebuild it from scratch. When I do I'm benchmark whatever RAID options I think will meet my long term needs and then report back anything I find. Personally, I think that RAID6 should be just slightly slower than RAID5, and use slightly more CPU power doing it. How RAID5/6 really compare with RAID1 isn't really that much of an issue for me as using only RAID1 won't give me enough storage using any combinations of my 5 500GB drives. I think if I was into spending some money I'd look at buying a second SSD and do RAID1 for my / and then just use the disks for the VMs & video, but don't see that as an option right now. Cheers, Mark ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? 2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht 2013-06-20 19:16 ` Volker Armin Hemmann @ 2013-06-20 19:27 ` Rich Freeman 2013-06-20 19:31 ` Mark Knecht 2013-06-21 7:31 ` [gentoo-amd64] " Duncan ` (2 subsequent siblings) 4 siblings, 1 reply; 46+ messages in thread From: Rich Freeman @ 2013-06-20 19:27 UTC (permalink / raw To: gentoo-amd64 On Thu, Jun 20, 2013 at 3:10 PM, Mark Knecht <markknecht@gmail.com> wrote: > Looking for some thoughtful ideas from those more experienced in this area. Please do share your findings. I suspect my own RAID+LVM+EXT3/4 system is not optimized - especially with LVM I have no idea how blocks in ext3/4 end up mapping to stripes and physical blocks. Oh, and this is on 4k disks. Honestly, this is one of the reasons I REALLY want to move to btrfs when it fully supports raid5. Right now the various layers don't talk to each other and that means a lot of micro-management if you don't want a lot of read-write-read cycles (to say nothing of what you can buy with a filesystem that can aim to overwrite entire stripes at a time). Rich ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? 2013-06-20 19:27 ` Rich Freeman @ 2013-06-20 19:31 ` Mark Knecht 0 siblings, 0 replies; 46+ messages in thread From: Mark Knecht @ 2013-06-20 19:31 UTC (permalink / raw To: Gentoo AMD64 On Thu, Jun 20, 2013 at 12:27 PM, Rich Freeman <rich0@gentoo.org> wrote: > On Thu, Jun 20, 2013 at 3:10 PM, Mark Knecht <markknecht@gmail.com> wrote: >> Looking for some thoughtful ideas from those more experienced in this area. > > Please do share your findings. I suspect my own RAID+LVM+EXT3/4 > system is not optimized - especially with LVM I have no idea how > blocks in ext3/4 end up mapping to stripes and physical blocks. Oh, > and this is on 4k disks. > > Honestly, this is one of the reasons I REALLY want to move to btrfs > when it fully supports raid5. Right now the various layers don't talk > to each other and that means a lot of micro-management if you don't > want a lot of read-write-read cycles (to say nothing of what you can > buy with a filesystem that can aim to overwrite entire stripes at a > time). > > Rich > I'll share everything I find, true or false, and maybe as a group we can figure out what's right. In the meantime, please be careful with your RAID5 and do good backups :-) I ran RAID5 for awhile but moved to RAID6 due to the number of reports I read where one drive went bad on a RAID5 and then the RAID lost a second drive before the original bad drive was replaced and everything was gone. Cheers, Mark ^ permalink raw reply [flat|nested] 46+ messages in thread
* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht 2013-06-20 19:16 ` Volker Armin Hemmann 2013-06-20 19:27 ` Rich Freeman @ 2013-06-21 7:31 ` Duncan 2013-06-21 10:28 ` Rich Freeman ` (2 more replies) 2013-06-22 12:49 ` [gentoo-amd64] " B Vance 2013-06-23 11:31 ` thegeezer 4 siblings, 3 replies; 46+ messages in thread From: Duncan @ 2013-06-21 7:31 UTC (permalink / raw To: gentoo-amd64 Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted: > Does anyone know of info on how the starting sector number might > impact RAID performance under Gentoo? The drives are WD-500G RE3 drives > shown here: > > http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/ B001EMZPD0/ref=cm_cr_pr_product_top > > These are NOT 4k sector sized drives. > > Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My > benchmarking seems abysmal at around 40MB/S using dd copying large > files. > It's higher, around 80MB/S if the file being transferred is coming from > an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in > top. > And my 'large file' copies might not be large enough as the machine has > 24GB of DRAM and I've only been copying 21GB so it's possible some of > that is cached. I /suspect/ that the problem isn't striping, tho that can be a factor, but rather, your choice of raid6. Note that I personally ran md/raid-6 here for awhile, so I know a bit of what I'm talking about. I didn't realize the full implications of what I was setting up originally, or I'd have not chosen raid6 in the first place, but live and learn as they say, and that I did. General rule, raid6 is abysmal for writing and gets dramatically worse as fragmentation sets in, tho reading is reasonable. The reason is that in ordered to properly parity-check and write out less-than-full-stripe writes, the system must effectively read-in the existing data and merge it with the new data, then recalculate the parity, before writing the new data AND 100% of the (two-way in raid-6) parity. Further, because raid sits below the filesystem level, it knows nothing about what parts of the filesystem are actually used, and must read and write the FULL data stripe (perhaps minus the new data bit, I'm not sure), including parts that will be empty on a freshly formatted filesystem. So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k in data across three devices, and 8k of parity across the other two devices. Now you go to write a 1k file, but in ordered to do so the full 12k of existing data must be read in, even on an empty filesystem, because the RAID doesn't know it's empty! Then the new data must be merged in and new checksums created, then the full 20k must be written back out, certainly the 8k of parity, but also likely the full 12k of data even if most of it is simply rewrite, but almost certainly at least the 4k strip on the device the new data is written to. As I said that gets much worse as a filesystem ages, due to fragmentation meaning writes are more often writes to say 3 stripe fragments instead of a single whole stripe. That's what proper stride size, etc, can help with, if the filesystem's reasonably fragmentation resistant, but even then filesystem aging certainly won't /help/. Reads, meanwhile, are reasonable speed (in normal non-degraded mode), because on a raid6 the data is at least two-way striped (on a 4-device raid, your 5-device would be three-way striped data, the other two being parity of course), so you do get moderate striping read bonuses. Then there's all that parity information available and written out at every write, but it's not actually used to check the reliability of the data in normal operation, only to reconstruct if a device or two goes missing. On a well laid out system, I/O to the separate drives at least shouldn't interfere with each other, assuming SATA and a chipset and bus layout that can handle them in parallel, not /that/ big a feat on today's hardware at least as long as you're still doing "spinning rust", as the mechanical drive latency is almost certainly the bottleneck there, and at least that can be parallelized to a reasonable degree across the individual drives. What I ultimately came to realize here is that unless the job at hand is nearly 100% read on the raid, with the caveat that you have enough space, raid1 is almost certainly at least as good if not a better choice. If you have the devices to support it, you can go for raid10/50/60, and a raid10 across 5 devices is certainly possible with mdraid, but a straight raid-6... you're generally better off with an N-way raid-1, for a couple reasons. First, md/raid1 is surprisingly, even astoundingly, good at multi-task scheduling reads. So any time there's multiple I/O read tasks going on (like during boot), raid1 works really well, with the scheduler distributing tasks among the available devices, this minimizing seek- latency. So take a 5-device raid-1, you can very likely accomplish at least 5 and possibly 6 or even 7 read jobs in say 110 or 120% of the time it would take to do just the longest one on a single device, almost certainly well before a single device could have done the two longest read jobs. This also works if there's a single task alternating reads of N different files/directories, since the scheduler will again distribute jobs among the devices, so say one device head stays over the directory information, while another goes to read the first file, the second reads another file, etc, and the heads stay where they are until they're needed elsewhere so the more devices in raid1 you have the more likely it is that more data read from the same location still has a head located right over it and can just read it as the correct portion of the disk spins underneath, instead of first seeking to the correct spot on the disk. It's worth pointing out that in the case of parallel job read access, due to this parallel read-scheduling md/raid1 can often best raid0 performance, despite raid0's technically better single-job thruput numbers. This was something I learned by experience as well, that makes sense but that I had TOTALLY not realized or calculated for in my original setup, as I was running raid0 for things like the gentoo ebuild tree and the kernel sources, since I didn't need redundancy for them. My raid0 performance there was rather disappointing, because both portage tree updates and dep calculation and the kernel build process don't optimize well for thruput, which is what raid0 does, but optimize rather better for parallel I/O, where raid1 shines especially for read. Second, md/raid1 writes, because they happen in parallel with the bottleneck being the spinning rust, basically occur at the speed of the slowest disk. So you don't get N-way parallel job write speed, just single disk speed, but it's still *WAY* *WAY* better than raid6, which has to read in the existing data and do the merge before it can write back out. **THAT'S THE RAID6 PERFORMANCE KILLER**, or at least it was for me, effectively giving you half-device speed writes because the data in too many cases must be read in first before it can be written. Raid1 doesn't have that problem -- it doesn't get a write performance multiplier from the N devices, but at least it doesn't get device performance cut in half like raid5/6 does. Third, the read-scheduling benefits of #1 help to a lessor extent with large same-raid1 copies as well. Consider, the first block must be read by one device, then written to all at the new location. The second similarly, then the third, etc. But, with proper scheduling an N-way raid1 doing an N-block copy has done N+1 operations on all devices at the end of that N-block copy. IOW, given the memory to use as a buffer, the read can be done in parallel, reading N blocks in at once, one from each device, then the writes, one block at a time to all devices. So a 5-way raid1 will have done 6 jobs on each of the 5 devices at the end, 1 read and 5 writes, to write out 5 blocks. (In actuality due to read-ahead I think it's optimally 64-k blocks per device, 16 4-k blocks on each, 320k total, but that's well within the usual minimal 2MB drive buffer size, and the drive will probably do that on its own if both read and write- caching are on, given scheduling that forces a cache-flush only at the end, not multiple times in the middle. So all the kernel has to do is be sure it's not interfering by forcing untimely flushes, and the drives should optimize on their own.) Forth, back to the parity. Remember, raid5/6 has all that parity that it writes out (but basically never reads in normal mode, only when degraded, in ordered to reconstruct the data from the missing device(s)), but doesn't actually use it for integrity checking. So while raid1 doesn't have the benefit of that parity data, it's not like raid5/6 used it anyway, and an N-way raid1 means even MORE missing-device protection since you can lose all but one device and keep right on going as if nothing happened. So a 5-way raid1 can lose 4 devices, not just the two devices of a 5-way raid6 or the single device of a raid5. Yes, there's the loss of parity/integrity data with raid1, BUT RAID5/6 DOESN'T USE THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE CASE OF DEVICE LOSS! So the N-way raid1 is far more redundant since you have N copies of the data, not one copy plus two-way-parity-that's-never- used-except-for-reconstruction. Fifth, in the event of device loss, a raid1 continues to function at normal speed, because it's simply an N-way copy with a bit of extra metadata to keep track of the number of N-ways. Of course you'll lose the benefit of read-parallelization that the missing device provided, and you'll lose the redundancy of the missing device, but in general, performance remains pretty much the same no matter how many ways it's raid-1 mirrored. Contrast that with raid5/6 which is SEVERELY read performance impacted by device loss, since it then must reconstruct the data using the parity data, not simply read it from somewhere else, which is what raid1 does. The single down side to raid1 as opposed to raid5/6 is the loss of the extra space made available by the data striping, 3*single-device-space in the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the case of raid1. Otherwise, no contest, hands down, raid1 over raid6. IOW, you're seeing now exactly why raid6 and to a lessor extent raid5 have such terrible performance (as opposed to reliability) reputations. Really, unless you simply don't have the space to make it raid1, I **STRONGLY** urge you to try that instead. I know I was very happily surprised by the results I got, and only then realized what all the negativity I'd seen around raid5/6 had been about, as I really hadn't understood it at all when I was doing my original research. Meanwhile, Rich0 already brought up btrfs, which really does promise a better solution to many of these issues than md/raid, in part due to that is arguably "layering violation", but really DOES allow for some serious optimizations in the multiple-drive case, because as a filesystem, it DOES know what's real data and what's empty space that isn't worth worrying about, and because unlike raid5/6 parity, it really DOES care about data integrity, not just rebuilding in case of device failure. So several points on btrfs: 1) It's still in heavy development. The base single-device filesystem case works reasonably well now and is /almost/ stable, tho I'd still urge people to keep good backups as it's simply not time tested and settled, and won't be for at least a few more kernels as they're still busy with the other features. Second-level raid0/raid1/raid10 is at an intermediate level. Primary development and initial bug testing and fixing is done but they're still working on bugs that people doing only traditional single-device filesystems simply don't have to worry about. Third-round raid5/6 is still very new, introduced as VERY experimental only with 3.9 IIRC, and is currently EXPECTED to eat data in power-loss or crash events, so it's ONLY good for preliminary testing at this point. Thus, if you're using btrfs at all, keep good backups, and keep current, even -rc (if not live-git) on the kernel, because there really are fixes in every single kernel for very real corner-case problems they are still coming across. But single-device is /relatively/ stable now, so provided you keep good *TESTED* backups and are willing and able to use them if it comes to it, and keep current on the kernel, go for that. And I'm personally running dual-device raid-1 mode across two SSDs, at the second stage deployment level. I tried that (but still on spinning rust) a year ago and decided btrfs simply wasn't ready for me yet, so it has come quite a way in the last year. But raid5/6 mode is still fresh third-tier development, which I'd not consider usable until at LEAST 3.11 and probably 3.12 or later (maybe a year from now, since it's less mature than raid1 was at this point last year, but should mature a bit faster). Takeaway: If you don't have a backup you're prepared to use, you shouldn't be even THINKING about btrfs at this point, no matter WHAT type of deployment you're considering. If you do, you're probably reasonably safe with traditional single-device btrfs, intermediately risky/safe with raid0/1/10, don't even think about raid5/6 for real deployment yet, period. 2) RAID levels work QUITE a bit differently on btrfs. In particular, what btrfs calls raid1 mode (with the same applying to raid10) is simply two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no multi-way mirroring yet available, unless you're willing to apply not-yet- mainstreamed patches. It's planned, but not yet applied. The roadmap says it'll happen after raid5/6 are introduced (they have been, but aren't yet really finished including power-loss-recovery, etc), so I'm guessing 3.12 at the earliest as I think 3.11 is still focused on raid5/6 completion. 3) Btrfs raid1 mode is used to provide second-source for its data integrity feature as well, such that if one copy's checksum doesn't verify, it'll try the other one. Unfortunately #2 means there's only the single fallback to try, but that's better than most filesystems, without data integrity at all, or if they have it, no fallback if it fails. The combination of #2 and 3 was a bitter pill for me a year ago, when I was still running on aging spinning rust, and thus didn't trust two-copy- only redundancy. I really like the data integrity feature, but just a single backup copy was a great disappointment since I didn't trust my old hardware, and unfortunately two-copy-max remains the case for so-called raid1. (Raid5/6 mode apparently introduces N-way copies or some such, but as I said, it's not complete yet and is EXPECTED to eat data. N-way- mirroring will build on that and is on the horizon, but it has been on the horizon and not seeming to get much closer for over a year now...) Fortunately for me, my budget is in far better shape this year, and with the dual new SSDs I purchased and with spinning rust for backup still, I trust my hardware enough now to run the 2-way-only mirroring that btrfs calls raid1 mode. 4) As mentioned above in the btrfs intro paragraph, btrfs, being a filesystem, actually knows what data is actual data, and what is safely left untracked and thus unsynced. Thus, the read-data-in-before-writing- it problem will be rather less, certainly on freshly formated disks where most existing data WILL be garbage/zeros (trimmed if on SSD, as mkfs.btrfs issues a trim command for the entire filesystem range before it creates the superblocks, etc, so empty space really /is/ zeroed). Similarly with "slack space" that's not currently used but was used previously, as the filesystem ages -- btrfs knows that it can ignore that data too, and thus won't have to read it in to update the checksum when writing to a raid5/6 mode btrfs. 5) There's various other nice btrfs features and a few caveats as well, but with the exception of anything btrfs-raid pertaining I totally forgot about, they're out of scope for this thread, which is after all, on raid, so I'll skip discussing them here. So bottom line, I really recommend md/raid1 for now. Unless you want to go md/raid10, with three-way-mirroring on the raid1 side. AFAIK that's doable with 5 devices, but it's simpler, certainly conceptually simpler which can make a different to an admin trying to work with it, with 6. If the data simply won't fit on the 5-way raid1 and you want to keep at least 2-device-loss protection, consider splitting it up, raid1 with three devices for the first half, then either get a sixth device to do the same with the second half, or go raid1 with two devices and put your less critical data on the second set. Or, do the raid10 with 5 devices thing, but I'll admit that while I've read that it's possible, I don't really conceptually understand it myself, and haven't tried it, so I have no personal opinion or experience to offer on that. But in that case I really would try to scrap up the money for a sixth device if possible, and do raid10 with 3-way redundancy 2-way-striping across the six, simply because it's easier to conceptualize and thus to properly administer. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 7:31 ` [gentoo-amd64] " Duncan @ 2013-06-21 10:28 ` Rich Freeman 2013-06-21 14:23 ` Bob Sanders ` (2 more replies) 2013-06-21 17:40 ` Mark Knecht 2013-06-30 1:04 ` Rich Freeman 2 siblings, 3 replies; 46+ messages in thread From: Rich Freeman @ 2013-06-21 10:28 UTC (permalink / raw To: gentoo-amd64 On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: > So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k > in data across three devices, and 8k of parity across the other two > devices. With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a stripe, not 20k. If you modify one block it needs to read all 1.5M, or it needs to read at least the old chunk on the single drive to be modified and both old parity chunks (which on such a small array is 3 disks either way). > Forth, back to the parity. Remember, raid5/6 has all that parity that it > writes out (but basically never reads in normal mode, only when degraded, > in ordered to reconstruct the data from the missing device(s)), but > doesn't actually use it for integrity checking. I wasn't aware of this - I can't believe it isn't even an option either. Note to self - start doing weekly scrubs... > The single down side to raid1 as opposed to raid5/6 is the loss of the > extra space made available by the data striping, 3*single-device-space in > the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the > case of raid1. Otherwise, no contest, hands down, raid1 over raid6. This is a HUGE downside. The only downside to raid1 over not having raid at all is that your disk space cost doubles. raid5/6 is considerably cheaper in that regard. In a 5-disk raid5 the cost of redundancy is only 25% more, vs a 100% additional cost for raid1. To accomplish the same space as a 5-disk raid5 you'd need 8 disks. Sure, read performance would be vastly superior, but if you're going to spend $300 more on hard drives and whatever it takes to get so many SATA ports on your system you could instead add an extra 32GB of RAM or put your OS on a mirrored SSD. I suspect that both of those options on a typical workload are going to make a far bigger improvement in performance. Which is better really depends on your workload. In my case much of my raid space is used my mythtv, or for storage of stuff I only occasionally use. In these use cases the performance of the raid5 is more than adequate, and I'd rather be able to keep shows around for an extra 6 months in HD than have the DVR respond a millisecond faster when I hit play. If you really have sustained random access of the bulk of your data than a raid1 would make much more sense. > So several points on btrfs: > > 1) It's still in heavy development. That is what is keeping me away. I won't touch it until I can use it with raid5, and the first common containing that hit the kernel weeks ago I think (and it has known gaps). Until it is stable I'm sticking with my current setup. > 2) RAID levels work QUITE a bit differently on btrfs. In particular, > what btrfs calls raid1 mode (with the same applying to raid10) is simply > two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no multi-way > mirroring yet available Odd, for some reason I thought it let you specify arbitrary numbers of copies, but looking around I think you're right. It does store two copies of metadata regardless of the number of drives unless you override this. However, if one considered raid1 expensive, having multiple layers of redundancy is REALLY expensive if you aren't using Reed Solomon and many data disks. From my standpoint I don't think raid1 is the best use of money in most cases, either for performance OR for data security. If you want performance the money is probably better spent on other components. If you want data security the money is probably better spent on offline backups. However, this very-much depends on how the disks will be used - there are certainly cases where raid1 is your best option. Rich ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 10:28 ` Rich Freeman @ 2013-06-21 14:23 ` Bob Sanders 2013-06-21 14:27 ` Duncan 2013-06-22 23:04 ` Mark Knecht 2 siblings, 0 replies; 46+ messages in thread From: Bob Sanders @ 2013-06-21 14:23 UTC (permalink / raw To: gentoo-amd64 Rich Freeman, mused, then expounded: > On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: > > > The single down side to raid1 as opposed to raid5/6 is the loss of the > > extra space made available by the data striping, 3*single-device-space in > > the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space in the > > case of raid1. Otherwise, no contest, hands down, raid1 over raid6. > > This is a HUGE downside. The only downside to raid1 over not having > raid at all is that your disk space cost doubles. raid5/6 is > considerably cheaper in that regard. In a 5-disk raid5 the cost of > redundancy is only 25% more, vs a 100% additional cost for raid1. To > accomplish the same space as a 5-disk raid5 you'd need 8 disks. Sure, > read performance would be vastly superior, but if you're going to > spend $300 more on hard drives and whatever it takes to get so many > SATA ports on your system you could instead add an extra 32GB of RAM > or put your OS on a mirrored SSD. I suspect that both of those > options on a typical workload are going to make a far bigger > improvement in performance. > However, the incidence of failure is less with RAID1 than RAID5/6. As the number of devices increases, the failure rate increases. Indeed, the performance and total space can outweigh the increase in device failure. However, more devices - especially more devices that have motrs and bearings, takes more power, generates more heat, and increases the need for more backups to avert an increase in failures. Bob -- - ^ permalink raw reply [flat|nested] 46+ messages in thread
* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 10:28 ` Rich Freeman 2013-06-21 14:23 ` Bob Sanders @ 2013-06-21 14:27 ` Duncan 2013-06-21 15:13 ` Rich Freeman 2013-06-22 23:04 ` Mark Knecht 2 siblings, 1 reply; 46+ messages in thread From: Duncan @ 2013-06-21 14:27 UTC (permalink / raw To: gentoo-amd64 Rich Freeman posted on Fri, 21 Jun 2013 06:28:35 -0400 as excerpted: > On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: >> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k >> in data across three devices, and 8k of parity across the other two >> devices. > > With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a > stripe, not 20k. If you modify one block it needs to read all 1.5M, or > it needs to read at least the old chunk on the single drive to be > modified and both old parity chunks (which on such a small array is 3 > disks either way). I'll admit to not fully understanding chunks/stripes/strides in terms of actual size tho I believe you're correct, it's well over the filesystem block size and a half-meg is probably right. However, the original post went with a 4k blocksize, which is pretty standard as that's the usual memory page size as well so it makes for a convenient filesystem blocksize too, so that's what I was using as a base for my numbers. If it's 4k blocksize, then 5-device raid6 stripe would be 3*4k=12k of data, plus 2*4k=8k of parity. >> Forth, back to the parity. Remember, raid5/6 has all that parity that >> it writes out (but basically never reads in normal mode, only when >> degraded, >> in ordered to reconstruct the data from the missing device(s)), but >> doesn't actually use it for integrity checking. > > I wasn't aware of this - I can't believe it isn't even an option either. > Note to self - start doing weekly scrubs... Indeed. That's one of the things that frustrated me with mdraid -- all that data integrity metadata there, but just going to waste in normal operation, only used for device recovery. Which itself can be a problem as well, because if there *IS* an undetected cosmic-ray-error or whatever and a device goes out, that means you'll lose integrity on a second device in the rebuild as well (if it was a data device that dropped out and not parity anyway), because the parity's screwed against the undetected error and will thus rebuild a bad copy of the data on the replacement device. And it's one of the things which so attracted me to btrfs, too, and why I was so frustrated to see it could only be a single redundancy (two-way- mirrored), no way to do more. The btrfs sales pitch talks about how great data integrity and the ability to go find a good copy when the data's bad, but what if the only allowed second copy is bad as well? OOPS! But as I said, N-way mirroring is on the btrfs roadmap, it's simply not there yet. >> The single down side to raid1 as opposed to raid5/6 is the loss of the >> extra space made available by the data striping, 3*single-device-space >> in the case of 5-way raid6 (or 4-way raid5) vs. 1*single-device-space >> in the case of raid1. Otherwise, no contest, hands down, raid1 over >> raid6. > > This is a HUGE downside. The only downside to raid1 over not having > raid at all is that your disk space cost doubles. raid5/6 is > considerably cheaper in that regard. In a 5-disk raid5 the cost of > redundancy is only 25% more, vs a 100% additional cost for raid1. To > accomplish the same space as a 5-disk raid5 you'd need 8 disks. Sure, > read performance would be vastly superior, but if you're going to spend > $300 more on hard drives and whatever it takes to get so many SATA ports > on your system you could instead add an extra 32GB of RAM or put your OS > on a mirrored SSD. I suspect that both of those options on a typical > workload are going to make a far bigger improvement in performance. I'd suggest that with the exception of large database servers where the object is to be able to cache the entire db in RAM, the SSDs are likely a better option. FWIW, my general "gentoo/amd64 user" rule of thumb is 1-2 gig base, plus 1-2 gig per core. Certainly that scale can slide either way and it'd probably slide down for folks not doing system rebuilds in tmpfs, as gentooers often do, but up or down, unless you put that ram in a battery- backed ramdisk, 32-gig is a LOT of ram, even for an 8-core. FWIW, with my old dual-dual-core (so four cores), 8 gig RAM was nicely roomy, tho I /did/ sometimes bump the top and thus end up either dumping either cache or swapping. When I upgraded to the 6-core, I used that rule of thumb and figured ~12 gig, but due to the power-of-twos efficiency rule, I ended up with 16 gig, figuring that was more than I'd use in practice, but better than limiting it to 8 gig. I was right. The 16 gig is certainly nice, but in reality, I'm typically entirely wasting several gigs of it, not even cache filling it up. I typically run ~1 gig in application memory and several gigs in cache, with only a few tens of MB in buffer. But while I'll often exceed my old capacity of 8 gig, it's seldom by much, and 12 gig would handle everything including cache without dumping at well over the 90th percentile, probably 97% or there abouts. Even with parallel make at both the ebuild and global portage level and with PORTAGE_TMPDIR in tmpfs, I hit 100% on the cores well before I run out of RAM and start dumping cache or swapping. The only time that has NOT been the case is when I deliberately saturate, say a kernel build with an open-ended -j so it stacks up several-hundred jobs at once. Meanwhile, the paired SSDs in btrfs raid1 make a HUGE practical difference, especially in things like the (cold-cache) portage tree (and overlays) sync, kernel git pull, etc. (In my case actual booting didn't get a huge boost as I run ntp-client and ntpd at boot, and the ntp-client time sync takes ~12 seconds, more than the rest of the boot put together. But cold-cache loading kde happens faster now -- I actually uninstalled the ksplash and just go text-console login to x-black-screen to kde/plasma desktop, now. But the tree sync and kernel pull are still the places I appreciate the SSDs most.) And notably, because the cold-cache system is so much faster with the SSDs, I tend to actually shut-down instead of suspending, now, so I tend to cache even less and thus use less memory with the SSDs than before. I /could/ probably do 8-gig RAM now instead of 16, and not miss it. Even a gig per core, 6-gig, wouldn't be terrible, tho below that would start to bottleneck and pinch a bit again I suspect. > Which is better really depends on your workload. In my case much of my > raid space is used my mythtv, or for storage of stuff I only > occasionally use. In these use cases the performance of the raid5 is > more than adequate, and I'd rather be able to keep shows around for an > extra 6 months in HD than have the DVR respond a millisecond faster when > I hit play. If you really have sustained random access of the bulk of > your data than a raid1 would make much more sense. Definitely. For mythTV or similar massive media needs, raid5 will be fast enough. And I suspect just the single device-loss tolerance is a reasonable risk tradeoff for you too, since after all it /is/ just media, so tolerating loss of a single device is good, but the risk of losing two before a full rebuild with a replacement if one fails is acceptable, given the cost vs. size tradeoff with the massive size requirements of video. But again, the OP seemed to find his speed benchmarks disappointing, to say the least, and I believe pointing out raid6 as the culprit is accurate. Which, given his production-rating reliability stock trading VMs usage, I'm guessing raid5/6 really isn't the ideal match. Massive media, yes, definitely. Massive VMs, not so much. >> So several points on btrfs: >> >> 1) It's still in heavy development. > > That is what is keeping me away. I won't touch it until I can use it > with raid5, and the first common containing that hit the kernel weeks > ago I think (and it has known gaps). Until it is stable I'm sticking > with my current setup. Question: Would you use it for raid1 yet, as I'm doing? What about as a single-device filesystem? Do you believe my estimates of reliability in those cases (almost but not quite stable for single-device, kind of in the middle for raid1/raid0/raid10, say a year behind single-device and raid5/6/50/60 about a year behind that) reasonably accurate? Because if you're waiting until btrfs raid5 is fully stable, that's likely to be some wait yet -- I'd say a year, likely more given that everything btrfs has seemed to take longer than people expected. But if you're simply waiting until it matures to the point that say btrfs raid1 is at now, or maybe even a bit less, but certainly to where it's complete plus say a kernel release to work out a few more wrinkles, then that's quite possible by year-end. >> 2) RAID levels work QUITE a bit differently on btrfs. In particular, >> what btrfs calls raid1 mode (with the same applying to raid10) is >> simply two-way-mirroring, NO MATTER THE NUMBER OF DEVICES. There's no >> multi-way mirroring yet available > > Odd, for some reason I thought it let you specify arbitrary numbers of > copies, but looking around I think you're right. It does store two > copies of metadata regardless of the number of drives unless you > override this. Default is single-copy data, dual-copy metadata, regardless of number of devices (single device does DUP metadata, two copies on the same device, by default), with the exception of SSDs, where the metadata default is single since many of the SSD firmwares (sandforce firmware, with its compression features, is known to do this, tho mine, I forgot the firmware brand ATM but it's Corsair Neutron SSDs aimed at the server/ workstation market where unpredictability isn't considered a feature, doesn't as one of its features is stable performance and usage regardless of the data its fed) do dedup on identical copy data anyway. At least that's the explanation given for the SSD exception. But the real gotcha is that there's no way to setup N-way (N>2) redundancy on btrfs raid1/10, and I know for a fact that catches some admins by nasty surprise, as I've seen it come up on the btrfs list as well as had my own personal disappointment with it, tho luckily I did my research and figured that out before I actually installed on btrfs. I just wish they'd called it 2-way-mirroring instead of raid1, as that wouldn't be the deception in labeling that I consider the btrfs raid1 moniker at this point, and admins would be far less likely to be caught unaware when a second device goes haywire that they /thought/ they'd be covered for. Of course at this point it's all still development anyway, so no sane admin is going to be lacking backups in any case, but there's a lot of people flying by the seat of their pants out there, who have NOT done the research, and I they show up frequently on the btrfs list, after it's too late. (Tho certainly there's less of them showing up now than a year ago, when I first investigated btrfs, I think both due to btrfs maturing quite a bit since then and to a lot of the original btrfs hype dying down, which is a good thing considering the number of folks that were installing it, only to find out once they lost data that it was still development.) > However, if one considered raid1 expensive, having multiple layers of > redundancy is REALLY expensive if you aren't using Reed Solomon and many > data disks. Well, depending on the use case. In your media case, certainly. However, that's one of the few cases that still gobbles storage space as fast as the manufacturers up their capacities, and that is likely to continue to do so for at least a few more years, given that HD is still coming in, so a lot of the media is still SD, and with quad-HD in the wings as well, now. But once we hit half-petabyte, I suppose even quad-HD won't be gobbling the space as fast as they can upgrade it, any more. So a half-decade or so, maybe? Plus of course the shear bandwidth requirements for quad-HD are astounding, so at that point either some serious raid0/x0 raid or ssds for the speed will be pretty mandatory anyway, remaining SSD size limits or no SSD size limits. > From my standpoint I don't think raid1 is the best use of money in most > cases, either for performance OR for data security. If you want > performance the money is probably better spent on other components. If > you want data security the money is probably better spent on offline > backups. However, this very-much depends on how the disks will be used > - there are certainly cases where raid1 is your best option. I agree when the use is primarily video media. Other than that, a pair of 2 TB spinning rust drives tends to still go quite a long way, and tends to be a pretty good cost/risk tradeoff IMO. Throwing in a third 2- TB drive for three-way raid1 mirroring is often a good idea as well, where the additional data security is needed, but beyond that, the cost/ benefit balance probably doesn't make a whole lot of sense, agreed. And offline backups are important too, but with dual 2TB drives, many people can live with a TB of data and do multiple raid1s, giving themselves both logically offline backup and physical device redundancy. And if that means they do backups to the second raid set on the same physical devices more reliably than they would with an external that they have to physically look for and/or attach each time (as turned out to be the case for me), then the pair of 2TB drives is quite a reasonable investment indeed. But if you're going for performance, spinning rust raid simply doesn't cut it at the consumer level any longer. SSD at least the commonly used data, leaving say the media data on spinning rust for the time being if the budget doesn't work otherwise, as I've actually done here with my (much smaller than yours) media collection, figuring it not worth the cost to put /it/ on SSD just yet. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 14:27 ` Duncan @ 2013-06-21 15:13 ` Rich Freeman 2013-06-22 10:29 ` Duncan 0 siblings, 1 reply; 46+ messages in thread From: Rich Freeman @ 2013-06-21 15:13 UTC (permalink / raw To: gentoo-amd64 On Fri, Jun 21, 2013 at 10:27 AM, Duncan <1i5t5.duncan@cox.net> wrote: > Rich Freeman posted on Fri, 21 Jun 2013 06:28:35 -0400 as excerpted: > >> That is what is keeping me away. I won't touch it until I can use it >> with raid5, and the first common containing that hit the kernel weeks >> ago I think (and it has known gaps). Until it is stable I'm sticking >> with my current setup. > > Question: Would you use it for raid1 yet, as I'm doing? What about as a > single-device filesystem? Do you believe my estimates of reliability in > those cases (almost but not quite stable for single-device, kind of in > the middle for raid1/raid0/raid10, say a year behind single-device and > raid5/6/50/60 about a year behind that) reasonably accurate? If I wanted to use raid1 I might consider using btrfs now. I think it is still a bit risky, but the established use cases have gotten a fair bit of testing now. I'd be more confident in using it with a single device. > > Because if you're waiting until btrfs raid5 is fully stable, that's > likely to be some wait yet -- I'd say a year, likely more given that > everything btrfs has seemed to take longer than people expected. That's my thought as well. Right now I'm not running out of space, so I'm hoping that I can wait until the next time I need to migrate my data (from 1TB to 5+TB drives, for example). With such a scenario I don't need to have 10 drives mounted at once to migrate the data - I can migrate existing data to 1-2 drives, remove the old ones, and expand the new array. To migrate today would require finding someplace to dump all the data offline and migrate the drives, as there is no in-place way to migrate multiple ext3/4 logical volumes on top of mdadm to a single btrfs on bare metal. Without replying to anything in particular both you and Bob have mentioned the importance of multiple redundancy. Obviously risk goes down as redundancy goes up. If you protect 25 drives of data with 1 drive of parity then you need 2/26 drives to fail to hose 25 drives of data. If you protect 1 drive of data with 25 drives of parity (call them mirrors or parity or whatever - they're functionally equivalent) then you need 25/26 drives to fail to lose 1 drive of data. RAID 1 is actually less effective - if you protect 13 drives of data with 13 mirrors you need 2/26 drives to fail to lose 1 drive of data (they just have to be the wrong 2). However, you do need to consider that RAID is not the only way to protect data, and I'm not sure that multiple-redundancy raid-1 is the most cost-effective strategy. If I had 2 drives of data to protect and had 4 spare drives to do it with, I doubt I'd set up a 3x raid-1/5/10 setup (or whatever you want to call it - imho raid "levels" are poorly named as there really is just striping and mirroring and adding RS parity and everything else is just combinations). Instead I'd probably set up a RAID1/5/10/whatever with single redundancy for faster storage and recovery, and an offline backup (compressed and with incrementals/etc). The backup gets you more security and you only need it in a very unlikely double-failure. I'd only invest in multiple redundancy in the event that the risk-weighted cost of having the node go down exceeds the cost of the extra drives. Frankly in that case RAID still isn't the right solution - you need a backup node someplace else entirely as hard drives aren't the only thing that can break in your server. This sort of rationale is why I don't like arguments like "RAM is cheap" or "HDs are cheap" or whatever. The fact is that wasting money on any component means investing less in some other component that could give you more space/performance/whatever-makes-you-happy. If you have $1000 that you can afford to blow on extra drives then you have $1000 you could blow on RAM, CPU, an extra server, or a trip to Disney. Why not blow it on something useful? Rich ^ permalink raw reply [flat|nested] 46+ messages in thread
* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 15:13 ` Rich Freeman @ 2013-06-22 10:29 ` Duncan 2013-06-22 11:12 ` Rich Freeman 0 siblings, 1 reply; 46+ messages in thread From: Duncan @ 2013-06-22 10:29 UTC (permalink / raw To: gentoo-amd64 Rich Freeman posted on Fri, 21 Jun 2013 11:13:51 -0400 as excerpted: > On Fri, Jun 21, 2013 at 10:27 AM, Duncan <1i5t5.duncan@cox.net> wrote: >> Question: Would you use [btrfs] for raid1 yet, as I'm doing? >> What about as a single-device filesystem? > If I wanted to use raid1 I might consider using btrfs now. I think it > is still a bit risky, but the established use cases have gotten a fair > bit of testing now. I'd be more confident in using it with a single > device. OK, so we agree on the basic confidence level of various btrfs features. I trust my own judgement a bit more now. =:^) > To migrate today would require finding someplace to dump all > the data offline and migrate the drives, as there is no in-place way to > migrate multiple ext3/4 logical volumes on top of mdadm to a single > btrfs on bare metal. ... Unless you have enough unpartitioned space available still. What I did a few years ago is buy a 1 TB USB drive I found at a good deal. (It was very near the price of half-TB drives at the time, I figured out later they must have gotten shipped a pallet of the wrong ones for a sale on the half-TB version of the same thing, so it was a single-store, get-it-while-they're-there-to-get, deal.) That's how I was able to migrate from the raid6 I had back to raid1. I had to squeeze the data/partitions a bit to get everything to fit, but it did, and that's how I ended up with 4-way raid1, since it /had/ been a 4- way raid6. All 300-gig drives at the time, so the TB USB had /plenty/ of room. =:^) > Without replying to anything in particular both you and Bob have > mentioned the importance of multiple redundancy. > > Obviously risk goes down as redundancy goes up. If you protect 25 > drives of data with 1 drive of parity then you need 2/26 drives to fail > to hose 25 drives of data. Ouch! > If you protect 1 drive of data with 25 drives of parity (call them > mirrors or parity or whatever - they're functionally equivalent) then > you need 25/26 drives to fail to lose 1 drive of data. Almost correct. Except that with 25/26 failed, you'd still have 1 working, which with raid1/mirroring would be enough. (AFAIK that's the difference with parity. Parity is generally done on a minimum of two devices with the third as parity, and going down to just one isn't enough, you can lose only one, or two if you have two-way-parity as with raid6. With mirroring/raid1, they're all essentially identical, so one is enough to keep going, you'd have to loose 26/26 to be dead in the water. But 25/26 dead or 26/26 dead, you better HOPE it never comes down to where that matters!) > RAID 1 is actually less effective - if you protect 13 > drives of data with 13 mirrors you need 2/26 drives to fail to lose 1 > drive of data (they just have to be the wrong 2). However, you do need > to consider that RAID is not the only way to protect data, and I'm not > sure that multiple-redundancy raid-1 is the most cost-effective > strategy. The first time I read that thru I read it wrong, and was about to disagree. Then I realized what you meant... and that it was an equally valid read of what you wrote, except... AFAIK 13 drives of data with 13 mirrors wouldn't (normally) be called raid1 (unless it's 13 individual raid1s). Normally, an arrangement of that nature if configured together would be configured as raid10, 2-way- mirrored, 13-way-striped (or possibly raid0+1, but that's not recommended for technical reasons having to do with rebuild thruput), tho it could also be configured as what mdraid calls linear mode (which isn't really raid, but it happens to be handled by the same md/raid driver in Linux) across the 13, plus raid1, or if they're configured as separate volumes, 13 individual two-disk raid1s, any of which might be what you meant (and the wording appears to favor 13 individual raid1s). What I interpreted it as initially was a 13-way raid1, mirrored again at a second level to 13 additional drives, which would be called raid11, except that there's no benefit of that over a simple single-layer 26-way raid1 so the raid11 term is seldom seen, and that's clearly not what you meant. Anyway, you're correct if it's just two-way-mirrored. However, at that level, if one was to do only two-way-mirroring, one would usually do either raid10 for the 13-way striping, or 13 separate raid1s, which would give one the opportunity to make some of them 3-way-mirrored (or more) raid1s for the really vital data, leaving the less vital data as simple 2-way-mirror-raid1s. Or raid6 and get loss-of-two tolerance, but as this whole subthread is discussing, that can be problematic for thruput. (I've occasionally seen reference to raid7, which is said to be 3-way-parity, loss-of-three- tolerance, but AFAIK there's no support for it in the kernel, and I wouldn't be surprised if all implementations are proprietary. AFAIK, in practice, raid10 with N-way mirroring on the raid1 portion is implemented once that many devices get involved, or other multi-level raid schemes.) > If I had 2 drives of data to protect and had 4 spare drives to do it > with, I doubt I'd set up a 3x raid-1/5/10 setup (or whatever you want to > call it - imho raid "levels" are poorly named as there really is just > striping and mirroring and adding RS parity and everything else is just > combinations). Instead I'd probably set up a RAID1/5/10/whatever with > single redundancy for faster storage and recovery, and an offline backup > (compressed and with incrementals/etc). The backup gets you more > security and you only need it in a very unlikely double-failure. I'd > only invest in multiple redundancy in the event that the risk-weighted > cost of having the node go down exceeds the cost of the extra drives. > Frankly in that case RAID still isn't the right solution - you need a > backup node someplace else entirely as hard drives aren't the only thing > that can break in your server. So we're talking six drives, two of data and four "spares" to play with. Often that's setup as raid10, either two-way-striped and 3-way-mirrored, or 3-way-striped and 2-way-mirrored, depending on whether the loss-of-two tolerance of 3-way-mirroring or thruput of three-way-striping, is considered of higher value. You're right that at that level, you DO need a real backup, and it should take priority over raid-whatever. HOWEVER, in addition to creating a SINGLE raid across all those drives, it's possible to partition them up, and create multiple raids out of the partitions, with one set being a backup of the other. And since you've already stated that there's only two drives worth of data, there's certainly room enough amongst the six drives total to do just that. This is in fact how I ran my raids, both my raid6 config, and my raid1 config, for a number of years, and is in fact how I have my (raid1-mode) btrfs filesystems setup now on the SSDs. Effectively I had/have each drive partitioned up into two sets of partitions, my "working" set, and my "backup" set. Then I md-raided at my chosen level each partition across all devices. So on each physical device partition 5 might be the working rootfs partition, partition 6 the woriing home partition... partition 9 the backup rootfs partition, and partition 10 the backup home partition. They might end up being md3 (rootwork), md4 (homework), md7 (rootbak) and md8 (homebak). That way, you're protected against physical device death by the redundancy of the raids, and from fat-fingering or an update gone wrong by the redundancy of the backup partitions across the same physical devices. What's nice about an arrangement such as this is that it gives you quite a bit more flexibility than you'd have with a single raid, since it's now possible to decide "Hmm, I don't think I actually need a backup of /var/ log, so I think I'll only run with one log partition/raid, instead of the usual working/backup arrangement." Similarly, "You know, I ultimately don't need backups of the gentoo tree and overlays, or of the kernel git tree, at all, since as Linus says, 'Real men upload it to the net and let others be their backup', and I can always redownload that from the net, so I think I'll raid0 this partition and not keep any copies at all, since re-downloading's less trouble than dealing with the backups anyway." Finally, and possibly critically, it's possible to say, "You know, what happens if I've just wiped rootbak in ordered to make a new root backup, and I have a crash and working-root refuses to boot. I think I need a rootbak2, and with the space I saved by doing only one log partition and by making the sources trees raid0, I have room for it now, without using any more space than I would had I had everything on the same raid." Another nice thing about it, and this is what I would have ended up doing if I hadn't conveniently found that 1 TB USB drive at such a good price, is that while the whole thing is partitioned up and in use, it's very possible to wipe out the backup partitions temporarily, and recreate them as a different raid level or a different filesystem, or otherwise reorganize that area, then reboot into the new version, and do the same to what was the working copies. (For the area that was raid0, well, it was raid0 because it's easy to recreate, so just blow it away and recreate it on the new layout. And for the single-raid log without a backup copy, it's simple enough to simply point the log elsewhere or keep it on rootfs for long enough to redo that set of partitions across all physical devices.) Again, this isn't just theory, it really works, as I've done it to various degrees at various times, even if I found copying to the external 1 TB USB drive and booting from it more convenient to do when I transferred from raid6 to raid1. And being I do run ~arch, there's been a number of times I've needed to boot to rootbak instead of rootworking, including once when a ~arch portage was hosing symlinks just as a glibc update came along, thus breaking glibc (!!), once when a bash update broke, and another time when a glibc update mostly worked but I needed to downgrade and the protection built into the glibc ebuild wasn't letting me do it from my working root. What's nice about this setup in regard to booting to rootbak instead of the usual working root, is that unlike booting to a liveCD/DVD rescue disk, you have the full working system installed, configured and running just as it was when the backup was made. That makes it much easier to pickup and run from where you left off, with all the tools you're used to having and modes of working you're used to using, instead of being limited to some artificial rescue environment often with limited tools, and in any case setup and configured differently than you have your own system, because rootbak IS your own system, just from a few days/weeks/ months ago, whenever it was that you last did the backup. Anyway, with the parameters you specified, two drives full of data and four spare drives (presumably of a similar size), there's a LOT of flexibility. There's raid10 across four drives (two-mirror, two-stripe) with the other two as backup (this would probably be my choice given the 2-disks of data, 6 disk total, constraints, but see below, and it appears this might be your choice as well), or raid6 across four drives (two mirror, two parity) with two as backups (not a choice I'd likely make, but a choice), or a working pair of drives plus two sets of backups (not a choice I'd likely make), or raid10 across all six drives in either 3- mirror/2-stripe or 3-stripe/2-mirror mode (I'd probably elect for this with 3-stripe/2-mirror for the 3X speed and space, and prioritize a separate backup, see the discussion below), or two independent 3-disk raid5s (IMO there's better options for most cases, with the possible exception of primarily slow media usage, just which options are better depends on usage and priorities tho), or some hybrid combination of these. > This sort of rationale is why I don't like arguments like "RAM is cheap" > or "HDs are cheap" or whatever. The fact is that wasting money on any > component means investing less in some other component that could give > you more space/performance/whatever-makes-you-happy. If you have $1000 > that you can afford to blow on extra drives then you have $1000 you > could blow on RAM, CPU, an extra server, or a trip to Disney. Why not > blow it on something useful? [ This gets philosophical. OK to quit here if uninterested. ] You're right. "RAM and HDs are cheap"... relative to WHAT, the big- screen TV/monitor I WOULD have been replacing my much smaller monitor with, if I hadn't been spending the money on the "cheap" RAM and HDs? Of course, "time is cheap" comes with the same caveats, and can actually end up being far more dear. Stress and hassle of administration similarly. And sometimes, just a bit of investment in another "expensive" HD, saves you quite a bit of "cheap" time and stress, that's actually more expensive. "It's all relative"... to one's individual priorities. Because one thing's for sure, both money and time are fungible, and if they aren't spent on one thing, they WILL be on another (even if that "spent" is savings, for money), and ultimately, it's one's individual priorities that should rank where that spending goes. And I can't set your priorities and you can't set mine, so... But from my observation, a LOT of folks don't realize that and/or don't take the time necessary to reevaluate their own priorities from time to time, so end up spending out of line with their real priorities, and end up rather unhappy people as a result! That's one reason why I have a personal policy to deliberately reevaluate personal priorities from time to time (as well as being aware of them constantly), and rearrange spending, money time and otherwise, in accordance with those reranked priorities. I'm absolutely positive I'm a happier man for doing so! =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-22 10:29 ` Duncan @ 2013-06-22 11:12 ` Rich Freeman 2013-06-22 15:45 ` Duncan 0 siblings, 1 reply; 46+ messages in thread From: Rich Freeman @ 2013-06-22 11:12 UTC (permalink / raw To: gentoo-amd64 On Sat, Jun 22, 2013 at 6:29 AM, Duncan <1i5t5.duncan@cox.net> wrote: > Rich Freeman posted on Fri, 21 Jun 2013 11:13:51 -0400 as excerpted: >> If you protect 1 drive of data with 25 drives of parity (call them >> mirrors or parity or whatever - they're functionally equivalent) then >> you need 25/26 drives to fail to lose 1 drive of data. > > Almost correct. DOH - good catch. Would need 26 fails. > AFAIK 13 drives of data with 13 mirrors wouldn't (normally) be called > raid1 (unless it's 13 individual raid1s)... That's why I commented that I find RAID "levels" extremely unhelpful. There is striping, mirroring, and RS parity, and every possible combination of the above. We have a special name raid5 for striping with one RS parity drive. We have another special name raid6 for striping with two RS parity drives. We don't have a special name for striping with 37 RS parity drives. Yet, all three of these are the same thing. I was referring to 13 data drives with one mirror each . If you lose two drives you could potential lose one drive of data. If you made that one big raid10 then if you lose two drives you could lose 13 drives of data. Both scenarios involve bad luck in terms of what pair goes. > You're right that at that level, you DO need a real backup, and it should > take priority over raid-whatever. HOWEVER, in addition to creating a > SINGLE raid across all those drives, it's possible to partition them up, > and create multiple raids out of the partitions, with one set being a > backup of the other. I wouldn't consider that a great strategy. Sure, it is convenient, but it does you no good at all if your computer burns up in a fire. Multiple-level redundancy just seems to be past the point of diminishing returns to me. If I wanted to spend that kind of money I'd probably spend it differently. However, I do agree that mdadm should support more flexible arrays. For example, my boot partition is raid1 (since grub doesn't support anything else), and I have it set up across all 5 of my drives. However, the reality is that only two get used and the others are treated only as spares. So, that is just a waste of space, and it is actually more annoying from a config perspective because it would be really nice if my system could boot from an arbitrary drive. Oh, as far as raid on partitions goes - I do use this for a different purpose. If you have a collection of drives of different sizes it can reduce space waste. Suppose you have 3 500GB drives and 2 1TB drives. If you put them all directly in a raid5 you get 2TB of space. If you chop the 1TB drives into 2 500GB partitions then you can get two raid5s - one 2TB in space, and the other 500GB in space. That is 500GB more data for the same space. Oh, and I realize I wrote raid5. With mdadm you can set up a 2-drive raid5. It is functionally equivalent to a raid1 I think, and I believe you can convert between them, but since I generally intend to expand arrays I prefer to just set them up as raid5 from the start. Since I stick lvm on top I don't care if the space is chopped up. Rich ^ permalink raw reply [flat|nested] 46+ messages in thread
* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-22 11:12 ` Rich Freeman @ 2013-06-22 15:45 ` Duncan 0 siblings, 0 replies; 46+ messages in thread From: Duncan @ 2013-06-22 15:45 UTC (permalink / raw To: gentoo-amd64 Rich Freeman posted on Sat, 22 Jun 2013 07:12:25 -0400 as excerpted: > Multiple-level redundancy just seems to be past the point of diminishing > returns to me. If I wanted to spend that kind of money I'd probably > spend it differently. My point was that for me, it wasn't multiple level redundancy. It was simply device redundancy (raid), and fat-finger redundancy (backups), on the same set of drives so I was protected from either scenario. The fire/flood scenario would certainly get me if I didn't have offsite backups, but just as you call multiple redundancy past your point of diminishing returns, I call the fire/flood scenario past mine. If that happens, I figure I'll have far more important things to worry about than rebuilding my computer for awhile. And chances are, when I do get around to it, things will be progressed enough that much of the data won't be worth so much any more anyway. Besides, the real /important/ data is in my head. What's worth rebuilding, will be nearly as easy to rebuild due to what's in my head, as it would be to go thru what's now historical data and try to pick up the pieces, sorting thru what's still worth keeping around and what's not. Tho as I said, I do/did keep an additional level of backup on that 1 TB drive, but it's on-site too, and while not in the computer, it's generally nearby enough that it'd be lost too in case of flood/fire. It's more a convenience than a real backup, and I don't really keep it upto date, but if it survived and what's in the computer itself didn't, I do have old copies of much of my data, simply because it's still there from the last time I used that drive as convenient temporary storage while I switched things around. > However, I do agree that mdadm should support more flexible arrays. For > example, my boot partition is raid1 (since grub doesn't support anything > else), and I have it set up across all 5 of my drives. However, the > reality is that only two get used and the others are treated only as > spares. So, that is just a waste of space, and it is actually more > annoying from a config perspective because it would be really nice if my > system could boot from an arbitrary drive. Three points on that. First, obviously you're not on grub2 yet. It handles all sorts of raid, lvm, newer filesystems like btrfs (and zfs for those so inclined), various filesystems, etc, natively, thru its modules. Second, /boot is an interesting case. Here, originally (with grub1 and the raid6s across 4 drives) I setup a 4-drive raid1. But, I actually installed grub to the boot sector of all four drives, and tested each one booting just to grub by itself (the other drives off), so I knew it was using its own grub, not pointed somewhere else. But I was still worried about it as while I could boot from any of the drives, they were a single raid1, which meant no fat-finger redundancy, and doing a usable backup of /boot isn't so easy. So I think it was when I switched from raid6 to raid1 for almost the entire system, that I switched to dual dual-drive raid1s for /boot as well, and of course tested booting to each one alone again, just to be sure. That gave me fat-finger redundancy, as well as added convenience since I run git kernels, and I was able to update just the one dual-drive raid1 /boot with the git kernels, then update the backup with the releases once they came out, which made for a nice division of stable kernel vs pre-release there. That dual dual-drive raid-1 setup proved very helpful when I upgraded to grub2 as well, since I was able to play around with it on the one dual- drive raid1 /boot while the other one stayed safely bootable grub1 until I had grub2 working the way I wanted on the working /boot, and had again installed and tested it on both component hard drives to boot to grub and to the full raid1 system just from the one drive by itself, with the others entirely shut off. Only when I had both drives of the working /boot up and running grub2, did I mounth the backup /boot as well, and copy over the now working config to it, before running grub2-install on those two drives. Of course somewhere along the way, IIRC at the same time as the raid6 to raid1 conversion as well, I had also upgraded to gpt partitions from traditional mbr. When I did I had the foresight to create BOTH dedicated BIOS boot partitions and EFI partitions on each of the four drives. grub1 wasn't using them, but that was fine; they were small (tiny). That made the upgrade to grub2 even easier, since grub2 could install its core into the dedicated BIOS partitions. The EFI partitions remain unused to this day, but as I said, they're tiny, and with gpt they're specifically typed and labeled so they can't mix me up, either. (BTW, talking about data integrity, if you're not on GPT yet, do consider it. It keeps a second partition table at the end of the drive as well as the one at the beginning, and unlike mbr they're checksummed, so corruption is detected. It also kills the primary/secondary/extended difference so no more worrying about that, and allows partition labels, much like filesystem labels, which makes tracking and managing what's what **FAR** easier. I GPT partition everything now, including my USB thumbdrives if I partition them at all!) When that machine slowly died and I transferred to a new half-TB drive thinking it was the aging 300-gigs (it wasn't, caps were dying on the by then 8 year old mobo), and then transferred that into my new machine without raid, I did the usual working/backup partition arrangement, but got frustrated without the ability to have a backup /boot, because with just one device, the boot sector could point just one place, at the core grub2 in the dedicated BIOS boot partition, which in turn pointed at the usual /boot. Now grub2's better in this regard than grub2, since that core grub2 has an emergency mode that would give me limited ability to load a backup /boot, that's an entirely manual process with a comparatively limited grub2 emergency shell without additional modules available, and I didn't actually take advantage of that to configure a backup /boot that it could reach. But when I switched to the SSDs, I again had multiple devices, the pair of SSDs, which I setup with individual /boots, and the original one still on the spinning rust. Again I installed grub2 to each one, pointed at its own separately configured /boot, so now I actually have three separately configured and bootable /boots, one on each of the SSDs and a third on the spinning rust half-TB. (FWIW the four old 300-gigs are sitting on the shelf. I need to badblocks or dd them to wipe, and I have a friend that'll buy them off me.) Third point. /boot partition raid1 across all five drives and three are wasted? How? I believe if you check, all five will have a mirror of the data (not just two unless it's btrfs raid1 not mdadm raid1, but btrfs is / entirely/ different in that regard). Either they're all wasted but one, or none are wasted, depending on how you look at it. Meanwhile, do look into installing grub on each drive, so you can boot from any of them. I definitely know it's possible as that's what I've been doing, tested, for quite some time. > Oh, as far as raid on partitions goes - I do use this for a different > purpose. If you have a collection of drives of different sizes it can > reduce space waste. Suppose you have 3 500GB drives and 2 1TB drives. > If you put them all directly in a raid5 you get 2TB of space. If you > chop the 1TB drives into 2 500GB partitions then you can get two raid5s > - one 2TB in space, and the other 500GB in space. That is 500GB more > data for the same space. Oh, and I realize I wrote raid5. With mdadm > you can set up a 2-drive raid5. It is functionally equivalent to a > raid1 I think, You better check. Unless I'm misinformed, which I could be as I've not looked at this in awhile and both mdadm and the kernel have changed quite a bit since then, that'll be setup as a degraded raid5, which means if you lose one... But I do know raid10 can be setup like that, on fewer drives than it'd normally take, with the mirrors in "far" mode I believe, and it just arranges the stripes as it needs to. It's quite possible that they fixed it so raid5 works similarly and can do the same thing now, in which case that degraded thing I knew about is obsolete. But unless you know for sure, please do check. > and I believe you can convert between them, but since I generally intend > to expand arrays I prefer to just set them up as raid5 from the start. > Since I stick lvm on top I don't care if the space is chopped up. There's a lot of raid conversion ability in modern mdadm. I think most levels can be converted between, given sufficient devices. Again, a lot has changed in that regard since I set my originals up, I'd guess somewhere around 2008. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 10:28 ` Rich Freeman 2013-06-21 14:23 ` Bob Sanders 2013-06-21 14:27 ` Duncan @ 2013-06-22 23:04 ` Mark Knecht 2013-06-22 23:17 ` Matthew Marlowe ` (2 more replies) 2 siblings, 3 replies; 46+ messages in thread From: Mark Knecht @ 2013-06-22 23:04 UTC (permalink / raw To: Gentoo AMD64 On Fri, Jun 21, 2013 at 3:28 AM, Rich Freeman <rich0@gentoo.org> wrote: > On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: >> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k >> in data across three devices, and 8k of parity across the other two >> devices. > > With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a > stripe, not 20k. If you modify one block it needs to read all 1.5M, > or it needs to read at least the old chunk on the single drive to be > modified and both old parity chunks (which on such a small array is 3 > disks either way). > Hi Rich, I've been rereading everyone's posts as well as trying to collect my own thoughts. One question I have at this point, being that you and I seem to be the two non-RAID1 users (but not necessarily devotees) at this time, is what chunk size, stride & stripe width with you are using? Are you currently using 512K chunks on your RAID5? If so that's potentially quite different than my 16K chunk RAID6. The more I read through this thread and other things on the web the more I am concerned that 16K chunks has possibly forced far more IO operations that really makes sense for performance. Unfortunately there's no easy way to me to really test this right now as the RAID6 uses the whole drive. However for every 512K I want to get off the drive you might need 1 chuck whereas I'm going to need what, 32 chunks? That's got to be a lot more IO operations on my machine isn't it? For clarity, I'm a 16K chunk, stride of 4K, stripe of 12K: c2RAID6 ~ # tune2fs -l /dev/md3 | grep RAID Filesystem volume name: RAID6root RAID stride: 4 RAID stripe width: 12 c2RAID6 ~ # c2RAID6 ~ # cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md3 : active raid6 sdb3[9] sdf3[5] sde3[6] sdd3[7] sdc3[8] 1452264480 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU] unused devices: <none> c2RAID6 ~ # As I understand one of your earlier responses I think you are using 4K sector drives, which again has that extra level of complexity in terms of creating the partitions initially, but after that should be fairly straight forward to use. (I think) That said there are trade-offs between RAID5 & RAID6 but have you measured speeds using anything like the dd method I posted yesterday, or any other way that we could compare? As I think Duncan asked about storage usage requirements in another part of this thread I'll just document it here. The machine serves main 3 purposes for me: 1) It's my day in, day out desktop. I run almostly totally Gentoo 64-bit stable unless I need to keyword a package to get what I need. Over time I tend to let my keyworded packages go stable if they are working for me. The overall storage requirements for this, including my home directory, typically don't run over 50GB. 2) The machine runs 3 Windows VMs every day - 2 Win 7 & 1 Win XP. Total storage for the basic VMs is about 150GB. XP is just for things like NetFlix. These 3 VMs typically have allocated 9 cores allocated to them (6+2+1) leaving 3 for Gentoo to run the hardware, etc. The 6 core VM is often using 80-100% of its CPUs sustained for times. (hours to days.) It's doing a lot of stock market math... 3) More recently, and really the reason to consolidate into a single RAID of any type, I have about 900GB of mp4s which has been on an external USB drive, and backed up to a second USB drive. However this is mostly storage. We watch most of this video on the TV using the second copy drive hooked directly to the TV or copied onto Kindles. I've been having to keep multiple backups of this outside the machine (poor man's RAID1 - two separate USB drives hooked up one at a time!) ;-) I'd rather just keep it safe on the RAID 6, That said, I've not yet put it on the RAID6 as I have these performance issues I'd like to solve first. (If possible. Duncan is making me worry that they cannot be solved...) Lastly, even if I completely buy into Duncan's well formed reasons about why RAID1 might be faster, using 500GB drives I see no single RAID solution for me other than RAID5/6. The real RAID1/RAID6 comparison from a storage standpoint would be a (conceptual) 3-drive RAID6 vs 3 drive RAID1. Both create 500GB of storage and can (conceptually) lose 2 drives and still recover data. However adding another drive to the RAID1 gains you more speed but no storage (buying into Duncan's points) vs adding storage to the RAID6 and probably reducing speed. As I need storage what other choices do I have? Answering myself, take the 5 drives, create two RAIDS - a 500GB 2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video data maybe? I don't know... Or buy more hardware and do a 2 drive SSD RAID1 for the system, or a hardware RAID controller, etc. The options explode if I start buying more hardware. Also, THANKS TO EVERYONE for the continued conversation. Cheers, Mark ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-22 23:04 ` Mark Knecht @ 2013-06-22 23:17 ` Matthew Marlowe 2013-06-23 11:43 ` Rich Freeman 2013-06-28 0:51 ` Duncan 2 siblings, 0 replies; 46+ messages in thread From: Matthew Marlowe @ 2013-06-22 23:17 UTC (permalink / raw To: gentoo-amd64 [-- Attachment #1: Type: text/plain, Size: 5732 bytes --] I would recommend that anyone concerned about mdadm software raid performance on gentoo test via tools like bonnie++ before putting any data on the drives and separate from data into different sets/volumes. I did testing two years ago watching read, write burst and sustained rates, file ops per second, etc.... Ended up getting 7 2tb enterprise data drives Disk 1 is os, no raid Disk 2-5 are data, raid 10 Disk 6-7 are backups and to test/scratch space, raid 0 On Jun 22, 2013 4:04 PM, "Mark Knecht" <markknecht@gmail.com> wrote: > On Fri, Jun 21, 2013 at 3:28 AM, Rich Freeman <rich0@gentoo.org> wrote: > > On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: > >> So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k > >> in data across three devices, and 8k of parity across the other two > >> devices. > > > > With mdadm on a 5-device raid6 with 512K chunks you have 1.5M in a > > stripe, not 20k. If you modify one block it needs to read all 1.5M, > > or it needs to read at least the old chunk on the single drive to be > > modified and both old parity chunks (which on such a small array is 3 > > disks either way). > > > > Hi Rich, > I've been rereading everyone's posts as well as trying to collect > my own thoughts. One question I have at this point, being that you and > I seem to be the two non-RAID1 users (but not necessarily devotees) at > this time, is what chunk size, stride & stripe width with you are > using? Are you currently using 512K chunks on your RAID5? If so that's > potentially quite different than my 16K chunk RAID6. The more I read > through this thread and other things on the web the more I am > concerned that 16K chunks has possibly forced far more IO operations > that really makes sense for performance. Unfortunately there's no easy > way to me to really test this right now as the RAID6 uses the whole > drive. However for every 512K I want to get off the drive you might > need 1 chuck whereas I'm going to need what, 32 chunks? That's got to > be a lot more IO operations on my machine isn't it? > > For clarity, I'm a 16K chunk, stride of 4K, stripe of 12K: > > c2RAID6 ~ # tune2fs -l /dev/md3 | grep RAID > Filesystem volume name: RAID6root > RAID stride: 4 > RAID stripe width: 12 > c2RAID6 ~ # > > c2RAID6 ~ # cat /proc/mdstat > Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] > md3 : active raid6 sdb3[9] sdf3[5] sde3[6] sdd3[7] sdc3[8] > 1452264480 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] > [UUUUU] > > unused devices: <none> > c2RAID6 ~ # > > As I understand one of your earlier responses I think you are using > 4K sector drives, which again has that extra level of complexity in > terms of creating the partitions initially, but after that should be > fairly straight forward to use. (I think) That said there are > trade-offs between RAID5 & RAID6 but have you measured speeds using > anything like the dd method I posted yesterday, or any other way that > we could compare? > > As I think Duncan asked about storage usage requirements in another > part of this thread I'll just document it here. The machine serves > main 3 purposes for me: > > 1) It's my day in, day out desktop. I run almostly totally Gentoo > 64-bit stable unless I need to keyword a package to get what I need. > Over time I tend to let my keyworded packages go stable if they are > working for me. The overall storage requirements for this, including > my home directory, typically don't run over 50GB. > > 2) The machine runs 3 Windows VMs every day - 2 Win 7 & 1 Win XP. > Total storage for the basic VMs is about 150GB. XP is just for things > like NetFlix. These 3 VMs typically have allocated 9 cores allocated > to them (6+2+1) leaving 3 for Gentoo to run the hardware, etc. The 6 > core VM is often using 80-100% of its CPUs sustained for times. (hours > to days.) It's doing a lot of stock market math... > > 3) More recently, and really the reason to consolidate into a single > RAID of any type, I have about 900GB of mp4s which has been on an > external USB drive, and backed up to a second USB drive. However this > is mostly storage. We watch most of this video on the TV using the > second copy drive hooked directly to the TV or copied onto Kindles. > I've been having to keep multiple backups of this outside the machine > (poor man's RAID1 - two separate USB drives hooked up one at a time!) > ;-) I'd rather just keep it safe on the RAID 6, That said, I've not > yet put it on the RAID6 as I have these performance issues I'd like to > solve first. (If possible. Duncan is making me worry that they cannot > be solved...) > > Lastly, even if I completely buy into Duncan's well formed reasons > about why RAID1 might be faster, using 500GB drives I see no single > RAID solution for me other than RAID5/6. The real RAID1/RAID6 > comparison from a storage standpoint would be a (conceptual) 3-drive > RAID6 vs 3 drive RAID1. Both create 500GB of storage and can > (conceptually) lose 2 drives and still recover data. However adding > another drive to the RAID1 gains you more speed but no storage (buying > into Duncan's points) vs adding storage to the RAID6 and probably > reducing speed. As I need storage what other choices do I have? > > Answering myself, take the 5 drives, create two RAIDS - a 500GB > 2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video > data maybe? I don't know... > > Or buy more hardware and do a 2 drive SSD RAID1 for the system, or > a hardware RAID controller, etc. The options explode if I start buying > more hardware. > > Also, THANKS TO EVERYONE for the continued conversation. > > Cheers, > Mark > > [-- Attachment #2: Type: text/html, Size: 6574 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-22 23:04 ` Mark Knecht 2013-06-22 23:17 ` Matthew Marlowe @ 2013-06-23 11:43 ` Rich Freeman 2013-06-23 15:23 ` Mark Knecht 2013-06-28 0:51 ` Duncan 2 siblings, 1 reply; 46+ messages in thread From: Rich Freeman @ 2013-06-23 11:43 UTC (permalink / raw To: gentoo-amd64 On Sat, Jun 22, 2013 at 7:04 PM, Mark Knecht <markknecht@gmail.com> wrote: > I've been rereading everyone's posts as well as trying to collect > my own thoughts. One question I have at this point, being that you and > I seem to be the two non-RAID1 users (but not necessarily devotees) at > this time, is what chunk size, stride & stripe width with you are > using? I'm using 512K chunks on the two RAID5s which are my LVM PVs: md7 : active raid5 sdc3[0] sdd3[6] sde3[7] sda4[2] sdb4[5] 971765760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU] bitmap: 1/2 pages [4KB], 65536KB chunk md6 : active raid5 sda3[0] sdd2[4] sdb3[3] sde2[5] 2197687296 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU] bitmap: 2/6 pages [8KB], 65536KB chunk On top of this I have a few LVs with ext4 filesystems: tune2fs -l /dev/vg1/root | grep RAID RAID stride: 128 RAID stripe width: 384 (this is root, bin, sbin, lib) tune2fs -l /dev/vg1/data | grep RAID RAID stride: 19204 (this is just about everything else) tune2fs -l /dev/vg1/video | grep RAID RAID stride: 11047 (this is mythtv video) Those were all the defaults picked, and with the exception of root I believe the array was quite different when the others were created. I'm pretty confident that none of these are optimizes, and I'd be shocked if any of them are aligned unless this is automated (including across pvmoves, reshaping, and such). That is part of why I'd like to move to btrfs - optimizing raid with mdadm+lvm+mkfs.ext4 involves a lot of micromanagement as far as I'm aware. Docs are very spotty at best, and it isn't at all clear that things get adjusted as needed when you actually take advantage of things like pvmove or reshaping arrays. I suspect that having btrfs on bare metal will be more likely to result in something that keeps itself in-tune. Rich ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-23 11:43 ` Rich Freeman @ 2013-06-23 15:23 ` Mark Knecht 0 siblings, 0 replies; 46+ messages in thread From: Mark Knecht @ 2013-06-23 15:23 UTC (permalink / raw To: Gentoo AMD64 On Sun, Jun 23, 2013 at 4:43 AM, Rich Freeman <rich0@gentoo.org> wrote: > On Sat, Jun 22, 2013 at 7:04 PM, Mark Knecht <markknecht@gmail.com> wrote: >> I've been rereading everyone's posts as well as trying to collect >> my own thoughts. One question I have at this point, being that you and >> I seem to be the two non-RAID1 users (but not necessarily devotees) at >> this time, is what chunk size, stride & stripe width with you are >> using? > > I'm using 512K chunks on the two RAID5s which are my LVM PVs: > md7 : active raid5 sdc3[0] sdd3[6] sde3[7] sda4[2] sdb4[5] > 971765760 blocks super 1.2 level 5, 512k chunk, algorithm 2 [5/5] [UUUUU] > bitmap: 1/2 pages [4KB], 65536KB chunk > > md6 : active raid5 sda3[0] sdd2[4] sdb3[3] sde2[5] > 2197687296 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU] > bitmap: 2/6 pages [8KB], 65536KB chunk > > On top of this I have a few LVs with ext4 filesystems: > tune2fs -l /dev/vg1/root | grep RAID > RAID stride: 128 > RAID stripe width: 384 > (this is root, bin, sbin, lib) > > tune2fs -l /dev/vg1/data | grep RAID > RAID stride: 19204 > (this is just about everything else) > > tune2fs -l /dev/vg1/video | grep RAID > RAID stride: 11047 > (this is mythtv video) > > Those were all the defaults picked, and with the exception of root I > believe the array was quite different when the others were created. > I'm pretty confident that none of these are optimizes, and I'd be > shocked if any of them are aligned unless this is automated (including > across pvmoves, reshaping, and such). > > That is part of why I'd like to move to btrfs - optimizing raid with > mdadm+lvm+mkfs.ext4 involves a lot of micromanagement as far as I'm > aware. Docs are very spotty at best, and it isn't at all clear that > things get adjusted as needed when you actually take advantage of > things like pvmove or reshaping arrays. I suspect that having btrfs > on bare metal will be more likely to result in something that keeps > itself in-tune. > > Rich > Thanks Rich. I'm finding that helpful. I completely agree on the micromanagement comment. At one level or another that's sort of what this whole thread is about! On your root partition I sort of wonder about the stripe width. Assuming I did it right (5, 5, 512, 4) his little page calculates 128 for the stride and 512 stripe width. (4 data disks * 128 I think) Just a piece of info. http://busybox.net/~aldot/mkfs_stride.html Returning to the title of the thread, asking about partition location essentially, I woke up this morning and had sort of decided to just try changing the chunk size to something large like your 512K. It seems I'm out of luck as my partition size is not (apparently) divisible by 512K: c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=512 --backup-file=/backups/ChunkSizeBackup mdadm: component size 484088160K is not a multiple of chunksize 512K c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=256 --backup-file=/backups/ChunkSizeBackup mdadm: component size 484088160K is not a multiple of chunksize 256K c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=128 --backup-file=/backups/ChunkSizeBackup mdadm: component size 484088160K is not a multiple of chunksize 128K c2RAID6 ~ # mdadm --grow /dev/md3 --chunk=64 --backup-file=/backups/ChunkSizeBackup mdadm: component size 484088160K is not a multiple of chunksize 64K c2RAID6 ~ # c2RAID6 ~ # cat /proc/mdstat Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] md3 : active raid6 sdb3[9] sdf3[5] sde3[6] sdd3[7] sdc3[8] 1452264480 blocks super 1.2 level 6, 16k chunk, algorithm 2 [5/5] [UUUUU] unused devices: <none> c2RAID6 ~ # fdisk -l /dev/sdb Disk /dev/sdb: 500.1 GB, 500107862016 bytes, 976773168 sectors Units = sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x8b45be24 Device Boot Start End Blocks Id System /dev/sdb1 * 63 112454 56196 83 Linux /dev/sdb2 112455 8514449 4200997+ 82 Linux swap / Solaris /dev/sdb3 8594775 976773167 484089196+ fd Linux raid autodetect c2RAID6 ~ # I suspect I might be much better off if all the partition sizes were divisible by 2048 and started on 2048 multiple, like the newer fdisk tools enforce. I am thinking I won't make much headway unless I completely rebuild the system from bare metal up. If I'm going to do that then I need to get a good copy of the whole RAID onto some other drive which is a big scary job, then start over with an install disk I guess. Not sure I'm up for that just yet on a Sunday morning... Take care, Mark ^ permalink raw reply [flat|nested] 46+ messages in thread
* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-22 23:04 ` Mark Knecht 2013-06-22 23:17 ` Matthew Marlowe 2013-06-23 11:43 ` Rich Freeman @ 2013-06-28 0:51 ` Duncan 2013-06-28 3:18 ` Matthew Marlowe 2 siblings, 1 reply; 46+ messages in thread From: Duncan @ 2013-06-28 0:51 UTC (permalink / raw To: gentoo-amd64 Mark Knecht posted on Sat, 22 Jun 2013 16:04:06 -0700 as excerpted: > Lastly, even if I completely buy into Duncan's well formed reasons about > why RAID1 might be faster, using 500GB drives I see no single RAID > solution for me other than RAID5/6. The real RAID1/RAID6 comparison from > a storage standpoint would be a (conceptual) 3-drive RAID6 vs 3 drive > RAID1. Both create 500GB of storage and can (conceptually) lose 2 drives > and still recover data. However adding another drive to the RAID1 gains > you more speed but no storage (buying into Duncan's points) vs adding > storage to the RAID6 and probably reducing speed. As I need storage what > other choices do I have? > > Answering myself, take the 5 drives, create two RAIDS - a 500GB > 2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video > data maybe? I don't know... > > Or buy more hardware and do a 2 drive SSD RAID1 for the system, or > a hardware RAID controller, etc. The options explode if I start buying > more hardware. Finally getting back to this on what's my "weekend"... Unfortunately, given 900 gigs media data and 150 gigs of VMs, with 5 500 gig drives to work with, you're right, simply making a raid1 out of everything isn't possible. You could do a 4-drive raid10, two-way striped and two-way mirrored, for a TB of storage for the media files and possibly squeeze the VMs between the SSD and the raid, with the 5th half-TB as a backup, but it'd be quite tight and non-optimal, plus losing the wrong two drives on the raid10 would put it out of commission so you'd have only one-drive-loss- tolerance there. You could buy a sixth half-TB and try either three-way-striping and two- way mirroring for the same one-drive-loss tolerance but a good 1.5 TB (3- way half-TB stripe) space, giving you plenty of space and thruput speed but at the cost of only single-drive-loss-tolerance. You could use the same six in a raid10 with the reverse configuration, two-way-stripe three-way-mirror, for better loss-of-two-tolerance but at only a TB of space and have the same squeeze as the 4-way raid10 (but now without the extra drive for backup), or... Personally, I'd probably be intensely motivated enough to try the 2-way- stripe 3-way-mirror 6-drive raid10, squeezing the media space as necessary to do it (maybe by using external drives for what wouldn't fit), but that's still a compromise... and includes buying that sixth drive. So the raid6 might well be the best alternative you have, given the data size AND physical device size constraints. But some time testing the performance of different configs and familiarizing yourself with the options and operation, as you've decided to do now, certainly won't hurt. I DID say I wasn't real strong on the chunk options, etc, myself, and you're using ext4, not the reiserfs I was using, and I believe ext4 has at least some potential performance upside compared to reiserfs, so it's quite possible that with some chunk/stride/ etc tweaking, you can get something better, performance-wise. Tho I expect raid6 will never be a speed demon, and may well never perform as you had originally expected/hoped. But better than the initial results should be possible, hopefully, and familiarizing yourself with things while experimenting has benefits of its own, so that's an idea I can agree with 100%. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-28 0:51 ` Duncan @ 2013-06-28 3:18 ` Matthew Marlowe 0 siblings, 0 replies; 46+ messages in thread From: Matthew Marlowe @ 2013-06-28 3:18 UTC (permalink / raw To: gentoo-amd64 I supported about 250 gentoo vm's using about 30 SAS 15K rpm 144GB drives awhile back. Drives were split into 14 disk RAID10 sets. Then each RAID10 set was split it into 200-500GB virtual drives, and the virtual machines were grouped into sets of 3-5 and matched with a virtual drive. Virtual machines on the same virtual drive were setup to use thin provisioning, so that only used up as much storage space as their data differed from the canonical gentoo os image which was usually less than 20%. The virtual drives were usually only 30-50% full and we could virtually provision 2TB+ of virtual machines on a single 500GB virtual drive. Don't underestimate what you can do with small drives, especially if they are fast and you have a lot of them.... On Thu, Jun 27, 2013 at 5:51 PM, Duncan <1i5t5.duncan@cox.net> wrote: > Mark Knecht posted on Sat, 22 Jun 2013 16:04:06 -0700 as excerpted: > >> Lastly, even if I completely buy into Duncan's well formed reasons about >> why RAID1 might be faster, using 500GB drives I see no single RAID >> solution for me other than RAID5/6. The real RAID1/RAID6 comparison from >> a storage standpoint would be a (conceptual) 3-drive RAID6 vs 3 drive >> RAID1. Both create 500GB of storage and can (conceptually) lose 2 drives >> and still recover data. However adding another drive to the RAID1 gains >> you more speed but no storage (buying into Duncan's points) vs adding >> storage to the RAID6 and probably reducing speed. As I need storage what >> other choices do I have? >> >> Answering myself, take the 5 drives, create two RAIDS - a 500GB >> 2-drive RAID1 for the system + VMs, and then a 3-drive RAID5 for video >> data maybe? I don't know... >> >> Or buy more hardware and do a 2 drive SSD RAID1 for the system, or >> a hardware RAID controller, etc. The options explode if I start buying >> more hardware. > > Finally getting back to this on what's my "weekend"... > > Unfortunately, given 900 gigs media data and 150 gigs of VMs, with 5 500 > gig drives to work with, you're right, simply making a raid1 out of > everything isn't possible. > > You could do a 4-drive raid10, two-way striped and two-way mirrored, for > a TB of storage for the media files and possibly squeeze the VMs between > the SSD and the raid, with the 5th half-TB as a backup, but it'd be quite > tight and non-optimal, plus losing the wrong two drives on the raid10 > would put it out of commission so you'd have only one-drive-loss- > tolerance there. > > You could buy a sixth half-TB and try either three-way-striping and two- > way mirroring for the same one-drive-loss tolerance but a good 1.5 TB (3- > way half-TB stripe) space, giving you plenty of space and thruput speed > but at the cost of only single-drive-loss-tolerance. > > You could use the same six in a raid10 with the reverse configuration, > two-way-stripe three-way-mirror, for better loss-of-two-tolerance but at > only a TB of space and have the same squeeze as the 4-way raid10 (but now > without the extra drive for backup), or... > > Personally, I'd probably be intensely motivated enough to try the 2-way- > stripe 3-way-mirror 6-drive raid10, squeezing the media space as > necessary to do it (maybe by using external drives for what wouldn't > fit), but that's still a compromise... and includes buying that sixth > drive. > > So the raid6 might well be the best alternative you have, given the data > size AND physical device size constraints. > > But some time testing the performance of different configs and > familiarizing yourself with the options and operation, as you've decided > to do now, certainly won't hurt. I DID say I wasn't real strong on the > chunk options, etc, myself, and you're using ext4, not the reiserfs I was > using, and I believe ext4 has at least some potential performance upside > compared to reiserfs, so it's quite possible that with some chunk/stride/ > etc tweaking, you can get something better, performance-wise. Tho I > expect raid6 will never be a speed demon, and may well never perform as > you had originally expected/hoped. But better than the initial results > should be possible, hopefully, and familiarizing yourself with things > while experimenting has benefits of its own, so that's an idea I can > agree with 100%. =:^) > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 7:31 ` [gentoo-amd64] " Duncan 2013-06-21 10:28 ` Rich Freeman @ 2013-06-21 17:40 ` Mark Knecht 2013-06-21 17:56 ` Bob Sanders ` (2 more replies) 2013-06-30 1:04 ` Rich Freeman 2 siblings, 3 replies; 46+ messages in thread From: Mark Knecht @ 2013-06-21 17:40 UTC (permalink / raw To: Gentoo AMD64 On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: > Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted: > >> Does anyone know of info on how the starting sector number might >> impact RAID performance under Gentoo? The drives are WD-500G RE3 drives >> shown here: >> >> http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/ > B001EMZPD0/ref=cm_cr_pr_product_top >> >> These are NOT 4k sector sized drives. >> >> Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My >> benchmarking seems abysmal at around 40MB/S using dd copying large >> files. >> It's higher, around 80MB/S if the file being transferred is coming from >> an SSD, but even 80MB/S seems slow to me. I see a LOT of wait time in >> top. >> And my 'large file' copies might not be large enough as the machine has >> 24GB of DRAM and I've only been copying 21GB so it's possible some of >> that is cached. > > I /suspect/ that the problem isn't striping, tho that can be a factor, > but rather, your choice of raid6. Note that I personally ran md/raid-6 > here for awhile, so I know a bit of what I'm talking about. I didn't > realize the full implications of what I was setting up originally, or I'd > have not chosen raid6 in the first place, but live and learn as they say, > and that I did. > > General rule, raid6 is abysmal for writing and gets dramatically worse as > fragmentation sets in, tho reading is reasonable. The reason is that in > ordered to properly parity-check and write out less-than-full-stripe > writes, the system must effectively read-in the existing data and merge > it with the new data, then recalculate the parity, before writing the new > data AND 100% of the (two-way in raid-6) parity. Further, because raid > sits below the filesystem level, it knows nothing about what parts of the > filesystem are actually used, and must read and write the FULL data > stripe (perhaps minus the new data bit, I'm not sure), including parts > that will be empty on a freshly formatted filesystem. > > So with 4k block sizes on a 5-device raid6, you'd have 20k stripes, 12k > in data across three devices, and 8k of parity across the other two > devices. Now you go to write a 1k file, but in ordered to do so the full > 12k of existing data must be read in, even on an empty filesystem, > because the RAID doesn't know it's empty! Then the new data must be > merged in and new checksums created, then the full 20k must be written > back out, certainly the 8k of parity, but also likely the full 12k of > data even if most of it is simply rewrite, but almost certainly at least > the 4k strip on the device the new data is written to. > <SNIP> Hi Duncan, Wonderful post but much too long to carry on a conversation in-line. As you sound pretty sure of your understanding/history I'll assume you're right 100% of the time, but only maybe 80% of the post feels right to me at this time so let's assume I have much to learn and go from there. I expect that others here are in a similar situation to me - they use RAID but are laboring with little hard data on what different portions of the system are doing and how to optimize it. I certainly feel that's true in my case. I hope this thread over the near or far term future might help a bit for me and potentially others. In thinking about this issue this morning I think it's important to me to get down to basics and verify as much as possible, step-by-step, so that I don't layer good work on top of bad assumptions. To that end, and before I move too much farther forward, let me document a few things about my system and the hardware available to work with and see if you, Rich, Bob, Volker or anyone else wants to chime in about what is correct, not correct or a better way to use it. Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB DDR3 + Core i7-980x Extreme 12 core processor 1 SDD - 120GB SATA3 on it's own controller 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives using Intel integrated controllers (NOTE: I can possibly go to a 6-drive RAID if I made some changes in the box but that's for later) According to the WD spec (http://www.wdc.com/en/library/spec/2879-701281.pdf) the 500GB drives sustain 113MB/S to the drive. Using hdparm I measure 107MB/S or higher for all 5 drives: c2RAID6 ~ # hdparm -tT /dev/sdb /dev/sdb: Timing cached reads: 17374 MB in 2.00 seconds = 8696.12 MB/sec Timing buffered disk reads: 322 MB in 3.00 seconds = 107.20 MB/sec c2RAID6 ~ # The SDD on it's own PCI Express controller clocks in at about 250MB/S for reads. c2RAID6 ~ # hdparm -tT /dev/sda /dev/sda: Timing cached reads: 17492 MB in 2.00 seconds = 8754.42 MB/sec Timing buffered disk reads: 760 MB in 3.00 seconds = 253.28 MB/sec c2RAID6 ~ # TESTING: I'm using dd to test. It gives an easy to read anyway result and seems to be used a lot. I can use bonnie++ or IOzone later but I don't think that's necessary quite yet. Being that I have 24GB and don't want cached data to effect the test speeds I do the following: 1) Using dd I created a 50GB file for copying using the following commands: cd /mnt/fastVM dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50] mark@c2RAID6 /VirtualMachines/bonnie $ ls -alh /mnt/fastVM/ran* -rw-r--r-- 1 mark mark 47G Jun 21 07:10 /mnt/fastVM/random1 mark@c2RAID6 /VirtualMachines/bonnie $ 2) To ensure that nothing is cached and the copies are (hopefully) completely fair as root I do the following between each test: sync free -h echo 3 > /proc/sys/vm/drop_caches free -h An example: c2RAID6 ~ # sync c2RAID6 ~ # free -h total used free shared buffers cached Mem: 23G 23G 129M 0B 8.5M 21G -/+ buffers/cache: 1.6G 21G Swap: 12G 0B 12G c2RAID6 ~ # echo 3 > /proc/sys/vm/drop_caches c2RAID6 ~ # free -h total used free shared buffers cached Mem: 23G 2.6G 20G 0B 884K 1.3G -/+ buffers/cache: 1.3G 22G Swap: 12G 0B 12G c2RAID6 ~ # 3) As a first test I copy using dd the 50GB file from the SDD to the RAID6. As long as reading the SDD is much faster than writing the RAID6 then it should be a test of primarily the RAID6 write speed: mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/mnt/fastVM/random1 of=SDDCopy 97656250+0 records in 97656250+0 records out 50000000000 bytes (50 GB) copied, 339.173 s, 147 MB/s mark@c2RAID6 /VirtualMachines/bonnie $ If I clear cache as above and rerun the test it's always 145-155MB/S 4) As a second test I read from the RAID6 and write back to the RAID6. I see MUCH lower speeds, again repeatable: mark@c2RAID6 /VirtualMachines/bonnie $ dd if=SDDCopy of=HDDWrite 97656250+0 records in 97656250+0 records out 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s mark@c2RAID6 /VirtualMachines/bonnie $ 5) As a final test, and just looking for problems if any, I do an SDD to SDD copy which clocked in at close to 200MB/S mark@c2RAID6 /mnt/fastVM $ dd if=random1 of=SDDCopy 97656250+0 records in 97656250+0 records out 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s mark@c2RAID6 /mnt/fastVM $ So, being that this RAID6 was grown yesterday from something that has existed for a year or two I'm not sure of it's fragmentation, or even how to determine that at this time. However it seems my problem are RAID6 reads, not RAID6 writes, at least to new an probably never used disk space. I will also report more later but I can state that just using top there's never much CPU usage doing this but a LOT of WAIT time when reading the RAID6. It really appears the system is spinning it's wheels waiting for the RAID to get data from the disk. One place where I wanted to double check your thinking. My thought is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as it has to read from three drives and make sure they are all good before returning data to the user. I don't see how that could ever be faster than what a single drive file system could do which for these drives would be the 113MB/S WD spec number, correct? As I'm currently getting 145MB/S it appears on the surface that the RAID6 is providing some value, at least in these early days of use. Maybe it will degrade over time though. Comments? Cheers, Mark ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 17:40 ` Mark Knecht @ 2013-06-21 17:56 ` Bob Sanders 2013-06-21 18:12 ` Mark Knecht 2013-06-21 17:57 ` Rich Freeman 2013-06-22 14:23 ` Duncan 2 siblings, 1 reply; 46+ messages in thread From: Bob Sanders @ 2013-06-21 17:56 UTC (permalink / raw To: gentoo-amd64 Mark Knecht, mused, then expounded: > On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: > > Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted: > > > > Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB > DDR3 + Core i7-980x Extreme 12 core processor The hit to iop performance is mainly due to the large number of cores in the the high end Intel cpu. I suggest you find a nice 4-core Intel processor, something non-extreme. you'll find all your IO will improve. > 1 SDD - 120GB SATA3 on it's own controller > 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives using Intel integrated > controllers > Again, if you're serious about RAID, get an LSI MegaRAID card. While I have my dislikes about the LSI controller, it's a lot faster than using MD and much faster (and more reliable) than any bios software RAID. Oh, and don't believe all the published numbers on drives, etc...benchmarking is an art. Bob -- - ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 17:56 ` Bob Sanders @ 2013-06-21 18:12 ` Mark Knecht 0 siblings, 0 replies; 46+ messages in thread From: Mark Knecht @ 2013-06-21 18:12 UTC (permalink / raw To: Gentoo AMD64 On Fri, Jun 21, 2013 at 10:56 AM, Bob Sanders <rsanders@sgi.com> wrote: > Mark Knecht, mused, then expounded: >> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: >> > Mark Knecht posted on Thu, 20 Jun 2013 12:10:04 -0700 as excerpted: >> > >> >> Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB >> DDR3 + Core i7-980x Extreme 12 core processor > > The hit to iop performance is mainly due to the large number of cores in > the the high end Intel cpu. I suggest you find a nice 4-core Intel > processor, something non-extreme. you'll find all your IO will improve. > Interesting point but not likely to happen. I run 3 Windows VMs all day, most of which are doing numerical calculations and not a huge amount of IO in the Windows environment itself. In my usage model the 12 cores get a workout nearly every day. > >> 1 SDD - 120GB SATA3 on it's own controller >> 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives using Intel integrated >> controllers >> > > Again, if you're serious about RAID, get an LSI MegaRAID card. While I > have my dislikes about the LSI controller, it's a lot faster than using > MD and much faster (and more reliable) than any bios software RAID. > I suppose if I accept your assertion above then an LSI MegaRAID might be a better solution specifically because I _am_ using the 12 core Extreme processor. Will consider, at least in the long run, if this thread & work doesn't yield significantly improved results over the next few weeks. > Oh, and don't believe all the published numbers on drives, > etc...benchmarking is an art. > > Bob Absolutely! :-) Thanks, Mark ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 17:40 ` Mark Knecht 2013-06-21 17:56 ` Bob Sanders @ 2013-06-21 17:57 ` Rich Freeman 2013-06-21 18:10 ` Gary E. Miller 2013-06-21 18:38 ` Mark Knecht 2013-06-22 14:23 ` Duncan 2 siblings, 2 replies; 46+ messages in thread From: Rich Freeman @ 2013-06-21 17:57 UTC (permalink / raw To: gentoo-amd64 On Fri, Jun 21, 2013 at 1:40 PM, Mark Knecht <markknecht@gmail.com> wrote: > One place where I wanted to double check your thinking. My thought > is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as > it has to read from three drives and make sure they are all good > before returning data to the user. That isn't correct. In theory it could be done that way, but every raid1 implementation I've heard of makes writes to all drives (obviously), but reads from only a single drive (assuming it is correct). That means that read latency is greatly reduced since they can be split across two drives which effectively means two heads per "platter." Also, raid1 typically does not include checksumming, so if there is a discrepancy between the drives there is no way to know which one is right. With raid5 at least you can always correct discrepancies if you have all the disks (though as Duncan pointed out in practice this only happens if you do an explicit scrub on mdadm). With btrfs every block is checksummed and so as long as there is one good (err, consistent) copy somewhere it will be used. Rich ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 17:57 ` Rich Freeman @ 2013-06-21 18:10 ` Gary E. Miller 2013-06-21 18:38 ` Mark Knecht 1 sibling, 0 replies; 46+ messages in thread From: Gary E. Miller @ 2013-06-21 18:10 UTC (permalink / raw To: gentoo-amd64; +Cc: rich0 [-- Attachment #1: Type: text/plain, Size: 1722 bytes --] Yo Rich! On Fri, 21 Jun 2013 13:57:20 -0400 Rich Freeman <rich0@gentoo.org> wrote: > In theory it could be done that way, but every > raid1 implementation I've heard of makes writes to all drives > (obviously), but reads from only a single drive (assuming it is > correct). That means that read latency is greatly reduced since they > can be split across two drives which effectively means two heads per > "platter." Yes, that is what I see in practice. A much reduced average read time. And if you are really pressed for speed, add more stripes and get even more speed. > Also, raid1 typically does not include checksumming, so if > there is a discrepancy between the drives there is no way to know > which one is right. Uh, not exactly correct. Remember each HDD has ECC for each sector. If there is a read error the HDD will detecct the bad ECC and report the error to the RAID1 hardware/software. Then RAID1 is smart enough to try to read from the 2nd drive. > With raid5 at least you can always correct > discrepancies if you have all the disks Not really. If 2 disks fail in an n+1 RAID5 you are out of luck. Not as uncommon occurance as one might think. > (though as Duncan pointed out > in practice this only happens if you do an explicit scrub on mdadm). Which you should be doing at least weekly. Otherwise you only find out your disks have failed when you try to do a full copy or backup, and then you likely have multiple failures and you are out of luck. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701 gem@rellim.com Tel:+1(541)382-8588 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 17:57 ` Rich Freeman 2013-06-21 18:10 ` Gary E. Miller @ 2013-06-21 18:38 ` Mark Knecht 2013-06-21 18:50 ` Gary E. Miller 2013-06-21 18:53 ` Bob Sanders 1 sibling, 2 replies; 46+ messages in thread From: Mark Knecht @ 2013-06-21 18:38 UTC (permalink / raw To: Gentoo AMD64 On Fri, Jun 21, 2013 at 10:57 AM, Rich Freeman <rich0@gentoo.org> wrote: > On Fri, Jun 21, 2013 at 1:40 PM, Mark Knecht <markknecht@gmail.com> wrote: >> One place where I wanted to double check your thinking. My thought >> is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as >> it has to read from three drives and make sure they are all good >> before returning data to the user. > > That isn't correct. In theory it could be done that way, but every > raid1 implementation I've heard of makes writes to all drives > (obviously), but reads from only a single drive (assuming it is > correct). That means that read latency is greatly reduced since they > can be split across two drives which effectively means two heads per > "platter." Also, raid1 typically does not include checksumming, so if > there is a discrepancy between the drives there is no way to know > which one is right. With raid5 at least you can always correct > discrepancies if you have all the disks (though as Duncan pointed out > in practice this only happens if you do an explicit scrub on mdadm). > With btrfs every block is checksummed and so as long as there is one > good (err, consistent) copy somewhere it will be used. > > Rich > Humm... OK, we agree on RAID1 writes. All data must be written to all drives so there's no way to implement any real speed up in that area. If I simplistically assume that write speeds are similar to hdparm -tT read speeds then that's that. On the read side I'm not sure if I'm understanding your point. I agree that a so-designed RAID1 system could/might read smaller portions of a larger read from RAID1 drives in parallel, taking some data from one drive and some from another drive, and then only take action corrective if one of the drives had troubles. However I don't know that mdadm-based RAID1 does anything like that. Does it? It seems to me that unless I at least _request_ all data from all drives and minimally compare at least some error flag from the controller telling me one drive had trouble reading a sector then how do I know if anything bad is happening? Or maybe you're saying it's RAID1 and I don't know if anything bad is happening _unless_ I do a scrub and specifically check all the drives for consistency? Just trying to get clear what you're saying. I do mdadm scrubs at least once a week. I still do them by hand. They have never appeared terribly expensive watching top or iotop but sometimes when I'm watching NetFlix or Hulu in a VM I get more pauses when the scrub is taking place, but it's not huge. I agree that RAID5 gives you an opportunity to get things fixed, but there are folks who lose a disk in a RAID5, start the rebuild, and then lose a second disk during the rebuild. That was my main reason to go to RAID6. Not that I would ever run the array degraded but that I could still tolerate a second loss while the rebuild was happening and hopefully get by. That was similar to my old 3-disk RAID1 where I'd have to lose all 3 disks to be out of business. Thanks, Mark ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 18:38 ` Mark Knecht @ 2013-06-21 18:50 ` Gary E. Miller 2013-06-21 18:57 ` Rich Freeman 2013-06-22 14:34 ` Duncan 2013-06-21 18:53 ` Bob Sanders 1 sibling, 2 replies; 46+ messages in thread From: Gary E. Miller @ 2013-06-21 18:50 UTC (permalink / raw To: gentoo-amd64; +Cc: markknecht [-- Attachment #1: Type: text/plain, Size: 2699 bytes --] Yo Mark! On Fri, 21 Jun 2013 11:38:00 -0700 Mark Knecht <markknecht@gmail.com> wrote: > On the read side I'm not sure if I'm understanding your point. I agree > that a so-designed RAID1 system could/might read smaller portions of a > larger read from RAID1 drives in parallel, taking some data from one > drive and some from another drive, and then only take action > corrective if one of the drives had troubles. However I don't know > that mdadm-based RAID1 does anything like that. Does it? It surely does. I have confirmed that at least monthly since md has existed in the kernel. > It seems to me that unless I at least _request_ all data from all > drives and minimally compare at least some error flag from the > controller telling me one drive had trouble reading a sector then how > do I know if anything bad is happening? Correct. You cant' tell if you can read something without trying to read it. Which is why you should do a full raid rebuild every week. > > Or maybe you're saying it's RAID1 and I don't know if anything bad is > happening _unless_ I do a scrub and specifically check all the drives > for consistency? No. A simple read will find the problem. But given it is RAID1 the only way to be sure to read from both dirves is a raid rebuild. > I do mdadm scrubs at least once a week. I still do them by hand. They > have never appeared terribly expensive watching top or iotop but > sometimes when I'm watching NetFlix or Hulu in a VM I get more pauses > when the scrub is taking place, but it's not huge. Which is why you should cron jothem at oh-dark-thirty. > > I agree that RAID5 gives you an opportunity to get things fixed, but > there are folks who lose a disk in a RAID5, start the rebuild, and > then lose a second disk during the rebuild. Because they failed to do weekly rebuilds. > Not that I would ever run the array degraded but that I > could still tolerate a second loss while the rebuild was happening and > hopefully get by. Sadly most people make their RAID5 or RAID6 out of brand new, consecutively serial numbered drives. They then get the exactly the same temp, voltage, humidity, seek stress until they all fail within days of each other. I have personally seen 4 of 5 drives in a RAID5 fail within 3 days many times. Usually on a Friday where the tech decides the drive replacement can wait until Monday. Your only protection against a full RAIDx failure is an offsite backup. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701 gem@rellim.com Tel:+1(541)382-8588 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 18:50 ` Gary E. Miller @ 2013-06-21 18:57 ` Rich Freeman 2013-06-22 14:34 ` Duncan 1 sibling, 0 replies; 46+ messages in thread From: Rich Freeman @ 2013-06-21 18:57 UTC (permalink / raw To: gentoo-amd64 On Fri, Jun 21, 2013 at 2:50 PM, Gary E. Miller <gem@rellim.com> wrote: > On Fri, 21 Jun 2013 11:38:00 -0700 > Mark Knecht <markknecht@gmail.com> wrote: >> Or maybe you're saying it's RAID1 and I don't know if anything bad is >> happening _unless_ I do a scrub and specifically check all the drives >> for consistency? > > No. A simple read will find the problem. But given it is RAID1 the only > way to be sure to read from both dirves is a raid rebuild. Keep in mind that a read will only find the problem if it is visible to the hard drive's ECC. A silent error would not be detected. It could be detected by a rebuild, though it could not be reliably fixed in this way. With raid5 a silent error in a single drive per stripe could be fixed in a rebuild. > > Your only protection against a full RAIDx failure is an offsite backup. ++ That's why I'm not big on crazy levels of redundancy. RAID is first and foremost a restoration avoidance tool, not a backup solution. It reduces the risk of needing restoration, but it does not cover as many failure modes as an offline backup. If btrfs eats your data it really won't matter how many platters it had to chew on in the process. So, by all means use RAID, but if you're going to spend a lot of money on redundant disks, spend it on a backup solution instead (which might very well involve disks, though you should move them offsite). Rich ^ permalink raw reply [flat|nested] 46+ messages in thread
* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 18:50 ` Gary E. Miller 2013-06-21 18:57 ` Rich Freeman @ 2013-06-22 14:34 ` Duncan 2013-06-22 22:15 ` Gary E. Miller 1 sibling, 1 reply; 46+ messages in thread From: Duncan @ 2013-06-22 14:34 UTC (permalink / raw To: gentoo-amd64 Gary E. Miller posted on Fri, 21 Jun 2013 11:50:43 -0700 as excerpted: > On Fri, 21 Jun 2013 11:38:00 -0700 Mark Knecht <markknecht@gmail.com> > wrote: > >> On the read side I'm not sure if I'm understanding your point. I agree >> that a so-designed RAID1 system could/might read smaller portions of a >> larger read from RAID1 drives in parallel, taking some data from one >> drive and some from another drive, and then only take action corrective >> if one of the drives had troubles. However I don't know that >> mdadm-based RAID1 does anything like that. Does it? > > It surely does. I have confirmed that at least monthly since md has > existed in the kernel. Out of curiosity, /how/ do you confirm that? I agree based on real usage experience, but with a claim that you're confirming it at least monthly, it sounds like you have a standardized/scripted test, and I'm interested in what/how you do it. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-22 14:34 ` Duncan @ 2013-06-22 22:15 ` Gary E. Miller 2013-06-28 0:20 ` Duncan 0 siblings, 1 reply; 46+ messages in thread From: Gary E. Miller @ 2013-06-22 22:15 UTC (permalink / raw To: gentoo-amd64; +Cc: 1i5t5.duncan [-- Attachment #1: Type: text/plain, Size: 1401 bytes --] Yo Duncan! On Sat, 22 Jun 2013 14:34:36 +0000 (UTC) Duncan <1i5t5.duncan@cox.net> wrote: > >> On the read side I'm not sure if I'm understanding your point. I > >> agree that a so-designed RAID1 system could/might read smaller > >> portions of a larger read from RAID1 drives in parallel, taking > >> some data from one drive and some from another drive, and then > >> only take action corrective if one of the drives had troubles. > >> However I don't know that mdadm-based RAID1 does anything like > >> that. Does it? > > > > It surely does. I have confirmed that at least monthly since md has > > existed in the kernel. > > Out of curiosity, /how/ do you confirm that? I agree based on real > usage experience, but with a claim that you're confirming it at least > monthly, it sounds like you have a standardized/scripted test, and > I'm interested in what/how you do it. I have around 30 RAID1 sets in production right now. Some of them doing mostly reads and some mostly writes. Some are HDD and some SSD. The RAID sets are pushed pretty hard 24x7 and we watch the performance pretty closely to plan updates. I have collectd performance graphs going way back. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701 gem@rellim.com Tel:+1(541)382-8588 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-22 22:15 ` Gary E. Miller @ 2013-06-28 0:20 ` Duncan 2013-06-28 0:41 ` Gary E. Miller 0 siblings, 1 reply; 46+ messages in thread From: Duncan @ 2013-06-28 0:20 UTC (permalink / raw To: gentoo-amd64 Gary E. Miller posted on Sat, 22 Jun 2013 15:15:16 -0700 as excerpted: >> >> [Does md/raid1 do parallel reads of multiple files at once?] >> > >> > It surely does. I have confirmed that at least monthly since md has >> > existed in the kernel. >> >> Out of curiosity, /how/ do you confirm that? I agree based on real >> usage experience, but with a claim that you're confirming it at least >> monthly, it sounds like you have a standardized/scripted test, and I'm >> interested in what/how you do it. > > I have around 30 RAID1 sets in production right now. Some of them doing > mostly reads and some mostly writes. Some are HDD and some SSD. > The RAID sets are pushed pretty hard 24x7 and we watch the performance > pretty closely to plan updates. I have collectd performance graphs > going way back. So you're basically confirming it with normal usage as well, but have documented performance history going pretty well all the way back. Not the simple test script I was hoping for, but pretty impressive, none-the- less. Thanks. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-28 0:20 ` Duncan @ 2013-06-28 0:41 ` Gary E. Miller 0 siblings, 0 replies; 46+ messages in thread From: Gary E. Miller @ 2013-06-28 0:41 UTC (permalink / raw To: gentoo-amd64; +Cc: 1i5t5.duncan [-- Attachment #1: Type: text/plain, Size: 1233 bytes --] Yo Duncan! On Fri, 28 Jun 2013 00:20:45 +0000 (UTC) Duncan <1i5t5.duncan@cox.net> wrote: > > I have around 30 RAID1 sets in production right now. Some of them > > doing mostly reads and some mostly writes. Some are HDD and some > > SSD. The RAID sets are pushed pretty hard 24x7 and we watch the > > performance pretty closely to plan updates. I have collectd > > performance graphs going way back. > > So you're basically confirming it with normal usage as well, but have > documented performance history going pretty well all the way back. > Not the simple test script I was hoping for, but pretty impressive, > none-the- less. I find that 'hdparm -tT', 'dd' and 'bonnie++' will match up pretty well with what I see in production. Just be sure to use really large test file sizes with bonnie++ and dd. dd also needs a pretty large block size (bs=) and pretty large/fast source of bits when writing. With bonnie++ you can easily see the speed differences between raw disks and various RAID types. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701 gem@rellim.com Tel:+1(541)382-8588 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 18:38 ` Mark Knecht 2013-06-21 18:50 ` Gary E. Miller @ 2013-06-21 18:53 ` Bob Sanders 1 sibling, 0 replies; 46+ messages in thread From: Bob Sanders @ 2013-06-21 18:53 UTC (permalink / raw To: gentoo-amd64 Mark Knecht, mused, then expounded: > > > I agree that RAID5 gives you an opportunity to get things fixed, but > there are folks who lose a disk in a RAID5, start the rebuild, and > then lose a second disk during the rebuild. That was my main reason to > go to RAID6. Not that I would ever run the array degraded but that I > could still tolerate a second loss while the rebuild was happening and > hopefully get by. That was similar to my old 3-disk RAID1 where I'd > have to lose all 3 disks to be out of business. > If the drives in the RAID came from the same build lot, the chances of multi-drive failure are fairly high, if one fails. I've had 3 out of four drives, from the same lot build, fail at the same time. I've had others never fail. And a few that fail over time where others from the same lot failed within a month of the first failure. Bob -- - ^ permalink raw reply [flat|nested] 46+ messages in thread
* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 17:40 ` Mark Knecht 2013-06-21 17:56 ` Bob Sanders 2013-06-21 17:57 ` Rich Freeman @ 2013-06-22 14:23 ` Duncan 2013-06-23 1:02 ` Mark Knecht 2 siblings, 1 reply; 46+ messages in thread From: Duncan @ 2013-06-22 14:23 UTC (permalink / raw To: gentoo-amd64 Mark Knecht posted on Fri, 21 Jun 2013 10:40:48 -0700 as excerpted: > On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: > <SNIP> > > Wonderful post but much too long to carry on a conversation > in-line. FWIW... I'd have a hard time doing much of anything else, these days, no matter the size. Otherwise, I'd be likely to forget a point. But I do try to snip or summarize when possible. And I do understand your choice and agree with it for you. It's just not one I'd find workable for me... which is why I'm back to inline, here. > As you sound pretty sure of your understanding/history I'll > assume you're right 100% of the time, but only maybe 80% of the post > feels right to me at this time so let's assume I have much to learn and > go from there. That's a very nice way of saying "I'll have to verify that before I can fully agree, but we'll go with it for now." I'll have to remember it! =:^) > In thinking about this issue this morning I think it's important to > me to get down to basics and verify as much as possible, step-by-step, > so that I don't layer good work on top of bad assumptions. Extremely reasonable approach. =:^) > Basic Machine - ASUS Rampage II Extreme motherboard (4/1/2010) + 24GB > DDR3 + Core i7-980x Extreme 12 core processor That's a very impressive base. But as you point out elsewhere, you use it. Multiple VMs running MS should well use use both the dozen cores and the 24 gig RAM. As an aside, it's interesting how well your dozen cores, 24 gig RAM, fits my basic two gigs a core rule of thumb. Obviously I'd consider that reasonably well balanced RAM/cores-wise. > 1 SDD - 120GB SATA3 on it's own controller > 5+ HDD - WD5002ABYS RAID Edition 3 SATA3 drives > using Intel integrated controllers > > (NOTE: I can possibly go to a 6-drive RAID if I made some changes in the > box but that's for later) > > According to the WD spec > (http://www.wdc.com/en/library/spec/2879-701281.pdf) the 500GB drives OK, single 120 gig main drive (SSD), 5 half-TB drives for the raid. > [...] sustain 113MB/S to the drive. Using hdparm I measure 107MB/S > or higher for all 5 drives [...] > The SDD on it's own PCI Express controller clocks in at about 250MB/S > for reads. OK. But there's a caveat on the measured "spinning rust" speeds. You're effectively getting "near best case". I suppose you're familiar with absolute velocity vs rotational velocity vs distance from center. Think merry-go-round as a kid or crack-the-whip as a teen (or insert your own experience here). The closer to the center you are the slower you go at the same rotational speed (RPM). Conversely, the farther from the center you are, the faster you're actually moving at the same RPM. Rotational disk data I/O rates have a similar effect -- data toward the outside edge of the platter (beginning of the disk) is faster to read/ write, while data toward the inside edge (center) is slower. Based on my own hddparm tests on partitioned drives where I knew the location of the partition, vs. the results for the drive as a whole, the speed reported for rotational drives as a whole, is the speed near the outside edge (beginning of the disk). Thus, it'd be rather interesting to partition up one of those drives with a small partition at the beginning and another at the end, and do an hdparm -t of each, as well as of the whole disk. I bet you'd find the one at the end reports rather lower numbers, while the report for the drive as a whole is similar to that of the partition near the beginning of the drive, much faster. A good SSD won't have this same sort of variance, since it's SSD and the latency to any of its flash, at least as presented by the firmware which should deal with any variance as it distributes wear, should be similar. (Cheap SSDs and standard USB thumbdrive flash storage works differently, however. Often they assume FAT and have a small amount of fast and resilient but expensive SLC flash at the beginning, where the FAT would be, with the rest of the device much slower and less resilient to rewrite but far cheaper MLC. I was just reading about this recently as I researched my own SSDs.) > TESTING: I'm using dd to test. It gives an easy to read anyway result > and seems to be used a lot. I can use bonnie++ or IOzone later but I > don't think that's necessary quite yet. Agreed. > Being that I have 24GB and don't > want cached data to effect the test speeds I do the following: > > 1) Using dd I created a 50GB file for copying using the following > commands: > > cd /mnt/fastVM > dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50] It'd be interesting to see what the reported speed is here... See below for more. > 2) To ensure that nothing is cached and the copies are (hopefully) > completely fair as root I do the following between each test: > > sync free -h > echo 3 > /proc/sys/vm/drop_caches > free -h Good job. =:^) > 3) As a first test I copy using dd the 50GB file from the SDD to the > RAID6. OK, that answered the question I had about where that file you created actually was -- on the SSD. > As long as reading the SDD is much faster than writing the RAID6 > then it should be a test of primarily the RAID6 write speed: > > dd if=/mnt/fastVM/random1 of=SDDCopy > 97656250+0 records in 97656250+0 records out > 50000000000 bytes (50 GB) copied, 339.173 s, 147 MB/s > If I clear cache as above and rerun the test it's always 145-155MB/S ... Assuming $PWD is now on the raid. You had the path shown too, which I snipped, but that doesn't tell /me/ (as opposed to you, who should know based on your mounts) anything about whether it's on the raid or not. However, the above including the drop-caches demonstrates enough care that I'm quite confident you'd not make /that/ mistake. > 4) As a second test I read from the RAID6 and write back to the RAID6. > I see MUCH lower speeds, again repeatable: > > dd if=SDDCopy of=HDDWrite > 97656250+0 records in 97656250+0 records out > 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s > 5) As a final test, and just looking for problems if any, I do an SDD to > SDD copy which clocked in at close to 200MB/S > > dd if=random1 of=SDDCopy > 97656250+0 records in 97656250+0 records out > 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s > So, being that this RAID6 was grown yesterday from something that > has existed for a year or two I'm not sure of it's fragmentation, or > even how to determine that at this time. However it seems my problem are > RAID6 reads, not RAID6 writes, at least to new an probably never used > disk space. Reading all that, one question occurs to me. If you want to test read and write separately, why the intermediate step of dd-ing from /dev/ random to ssd, then from ssd to raid or ssd? Why not do direct dd if=/dev/random (or urandom, see note below) of=/desired/target ... for write tests, and then (after dropping caches), if=/desired/target of=/dev/null ... for read tests? That way there's just the one block device involved, not both. /dev/random note: I presume with that hardware you have one of the newer CPUs with the new Intel hardware random instruction, with the appropriate kernel config hooking it into /dev/random, and/or otherwise have /dev/random hooked up to a hardware random number generator. Otherwise, using that much random data could block until more suitably random data is generated from approved kernel sources. Thus, the following probably doesn't apply to you, but it may well apply to others, and is good practice in any case, unless you KNOW your random isn't going to block due to hardware generation, and even then it's worth noting that when you're posting examples like the above. In general, for tests such as this where a LOT of random data is needed, but cryptographic-quality random isn't necessarily required, use /dev/urandom. In the event that real-random data gets too low, /dev/urandom will switch to pseudo-random generation, which should be "good enough" for this sort of usage. /dev/random, OTOH, will block until it gets more random data from sources the kernel trusts to be truly random. On some machines with relatively limited sources of randomness the kernel considers truly random, therefore, just grabbing 50 GB of data from /dev/random could take QUITE some time (days maybe? I don't know). Obviously you don't have /too/ big a problem with it as you got the data from /dev/random, but it's worth noting. If your machine has a hardware random generator hooked into /dev/random, then /dev/urandom will never switch to pseudo-random in any case, so for tests of anything above /kilobytes/ of random data (and even at that...), just use urandom and you won't have to worry about it either way. OTOH, if you're generating an SSH key or something, always use /dev/random as that needs cryptographic security level randomness, but that'll take just a few bytes of randomness, not kilobytes let alone gigabytes, and if your hardware doesn't have good randomness and it does block, wiggling your mouse around a bit (obviously assumes a local command, remote could require something other than mouse, obviously) should give it enough randomness to continue. Meanwhile, dd-ing either from /dev/urandom as source, or to /dev/null as sink, with only the test-target block device as a real block device, should give you "purer" read-only and write-only tests. In theory it shouldn't matter much given your method of testing, but as we all know, theory and reality aren't always well aligned. Of course the next question follows on from the above. I see a write to the raid, and a copy from the raid to the raid, so read/write on the raid, and a copy from the ssd to the ssd, read/write on it, but no test of from the raid read. So if=/dev/urandom of=/mnt/raid/target ... should give you raid write. drop-caches if=/mnt/raid/target of=/dev/null ... should give you raid read. *THEN* we have good numbers on both to compare the raid read/write to. What I suspect you'll find, unless fragmentation IS your problem, is that both read (from the raid) alone and write (to the raid) alone should be much faster than read/write (from/to the raid). The problem with read/write is that you're on "rotating rust" hardware and there's some latency as it repositions the heads from the read location to the write location and back. If I'm correct and that's what you find, a workaround specific to dd would be to specify a much larger block size, so it reads in far more data at once, then writes it out at once, with far fewer switches between modes. In the above you didn't specify bs (or the separate input/output equivilents, ibs/obs respectively) at all, so it's using 512-byte blocksize defaults. From what I know of hardware, 64KB is a standard read-ahead, so in theory you should see improvements using larger block sizes upto at LEAST that size, and on a 5-disk raid6, probably 3X that, 192KB, which should in theory do a full 64KB buffer on each of the three data drives of the 5- way raid6 (the other two being parity). I'm guessing you'll see a "knee" at the 192 KB (that's 2^10 power not 10^3 power BTW) block size, and above that you might see improvement, but not near as much, since the hardware should be doing full 64KB blocks which it's optimized to. There's likely to be another knee at the 16MB point (again, power of two, not 10), or more accurately, the 48MB point (3*16MB), since that's the size of the device hardware buffers (again, three devices worth of data-stripe, since the other two are parity, 3*16MB=48MB). Above that, theory says you'll see even less improvement, since the caches will be full and any improvement still seen should be purely that of less switches between read/write mode and thus less seeks. But it'd be interesting to see how closely theory matches reality, there's very possibly a fly in that theoretical ointment somewhere. =:^\ Of course configurable block size is specific to dd. Real life file transfers may well be quite a different story. That's where the chunk size, stripe size, etc, stuff comes in, setting the defaults for the kernel for that device, and again, I'll freely admit to not knowing as much as I could in that area. > I will also report more later but I can state that just using top > there's never much CPU usage doing this but a LOT of WAIT time when > reading the RAID6. It really appears the system is spinning it's wheels > waiting for the RAID to get data from the disk. When you're dealing with spinning rust, any time you have a transfer of any size (certainly GB), you WILL see high wait times. Disks are simply SLOW. Even SSDs are no match for system memory, tho their enough closer to help a lot and can be close enough that the bottleneck is elsewhere. (Modern SSDs saturate the SATA-600 links with thruput above 500 MByte/ sec, making the SATA-600 bus the bottleneck, or the 1x PCI-E 2.xlink if that's what it's running on, since they saturate at 485MByte/sec or so, tho PCI-E 3.x is double that so nearly a GByte/sec and a single SATA-600 won't saturate that. Modern DDR3 SDRAM by comparision runs 10+ GByte/sec LOW end, two orders of magnitude faster. Numbers fresh from wikipedia, BTW.) > One place where I wanted to double check your thinking. My thought > is that a RAID1 will _NEVER_ outperform the hdparm -tT read speeds as it > has to read from three drives and make sure they are all good before > returning data to the user. I don't see how that could ever be faster > than what a single drive file system could do which for these drives > would be the 113MB/S WD spec number, correct? As I'm currently getting > 145MB/S it appears on the surface that the RAID6 is providing some > value, at least in these early days of use. Maybe it will degrade over > time though. As someone else already posted, that's NOT correct. Neither raid1 nor raid6, at least the mdraid implementations, verify the data. Raid1 doesn't have parity at all, just many copies, and raid6 has parity but only uses it for rebuilds, NOT to check data integrity under normal usage -- it too simply reads the data and returns it. What raid1 does (when it's getting short reads only one at a time) is send the request to every spindle. The first one that returns the data wins; the others simply get their returns thrown away. So under small-one-at-a-time reading conditions, the speed of raid1 reads should be the speed of the fastest disk in the bunch. The raid1 read advantage is in the fact that there's often more than one read going on at once, or that the read is big enough to split up, so different spindles can be seeking to and reading different parts of the request in parallel. (This also helps in fragmented file conditions as long as fragmentation isn't overwhelming, since a raid1 can then send different spindle heads to read the different segments in parallel, instead of reading one at a time serially, as it would have to do in a single spindle case.) In theory, the stripes of raid6 /can/ lead to better thruput for reads. In fact, my experience both with raid6 and with raid0 demonstrates that not to be the case as often as one might expect, due either to small reads or due to fragmentation breaking up the big reads thus negating the theoretical thruput advantage of multiple stripes. To be fair, my raid0 experience was as I mentioned earlier, with files I could easily redownload from the net, mostly the portage tree and overlays, along with the kernel git tree. Due to the frequency of update and the fast rate of change as well as the small files, fragmentation was quite a problem, and the files were small enough I likely wouldn't have seen the full benefit of the 4-way raid0 stripes in any case, so that wasn't a best-case test scenario. But it's what one practically puts on raid0, because it IS easily redownloaded from the net, so it DOESN'T matter that a loss of any of the raid0 component devices will kill the entire thing. If I'd have been using the raid0 for much bigger media files, mp3s or video of megabytes in size minimum, that get saved and never changed so there's little fragmentation, I expect my raid0 experience would have been *FAR* better. But at the same time, that's not the type of data that it generally makes SENSE to store on a raid0 without backups or redundancy of any sort, unless it's simply VDR files that if a device drops from the raid and you lose it you don't particularly care (which would make a GREAT raid0 candidate), so... Raid6 is the stripes of raid0, plus two-way-parity. So since the parity is ignored for reads, for them it's effectively raid0 with two less stripes then the number of devices. Thus your 5-device raid6 is effectively a 3-device raid0 in terms of reads. In theory, thruput for large reads done by themselves should be pretty good -- three times that of a single device. In fact... due either to multiple jobs happening at once, or to a mix of read/write happening at once, or to fragmentation, I was disappointed, and far happier with raid1. But your situation is indeed rather different than mine, and depending on how much writing happens in those big VM files and how the filesystem you choose handles fragmentation, you could be rather happier with raid6 than I was. But I'd still suggest you try raid1 if the amount of data you're handling will let you. Honestly, it surprised me how well raid1 did for me. I wasn't prepared for that at all, and I believe that in comparison to what I was getting on raid6 is what colored my opinion of raid6 so badly. I had NO IDEA there would be that much difference! But your experience may indeed be different. The only way to know is to try it. However, one thing I either overlooked or that hasn't been posted yet is just how much data you're talking about. You're running five 500-gig drives in raid6 now, which should give you 3*500=1500 gigs (10-power) capacity. If it's under a third full, 500 MB (10-power), you can go raid1 with as many mirrors as you like of the five, and keep the rest of them for hot- spares or whatever. If you're running (or plan to be running) near capacity, over 2/3 full, 1 TB (10-power), you really don't have much option but raid6. If you're in between, 1/3 to 2/3 full, 500-1000 GB (10-power), then a raid10 is possible, perhaps 4-spindle with the 5th as a hot-spare. (A spindle configured as a hot-spare is kept unused but ready for use by mdadm and the kernel. If a spindle should drop out, the hot-spare is automatically inserted in its place and a rebuild immediately started. This narrows the danger zone during which you're degraded and at risk if further spindles drop out, because handling is automatic so you're back to full un-degraded as soon as possible. However, it doesn't eliminate that danger zone should another one drop out during the rebuild, which is after all quite stressful on the remaining drives since due to all that reading going on, so the risk is greater during a rebuild than under normal operation.) So if you're over 2/3 full, or expect to be in short order, there's little sense in further debate on at least /your/ raid6, as that's pretty much what you're stuck with. (Unless you can categorize some data as more important than other, and raid it, while the other can be considered worth the risk of loss if the device goes, in which case we're back in play with other options once again.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-22 14:23 ` Duncan @ 2013-06-23 1:02 ` Mark Knecht 2013-06-23 1:48 ` Mark Knecht 0 siblings, 1 reply; 46+ messages in thread From: Mark Knecht @ 2013-06-23 1:02 UTC (permalink / raw To: Gentoo AMD64 On Sat, Jun 22, 2013 at 7:23 AM, Duncan <1i5t5.duncan@cox.net> wrote: > Mark Knecht posted on Fri, 21 Jun 2013 10:40:48 -0700 as excerpted: > >> On Fri, Jun 21, 2013 at 12:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: >> <SNIP> <SNIP> > > ... Assuming $PWD is now on the raid. You had the path shown too, which > I snipped, but that doesn't tell /me/ (as opposed to you, who should know > based on your mounts) anything about whether it's on the raid or not. > However, the above including the drop-caches demonstrates enough care > that I'm quite confident you'd not make /that/ mistake. > >> 4) As a second test I read from the RAID6 and write back to the RAID6. >> I see MUCH lower speeds, again repeatable: >> >> dd if=SDDCopy of=HDDWrite >> 97656250+0 records in 97656250+0 records out >> 50000000000 bytes (50 GB) copied, 1187.07 s, 42.1 MB/s > >> 5) As a final test, and just looking for problems if any, I do an SDD to >> SDD copy which clocked in at close to 200MB/S >> >> dd if=random1 of=SDDCopy >> 97656250+0 records in 97656250+0 records out >> 50000000000 bytes (50 GB) copied, 251.105 s, 199 MB/s > >> So, being that this RAID6 was grown yesterday from something that >> has existed for a year or two I'm not sure of it's fragmentation, or >> even how to determine that at this time. However it seems my problem are >> RAID6 reads, not RAID6 writes, at least to new an probably never used >> disk space. > > Reading all that, one question occurs to me. If you want to test read > and write separately, why the intermediate step of dd-ing from /dev/ > random to ssd, then from ssd to raid or ssd? > > Why not do direct dd if=/dev/random (or urandom, see note below) > of=/desired/target ... for write tests, and then (after dropping caches), > if=/desired/target of=/dev/null ... for read tests? That way there's > just the one block device involved, not both. > 1) I was a bit worried about using it in a way it might not have been intended to be used. 2) I felt that if I had a specific file then results should be repeatable, or at least not dependent on what's in the file. <SNIP> > > Meanwhile, dd-ing either from /dev/urandom as source, or to /dev/null as > sink, with only the test-target block device as a real block device, > should give you "purer" read-only and write-only tests. In theory it > shouldn't matter much given your method of testing, but as we all know, > theory and reality aren't always well aligned. > Will try some tests this way tomorrow morning. > > Of course the next question follows on from the above. I see a write to > the raid, and a copy from the raid to the raid, so read/write on the > raid, and a copy from the ssd to the ssd, read/write on it, but no test > of from the raid read. > > So > > if=/dev/urandom of=/mnt/raid/target ... should give you raid write. > > drop-caches > > if=/mnt/raid/target of=/dev/null ... should give you raid read. > > *THEN* we have good numbers on both to compare the raid read/write to. > > What I suspect you'll find, unless fragmentation IS your problem, is that > both read (from the raid) alone and write (to the raid) alone should be > much faster than read/write (from/to the raid). > > The problem with read/write is that you're on "rotating rust" hardware > and there's some latency as it repositions the heads from the read > location to the write location and back. > If this lack of performance is truly driven by the drive rotational issues than I completely agree. > If I'm correct and that's what you find, a workaround specific to dd > would be to specify a much larger block size, so it reads in far more > data at once, then writes it out at once, with far fewer switches between > modes. In the above you didn't specify bs (or the separate input/output > equivilents, ibs/obs respectively) at all, so it's using 512-byte > blocksize defaults. > So help me clarify this before I do the work and find out I didn't understand. Whereas earlier I created a file using: dd if=/dev/random of=random1 bs=1000 count=0 seek=$[1000*1000*50] if what you are suggesting is more like this very short example: mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/urandom of=urandom1 bs=4096 count=$[1000*100] 100000+0 records in 100000+0 records out 409600000 bytes (410 MB) copied, 25.8825 s, 15.8 MB/s mark@c2RAID6 /VirtualMachines/bonnie $ then the results for writing this 400MB file are very slow, but I'm sure I don't understand what you're asking, or urandom is the limiting factor here. I'll look for a reply (you or anyone else that has Duncan's idea better than I do) before I do much more. Thanks! - Mark ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-23 1:02 ` Mark Knecht @ 2013-06-23 1:48 ` Mark Knecht 2013-06-28 3:36 ` Duncan 0 siblings, 1 reply; 46+ messages in thread From: Mark Knecht @ 2013-06-23 1:48 UTC (permalink / raw To: Gentoo AMD64 On Sat, Jun 22, 2013 at 6:02 PM, Mark Knecht <markknecht@gmail.com> wrote: > if what you are suggesting is more like this very short example: > > mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/urandom of=urandom1 > bs=4096 count=$[1000*100] > 100000+0 records in > 100000+0 records out > 409600000 bytes (410 MB) copied, 25.8825 s, 15.8 MB/s > mark@c2RAID6 /VirtualMachines/bonnie $ > Duncan, Actually, using your idea of piping things to /dev/null it appears that the random number generator itself is only capable of 15MB/S on my machine. mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/urandom of=/dev/null bs=4096 count=$[1000] 1000+0 records in 1000+0 records out 4096000 bytes (4.1 MB) copied, 0.260608 s, 15.7 MB/s mark@c2RAID6 /VirtualMachines/bonnie $ It doesn't change much based on block size of number of bytes I pipe. If this speed is representative of how well that works then I think I have to use a file. It appears this guy gets similar values: http://www.globallinuxsecurity.pro/quickly-fill-a-disk-with-random-bits-without-dev-urandom/ On the other hand, piping /dev/zero appears to be very fast - basically the speed of the processor I think: mark@c2RAID6 /VirtualMachines/bonnie $ dd if=/dev/zero of=/dev/null bs=4096 count=$[1000] 1000+0 records in 1000+0 records out 4096000 bytes (4.1 MB) copied, 0.000622594 s, 6.6 GB/s mark@c2RAID6 /VirtualMachines/bonnie $ - Mark ^ permalink raw reply [flat|nested] 46+ messages in thread
* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-23 1:48 ` Mark Knecht @ 2013-06-28 3:36 ` Duncan 2013-06-28 9:12 ` Duncan 0 siblings, 1 reply; 46+ messages in thread From: Duncan @ 2013-06-28 3:36 UTC (permalink / raw To: gentoo-amd64 Mark Knecht posted on Sat, 22 Jun 2013 18:48:15 -0700 as excerpted: > Duncan, Again, following up now that it's my "weekend" and I have a chance... > Actually, using your idea of piping things to /dev/null it appears > that the random number generator itself is only capable of 15MB/S on my > machine. It doesn't change much based on block size of number of bytes > I pipe. =:^( Well, you tried. > If this speed is representative of how well that works then I think > I have to use a file. It appears this guy gets similar values: > > http://www.globallinuxsecurity.pro/quickly-fill-a-disk-with-random-bits- without-dev-urandom/ Wow, that's a very nice idea he has there! I'll have to remember that! The same idea should work for creating any relatively large random file, regardless of final use. Just crypt-setup the thing and dd /dev/zero into it. FWIW, you're doing better than my system does, however. I seem to run about 13 MB/s from /dev/urandom (upto 13.7 depending on blocksize). And back to the random vs urandom discussion, random totally blocked here after a few dozen bytes, waiting for more random data to be generated. So the fact that you actually got a usefully sized file out of it does indicate that you must have hardware random and that it's apparently working well. > On the other hand, piping /dev/zero appears to be very fast - > basically the speed of the processor I think: > > $ dd if=/dev/zero of=/dev/null bs=4096 count=$[1000] > 1000+0 records in 1000+0 records out 4096000 bytes (4.1 MB) copied, > 0.000622594 s, 6.6 GB/s What's most interesting to me when I tried that here is that unlike urandom, zero's output varies DRAMATICALLY by blocksize. With bs=$((1024*1024)) (aka 1MB), I get 14.3 GB/s, tho at the default bs=512, I get only 1.2 GB/s. (Trying a few more values, 1024*512 gives me very similar 14.5 GB/s, 1024*64 is already down to 13.2 GB/s, 1024*128=13.9 and 1024*256=14.1, while on the high side 1024*1024*2 is already down to 10.2 GB/s. So quarter MB to one MB seems the ideal range, on my hardware.) But of course, if your device is compressible-data speed-sensitive, as are say the sandforce-controller-based ssds, /dev/zero isn't going to give you anything like the real-world benchmark random data would (tho it should be a great best-case compressible-data test). Tho it's unlikely to matter on most spinning rust, AFAIK, and SSDs like my Corsair Neutrons (Link_A_Media/LAMD-based controller), which have as a bullet-point feature that they're data compression agnostic, unlike the sandforce- based SSDs. Since /dev/zero is so fast, I'd probably do a few initial tests to determine whether compressible data makes a difference on what you're testing, then use /dev/zero if it doesn't appear to, to get a reasonable base config, then finally double-check that against random data again. Meanwhile, here's another idea for random data, seeing as /dev/urandom is speed limited. Upto your memory constraints anyway, you should be able to dd if=/dev/urandom of=/some/file/on/tmpfs . Then you can dd if=/tmpfs/file, of=/dev/test/target, or if you want a bigger file than a direct tmpfs file will let you use, try something like this: cat /tmpfs/file /tmpfs/file /tmpfs/file | dd of=/dev/test/target ... which would give you 3X the data size of /tmpfs/file. (Man, testing that with a 10 GB tmpfs file (on a 12 GB tmpfs /tmp), I can see see how slow that 13 MB/s /dev/urandom actually is as I'm creating it! OUCH! I waited awhile before I started typing this comment... I've been typing slowly and looking at the usage graph as I type, and I'm still only at maybe 8 gigs, depending on where my cache usage was when I started, right now!) cd /tmp dd if=/dev/urandom of=/tmp/10gig.testfile bs=$((1024*1024)) count=10240 (10240 records, 10737418240 bytes, but it says 11 GB copied, I guess dd uses 10^3 multipliers, anyway, ~783 s, 13.7 MB/s) ls -l 10gig.testfile (confirm the size, 10737418240 bytes) cat 10gig.testfile 10gig.testfile 10gig.testfile \ 10gig.testfile 10gig.testfile | dd of=/dev/null (that's 5x, yielding 50 GB power of 2, 104857600+0 records, 53687091200 bytes, ~140s, 385 MB/s at the default 512-byte blocksize) Wow, what a difference block size makes there, too! Trying the above cat/ dd with bs=$((1024*1024)) (1MB) yields ~30s, 1.8 GB/s! 1GB block size (1024*1024*1024) yields about the same, 30s, 1.8 GB/s. LOL dd didn't like my idea to try a 10 GB buffer size! dd: memory exhausted by input buffer of size 10737418240 bytes (10 GiB) (No wonder, as that'd be 10GB in tmpfs/cache and a 10GB buffer, and I'm /only/ running 16 gigs RAM and no swap! But it won't take 2 GB either. Checking, looks like as my normal user I'm running a ulimit of 1-gig memory size, 2-gig virtual-size, so I'm sort of surprised it took the 1GB buffer... maybe that counts against virtual only or something? ) Low side again, ~90s, 599 MB/s @ 1KB (1024 byte) bs, already a dramatic improvement from the 140s 385 MB/s of the default 512-byte block. 2KB bs yields 52s, 1 GB/s 16KB bs yields 31s, 1.7 GB/s, near optimum already. High side again, 1024*1024*4 (4MB) bs appears to be best-case, just under 29s, 1.9 GB/s. Going to 8MB takes another second, 1.8 GB/s again, which is not a big surprise given that the memory page size is 4MB, so that's an unsurprising peak performance point. FWIW, cat seems to run just over 100% single-core saturation while dd seems to run just under, @97% or so. Running two instances in parallel (using the peak 4MB block size, 1.9 GB/ s with a single run) seems to cut performance some, but not nearly in half. (I got 1.5 GB/s and 1.6 GB/s, but I started one then switched to a different terminal to start the other, so they only overlapped by maybe 30s or so of the 35s on each.). OK, so that's all memory/cpu since neither end is actual storage, but that does give me a reasonable base against which to benchmark actual storage (rust or ssd), if I wished. What's interesting is that by, I guess pure coincidence, my 385 MB/s original 512-byte blocksize figure is reasonably close to what the SSD read benchmarks are with hddparm. IIRC the hdparm/ssd numbers were some higher, but not so much so (470 MB/sec I just tested). But the bus speed maxes out not /too/ far above that (500-600 MB/sec, theoretically 600 MB/ sec on SATA-600, but real world obviously won't /quite/ hit that, IIRC best numbers I've seen anywhere are 585 or so). So now I guess I send this and do some more testing of real device, now that you've provoked my curiosity and I have the 50 GB (mostly) pseudorandom file sitting in tmpfs already. Maybe I'll post those results later. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 46+ messages in thread
* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-28 3:36 ` Duncan @ 2013-06-28 9:12 ` Duncan 2013-06-28 17:50 ` Gary E. Miller 0 siblings, 1 reply; 46+ messages in thread From: Duncan @ 2013-06-28 9:12 UTC (permalink / raw To: gentoo-amd64 Duncan posted on Fri, 28 Jun 2013 03:36:10 +0000 as excerpted: > So now I guess I send this and do some more testing of real device, now > that you've provoked my curiosity and I have the 50 GB (mostly) > pseudorandom file sitting in tmpfs already. Maybe I'll post those > results later. Well, I decided to use something rather smaller, both because I wanted to run it against my much smaller btrfs partitions on the ssd, and because the big file was taking too long for the benchmarks I wanted to do in the time I wanted to do them. I settled on a 4 GiB file. Speeds are power-of-10-based since that's what dd reports, unless otherwise stated. Sizes are power-of-2-based unless otherwise stated. This was filesystem-layer-based, not direct to device, and single I/O task, plus whatever the system might have had going on in the background. Also note that after reading the dd manpage, I added the conv=fsync parameter, hoping that gave me more accurate speed ratings due to the reducing the write-caching. SSD speeds, dual Corsair Neutron n256gp3 SATA-600 ssds, running btrfs raid1 data and metadata: To SSD: peak was upper 250s MB/s over a wide blocksize range of 1 MiB to 1GiB. I believe the btrfs checksumming might lower speeds here somewhat, as it's quite lower than the rated 450 MB/s sequential write speed. From SSD: peak was lower 480s MB/s, blocksize 32 KiB to 512 KiB (smaller blocksize range but much smaller block than I expected). This is MUCH better, far closer to the 540 MB/s ratings. To/from SSD: At around 220 MB/s, peak was somewhat lower than write-only peak, as might be expected. Best-case blocksize range seemed to be 256 KiB to 2 MiB. So, best mixed-access case would seem to be a blocksize near 1 MiB. I did a few timed cps also, then did the math to confirm the dd numbers. They were close enough. Spinning rust speeds, single Seagate st9500424as, 7200rpm 2.5" 16MB buffer SATA-300 disk drive, reiserfs. Tests were done on a partition located roughly 40% thru the drive. I didn't test this one as closely and didn't do rust-to-rust tests at all, but: To rust: upper 70s MB/s, blocksize didn't seem to matter much. From rust: upper 90s MB/s, blocksize upto 4 MiB. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-28 9:12 ` Duncan @ 2013-06-28 17:50 ` Gary E. Miller 2013-06-29 5:40 ` Duncan 0 siblings, 1 reply; 46+ messages in thread From: Gary E. Miller @ 2013-06-28 17:50 UTC (permalink / raw To: gentoo-amd64; +Cc: 1i5t5.duncan [-- Attachment #1: Type: text/plain, Size: 1612 bytes --] Yo Duncan! On Fri, 28 Jun 2013 09:12:24 +0000 (UTC) Duncan <1i5t5.duncan@cox.net> wrote: > Duncan posted on Fri, 28 Jun 2013 03:36:10 +0000 as excerpted: > > I settled on a 4 GiB file. Speeds are power-of-10-based since that's > what dd reports, unless otherwise stated. dd is pretty good at testing linear file performance, pretty useless for testing mysql performance. > To SSD: peak was upper 250s MB/s over a wide blocksize range of 1 MiB > >From SSD: peak was lower 480s MB/s, blocksize 32 KiB to 512 KiB Sounds about right. Your speeds are now so high that small differences in the SATA controller chip will be bigger than that between some SSD drives. Use a PCIe/SATA card and your performance will drop from what you see. > Spinning rust speeds, single Seagate st9500424as, 7200rpm 2.5" 16MB Those are pretty old and slow. If you are going to test an HDD against a newer SSD you should at least test a newer HDD. A new 2TB drive could get pretty close to your SSD performance in linear tests. > To rust: upper 70s MB/s, blocksize didn't seem to matter much. > >From rust: upper 90s MB/s, blocksize upto 4 MiB. Seems about right, for that drive. I think your numbers are about right, if your workload is just reading and writing big linear files. For a MySQL workload there would be a lot of random reads/writes/seeks and the SSD would really shine. RGDS GARY --------------------------------------------------------------------------- Gary E. Miller Rellim 109 NW Wilmington Ave., Suite E, Bend, OR 97701 gem@rellim.com Tel:+1(541)382-8588 [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 198 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
* [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-28 17:50 ` Gary E. Miller @ 2013-06-29 5:40 ` Duncan 0 siblings, 0 replies; 46+ messages in thread From: Duncan @ 2013-06-29 5:40 UTC (permalink / raw To: gentoo-amd64 Gary E. Miller posted on Fri, 28 Jun 2013 10:50:08 -0700 as excerpted: > Yo Duncan! Nice greeting, BTW. Good to cheer a reader up after a long day with things not going right, especially after seeing it several times in a row so there's a bit of familiarity now. =:^) > On Fri, 28 Jun 2013 09:12:24 +0000 (UTC) > Duncan <1i5t5.duncan@cox.net> wrote: > >> Duncan posted on Fri, 28 Jun 2013 03:36:10 +0000 as excerpted: >> >> I settled on a 4 GiB file. Speeds are power-of-10-based since that's >> what dd reports, unless otherwise stated. > > dd is pretty good at testing linear file performance, pretty useless for > testing mysql performance. Recognized. A single i/o job test, but it's something, it's reasonably repeatable, and when done on the actual filesystem, it's real-world times and flexible in data and block size, if single-job limited. Plus, unlike some of the more exotic tests which need to be installed separately, it's commonly already installed and available for use on most *ix systems. =:^) >> To SSD: peak was upper 250s MB/s over a wide blocksize range of 1 MiB >> >From SSD: peak was lower 480s MB/s, blocksize 32 KiB to 512 KiB > > Sounds about right. Your speeds are now so high that small differences > in the SATA controller chip will be bigger than that between some SSD > drives. Use a PCIe/SATA card and your performance will drop from what > you see. Good point. I was thinking about that the other day. SSDs are fast enough that they saturate a modern PCIe 3.0 1x and a single SATA-600 channel all by themselves. SATA port-multipliers are arguably still useful for for slower spinning rust, but not so much for SSD, where the bottleneck is often already the SATA and/or PCIe, so doubling up will indeed only slow things down. And most add-on SATA cards have several SATA ports hanging off the same 1x PCIe, which means they'll bottleneck if actually using more than a single port, too. I believe I have seen 4x PCIe SATA cards, which would allow four or so SATA ports (I think 5), but they tend to be higher priced. After pondering that for a bit, I decided I'd take a closer look next time I was at Fry's Electronics, to see what was actually available, as well as the prices. Until last year I was still running old PCI-X boxes, so the whole PCI-E thing itself is still relatively new to me, and I'm still reorienting myself to the modern bus and its implications in terms of addon cards, etc. >> Spinning rust speeds, single Seagate st9500424as, 7200rpm 2.5" 16MB > > Those are pretty old and slow. If you are going to test an HDD against > a newer SSD you should at least test a newer HDD. A new 2TB drive could > get pretty close to your SSD performance in linear tests. Well, it's not particularly old, but it *IS* a 2.5 inch, down from the old 3.5 inch standard, which due to the smaller diameter does mean lower rim/maximum speeds at the same RPM. And of course 7200 RPM is middle of the pack as well. The fast stuff (tho calling any spinning rust "fast" in the age of SSDs does rather jar, it's relative!) is 15000. But 2.5 inch does seem to be on its way as the new standard for desktops and servers as well, helped along by the three factors of storage density, SSDs (which are invariably 2.5 inch, and even that's due to the standard form factor as often the circuit boards aren't full height and/ or are largely empty space, 3.5 inch is just /ridiculously/ huge for them), and the newer focus on power efficiency (plus raw spindle density! ) in the data center. There's still a lot of inertia behind the 3.5 inch standard, just as there is behind spinning rust, and it's not going away overnight, but in the larger picture, 3.5 inch tends to look as anachronous as a full size desktop in an age when even the laptop is being displaced by the tablet and mobile phone. Which isn't to say there's no one still using them, by far (my main machine is still a mid tower, easier to switch out parts on them, after all), but just sayin' what I'm sayin'. Anyway... >> To rust: upper 70s MB/s, blocksize didn't seem to matter much. >> >From rust: upper 90s MB/s, blocksize upto 4 MiB. > > Seems about right, for that drive. > > I think your numbers are about right, if your workload is just reading > and writing big linear files. For a MySQL workload there would be a lot > of random reads/writes/seeks and the SSD would really shine. Absolutely. And perhaps more to the point given the list and thus the readership... As I said in a different thread on a different list, recently, I didn't see my boot times change much, the major factor there being the ntp- client time-sync, at ~12 seconds usually (just long enough to trigger openrc's first 10-second warning in the minute timeout...), but *WOW*, did the SSDs drop my emerge sync, as well as kernel git pull, time! Those are both many smaller files that will tend to highly fragment over time due to constant churn, and that's EXACTLY the job type where good SSDs can (and do!) really shine! Like your MySQL db example (tho that's high activity large-file rather than high-activity huge number of smaller files), except this one's likely more directly useful to a larger share of the list readership. =:^) Meanwhile, the other thing with the boot times is that I boot to a CLI login, so don't tend to count the X and kde startup times as boot. But kde starts up much faster too, and that would count as boot time for many users. Additionally, I have one X app, pan, that in my (not exactly design targeted) usage, had a startup time that really suffered on spinning rust, so much so that for years I've had it start with the kde session, so that if it takes five minutes on cold-cache to startup, no big deal, I have other things to do and it's ready when I'm ready for it. It's a newsgroups (nntp) app, which as designed (by default) ships with a 10 MB article cache, and expires headers in (IIRC) two weeks. But my usage, in addition to following my various lists with it using gmane's list2news service, is as a long-time technical group and list archive. My text- instance pan (the one I use the most) has a cache size of several gig (with about a gig actually used) and is set to no-expiry on messages. In fact, I have ISP newsgroups archived in pan for an ISP server that hasn't hasn't even existed for several years, now, as well as the archives for several mailing lists going back over a decade to 2002. So this text-instance pan tends to be another prime example of best-use- case-for-SSDs. Thousands, actually tens of thousands of files I believe, all in the same cache dir, with pan accessing them all to rebuild its threading tree in memory at startup. (For years there's been talk of switching that to a database, so it doesn't have to all be in memory at once, but the implementation has yet to be coded up and switched to.) On spinning rust, I did note a good speed boost if I backed up everything and did a mkfs and restore from time to time, so it's definitely a high fragmentation use-case as well. *GREAT* use case for SSD, and there too, I noticed a HUGE difference. Tho I've not actually timed the startup since switching to SSD, I do know that the pan icon appears in the system tray far earlier than it did, such that I almost think it's there as soon as the system tray is, now, whereas on the spinning rust, it would take five minutes or more to appear. ... Which is something those dd results don't, and can't, show at all. Single i/o thread access to a single rather large (GBs) unchanged since original write file is one thing. Access to thousands or tens of thousands of constantly changing or multi-write-thread interwoven little files, or for that matter, to a high activity large file thus (depending on the filesystem) potentially triggering COW fragmentation there, is something entirely different. And the many-parallel-job seek latency of spinning rust is something that dd simply does not and cannot really measure, as it's simply the wrong tool for that sort of purpose. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Re: Is my RAID performance bad possibly due to starting sector value? 2013-06-21 7:31 ` [gentoo-amd64] " Duncan 2013-06-21 10:28 ` Rich Freeman 2013-06-21 17:40 ` Mark Knecht @ 2013-06-30 1:04 ` Rich Freeman 2 siblings, 0 replies; 46+ messages in thread From: Rich Freeman @ 2013-06-30 1:04 UTC (permalink / raw To: gentoo-amd64 On Fri, Jun 21, 2013 at 3:31 AM, Duncan <1i5t5.duncan@cox.net> wrote: > BUT RAID5/6 DOESN'T USE > THAT DATA FOR INTEGRITY CHECKING ANYWAY, ONLY FOR RECONSTRUCTION IN THE > CASE OF DEVICE LOSS! Well, to drive this point home in the case of the thread that wouldn't die, I had put an entry in crontab a week ago to do a weekly forced check of all my arrays. Last week it passed. Today towards the end drive performance seriously deteriorated, and eventually smartd sent me an email about pending sectors (these are read errors). Long story short I ended up failing the drive out of the array (at which point my system stopped crawling), and tried wiping the bad sectors individually, and after self tests kept failing I even tried zeroing the drive. With sustained read failures under those circumstances I decided the drive had to be suitable for RMA. The drive was almost a year old. So, crossing my fingers that I don't suffer another failure and I'll be ginger with my clean shutdowns. Since the problem was discovered before I had dual failures the RAID should be recoverable without further loss. If you don't already, check your arrays weekly in crontab. Scripts for this can be found online or I'd be happy to post the one I dug up somewhere... Rich ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? 2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht ` (2 preceding siblings ...) 2013-06-21 7:31 ` [gentoo-amd64] " Duncan @ 2013-06-22 12:49 ` B Vance 2013-06-22 13:12 ` Rich Freeman 2013-06-23 11:31 ` thegeezer 4 siblings, 1 reply; 46+ messages in thread From: B Vance @ 2013-06-22 12:49 UTC (permalink / raw To: gentoo-amd64 On Thu, 2013-06-20 at 12:10 -0700, Mark Knecht wrote: > Hi, > Does anyone know of info on how the starting sector number might > impact RAID performance under Gentoo? The drives are WD-500G RE3 > drives shown here: > > http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/B001EMZPD0/ref=cm_cr_pr_product_top > > These are NOT 4k sector sized drives. > > Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My > benchmarking seems abysmal at around 40MB/S using dd copying large > files. It's higher, around 80MB/S if the file being transferred is > coming from an SSD, but even 80MB/S seems slow to me. I see a LOT of > wait time in top. And my 'large file' copies might not be large enough > as the machine has 24GB of DRAM and I've only been copying 21GB so > it's possible some of that is cached. > > Then I looked again at how I partitioned the drives originally and > see the starting sector of sector 3 as 8594775. I started wondering if > something like 4K block sizes at the file system level might be > getting munged across 16k chunk sizes in the RAID. Maybe the blocks > are being torn apart in bad ways for performance? That led me down a > bunch of rabbit holes and I haven't found any light yet. > > Looking for some thoughtful ideas from those more experienced in this area. > > Cheers, > Mark > Not necessarily the kind of answer you are looking for, but a year or so back I converted my NAS from Hardware RAID1 to linux software RAID1 to RAID1 on ZFS. Before the conversion to ZFS I had issues with the NAS being unable to keep up with requests. Since then I have been able to hit the SAN relatively hard with no visible effects. Just to give an idea, a normal load involves streaming an HD movie to the TV, streaming music to a second system, being used as the shared storage for four computers, two of which almost constantly hit the shared drive for data (keep the distfile directory for all the systems on it as well as using it as the local rsync) , and once a month, transferring data to removable storage devices. All of this going over cat6 Ethernet and occasionally USB2. I'm unsure how I would go about measuring the throughput, mainly because I never cared in the past as long as the files transferred at a reasonable pace and the video/audio didn't stutter. By no means is my NAS a high-end system. It's stats are: AMD64 X2, 4200 ASUS A8V MoBo (I think) 4GB RAM 2 x Silicon Image Sil 3114 SATA RAID cards (4 port PCI cards) 3 x 1.5TB Seagate drives (on Raid Cards) 4 x 2TB Western Digital drives (On Raid Cards) 2 x Western Digital antique 80GB drives (mirrored on motherboard for OS) Marvell GigE network cards (Have a second card to add once I figure how to automatically load balance through two cards) Case with 2 x 120mm fans on top, 3 x 120mm fans on the front, 1 x 240mm fan on the side Total storage available 6.3TB, of which 3.4TB is used. An image of the pool is created on a daily basis via cron jobs, which are overwritten every 3 days. (Image of Day 1, Day 2, Day 3 then Day 4 overwrites Day 1.)The pool started with 5 750GB drives and has been grown slowly as I find deals on better drives. Main advantage of using ZFS on linux is the ease of growing your pools. As long as you know the id of the drive (preferably the hardware id not the delegated one), its so simple I can manage it. Since I'm nowhere near the technical level of most folk here, anyone can do it. For what it's worth (very little I know), I think that ZFS has too many advantages over linux software RAID for it to be a real competition. YMMV B. Vance ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? 2013-06-22 12:49 ` [gentoo-amd64] " B Vance @ 2013-06-22 13:12 ` Rich Freeman 0 siblings, 0 replies; 46+ messages in thread From: Rich Freeman @ 2013-06-22 13:12 UTC (permalink / raw To: gentoo-amd64 On Sat, Jun 22, 2013 at 8:49 AM, B Vance <anonymous.pseudonym.88@gmail.com> wrote: > Main advantage of using ZFS on linux is the ease of growing your pools. > As long as you know the id of the drive (preferably the hardware id not > the delegated one), its so simple I can manage it. Since I'm nowhere > near the technical level of most folk here, anyone can do it. For what > it's worth (very little I know), I think that ZFS has too many > advantages over linux software RAID for it to be a real competition. I'm holding out for btrfs but for all the same reasons. I really don't want to mess with zfs on linux (fuse, etc - and the license issues - the thing I don't get is that Oracle maintains both). However, the last time I checked ZFS does not support reshaping of RAID-Z. That is a major limitation for me, as I almost always expand arrays gradually. You can add additional raid-z's to a zpool, but if you have a raid-z with 5 drives you can't add 1 more drive to it as part of the same raid-z. That means that it get treated as a mirror and not a stripe, and that means that if you add 10 drives in this manner one at a time you get 5 drives of capacity and not 9. Btrfs targets making raids re-shapeable, just like mdadm. But in general COW makes a LOT more sense with RAID because the layer-breaking allows them to often avoid read-write cycles by writing complete stripes more often, and files aren't modified in place so you can consolidate changes for many files into a single stripe (granted, that can cause fragmentation). ZFS has all those advantages being COW, as will btrfs when it is ready for prime time. Rich ^ permalink raw reply [flat|nested] 46+ messages in thread
* Re: [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? 2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht ` (3 preceding siblings ...) 2013-06-22 12:49 ` [gentoo-amd64] " B Vance @ 2013-06-23 11:31 ` thegeezer 4 siblings, 0 replies; 46+ messages in thread From: thegeezer @ 2013-06-23 11:31 UTC (permalink / raw To: gentoo-amd64; +Cc: Mark Knecht [-- Attachment #1: Type: text/plain, Size: 3887 bytes --] Howdy, My own 2c on the issue is to suggest LVM this looks at things in a slightly different way and allows me to treat all my disks as one large volume i can carve up. It supports multi way mirroring, so i can choose to create a volume for all my pictures which is on at least 3 drives. It supports volume striping (RAID0) so i can put swap file and scratch files there. It does support other RAID levels but I can't find where the scrub option is It supports volume concatenation so i can just keep growing my MythTV recordings volume by just adding another disk. It supports encrypted volumes so I can put all my guarded stuff in there. it supports (with some magic) nested volumes, so i can have an encrypted volume sitting inside a mirrored volume so my secrets are protected. i can partition my drives in 3 parts, so that i can create a volume group of fast, medium and slow based on where on the disk the partition is (start track ~150MB/sec, end track ~60MB/sec, numbers sort of remembered sort of made up) I can have a bunch of disks for long term storage and hdparm can spin them down all the time. Live movement even of a root volume also means that i can keep moving storage to the storage drives or decide to use a fast disk as a storage disk and have that spin down too. I think the crucial aspect is to also consider what you wish to put on the drives. If it is just pr0n, do you really care if it gets lost? if it is just scratch areas that need to be fast, ditto. Where the different RAIDs are good is the use of parity so you don't lose half of your potential storage size if it were a mirror. Bit rot is real, all it takes is a single misaligned charged particle from that nuclear furnace in the sky to knock a single bit out of magnetic alignment so it will require regular scrubbing maybe in a cron. https://wiki.archlinux.org/index.php/Software_RAID_and_LVM#Data_scrubbing Specifically on the bandwidth issue, I'd suggest 1. take all the drives out of RAID if you can, run a benchmark against them individually, I like the benchmark tool in palimpset, but that's me. 2. concurrently run dd if=/dev/zero of=/dev/sdX on all drives and see how it compares to the individual scores this will show you the computer mainboard/chipset effect. 3. you might find this https://raid.wiki.kernel.org/index.php/RAID_setup#Calculation a good starting point for calculating strides and stripes and this http://forums.gentoo.org/viewtopic-t-942794-start-0.html shows the benefit of adjusting the numbers hope this helps! On 06/20/2013 08:10 PM, Mark Knecht wrote: > Hi, > Does anyone know of info on how the starting sector number might > impact RAID performance under Gentoo? The drives are WD-500G RE3 > drives shown here: > > http://www.amazon.com/Western-Digital-WD5002ABYS-3-5-inch-Enterprise/dp/B001EMZPD0/ref=cm_cr_pr_product_top > > These are NOT 4k sector sized drives. > > Specifically I'm a 5-drive RAID6 for about 1.45TB of storage. My > benchmarking seems abysmal at around 40MB/S using dd copying large > files. It's higher, around 80MB/S if the file being transferred is > coming from an SSD, but even 80MB/S seems slow to me. I see a LOT of > wait time in top. And my 'large file' copies might not be large enough > as the machine has 24GB of DRAM and I've only been copying 21GB so > it's possible some of that is cached. > > Then I looked again at how I partitioned the drives originally and > see the starting sector of sector 3 as 8594775. I started wondering if > something like 4K block sizes at the file system level might be > getting munged across 16k chunk sizes in the RAID. Maybe the blocks > are being torn apart in bad ways for performance? That led me down a > bunch of rabbit holes and I haven't found any light yet. > > Looking for some thoughtful ideas from those more experienced in this area. > > Cheers, > Mark > [-- Attachment #2: Type: text/html, Size: 5389 bytes --] ^ permalink raw reply [flat|nested] 46+ messages in thread
end of thread, other threads:[~2013-06-30 1:04 UTC | newest] Thread overview: 46+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-20 19:10 [gentoo-amd64] Is my RAID performance bad possibly due to starting sector value? Mark Knecht 2013-06-20 19:16 ` Volker Armin Hemmann 2013-06-20 19:28 ` Mark Knecht 2013-06-20 20:45 ` Mark Knecht 2013-06-24 18:47 ` Volker Armin Hemmann 2013-06-24 19:11 ` Mark Knecht 2013-06-20 19:27 ` Rich Freeman 2013-06-20 19:31 ` Mark Knecht 2013-06-21 7:31 ` [gentoo-amd64] " Duncan 2013-06-21 10:28 ` Rich Freeman 2013-06-21 14:23 ` Bob Sanders 2013-06-21 14:27 ` Duncan 2013-06-21 15:13 ` Rich Freeman 2013-06-22 10:29 ` Duncan 2013-06-22 11:12 ` Rich Freeman 2013-06-22 15:45 ` Duncan 2013-06-22 23:04 ` Mark Knecht 2013-06-22 23:17 ` Matthew Marlowe 2013-06-23 11:43 ` Rich Freeman 2013-06-23 15:23 ` Mark Knecht 2013-06-28 0:51 ` Duncan 2013-06-28 3:18 ` Matthew Marlowe 2013-06-21 17:40 ` Mark Knecht 2013-06-21 17:56 ` Bob Sanders 2013-06-21 18:12 ` Mark Knecht 2013-06-21 17:57 ` Rich Freeman 2013-06-21 18:10 ` Gary E. Miller 2013-06-21 18:38 ` Mark Knecht 2013-06-21 18:50 ` Gary E. Miller 2013-06-21 18:57 ` Rich Freeman 2013-06-22 14:34 ` Duncan 2013-06-22 22:15 ` Gary E. Miller 2013-06-28 0:20 ` Duncan 2013-06-28 0:41 ` Gary E. Miller 2013-06-21 18:53 ` Bob Sanders 2013-06-22 14:23 ` Duncan 2013-06-23 1:02 ` Mark Knecht 2013-06-23 1:48 ` Mark Knecht 2013-06-28 3:36 ` Duncan 2013-06-28 9:12 ` Duncan 2013-06-28 17:50 ` Gary E. Miller 2013-06-29 5:40 ` Duncan 2013-06-30 1:04 ` Rich Freeman 2013-06-22 12:49 ` [gentoo-amd64] " B Vance 2013-06-22 13:12 ` Rich Freeman 2013-06-23 11:31 ` thegeezer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox