* [gentoo-user] OT: btrfs raid 5/6 @ 2017-11-27 22:30 Bill Kenworthy 2017-12-01 15:59 ` J. Roeleveld 2017-12-01 16:58 ` Wols Lists 0 siblings, 2 replies; 35+ messages in thread From: Bill Kenworthy @ 2017-11-27 22:30 UTC (permalink / raw To: gentoo-user Hi all, I need to expand two bcache fronted 4xdisk btrfs raid 10's - this requires purchasing 4 drives (and one system does not have room for two more drives) so I am trying to see if using raid 5 is an option I have been trying to find if btrfs raid 5/6 is stable enough to use but while there is mention of improvements in kernel 4.12, and fixes for the write hole problem I cant see any reports that its "working fine now" though there is a phoronix article saying Oracle is using it since the fixes. Is anyone here successfully using btrfs raid 5/6? What is the status of scrub and self healing? The btrfs wiki is woefully out of date :( BillK ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-11-27 22:30 [gentoo-user] OT: btrfs raid 5/6 Bill Kenworthy @ 2017-12-01 15:59 ` J. Roeleveld 2017-12-01 16:58 ` Wols Lists 1 sibling, 0 replies; 35+ messages in thread From: J. Roeleveld @ 2017-12-01 15:59 UTC (permalink / raw To: gentoo-user On Monday, November 27, 2017 11:30:13 PM CET Bill Kenworthy wrote: > Hi all, > I need to expand two bcache fronted 4xdisk btrfs raid 10's - this > requires purchasing 4 drives (and one system does not have room for two > more drives) so I am trying to see if using raid 5 is an option > > I have been trying to find if btrfs raid 5/6 is stable enough to use but > while there is mention of improvements in kernel 4.12, and fixes for the > write hole problem I cant see any reports that its "working fine now" > though there is a phoronix article saying Oracle is using it since the > fixes. > > Is anyone here successfully using btrfs raid 5/6? What is the status of > scrub and self healing? The btrfs wiki is woefully out of date :( > > BillK I have not seen any indication that BTRFS raid 5/6/.. is usable. Last status I heard: No scrub, no rebuild when disk failed, ... It should work as long as all disks stay functioning, but then I wonder why bother with anything more advanced than raid-0 ? It's the lack of progress with regards to proper "raid" support in BTRFS which made me stop considering it and simply went with ZFS. -- Joost ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-11-27 22:30 [gentoo-user] OT: btrfs raid 5/6 Bill Kenworthy 2017-12-01 15:59 ` J. Roeleveld @ 2017-12-01 16:58 ` Wols Lists 2017-12-01 17:14 ` Rich Freeman 1 sibling, 1 reply; 35+ messages in thread From: Wols Lists @ 2017-12-01 16:58 UTC (permalink / raw To: gentoo-user On 27/11/17 22:30, Bill Kenworthy wrote: > Hi all, > I need to expand two bcache fronted 4xdisk btrfs raid 10's - this > requires purchasing 4 drives (and one system does not have room for two > more drives) so I am trying to see if using raid 5 is an option > > I have been trying to find if btrfs raid 5/6 is stable enough to use but > while there is mention of improvements in kernel 4.12, and fixes for the > write hole problem I cant see any reports that its "working fine now" > though there is a phoronix article saying Oracle is using it since the > fixes. > > Is anyone here successfully using btrfs raid 5/6? What is the status of > scrub and self healing? The btrfs wiki is woefully out of date :( > Or put btrfs over md-raid? Thing is, with raid-6 over four drives, you have a 100% certainty of surviving a two-disk failure. With raid-10 you have a 33% chance of losing your array. Cheers, Wol ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-01 16:58 ` Wols Lists @ 2017-12-01 17:14 ` Rich Freeman 2017-12-01 17:24 ` Wols Lists 2017-12-06 23:28 ` Frank Steinmetzger 0 siblings, 2 replies; 35+ messages in thread From: Rich Freeman @ 2017-12-01 17:14 UTC (permalink / raw To: gentoo-user On Fri, Dec 1, 2017 at 11:58 AM, Wols Lists <antlists@youngman.org.uk> wrote: > On 27/11/17 22:30, Bill Kenworthy wrote: >> Hi all, >> I need to expand two bcache fronted 4xdisk btrfs raid 10's - this >> requires purchasing 4 drives (and one system does not have room for two >> more drives) so I am trying to see if using raid 5 is an option >> >> I have been trying to find if btrfs raid 5/6 is stable enough to use but >> while there is mention of improvements in kernel 4.12, and fixes for the >> write hole problem I cant see any reports that its "working fine now" >> though there is a phoronix article saying Oracle is using it since the >> fixes. >> >> Is anyone here successfully using btrfs raid 5/6? What is the status of >> scrub and self healing? The btrfs wiki is woefully out of date :( >> > Or put btrfs over md-raid? > > Thing is, with raid-6 over four drives, you have a 100% certainty of > surviving a two-disk failure. With raid-10 you have a 33% chance of > losing your array. > I tend to be a fan of parity raid in general for these reasons. I'm not sure the performance gains with raid-10 are enough to warrant the waste of space. With btrfs though I don't really see the point of "Raid-10" vs just a pile of individual disks in raid1 mode. Btrfs will do a so-so job of balancing the IO across them already (they haven't really bothered to optimize this yet). I've moved away from btrfs entirely until they sort things out. However, I would not use btrfs for raid-5/6 under any circumstances. That has NEVER been stable, and if anything has gone backwards. I'm sure they'll sort it out sometime, but I have no idea when. RAID-1 on btrfs is reasonably stable, but I've still had it run into issues (nothing that kept me from reading the data off the array, but I've had various issues with it, and when I finally moved it to ZFS it was in a state where I couldn't run it in anything other than degraded mode). You could run btrfs over md-raid, but other than the snapshots I think this loses a lot of the benefit of btrfs in the first place. You are vulnerable to the write hole, the ability of btrfs to recover data with soft errors is compromised (though you can detect it still), and you're potentially faced with more read-write-read cycles when raid stripes are modified. Both zfs and btrfs were really designed to work best on raw block devices without any layers below. They still work of course, but you don't get some of those optimizations since they don't have visibility into what is happening at the disk level. -- Rich ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-01 17:14 ` Rich Freeman @ 2017-12-01 17:24 ` Wols Lists 2017-12-06 23:28 ` Frank Steinmetzger 1 sibling, 0 replies; 35+ messages in thread From: Wols Lists @ 2017-12-01 17:24 UTC (permalink / raw To: gentoo-user On 01/12/17 17:14, Rich Freeman wrote: > You could run btrfs over md-raid, but other than the snapshots I think > this loses a lot of the benefit of btrfs in the first place. You are > vulnerable to the write hole, The write hole is now "fixed". In quotes because, although journalling has now been merged and is available, there still seem to be a few corner case (and not so corner case) bugs that need ironing out before it's solid. Cheers, Wol ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-01 17:14 ` Rich Freeman 2017-12-01 17:24 ` Wols Lists @ 2017-12-06 23:28 ` Frank Steinmetzger 2017-12-06 23:35 ` Rich Freeman 1 sibling, 1 reply; 35+ messages in thread From: Frank Steinmetzger @ 2017-12-06 23:28 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 2166 bytes --] On Fri, Dec 01, 2017 at 12:14:12PM -0500, Rich Freeman wrote: > On Fri, Dec 1, 2017 at 11:58 AM, Wols Lists <antlists@youngman.org.uk> wrote: > > On 27/11/17 22:30, Bill Kenworthy wrote: > >> […] > >> Is anyone here successfully using btrfs raid 5/6? What is the status of > >> scrub and self healing? The btrfs wiki is woefully out of date :( > >> > > […] > > Thing is, with raid-6 over four drives, you have a 100% certainty of > > surviving a two-disk failure. With raid-10 you have a 33% chance of > > losing your array. > > > […] > I tend to be a fan of parity raid in general for these reasons. I'm > not sure the performance gains with raid-10 are enough to warrant the > waste of space. > […] > and when I finally moved it to ZFS > […] I am about to upgrade my Gentoo-NAS from 2× to 4×6 TB WD Red (non-pro). The current setup is a ZFS mirror. I had been holding off the purchase for months, all the while pondering on which RAID scheme to use. First it was raidz1 due to space (I only have four bays), but eventually discarded it due to reduced resilience. Which brought me to raidz2 (any 2 drives may fail). But then I came across that famous post by a developer on “You should always use mirrors unless you are really really sure what you’re doing”. The main points were higher strain on the entire array during resilvering (all drives nead to read everything instead of just one drive) and easier maintainability of a mirror set (e.g. faster and easier upgrade). I don’t really care about performance. It’s a simple media archive powered by the cheapest Haswell Celeron I could get (with 16 Gigs of ECC RAM though ^^). Sorry if I more or less stole the thread, but this is almost the same topic. I could use a nudge in either direction. My workplace’s storage comprises many 2× mirrors, but I am not a company and I am capped at four bays. So, Do you have any input for me before I fetch the dice? -- Gruß | Greetings | Qapla’ Please do not share anything from, with or about me on any social network. All PCs are compatible. Some are just more compatible than others. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-06 23:28 ` Frank Steinmetzger @ 2017-12-06 23:35 ` Rich Freeman 2017-12-07 0:13 ` Frank Steinmetzger 2017-12-07 7:54 ` Richard Bradfield 0 siblings, 2 replies; 35+ messages in thread From: Rich Freeman @ 2017-12-06 23:35 UTC (permalink / raw To: gentoo-user On Wed, Dec 6, 2017 at 6:28 PM, Frank Steinmetzger <Warp_7@gmx.de> wrote: > > I don’t really care about performance. It’s a simple media archive powered > by the cheapest Haswell Celeron I could get (with 16 Gigs of ECC RAM though > ^^). Sorry if I more or less stole the thread, but this is almost the same > topic. I could use a nudge in either direction. My workplace’s storage > comprises many 2× mirrors, but I am not a company and I am capped at four > bays. > > So, Do you have any input for me before I fetch the dice? > IMO the cost savings for parity RAID trumps everything unless money just isn't a factor. Now, with ZFS it is frustrating because arrays are relatively inflexible when it comes to expansion, though that applies to all types of arrays. That is one major advantage of btrfs (and mdadm) over zfs. I hear they're working on that, but in general there are a lot of things in zfs that are more static compared to btrfs. -- Rich ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-06 23:35 ` Rich Freeman @ 2017-12-07 0:13 ` Frank Steinmetzger 2017-12-07 0:29 ` Rich Freeman 2017-12-07 7:54 ` Richard Bradfield 1 sibling, 1 reply; 35+ messages in thread From: Frank Steinmetzger @ 2017-12-07 0:13 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 1042 bytes --] On Wed, Dec 06, 2017 at 06:35:10PM -0500, Rich Freeman wrote: > On Wed, Dec 6, 2017 at 6:28 PM, Frank Steinmetzger <Warp_7@gmx.de> wrote: > > > > I don’t really care about performance. It’s a simple media archive powered > > by the cheapest Haswell Celeron I could get (with 16 Gigs of ECC RAM though > > ^^). Sorry if I more or less stole the thread, but this is almost the same > > topic. I could use a nudge in either direction. My workplace’s storage > > comprises many 2× mirrors, but I am not a company and I am capped at four > > bays. > > > > So, Do you have any input for me before I fetch the dice? > > > > IMO the cost savings for parity RAID trumps everything unless money > just isn't a factor. Cost saving compared to what? In my four-bay-scenario, mirror and raidz2 yield the same available space (I hope so). -- Gruß | Greetings | Qapla’ Please do not share anything from, with or about me on any social network. Advanced mathematics have the advantage that you can err more accurately. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 0:13 ` Frank Steinmetzger @ 2017-12-07 0:29 ` Rich Freeman 2017-12-07 21:37 ` Frank Steinmetzger 0 siblings, 1 reply; 35+ messages in thread From: Rich Freeman @ 2017-12-07 0:29 UTC (permalink / raw To: gentoo-user On Wed, Dec 6, 2017 at 7:13 PM, Frank Steinmetzger <Warp_7@gmx.de> wrote: > On Wed, Dec 06, 2017 at 06:35:10PM -0500, Rich Freeman wrote: >> >> IMO the cost savings for parity RAID trumps everything unless money >> just isn't a factor. > > Cost saving compared to what? In my four-bay-scenario, mirror and raidz2 > yield the same available space (I hope so). > Sure, if you only have 4 drives and run raid6/z2 then it is no more efficient than mirroring. That said, it does provide more security because raidz2 can tolerate the failure of any two disks, while 2xraid1 or raid10 can tolerate only half of the combinations of two disks. The increased efficiency of parity raid comes as you scale up. They're equal at 4 disks. If you had 6 disks then raid6 holds 33% more. If you have 8 then it holds 50% more. That and it takes away the chance factor when you lose two disks. If you're really unlucky with 4xraid1 the loss of two disks could result in the loss of 25% of your data, while with an 8-disk raid6 the loss of two disks will never result in the loss of any data. (Granted, a 4xraid1 could tolerate the loss of 4 drives if you're very lucky - the luck factor is being eliminated and that cuts both ways.) If I had only 4 drives I probably wouldn't use raidz2. I might use raid5/raidz1, or two mirrors. With mdadm I'd probably use raid5 knowing that I can easily reshape the array if I want to expand it further. -- Rich ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 0:29 ` Rich Freeman @ 2017-12-07 21:37 ` Frank Steinmetzger 2017-12-07 21:49 ` Wols Lists 0 siblings, 1 reply; 35+ messages in thread From: Frank Steinmetzger @ 2017-12-07 21:37 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 1598 bytes --] On Wed, Dec 06, 2017 at 07:29:08PM -0500, Rich Freeman wrote: > On Wed, Dec 6, 2017 at 7:13 PM, Frank Steinmetzger <Warp_7@gmx.de> wrote: > > On Wed, Dec 06, 2017 at 06:35:10PM -0500, Rich Freeman wrote: > >> > >> IMO the cost savings for parity RAID trumps everything unless money > >> just isn't a factor. > > > > Cost saving compared to what? In my four-bay-scenario, mirror and raidz2 > > yield the same available space (I hope so). > > > > Sure, if you only have 4 drives and run raid6/z2 then it is no more > efficient than mirroring. That said, it does provide more security > because raidz2 can tolerate the failure of any two disks, while > 2xraid1 or raid10 can tolerate only half of the combinations of two > disks. Ooooh, I just came up with another good reason for raidz over mirror: I don't encrypt my drives because it doesn't hold sensitive stuff. (AFAIK native ZFS encryption is available in Oracle ZFS, so it might eventually come to the Linux world). So in case I ever need to send in a drive for repair/replacement, noone can read from it (or only in tiny bits'n'pieces from a hexdump), because each disk contains a mix of data and parity blocks. I think I'm finally sold. :) And with that, good night. -- Gruß | Greetings | Qapla’ Please do not share anything from, with or about me on any social network. “I think Leopard is a much better system [than Windows Vista] … but OS X in some ways is actually worse than Windows to program for. Their file system is complete and utter crap, which is scary.” – Linus Torvalds [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 21:37 ` Frank Steinmetzger @ 2017-12-07 21:49 ` Wols Lists 2017-12-07 22:35 ` Frank Steinmetzger 2017-12-11 23:20 ` Frank Steinmetzger 0 siblings, 2 replies; 35+ messages in thread From: Wols Lists @ 2017-12-07 21:49 UTC (permalink / raw To: gentoo-user On 07/12/17 21:37, Frank Steinmetzger wrote: > Ooooh, I just came up with another good reason for raidz over mirror: > I don't encrypt my drives because it doesn't hold sensitive stuff. (AFAIK > native ZFS encryption is available in Oracle ZFS, so it might eventually > come to the Linux world). > > So in case I ever need to send in a drive for repair/replacement, noone can > read from it (or only in tiny bits'n'pieces from a hexdump), because each > disk contains a mix of data and parity blocks. > > I think I'm finally sold. :) > And with that, good night. So you've never heard of LUKS? GPT LUKS MD-RAID Filesystem Simple stack so if you ever have to pull a disk, just delete the LUKS key from it and everything from that disk is now random garbage. (Oh - and md raid-5/6 also mix data and parity, so the same holds true there.) Cheers, Wol ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 21:49 ` Wols Lists @ 2017-12-07 22:35 ` Frank Steinmetzger 2017-12-07 23:48 ` Wols Lists 2017-12-11 23:20 ` Frank Steinmetzger 1 sibling, 1 reply; 35+ messages in thread From: Frank Steinmetzger @ 2017-12-07 22:35 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 1291 bytes --] On Thu, Dec 07, 2017 at 09:49:29PM +0000, Wols Lists wrote: > > So in case I ever need to send in a drive for repair/replacement, noone can > > read from it (or only in tiny bits'n'pieces from a hexdump), because each > > disk contains a mix of data and parity blocks. > > > > I think I'm finally sold. :) > > And with that, good night. > > So you've never heard of LUKS? Sure thing, my laptop’s whole SSD is LUKSed and so are all my other home and backup partitions. But encrypting ZFS is different, because every disk needs to be encrypted separately since there is no separation between the FS and the underlying block device. This will result in a big computational overhead, choking my poor Celeron. When I benchmarked reading from a single LUKS container in a ramdisk, it managed around 160 MB/s IIRC. I might give it a try over the weekend before I migrate my data, but I’m not expecting miracles. Should have bought an i3 for that. > (Oh - and md raid-5/6 also mix data and parity, so the same holds true > there.) Ok, wasn’t aware of that. I thought I read in a ZFS article that this were a special thing. -- Gruß | Greetings | Qapla’ Please do not share anything from, with or about me on any social network. This is no signature. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 22:35 ` Frank Steinmetzger @ 2017-12-07 23:48 ` Wols Lists 2017-12-09 16:58 ` J. Roeleveld 0 siblings, 1 reply; 35+ messages in thread From: Wols Lists @ 2017-12-07 23:48 UTC (permalink / raw To: gentoo-user On 07/12/17 22:35, Frank Steinmetzger wrote: >> (Oh - and md raid-5/6 also mix data and parity, so the same holds true >> > there.) > Ok, wasn’t aware of that. I thought I read in a ZFS article that this were a > special thing. Say you've got a four-drive raid-6, it'll be something like data1 data2 parity1 parity2 data3 parity3 parity4 data4 parity5 parity6 data5 data6 The only thing to watch out for (and zfs is likely the same) if a file fits inside a single chunk it will be recoverable from a single drive. And I think chunks can be anything up to 64MB. Cheers, Wol ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 23:48 ` Wols Lists @ 2017-12-09 16:58 ` J. Roeleveld 2017-12-09 18:28 ` Wols Lists 0 siblings, 1 reply; 35+ messages in thread From: J. Roeleveld @ 2017-12-09 16:58 UTC (permalink / raw To: gentoo-user On Friday, December 8, 2017 12:48:45 AM CET Wols Lists wrote: > On 07/12/17 22:35, Frank Steinmetzger wrote: > >> (Oh - and md raid-5/6 also mix data and parity, so the same holds true > >> > >> > there.) > > > > Ok, wasn’t aware of that. I thought I read in a ZFS article that this were > > a special thing. > > Say you've got a four-drive raid-6, it'll be something like > > data1 data2 parity1 parity2 > data3 parity3 parity4 data4 > parity5 parity6 data5 data6 > > The only thing to watch out for (and zfs is likely the same) if a file > fits inside a single chunk it will be recoverable from a single drive. > And I think chunks can be anything up to 64MB. Except that ZFS doesn't have fixed on-disk-chunk-sizes. (especially if you use compression) See: https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz -- Joost ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-09 16:58 ` J. Roeleveld @ 2017-12-09 18:28 ` Wols Lists 2017-12-09 23:36 ` Rich Freeman 0 siblings, 1 reply; 35+ messages in thread From: Wols Lists @ 2017-12-09 18:28 UTC (permalink / raw To: gentoo-user On 09/12/17 16:58, J. Roeleveld wrote: > On Friday, December 8, 2017 12:48:45 AM CET Wols Lists wrote: >> On 07/12/17 22:35, Frank Steinmetzger wrote: >>>> (Oh - and md raid-5/6 also mix data and parity, so the same holds true >>>> >>>>> there.) >>> >>> Ok, wasn’t aware of that. I thought I read in a ZFS article that this were >>> a special thing. >> >> Say you've got a four-drive raid-6, it'll be something like >> >> data1 data2 parity1 parity2 >> data3 parity3 parity4 data4 >> parity5 parity6 data5 data6 >> >> The only thing to watch out for (and zfs is likely the same) if a file >> fits inside a single chunk it will be recoverable from a single drive. >> And I think chunks can be anything up to 64MB. > > Except that ZFS doesn't have fixed on-disk-chunk-sizes. (especially if you use > compression) > > See: > https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz > Which explains nothing, sorry ... :-( It goes on about 4K or 8K database blocks (and I'm talking about 64 MEG chunk sizes). And the OP was talking about files being recoverable from a disk that was removed from an array. Are you telling me that a *small* file has bits of it scattered across multiple drives? That would be *crazy*. If I have a file of, say, 10MB, and write it to an md-raid array, there is a good chance it will fit inside a single chunk, and be written - *whole* - to a single disk. With parity on another disk. How big does a file have to be on ZFS before it is too big to fit in a typical chunk, so that it gets split up across multiple drives? THAT is what I was on about, and that is what concerned the OP. I was just warning the OP that a chunk typically is rather more than just one disk block, so anybody harking back to the days of 512byte sectors could get a nasty surprise ... Cheers, Wol Cheers, Wol ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-09 18:28 ` Wols Lists @ 2017-12-09 23:36 ` Rich Freeman 2017-12-10 9:45 ` Wols Lists 0 siblings, 1 reply; 35+ messages in thread From: Rich Freeman @ 2017-12-09 23:36 UTC (permalink / raw To: gentoo-user On Sat, Dec 9, 2017 at 1:28 PM, Wols Lists <antlists@youngman.org.uk> wrote: > On 09/12/17 16:58, J. Roeleveld wrote: >> On Friday, December 8, 2017 12:48:45 AM CET Wols Lists wrote: >>> On 07/12/17 22:35, Frank Steinmetzger wrote: >>>>> (Oh - and md raid-5/6 also mix data and parity, so the same holds true >>>>> >>>>>> there.) >>>> >>>> Ok, wasn’t aware of that. I thought I read in a ZFS article that this were >>>> a special thing. >>> >>> Say you've got a four-drive raid-6, it'll be something like >>> >>> data1 data2 parity1 parity2 >>> data3 parity3 parity4 data4 >>> parity5 parity6 data5 data6 >>> >>> The only thing to watch out for (and zfs is likely the same) if a file >>> fits inside a single chunk it will be recoverable from a single drive. >>> And I think chunks can be anything up to 64MB. >> >> Except that ZFS doesn't have fixed on-disk-chunk-sizes. (especially if you use >> compression) >> >> See: >> https://www.delphix.com/blog/delphix-engineering/zfs-raidz-stripe-width-or-how-i-learned-stop-worrying-and-love-raidz >> > Which explains nothing, sorry ... :-( > > It goes on about 4K or 8K database blocks (and I'm talking about 64 MEG > chunk sizes). And the OP was talking about files being recoverable from > a disk that was removed from an array. Are you telling me that a *small* > file has bits of it scattered across multiple drives? That would be *crazy*. I'm not sure why it would be "crazy." Granted, most parity RAID systems seem to operate just as you describe, but I don't see why with Reed Solomon you couldn't store ONLY parity data on all the drives. All that matters is that you generate enough to recover the data - the original data contains no more information than an equivalent number of Reed-Solomon sets. Of course, with the original data I imagine you need to do less computation assuming you aren't bothering to check its integrity against the parity data. In case my point is clear a RAID would work perfectly fine if you had 5 drives with the capacity to store 4 drives wort of data, but instead of storing the original data across 4 drives and having 1 of parity, you instead compute 5 sets of parity so that now you have 9 sets of data that can tolerate the loss of any 5, then throw away the sets containing the original 4 sets of data and store the remaining 5 sets of parity data across the 5 drives. You can still tolerate the loss of one more set, but all 4 of the original sets of data have been tossed already. -- Rich ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-09 23:36 ` Rich Freeman @ 2017-12-10 9:45 ` Wols Lists 2017-12-10 15:07 ` Rich Freeman 0 siblings, 1 reply; 35+ messages in thread From: Wols Lists @ 2017-12-10 9:45 UTC (permalink / raw To: gentoo-user On 09/12/17 23:36, Rich Freeman wrote: > you instead compute 5 sets of parity so that now you have 9 sets of > data that can tolerate the loss of any 5, then throw away the sets > containing the original 4 sets of data and store the remaining 5 sets > of parity data across the 5 drives. You can still tolerate the loss > of one more set, but all 4 of the original sets of data have been > tossed already. Is that how ZFS works? Cheers, Wol ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-10 9:45 ` Wols Lists @ 2017-12-10 15:07 ` Rich Freeman 2017-12-10 21:00 ` Wols Lists 0 siblings, 1 reply; 35+ messages in thread From: Rich Freeman @ 2017-12-10 15:07 UTC (permalink / raw To: gentoo-user On Sun, Dec 10, 2017 at 4:45 AM, Wols Lists <antlists@youngman.org.uk> wrote: > On 09/12/17 23:36, Rich Freeman wrote: >> you instead compute 5 sets of parity so that now you have 9 sets of >> data that can tolerate the loss of any 5, then throw away the sets >> containing the original 4 sets of data and store the remaining 5 sets >> of parity data across the 5 drives. You can still tolerate the loss >> of one more set, but all 4 of the original sets of data have been >> tossed already. > > Is that how ZFS works? > I doubt it, hence why I wrote "most parity RAID systems seem to operate just as you describe." -- Rich ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-10 15:07 ` Rich Freeman @ 2017-12-10 21:00 ` Wols Lists 2017-12-11 1:33 ` Rich Freeman 0 siblings, 1 reply; 35+ messages in thread From: Wols Lists @ 2017-12-10 21:00 UTC (permalink / raw To: gentoo-user On 10/12/17 15:07, Rich Freeman wrote: >> > Is that how ZFS works? >> > > I doubt it, hence why I wrote "most parity RAID systems seem to > operate just as you describe." So the OP needs to be aware that, if his file is smaller than the chunk size, then it *will* be recoverable from a disk pulled from an array, be it md-raid or zfs. The question is, then, how big is a chunk? And if zfs is anything like md-raid, it will be a lot bigger than the 512B or 4KB that a naive user would think. Cheers, Wol ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-10 21:00 ` Wols Lists @ 2017-12-11 1:33 ` Rich Freeman 0 siblings, 0 replies; 35+ messages in thread From: Rich Freeman @ 2017-12-11 1:33 UTC (permalink / raw To: gentoo-user On Sun, Dec 10, 2017 at 4:00 PM, Wols Lists <antlists@youngman.org.uk> wrote: > > So the OP needs to be aware that, if his file is smaller than the chunk > size, then it *will* be recoverable from a disk pulled from an array, be > it md-raid or zfs. > > The question is, then, how big is a chunk? And if zfs is anything like > md-raid, it will be a lot bigger than the 512B or 4KB that a naive user > would think. > I suspect the data is striped/chunked/etc at a larger scale. However, I'd really go a step further. Unless a filesystem or block layer is explicitly designed to prevent the retrieval of data without a key/etc, then I would not rely on something like this for security. Even actual encryption systems can have bugs that render them vulnerable. Something that at best provides this kind of security "by accident" is not something you should rely on. Data might be stored in journals, or metadata, or unwiped free space, or in any number of ways that makes it possible to retrieve even if it isn't obvious from casual inspection. If you don't want somebody recovering data from a drive you're disposing of, then you should probably be encrypting that drive one way or another with a robust encryption layer. That might be built into the filesystem, or it might be a block layer. If you're desperate I guess you could use the SMART security features provided by your drive firmware, which probably work, but which nobody can really vouch for but the drive manufacturer. Any of these are going to provide more security that relying on RAID striping to make data irretrievable. If you really care about security, then you're going to be paranoid about the tools that actually are designed to be secure, let alone the ones that aren't. -- Rich ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 21:49 ` Wols Lists 2017-12-07 22:35 ` Frank Steinmetzger @ 2017-12-11 23:20 ` Frank Steinmetzger 2017-12-12 10:15 ` Neil Bothwick 1 sibling, 1 reply; 35+ messages in thread From: Frank Steinmetzger @ 2017-12-11 23:20 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 4253 bytes --] On Thu, Dec 07, 2017 at 09:49:29PM +0000, Wols Lists wrote: > On 07/12/17 21:37, Frank Steinmetzger wrote: > > Ooooh, I just came up with another good reason for raidz over mirror: > > I don't encrypt my drives because it doesn't hold sensitive stuff. (AFAIK > > native ZFS encryption is available in Oracle ZFS, so it might eventually > > come to the Linux world). > > > > So in case I ever need to send in a drive for repair/replacement, noone can > > read from it (or only in tiny bits'n'pieces from a hexdump), because each > > disk contains a mix of data and parity blocks. > > > > I think I'm finally sold. :) > > And with that, good night. > > So you've never heard of LUKS? > > GPT > LUKS > MD-RAID > Filesystem My new drives are finally here. One of them turned out to be an OEM. -_- The shop says it will cover any warranty claims and it’s not a backyard seller either, so methinks I’ll keep it. To evaluate LUKS, I created the following setup (I just love ASCII-painting in vim ^^): ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ tmpfs ┃ ┃ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┃ ┃ │ 1 GB file │ │ 1 GB file │ │ 1 GB file │ │ 1 GB file │ ┃ ┃ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ┃ ┃ V V V V ┃ ┃ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┃ ┃ │ LUKS device │ │ LUKS device │ │ LUKS device │ │ LUKS device │ ┃ ┃ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ┃ ┃ V V V V ┃ ┃ ┌─────────────────────────────────────────────────────────────┐ ┃ ┃ │ RaidZ2 │ ┃ ┃ └─────────────────────────────────────────────────────────────┘ ┃ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ While dd'ing a 1500 MB file from and to the pool, my NAS Celeron achieved (with the given number of vdevs out of all 4 being encrypted): non-encrypted 2 encrypted 4 encrypted ──────────────────────────────────────────────────── read 1600 MB/s 465 MB/s 290 MB/s write ~600 MB/s ~200 MB/s ~135 MB/s scrub time 10 s (~ 100 MB/s) So performance would be juuuust enough to satisfy GBE. I wonder though how long a real scrub/resilver would take. The last scrub of my mirror, which has 3.8 TB allocated, took 9¼ hours. Once the z2 pool is created and the data migrated, I *will* have to do a resilver in any case, because I only have four drives and they will all go into the pool, but two of which currently make up the mirror. I see myself bying an i3 before too long. Talk about first-world problems. -- Gruß | Greetings | Qapla’ Please do not share anything from, with or about me on any social network. When you are fine, don’t worry. It will pass. [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-11 23:20 ` Frank Steinmetzger @ 2017-12-12 10:15 ` Neil Bothwick 2017-12-12 12:18 ` Wols Lists 0 siblings, 1 reply; 35+ messages in thread From: Neil Bothwick @ 2017-12-12 10:15 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 2874 bytes --] On Tue, 12 Dec 2017 00:20:48 +0100, Frank Steinmetzger wrote: > My new drives are finally here. One of them turned out to be an OEM. -_- > The shop says it will cover any warranty claims and it’s not a backyard > seller either, so methinks I’ll keep it. > > To evaluate LUKS, I created the following setup (I just love > ASCII-painting in vim ^^): > > ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ > ┃ tmpfs ┃ > ┃ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┃ > ┃ │ 1 GB file │ │ 1 GB file │ │ 1 GB file │ │ 1 GB file │ ┃ > ┃ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ┃ > ┃ V V V V ┃ > ┃ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┃ > ┃ │ LUKS device │ │ LUKS device │ │ LUKS device │ │ LUKS device │ ┃ > ┃ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ┃ > ┃ V V V V ┃ > ┃ ┌─────────────────────────────────────────────────────────────┐ ┃ > ┃ │ RaidZ2 │ ┃ > ┃ └─────────────────────────────────────────────────────────────┘ ┃ > ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ That means every write has to be encrypted 4 times, whereas using encryption in the filesystem means it only has to be done once. I tried setting encrypted BTRFS this way and there was a significant performance hit. I'm seriously considering going back to ZoL now that encryption is on the way. -- Neil Bothwick A printer consists of three main parts: the case, the jammed paper tray and the blinking red light. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-12 10:15 ` Neil Bothwick @ 2017-12-12 12:18 ` Wols Lists 2017-12-12 13:24 ` Neil Bothwick 0 siblings, 1 reply; 35+ messages in thread From: Wols Lists @ 2017-12-12 12:18 UTC (permalink / raw To: gentoo-user On 12/12/17 10:15, Neil Bothwick wrote: > That means every write has to be encrypted 4 times, whereas using > encryption in the filesystem means it only has to be done once. I tried > setting encrypted BTRFS this way and there was a significant performance > hit. I'm seriously considering going back to ZoL now that encryption is > on the way. DISCLAIMER - I DON'T HAVE A CLUE HOW THIS ACTUALLY WORKS IN DETAIL but there's been a fair few posts on LKML sublists about how linux is very inefficient at using hardware encryption. Setup/teardown is expensive, and it only encrypts in small disk-size blocks, so somebody's been trying to make it encrypt in file-system-sized chunks. When/if they get this working, you'll probably notice a speedup of the order of 90% or so ... Cheers, Wol ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-12 12:18 ` Wols Lists @ 2017-12-12 13:24 ` Neil Bothwick 0 siblings, 0 replies; 35+ messages in thread From: Neil Bothwick @ 2017-12-12 13:24 UTC (permalink / raw To: gentoo-user [-- Attachment #1: Type: text/plain, Size: 1053 bytes --] On Tue, 12 Dec 2017 12:18:23 +0000, Wols Lists wrote: > > That means every write has to be encrypted 4 times, whereas using > > encryption in the filesystem means it only has to be done once. I > > tried setting encrypted BTRFS this way and there was a significant > > performance hit. I'm seriously considering going back to ZoL now that > > encryption is on the way. > > DISCLAIMER - I DON'T HAVE A CLUE HOW THIS ACTUALLY WORKS IN DETAIL > > but there's been a fair few posts on LKML sublists about how linux is > very inefficient at using hardware encryption. Setup/teardown is > expensive, and it only encrypts in small disk-size blocks, so somebody's > been trying to make it encrypt in file-system-sized chunks. When/if they > get this working, you'll probably notice a speedup of the order of 90% > or so ... This isn't so much a matter of hardware vs. software encryption, more that encrypting below the RAID level means everything has to be encrypted multiple times. -- Neil Bothwick There's no place like ~ [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-06 23:35 ` Rich Freeman 2017-12-07 0:13 ` Frank Steinmetzger @ 2017-12-07 7:54 ` Richard Bradfield 2017-12-07 9:28 ` Frank Steinmetzger 1 sibling, 1 reply; 35+ messages in thread From: Richard Bradfield @ 2017-12-07 7:54 UTC (permalink / raw To: gentoo-user On Wed, Dec 06, 2017 at 06:35:10PM -0500, Rich Freeman wrote: >On Wed, Dec 6, 2017 at 6:28 PM, Frank Steinmetzger <Warp_7@gmx.de> wrote: >> >> I don’t really care about performance. It’s a simple media archive powered >> by the cheapest Haswell Celeron I could get (with 16 Gigs of ECC RAM though >> ^^). Sorry if I more or less stole the thread, but this is almost the same >> topic. I could use a nudge in either direction. My workplace’s storage >> comprises many 2× mirrors, but I am not a company and I am capped at four >> bays. >> >> So, Do you have any input for me before I fetch the dice? >> > >IMO the cost savings for parity RAID trumps everything unless money >just isn't a factor. > >Now, with ZFS it is frustrating because arrays are relatively >inflexible when it comes to expansion, though that applies to all >types of arrays. That is one major advantage of btrfs (and mdadm) over >zfs. I hear they're working on that, but in general there are a lot >of things in zfs that are more static compared to btrfs. > >-- >Rich > When planning for ZFS pools, at least for home use, it's worth thinking about your usage pattern, and if you'll need to expand the pool before the lifetime of the drives rolls around. I incorporated ZFS' expansion inflexibility into my planned maintenance/servicing budget. I started out with 4x 2TB disks, limited to those 4 bays as you are, but planned to replace those drives after a period of 3-4 years. By the time the first of my drives began to show SMART errors, the price of a 3TB drive had dropped to what I had paid for the 2TB models, so I bought another set and did a rolling upgrade, bringing the pool up to 6TB. I expect I'll do the same thing late next year, I wonder if 4TB will be the sweet spot, or if I might be able to get something larger. -- Richard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 7:54 ` Richard Bradfield @ 2017-12-07 9:28 ` Frank Steinmetzger 2017-12-07 9:52 ` Richard Bradfield 0 siblings, 1 reply; 35+ messages in thread From: Frank Steinmetzger @ 2017-12-07 9:28 UTC (permalink / raw To: gentoo-user On Thu, Dec 07, 2017 at 07:54:41AM +0000, Richard Bradfield wrote: > On Wed, Dec 06, 2017 at 06:35:10PM -0500, Rich Freeman wrote: > >On Wed, Dec 6, 2017 at 6:28 PM, Frank Steinmetzger <Warp_7@gmx.de> wrote: > >> > >>I don’t really care about performance. It’s a simple media archive powered > >>by the cheapest Haswell Celeron I could get (with 16 Gigs of ECC RAM though > >>^^). Sorry if I more or less stole the thread, but this is almost the same > >>topic. I could use a nudge in either direction. My workplace’s storage > >>comprises many 2× mirrors, but I am not a company and I am capped at four > >>bays. > >> > >>So, Do you have any input for me before I fetch the dice? > >> > > > >IMO the cost savings for parity RAID trumps everything unless money > >just isn't a factor. > > > >Now, with ZFS it is frustrating because arrays are relatively > >inflexible when it comes to expansion, though that applies to all > >types of arrays. That is one major advantage of btrfs (and mdadm) over > >zfs. I hear they're working on that, but in general there are a lot > >of things in zfs that are more static compared to btrfs. > > > >-- > >Rich > > > > When planning for ZFS pools, at least for home use, it's worth thinking > about your usage pattern, and if you'll need to expand the pool before > the lifetime of the drives rolls around. When I set the NAS up, I migrated everything from my existing individual external harddrives onto it (the biggest of which was 3 TB). So the main data slurping is over. Going from 6 to 12 TB should be enough™ for a loooong time unless I start buying TV series on DVD for which I don't have physical space. > I incorporated ZFS' expansion inflexibility into my planned > maintenance/servicing budget. What was the conclusion? That having no more free slots meant that you can just as well use the inflexible Raidz, otherwise would have gone with Mirror? > I expect I'll do the same thing late next year, I wonder if 4TB will be > the sweet spot, or if I might be able to get something larger. Me thinks 4 TB was already the sweet spot when I bought my drives a year back (regarding ¤/GiB). Just checked: 6 TB is the cheapest now according to a pricing search engine. Well, the German version anyway[1]. The brits are a bit more picky[2]. [1] https://geizhals.de/?cat=hde7s&xf=10287_NAS~957_Western+Digital&sort=r [2] https://skinflint.co.uk/?cat=hde7s&xf=10287_NAS%7E957_Western+Digital&sort=r -- This message was written using only recycled electrons. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 9:28 ` Frank Steinmetzger @ 2017-12-07 9:52 ` Richard Bradfield 2017-12-07 14:53 ` Frank Steinmetzger 2017-12-07 18:35 ` Wols Lists 0 siblings, 2 replies; 35+ messages in thread From: Richard Bradfield @ 2017-12-07 9:52 UTC (permalink / raw To: gentoo-user On Thu, 7 Dec 2017, at 09:28, Frank Steinmetzger wrote: > > I incorporated ZFS' expansion inflexibility into my planned > > maintenance/servicing budget. > > What was the conclusion? That having no more free slots meant that you > can just as well use the inflexible Raidz, otherwise would have gone with > Mirror? Correct, I had gone back and forth between RaidZ2 and a pair of Mirrors. I needed the space to be extendable, but I calculated my usage growth to be below the rate at which drive prices were falling, so I could budget to replace the current set of drives in 3 years, and that would buy me a set of bigger ones when the time came. I did also investigate USB3 external enclosures, they're pretty fast these days. -- I apologize if my web client has mangled my message. Richard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 9:52 ` Richard Bradfield @ 2017-12-07 14:53 ` Frank Steinmetzger 2017-12-07 15:26 ` Rich Freeman 2017-12-07 20:02 ` Wols Lists 2017-12-07 18:35 ` Wols Lists 1 sibling, 2 replies; 35+ messages in thread From: Frank Steinmetzger @ 2017-12-07 14:53 UTC (permalink / raw To: gentoo-user On Thu, Dec 07, 2017 at 09:52:55AM +0000, Richard Bradfield wrote: > On Thu, 7 Dec 2017, at 09:28, Frank Steinmetzger wrote: > > > I incorporated ZFS' expansion inflexibility into my planned > > > maintenance/servicing budget. > > > > What was the conclusion? That having no more free slots meant that you > > can just as well use the inflexible Raidz, otherwise would have gone with > > Mirror? > > Correct, I had gone back and forth between RaidZ2 and a pair of Mirrors. > I needed the space to be extendable, but I calculated my usage growth > to be below the rate at which drive prices were falling, so I could > budget to replace the current set of drives in 3 years, and that > would buy me a set of bigger ones when the time came. I see. I'm always looking for ways to optimise expenses and cut down on environmental footprint by keeping stuff around until it really breaks. In order to increase capacity, I would have to replace all four drives, whereas with a mirror, two would be enough. > I did also investigate USB3 external enclosures, they're pretty > fast these days. When I configured my kernel the other day, I discovered network block devices as an option. My PC has a hotswap bay[0]. Problem solved. :) Then I can do zpool replace with the drive-to-be-replaced still in the pool, which improves resilver read distribution and thus lessens the probability of a failure cascade. [0] http://www.sharkoon.com/?q=de/node/2171 ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 14:53 ` Frank Steinmetzger @ 2017-12-07 15:26 ` Rich Freeman 2017-12-07 16:04 ` Frank Steinmetzger 2017-12-07 20:02 ` Wols Lists 1 sibling, 1 reply; 35+ messages in thread From: Rich Freeman @ 2017-12-07 15:26 UTC (permalink / raw To: gentoo-user On Thu, Dec 7, 2017 at 9:53 AM, Frank Steinmetzger <Warp_7@gmx.de> wrote: > > I see. I'm always looking for ways to optimise expenses and cut down on > environmental footprint by keeping stuff around until it really breaks. In > order to increase capacity, I would have to replace all four drives, whereas > with a mirror, two would be enough. > That is a good point. Though I would note that you can always replace the raidz2 drives one at a time - you just get zero benefit until they're all replaced. So, if your space use grows at a rate lower than the typical hard drive turnover rate that is an option. > > When I configured my kernel the other day, I discovered network block > devices as an option. My PC has a hotswap bay[0]. Problem solved. :) Then I > can do zpool replace with the drive-to-be-replaced still in the pool, which > improves resilver read distribution and thus lessens the probability of a > failure cascade. > If you want to get into the network storage space I'd keep an eye on cephfs. I don't think it is quite to the point where it is a zfs/btrfs replacement option, but it could get there. I don't think the checksums are quite end-to-end, but they're getting better. Overall stability for cephfs itself (as opposed to ceph object storage) is not as good from what I hear. The biggest issue with it though is RAM use on the storage nodes. They want 1GB/TB RAM, which rules out a lot of the cheap ARM-based solutions. Maybe you can get by with less, but finding ARM systems with even 4GB of RAM is tough, and even that means only one hard drive per node, which means a lot of $40+ nodes to go on top of the cost of the drives themselves. Right now cephfs mainly seems to appeal to the scalability use case. If you have 10k servers accessing 150TB of storage and you want that all in one managed well-performing pool that is something cephfs could probably deliver that almost any other solution can't (and the ones that can cost WAY more than just one box running zfs on a couple of RAIDs). -- Rich ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 15:26 ` Rich Freeman @ 2017-12-07 16:04 ` Frank Steinmetzger 2017-12-07 23:09 ` Rich Freeman 0 siblings, 1 reply; 35+ messages in thread From: Frank Steinmetzger @ 2017-12-07 16:04 UTC (permalink / raw To: gentoo-user On Thu, Dec 07, 2017 at 10:26:34AM -0500, Rich Freeman wrote: > On Thu, Dec 7, 2017 at 9:53 AM, Frank Steinmetzger <Warp_7@gmx.de> wrote: > > > > I see. I'm always looking for ways to optimise expenses and cut down on > > environmental footprint by keeping stuff around until it really breaks. In > > order to increase capacity, I would have to replace all four drives, whereas > > with a mirror, two would be enough. > > > > That is a good point. Though I would note that you can always replace > the raidz2 drives one at a time - you just get zero benefit until > they're all replaced. So, if your space use grows at a rate lower > than the typical hard drive turnover rate that is an option. > > > > > When I configured my kernel the other day, I discovered network block > > devices as an option. My PC has a hotswap bay[0]. Problem solved. :) Then I > > can do zpool replace with the drive-to-be-replaced still in the pool, which > > improves resilver read distribution and thus lessens the probability of a > > failure cascade. > > > > If you want to get into the network storage space I'd keep an eye on > cephfs. No, I was merely talking about the use case of replacing drives on-the-fly with the limited hardware available (all slots are occupied). It was not about expanding my storage beyond what my NAS case can provide. Resilvering is risky business, more so with big drives and especially once they get older. That's why I was talking about adding the new drive externally, which allows me to use all old drives during resilvering. Once it is resilvered, I install it physically. > […] They want 1GB/TB RAM, which rules out a lot of the cheap ARM-based > solutions. Maybe you can get by with less, but finding ARM systems with > even 4GB of RAM is tough, and even that means only one hard drive per > node, which means a lot of $40+ nodes to go on top of the cost of the > drives themselves. No need to overshoot. It's a simple media archive and I'm happy with what I have, apart from a few shortcomings of the case regarding quality and space. My main goal was reliability, hence ZFS, ECC, and a Gold-rated PSU. They say RAID is not a backup. For me it is -- in case of disk failure, which is my main dread. You can't really get ECC on ARM, right? So M-ITX was the next best choice. I have a tiny (probably one of the smallest available) M-ITX case for four 3,5″ bays and an internal 2.5″ mount: https://www.inter-tech.de/en/products/ipc/storage-cases/sc-4100 Tata... -- I cna ytpe 300 wrods pre mniuet!!! ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 16:04 ` Frank Steinmetzger @ 2017-12-07 23:09 ` Rich Freeman 0 siblings, 0 replies; 35+ messages in thread From: Rich Freeman @ 2017-12-07 23:09 UTC (permalink / raw To: gentoo-user On Thu, Dec 7, 2017 at 11:04 AM, Frank Steinmetzger <Warp_7@gmx.de> wrote: > On Thu, Dec 07, 2017 at 10:26:34AM -0500, Rich Freeman wrote: > >> […] They want 1GB/TB RAM, which rules out a lot of the cheap ARM-based >> solutions. Maybe you can get by with less, but finding ARM systems with >> even 4GB of RAM is tough, and even that means only one hard drive per >> node, which means a lot of $40+ nodes to go on top of the cost of the >> drives themselves. > > You can't really get ECC on ARM, right? So M-ITX was the next best choice. I > have a tiny (probably one of the smallest available) M-ITX case for four > 3,5″ bays and an internal 2.5″ mount: > https://www.inter-tech.de/en/products/ipc/storage-cases/sc-4100 > I don't think ECC is readily available on ARM (most of those boards are SoCs where the RAM is integral and can't be expanded). If CephFS were designed with end-to-end checksums that wouldn't really matter much, because the client would detect any error in a storage node and could obtain a good copy from another node and trigger a resilver. However, I don't think Ceph is quite there, with checksums being used at various points but I think there are gaps where no checksum is protecting the data. That is one of the things I don't like about it. If I were designing the checksums for it I'd probably have the client compute the checksum and send it with the data, then at every step the checksum is checked, and stored in the metadata on permanent storage. Then when the ack goes back to the client that the data is written the checksum would be returned to the client from the metadata, and the client would do a comparison. Any retrieval would include the client obtaining the checksum from the metadata and then comparing it to the data from the storage nodes. I don't think this approach would really add any extra overhead (the metadata needs to be recorded when writing anyway, and read when reading anyway). It just ensures there is a checksum on separate storage from the data and that it is the one captured when the data was first written. A storage node could be completely unreliable in this scenario as it exists apart from the checksum being used to verify it. Storage nodes would still do their own checksum verification anyway since that would allow errors to be detected sooner and reduce latency, but this is not essential to reliability. Instead I think Ceph does not store checksums in the metadata. The client checksum is used to verify accurate transfer over the network, but then the various nodes forget about it, and record the data. If the data is backed on ZFS/btrfs/bluestore then the filesystem would compute its own checksum to detect silent corruption while at rest. However, if the data were corrupted by faulty software or memory failure after it was verified upon reception but before it was re-checksummed prior to storage then you would have a problem. In that case a scrub would detect non-matching data between nodes but with no way to determine which node is correct. If somebody with more knowledge of Ceph knows otherwise I'm all ears, because this is one of those things that gives me a bit of pause. Don't get me wrong - most other approaches have the same issues, but I can reduce the risk of some of that with ECC, but that isn't practical when you want many RAM-intensive storage nodes in the solution. -- Rich ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 14:53 ` Frank Steinmetzger 2017-12-07 15:26 ` Rich Freeman @ 2017-12-07 20:02 ` Wols Lists 1 sibling, 0 replies; 35+ messages in thread From: Wols Lists @ 2017-12-07 20:02 UTC (permalink / raw To: gentoo-user On 07/12/17 14:53, Frank Steinmetzger wrote: > When I configured my kernel the other day, I discovered network block > devices as an option. My PC has a hotswap bay[0]. Problem solved. :) Then I > can do zpool replace with the drive-to-be-replaced still in the pool, which > improves resilver read distribution and thus lessens the probability of a > failure cascade. Or with mdadm, there's "mdadm --replace". If you want to swap a drive (rather than replace a failed drive), this both preserves redundancy and reduces the stress on the array by doing disk-to-disk copy rather than recalculating the new disk. Cheers, Wol ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 9:52 ` Richard Bradfield 2017-12-07 14:53 ` Frank Steinmetzger @ 2017-12-07 18:35 ` Wols Lists 2017-12-07 20:17 ` Richard Bradfield 1 sibling, 1 reply; 35+ messages in thread From: Wols Lists @ 2017-12-07 18:35 UTC (permalink / raw To: gentoo-user On 07/12/17 09:52, Richard Bradfield wrote: > I did also investigate USB3 external enclosures, they're pretty > fast these days. AARRGGHHHHH !!! If you're using mdadm, DO NOT TOUCH USB WITH A BARGE POLE !!! I don't know the details, but I gather the problems are very similar to the timeout problem, but much worse. I know the wiki says you can "get away" with USB, but only for a broken drive, and only when recovering *from* it. Cheers, Wol ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 18:35 ` Wols Lists @ 2017-12-07 20:17 ` Richard Bradfield 2017-12-07 20:39 ` Wols Lists 0 siblings, 1 reply; 35+ messages in thread From: Richard Bradfield @ 2017-12-07 20:17 UTC (permalink / raw To: gentoo-user On Thu, Dec 07, 2017 at 06:35:16PM +0000, Wols Lists wrote: >On 07/12/17 09:52, Richard Bradfield wrote: >> I did also investigate USB3 external enclosures, they're pretty >> fast these days. > >AARRGGHHHHH !!! > >If you're using mdadm, DO NOT TOUCH USB WITH A BARGE POLE !!! > >I don't know the details, but I gather the problems are very similar to >the timeout problem, but much worse. > >I know the wiki says you can "get away" with USB, but only for a broken >drive, and only when recovering *from* it. > >Cheers, >Wol > I'm using ZFS on Linux, does that make you any less terrified? :) I never ended up pursuing the USB enclosure, because disks got bigger faster than I needed more storage, but I'd be interested in hearing if there are real issues with trying to mount drive arrays over XHCI, given the failure of eSATA to achieve wide adoption it looked like a good route for future expansion. -- Richard ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [gentoo-user] OT: btrfs raid 5/6 2017-12-07 20:17 ` Richard Bradfield @ 2017-12-07 20:39 ` Wols Lists 0 siblings, 0 replies; 35+ messages in thread From: Wols Lists @ 2017-12-07 20:39 UTC (permalink / raw To: gentoo-user On 07/12/17 20:17, Richard Bradfield wrote: > On Thu, Dec 07, 2017 at 06:35:16PM +0000, Wols Lists wrote: >> On 07/12/17 09:52, Richard Bradfield wrote: >>> I did also investigate USB3 external enclosures, they're pretty >>> fast these days. >> >> AARRGGHHHHH !!! >> >> If you're using mdadm, DO NOT TOUCH USB WITH A BARGE POLE !!! >> >> I don't know the details, but I gather the problems are very similar to >> the timeout problem, but much worse. >> >> I know the wiki says you can "get away" with USB, but only for a broken >> drive, and only when recovering *from* it. >> >> Cheers, >> Wol >> > > I'm using ZFS on Linux, does that make you any less terrified? :) > > I never ended up pursuing the USB enclosure, because disks got bigger > faster than I needed more storage, but I'd be interested in hearing if > there are real issues with trying to mount drive arrays over XHCI, given > the failure of eSATA to achieve wide adoption it looked like a good > route for future expansion. > Sorry, not a clue. I don't know zfs. The problem with USB, as I understand it, is that USB itself times out. If that happens, there is presumably a tear-down/setup delay, which is the timeout problem, which upsets mdadm. My personal experience is that the USB protocol also seems vulnerable to crashing and losing drives. In the --replace scenario, the fact that you are basically streaming from the old drive to the new one seems not to trip over the problem, but anything else is taking rather unnecessary risks ... As for eSATA, I want to get hold of a JBOD enclosure, but I'll then need to get a PCI card with an external port-multiplier ESATA capability. I suspect one of the reasons it didn't take off was the multiplicity of specifications, such that people probably bought add-ons that were "unfit for purpose" because they didn't know what they were doing, or the mobo suppliers cut corners so the on-board ports were unfit for purpose, etc etc. So the whole thing sank with a bad rep it didn't deserve. Certainly, when I've been looking, the situation is, shall we say, confusing ... Cheers, Wol Cheers, Wol ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2017-12-12 13:25 UTC | newest] Thread overview: 35+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-11-27 22:30 [gentoo-user] OT: btrfs raid 5/6 Bill Kenworthy 2017-12-01 15:59 ` J. Roeleveld 2017-12-01 16:58 ` Wols Lists 2017-12-01 17:14 ` Rich Freeman 2017-12-01 17:24 ` Wols Lists 2017-12-06 23:28 ` Frank Steinmetzger 2017-12-06 23:35 ` Rich Freeman 2017-12-07 0:13 ` Frank Steinmetzger 2017-12-07 0:29 ` Rich Freeman 2017-12-07 21:37 ` Frank Steinmetzger 2017-12-07 21:49 ` Wols Lists 2017-12-07 22:35 ` Frank Steinmetzger 2017-12-07 23:48 ` Wols Lists 2017-12-09 16:58 ` J. Roeleveld 2017-12-09 18:28 ` Wols Lists 2017-12-09 23:36 ` Rich Freeman 2017-12-10 9:45 ` Wols Lists 2017-12-10 15:07 ` Rich Freeman 2017-12-10 21:00 ` Wols Lists 2017-12-11 1:33 ` Rich Freeman 2017-12-11 23:20 ` Frank Steinmetzger 2017-12-12 10:15 ` Neil Bothwick 2017-12-12 12:18 ` Wols Lists 2017-12-12 13:24 ` Neil Bothwick 2017-12-07 7:54 ` Richard Bradfield 2017-12-07 9:28 ` Frank Steinmetzger 2017-12-07 9:52 ` Richard Bradfield 2017-12-07 14:53 ` Frank Steinmetzger 2017-12-07 15:26 ` Rich Freeman 2017-12-07 16:04 ` Frank Steinmetzger 2017-12-07 23:09 ` Rich Freeman 2017-12-07 20:02 ` Wols Lists 2017-12-07 18:35 ` Wols Lists 2017-12-07 20:17 ` Richard Bradfield 2017-12-07 20:39 ` Wols Lists
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox