[gentoo-user] [offtopic] Copy-On-Write ?

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] [offtopic] Copy-On-Write ?
@ 2017-09-07 15:46 Helmut Jarausch
  2017-09-08 19:09 ` Simon Thelen
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Helmut Jarausch @ 2017-09-07 15:46 UTC (permalink / raw
  To: gentoo-user

Hi,

sorry, this question is not Gentoo specific - but I know there are many  
very knowledgeable people on this list.

I'd like to "hard-link" a file X to Y - i.e. there is no additional  
space on disk for Y.

But, contrary to the "standard" hard-link (ln), file Y should be stored  
in a different place (inode) IF it gets modified.
With the standard hard-link, file X is the same as Y, so any changes to  
Y are seen in X by definition.

Is this possible
- with an ext4 FS
- or only with a different (which) FS

Many thanks for a hint,
Helmut

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [gentoo-user] [offtopic] Copy-On-Write ?
  2017-09-07 15:46 [gentoo-user] [offtopic] Copy-On-Write ? Helmut Jarausch
@ 2017-09-08 19:09 ` Simon Thelen
  2017-09-08 19:10 ` Marc Joliet
  2017-09-08 19:16 ` [gentoo-user] " Kai Krakow
  2 siblings, 0 replies; 15+ messages in thread
From: Simon Thelen @ 2017-09-08 19:09 UTC (permalink / raw
  To: gentoo-user

On 17-09-07 at 17:46, Helmut Jarausch wrote:
> Hi,
Hello,

> sorry, this question is not Gentoo specific - but I know there are many  
> very knowledgeable people on this list.
> 
> I'd like to "hard-link" a file X to Y - i.e. there is no additional  
> space on disk for Y.
> 
> But, contrary to the "standard" hard-link (ln), file Y should be stored  
> in a different place (inode) IF it gets modified.
> With the standard hard-link, file X is the same as Y, so any changes to  
> Y are seen in X by definition.

> Is this possible
> - with an ext4 FS
> - or only with a different (which) FS
You can use GNU coreutil's `cp --reflink=always'.  This will, however,
only work on filesystems which support the operation (afaik so far only
btrfs). Though other CoW filesystems (such as ZFS) have similar
capabilities with snapshotting.

The only other possibility I know of would be creating an lvm partition
for that file and using lvm snapshots.

You should also be able to implement the functionality via fuse on top
of an ext4 base if the other solutions aren't to your taste.

-- 
Simon Thelen


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [gentoo-user] [offtopic] Copy-On-Write ?
  2017-09-07 15:46 [gentoo-user] [offtopic] Copy-On-Write ? Helmut Jarausch
  2017-09-08 19:09 ` Simon Thelen
@ 2017-09-08 19:10 ` Marc Joliet
  2017-09-08 19:16 ` [gentoo-user] " Kai Krakow
  2 siblings, 0 replies; 15+ messages in thread
From: Marc Joliet @ 2017-09-08 19:10 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 1247 bytes --]

Am Donnerstag, 7. September 2017, 17:46:27 CEST schrieb Helmut Jarausch:
> Hi,
> 
> sorry, this question is not Gentoo specific - but I know there are many
> very knowledgeable people on this list.
> 
> I'd like to "hard-link" a file X to Y - i.e. there is no additional
> space on disk for Y.
> 
> But, contrary to the "standard" hard-link (ln), file Y should be stored
> in a different place (inode) IF it gets modified.
> With the standard hard-link, file X is the same as Y, so any changes to
> Y are seen in X by definition.
> 
> Is this possible
> - with an ext4 FS
> - or only with a different (which) FS

This has come to be referred to as reflinks (see, e.g., the cp(1) man page).  
I don't think ext4 supports them, but both btrfs and xfs do (xfs only very 
recently, though, see for example [0]).  There might be other FSs that support 
it, too (bcachefs?), but I don't know about that.

Maybe at some point ext4 will add support for it, but since I mainly use btrfs 
I don't care that much.

[0] http://strugglers.net/~andy/blog/2017/01/10/xfs-reflinks-and-deduplication/

> Many thanks for a hint,
> Helmut

HTH
-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-07 15:46 [gentoo-user] [offtopic] Copy-On-Write ? Helmut Jarausch
  2017-09-08 19:09 ` Simon Thelen
  2017-09-08 19:10 ` Marc Joliet
@ 2017-09-08 19:16 ` Kai Krakow
  2017-09-15 18:28   ` Rich Freeman
  2 siblings, 1 reply; 15+ messages in thread
From: Kai Krakow @ 2017-09-08 19:16 UTC (permalink / raw
  To: gentoo-user

Am Thu, 07 Sep 2017 17:46:27 +0200
schrieb Helmut Jarausch <jarausch@skynet.be>:

> Hi,
> 
> sorry, this question is not Gentoo specific - but I know there are
> many very knowledgeable people on this list.
> 
> I'd like to "hard-link" a file X to Y - i.e. there is no additional  
> space on disk for Y.
> 
> But, contrary to the "standard" hard-link (ln), file Y should be
> stored in a different place (inode) IF it gets modified.
> With the standard hard-link, file X is the same as Y, so any changes
> to Y are seen in X by definition.
> 
> Is this possible
> - with an ext4 FS
> - or only with a different (which) FS

You can do this with "cp --reflink=always" if the filesystem supports
it.

To my current knowledge, only btrfs (since a long time) and xfs (in
newer kernel versions) support it. Not sure if ext4 supports it or
plans support for it.

It is different to hard linking as the new file is linked by a new
inode, thus it has it's own time stamp and permissions unlike hard
links. Just contents are initially shared until you modify them. Also
keep in mind that this increases fragmentation especially when there
are a lot of small modifications.

At least in btrfs there's also a caveat that the original extents may
not actually be split and the split extents share parts of the
original extent. That means, if you delete the original later, the copy
will occupy more space than expected until you defragment the file:

File A extent map: [1111][22  2  2][3333]
File B extent map: [1111][22  2  2][3333]
Modify b:          [1111][22][4][2][3333] <- one block modified

Delete file a:     [----][22  2  2][----] <- extent 2 still mapped
File b extent map: [1111][22][4][2][3333]

So extent 2 is still on disk in its original state [2222].

Defragment file b: [1111][2242][3333]
File a:            [----][----][----] <- completely gone now

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-08 19:16 ` [gentoo-user] " Kai Krakow
@ 2017-09-15 18:28   ` Rich Freeman
  2017-09-16 12:06     ` Kai Krakow
  0 siblings, 1 reply; 15+ messages in thread
From: Rich Freeman @ 2017-09-15 18:28 UTC (permalink / raw
  To: gentoo-user

On Fri, Sep 8, 2017 at 3:16 PM, Kai Krakow <hurikhan77@gmail.com> wrote:
>
> At least in btrfs there's also a caveat that the original extents may
> not actually be split and the split extents share parts of the
> original extent. That means, if you delete the original later, the copy
> will occupy more space than expected until you defragment the file:
>

True, but keep in mind that this applies in general in btrfs to any
kind of modification to a file.  If you modify 1MB in the middle of a
10GB file on ext4 you end up it taking up 10GB of space.  If you do
the same thing in btrfs you'll probably end up with the file taking up
10.001GB.  Since btrfs doesn't overwrite files in-place it will
typically allocate a new extent for the additional 1MB, and the
original content at that position within the file is still on disk in
the original extent.  It works a bit like a log-based filesystem in
this regard (which is also effectively copy on write).

-- 
Rich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-15 18:28   ` Rich Freeman
@ 2017-09-16 12:06     ` Kai Krakow
  2017-09-16 13:39       ` Rich Freeman
  2017-09-17  6:20       ` Dan Douglas
  0 siblings, 2 replies; 15+ messages in thread
From: Kai Krakow @ 2017-09-16 12:06 UTC (permalink / raw
  To: gentoo-user

Am Fri, 15 Sep 2017 14:28:49 -0400
schrieb Rich Freeman <rich0@gentoo.org>:

> On Fri, Sep 8, 2017 at 3:16 PM, Kai Krakow <hurikhan77@gmail.com>
> wrote:
> >
> > At least in btrfs there's also a caveat that the original extents
> > may not actually be split and the split extents share parts of the
> > original extent. That means, if you delete the original later, the
> > copy will occupy more space than expected until you defragment the
> > file: 
> 
> True, but keep in mind that this applies in general in btrfs to any
> kind of modification to a file.  If you modify 1MB in the middle of a
> 10GB file on ext4 you end up it taking up 10GB of space.  If you do
> the same thing in btrfs you'll probably end up with the file taking up
> 10.001GB.  Since btrfs doesn't overwrite files in-place it will
> typically allocate a new extent for the additional 1MB, and the
> original content at that position within the file is still on disk in
> the original extent.  It works a bit like a log-based filesystem in
> this regard (which is also effectively copy on write).

Good point, this makes sense. I never thought about that.

But I guess that btrfs doesn't use 10G sized extents? And I also guess,
this is where autodefrag jumps in.


-- 
Regards,
Kai

Replies to list-only preferred.



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-16 12:06     ` Kai Krakow
@ 2017-09-16 13:39       ` Rich Freeman
  2017-09-16 16:43         ` Kai Krakow
  2017-09-17  6:20       ` Dan Douglas
  1 sibling, 1 reply; 15+ messages in thread
From: Rich Freeman @ 2017-09-16 13:39 UTC (permalink / raw
  To: gentoo-user

On Sat, Sep 16, 2017 at 8:06 AM, Kai Krakow <hurikhan77@gmail.com> wrote:
>
> But I guess that btrfs doesn't use 10G sized extents? And I also guess,
> this is where autodefrag jumps in.
>

It definitely doesn't use 10G extents considering the chunks are only
1GB.  (For those who aren't aware, btrfs divides devices into chunks
which basically act like individual sub-devices to which operations
like mirroring/raid/etc are applied.  This is why you can change raid
modes on the fly - the operation takes effect on new chunks.  This
also allows clever things like a "RAID1" on 3x1TB disks to have 1.5TB
of useful space, because the chunks essentially balance themselves
across all three disks in pairs.  It also is what causes the infamous
issues when btrfs runs low on space - once the last chunk is allocated
it can become difficult to rebalance/consolidate the remaining space.)

I couldn't actually find any info on default extent size.  I did find
a 128MB example in the docs, so presumably that isn't an unusual size.
So, the 1MB example would probably still work.  Obviously if an entire
extent becomes obsolete it will lose its reference count and become
free.

Defrag was definitely intended to deal with this.  I haven't looked at
the state of it in ages, when I stopped using it due to a bug and some
limitations.  The main limitation being that defrag at least used to
be over-zealous.  Not only would it free up the 1MB of wasted space,
as in this example, but if that 1GB file had a reflink clone it would
go ahead and split it into two duplicate 1GB extents.  I believe that
dedup would do the reverse of this.  Getting both to work together
"the right way" didn't seem possible the last time I looked into it,
but if that has changed I'm interested.

Granted, I've been moving away from btrfs lately, due to the fact that
it just hasn't matured as I originally thought it would.  I really
love features like reflinks, but it has been years since it was
"almost ready" and it still tends to eat data.  For the moment I'm
relying more on zfs.  I'd love to switch back if they ever pull things
together.  The other filesystem I'm eyeing with interest is cephfs,
but that still is slightly immature (on-disk checksums were only just
added), and it has a bit of overhead until you get into fairly large
arrays.  Cheap arm-based OSD options seem to be fairly RAM-starved at
the moment as well given the ceph recommendation of 1GB/TB.  arm64
still seems to be slow to catch on, let alone cheap boards with 4-16GB
of RAM.

-- 
Rich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-16 13:39       ` Rich Freeman
@ 2017-09-16 16:43         ` Kai Krakow
  2017-09-16 17:05           ` Rich Freeman
  0 siblings, 1 reply; 15+ messages in thread
From: Kai Krakow @ 2017-09-16 16:43 UTC (permalink / raw
  To: gentoo-user

Am Sat, 16 Sep 2017 09:39:33 -0400
schrieb Rich Freeman <rich0@gentoo.org>:

> On Sat, Sep 16, 2017 at 8:06 AM, Kai Krakow <hurikhan77@gmail.com>
> wrote:
> >
> > But I guess that btrfs doesn't use 10G sized extents? And I also
> > guess, this is where autodefrag jumps in.
> >  
> 
> It definitely doesn't use 10G extents considering the chunks are only
> 1GB.  (For those who aren't aware, btrfs divides devices into chunks
> which basically act like individual sub-devices to which operations
> like mirroring/raid/etc are applied.  This is why you can change raid
> modes on the fly - the operation takes effect on new chunks.  This
> also allows clever things like a "RAID1" on 3x1TB disks to have 1.5TB
> of useful space, because the chunks essentially balance themselves
> across all three disks in pairs.  It also is what causes the infamous
> issues when btrfs runs low on space - once the last chunk is allocated
> it can become difficult to rebalance/consolidate the remaining space.)

Actually, I'm running across 3x 1TB here on my desktop, with mraid1 and
draid 0. Combined with bcache it gives confident performance.

> I couldn't actually find any info on default extent size.  I did find
> a 128MB example in the docs, so presumably that isn't an unusual size.
> So, the 1MB example would probably still work.  Obviously if an entire
> extent becomes obsolete it will lose its reference count and become
> free.

According to bees[1] source code it's 128M actually if I remember right.

> Defrag was definitely intended to deal with this.  I haven't looked at
> the state of it in ages, when I stopped using it due to a bug and some
> limitations.  The main limitation being that defrag at least used to
> be over-zealous.  Not only would it free up the 1MB of wasted space,
> as in this example, but if that 1GB file had a reflink clone it would
> go ahead and split it into two duplicate 1GB extents.  I believe that
> dedup would do the reverse of this.  Getting both to work together
> "the right way" didn't seem possible the last time I looked into it,
> but if that has changed I'm interested.
> 
> Granted, I've been moving away from btrfs lately, due to the fact that
> it just hasn't matured as I originally thought it would.  I really
> love features like reflinks, but it has been years since it was
> "almost ready" and it still tends to eat data.

XFS has gained reflinks lately, and I think they are working on
snapshots currently. Kernel 4.14 or 4.15 promises new features for XFS
(if they get ready by then), so maybe that will be snapshots? I'm not
sure.

I was very happy a long time with XFS but switched to btrfs when it
became usable due to compression and stuff. But performance of
compression seems to get worse lately, IO performance drops due to
hogged CPUs even if my system really isn't that incapable.

What's still cool is that I don't need to manage volumes since the
volume manager is built into btrfs. XFS on LVM was not that flexible.
If btrfs wouldn't have this feature, I probably would have switched
back to XFS already.

>  For the moment I'm
> relying more on zfs.

How does it perform memory-wise? Especially, I'm currently using bees[1]
for deduplication: It uses a 1G memory mapped file (you can choose
other sizes if you want), and it picks up new files really fast, within
a minute. I don't think zfs can do anything like that within the same
resources.

>  I'd love to switch back if they ever pull things
> together.  The other filesystem I'm eyeing with interest is cephfs,
> but that still is slightly immature (on-disk checksums were only just
> added), and it has a bit of overhead until you get into fairly large
> arrays.  Cheap arm-based OSD options seem to be fairly RAM-starved at
> the moment as well given the ceph recommendation of 1GB/TB.  arm64
> still seems to be slow to catch on, let alone cheap boards with 4-16GB
> of RAM.

Well, for servers, XFS is still my fs of choice. But I will be
evaluating btrfs for that soon, maybe compare it to zfs. When we
evaluated the resource usage, we will buy matching hardware and setup a
new server, mainly for thin-provisioning container systems for web
hostings. I guess, ZFS would be somewhat misused here as DAS.

If XFS gets into shape anytime soon with snapshotting features, I will
of course consider it. Using it since years and it was extremely
reliable, surviving power losses, not degrading in performance.
Something I cannot say the same about for ext3, apparently. Also, XFS
gives good performance with JBOD because allocations are distributed
diagonally across the whole device. This is good for cheap hardware as
well as hardware raid controllers.

[1]: https://github.com/Zygo/bees

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-16 16:43         ` Kai Krakow
@ 2017-09-16 17:05           ` Rich Freeman
  2017-09-16 17:48             ` Kai Krakow
  0 siblings, 1 reply; 15+ messages in thread
From: Rich Freeman @ 2017-09-16 17:05 UTC (permalink / raw
  To: gentoo-user

On Sat, Sep 16, 2017 at 9:43 AM, Kai Krakow <hurikhan77@gmail.com> wrote:
>
> Actually, I'm running across 3x 1TB here on my desktop, with mraid1 and
> draid 0. Combined with bcache it gives confident performance.
>

Not entirely sure I'd use the word "confident" to describe a
filesystem where the loss of one disk guarantees that:
1.  You will lose data (no data redundancy).
2.  But the filesystem will be able to tell you exactly what data you
lost (as metadata will be fine).

>
> I was very happy a long time with XFS but switched to btrfs when it
> became usable due to compression and stuff. But performance of
> compression seems to get worse lately, IO performance drops due to
> hogged CPUs even if my system really isn't that incapable.
>

Btrfs performance is pretty bad in general right now.  The problem is
that they just simply haven't gotten around to optimizing it fully,
mainly because they're more focused on getting rid of the data
corruption bugs (which is of course the right priority).  For example,
with raid1 mode btrfs picks the disk to use for raid based on whether
the PID is even or odd, without any regard to disk utilization.

When I moved to zfs I noticed a huge performance boost.

Fundamentally I don't see why btrfs can't perform just as well as the
others.  It just isn't there yet.

> What's still cool is that I don't need to manage volumes since the
> volume manager is built into btrfs. XFS on LVM was not that flexible.
> If btrfs wouldn't have this feature, I probably would have switched
> back to XFS already.

My main concern with xfs/ext4 is that neither provides on-disk
checksums or protection against the raid write hole.

I just switched motherboards a few weeks ago and either a connection
or a SATA port was bad because one of my drives was getting a TON of
checksum errors on zfs.  I moved it to an LSI card and scrubbed, and
while it took forever and the system degraded the array more than once
due to the high error rate, eventually it patched up all the errors
and now the array is working without issue.  I didn't suffer more than
a bit of inconvenience but with even mdadm raid1 I'd have had a HUGE
headache trying to recover from that (doing who knows how much
troubleshooting before realizing I had to do a slow full restore from
backup with the system down).

I just don't see how a modern filesystem can get away without having
full checksum support.  It is a bit odd that it has taken so long for
Ceph to introduce it, and I'm still not sure if it is truly
end-to-end, or if at any point in its life the data isn't protected by
checksums.  If I were designing something like Ceph I'd checksum the
data at the client the moment it enters storage, then independently
store the checksum and data, and then retrieve both and check it at
the client when the data leaves storage.  Then you're protected
against corruption at any layer below that.  You could of course have
additional protections to catch errors sooner before the client even
sees them.  I think that the issue is that Ceph was really designed
for object storage originally and they just figured the application
would be responsible for data integrity.

The other benefit of checksums is that if they're done right scrubs
can go a lot faster, because you don't have to scrub all the
redundancy data synchronously.  You can just start an idle-priority
read thread on every drive and then pause it anytime a drive is
accessed, and an access on one drive won't slow down the others.  With
traditional RAID you have to read all the redundancy data
synchronously because you can't check the integrity of any of it
without the full set.  I think even ZFS is stuck doing synchronous
reads due to how it stores/computes the checksums.  This is something
btrfs got right.

>
>>  For the moment I'm
>> relying more on zfs.
>
> How does it perform memory-wise? Especially, I'm currently using bees[1]
> for deduplication: It uses a 1G memory mapped file (you can choose
> other sizes if you want), and it picks up new files really fast, within
> a minute. I don't think zfs can do anything like that within the same
> resources.

I'm not using deduplication, but my understanding is that zfs deduplication:
1.  Works just fine.
2.  Uses a TON of RAM.

So, it might not be your cup of tea.  There is no way to do
semi-offline dedup as with btrfs (not really offline in that the
filesystem is fully running - just that you periodically scan for dups
and fix them after the fact, vs detect them in realtime).    With a
semi-offline mode then the performance hits would only come at a time
of my choosing, vs using gobs of RAM all the time to detect what are
probably fairly rare dups.

That aside, I find it works fine memory-wise (I don't use dedup).  It
has its own cache system not integrated fully into the kernel's native
cache, so it tends to hold on to a lot more ram than other
filesystems, but you can tune this behavior so that it stays fairly
tame.

-- 
Rich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-16 17:05           ` Rich Freeman
@ 2017-09-16 17:48             ` Kai Krakow
  2017-09-16 19:02               ` Rich Freeman
  0 siblings, 1 reply; 15+ messages in thread
From: Kai Krakow @ 2017-09-16 17:48 UTC (permalink / raw
  To: gentoo-user

Am Sat, 16 Sep 2017 10:05:21 -0700
schrieb Rich Freeman <rich0@gentoo.org>:

> On Sat, Sep 16, 2017 at 9:43 AM, Kai Krakow <hurikhan77@gmail.com>
> wrote:
> >
> > Actually, I'm running across 3x 1TB here on my desktop, with mraid1
> > and draid 0. Combined with bcache it gives confident performance.
> >  
> 
> Not entirely sure I'd use the word "confident" to describe a
> filesystem where the loss of one disk guarantees that:
> 1.  You will lose data (no data redundancy).
> 2.  But the filesystem will be able to tell you exactly what data you
> lost (as metadata will be fine).

I take daily backups with borg backup. It takes only 15 minutes to run.
And it has been tested twice successfully. The only breakdowns I had
were due to btrfs bugs, not hardware faults.

This is confident enough for my desktop system.

> > I was very happy a long time with XFS but switched to btrfs when it
> > became usable due to compression and stuff. But performance of
> > compression seems to get worse lately, IO performance drops due to
> > hogged CPUs even if my system really isn't that incapable.
> >  
> 
> Btrfs performance is pretty bad in general right now.  The problem is
> that they just simply haven't gotten around to optimizing it fully,
> mainly because they're more focused on getting rid of the data
> corruption bugs (which is of course the right priority).  For example,
> with raid1 mode btrfs picks the disk to use for raid based on whether
> the PID is even or odd, without any regard to disk utilization.
> 
> When I moved to zfs I noticed a huge performance boost.

Interesting... While I never tried it I always feared that it would
perform worse if not throwing RAM and ZIL/L2ARC at it.

> Fundamentally I don't see why btrfs can't perform just as well as the
> others.  It just isn't there yet.

And it will take a long time still, because devs are still throwing new
features at it which need to stabilize.

> > What's still cool is that I don't need to manage volumes since the
> > volume manager is built into btrfs. XFS on LVM was not that
> > flexible. If btrfs wouldn't have this feature, I probably would
> > have switched back to XFS already.  
> 
> My main concern with xfs/ext4 is that neither provides on-disk
> checksums or protection against the raid write hole.

Btrfs suffers the same RAID5 write hole problem since years. I always
planned moving to RAID5 later (which is why I have 3 disks) but I fear
this won't be fixed any time soon due to design decisions made too
early.

> I just switched motherboards a few weeks ago and either a connection
> or a SATA port was bad because one of my drives was getting a TON of
> checksum errors on zfs.  I moved it to an LSI card and scrubbed, and
> while it took forever and the system degraded the array more than once
> due to the high error rate, eventually it patched up all the errors
> and now the array is working without issue.  I didn't suffer more than
> a bit of inconvenience but with even mdadm raid1 I'd have had a HUGE
> headache trying to recover from that (doing who knows how much
> troubleshooting before realizing I had to do a slow full restore from
> backup with the system down).

I found md raid not very reliable in the past but I didn't try again in
years. So this may have changed. I only remember it destroyed a file
system after an unclean shutdown not only once, that's not what I
expect from RAID1. Other servers with file systems on bare metal
survived this just fine.

> I just don't see how a modern filesystem can get away without having
> full checksum support.  It is a bit odd that it has taken so long for
> Ceph to introduce it, and I'm still not sure if it is truly
> end-to-end, or if at any point in its life the data isn't protected by
> checksums.  If I were designing something like Ceph I'd checksum the
> data at the client the moment it enters storage, then independently
> store the checksum and data, and then retrieve both and check it at
> the client when the data leaves storage.  Then you're protected
> against corruption at any layer below that.  You could of course have
> additional protections to catch errors sooner before the client even
> sees them.  I think that the issue is that Ceph was really designed
> for object storage originally and they just figured the application
> would be responsible for data integrity.

I'd at least pass the checksum through all the layers while checking it
again, so you could detect which transport or layer is broken.

> The other benefit of checksums is that if they're done right scrubs
> can go a lot faster, because you don't have to scrub all the
> redundancy data synchronously.  You can just start an idle-priority
> read thread on every drive and then pause it anytime a drive is
> accessed, and an access on one drive won't slow down the others.  With
> traditional RAID you have to read all the redundancy data
> synchronously because you can't check the integrity of any of it
> without the full set.  I think even ZFS is stuck doing synchronous
> reads due to how it stores/computes the checksums.  This is something
> btrfs got right.

One other point I decided for btrfs, tho I don't make much use of it
currently. I used to do regular scrubs a while ago but combined with
bcache, that is an SSD killer... I killed my old 128G SSD within one
year, although I used overprovisioning. Well, I actually didn't kill
it, it swapped it at 99% lifetime according to smartctl. It would
probably still work for a long time in normal workloads.

> >>  For the moment I'm
> >> relying more on zfs.  
> >
> > How does it perform memory-wise? Especially, I'm currently using
> > bees[1] for deduplication: It uses a 1G memory mapped file (you can
> > choose other sizes if you want), and it picks up new files really
> > fast, within a minute. I don't think zfs can do anything like that
> > within the same resources.  
> 
> I'm not using deduplication, but my understanding is that zfs
> deduplication:
> 1.  Works just fine.

No doubt...

> 2.  Uses a TON of RAM.

That's the problem. And I think there is no near-line dedup tool
available?

> So, it might not be your cup of tea.  There is no way to do
> semi-offline dedup as with btrfs (not really offline in that the
> filesystem is fully running - just that you periodically scan for dups
> and fix them after the fact, vs detect them in realtime).    With a
> semi-offline mode then the performance hits would only come at a time
> of my choosing, vs using gobs of RAM all the time to detect what are
> probably fairly rare dups.

I'm using bees, and I'd call it near-line. Changes to files are picked
up at commit time, when a new generation is made, and then it walks the
new extents, maps those to files, and deduplicates the blocks. I was
surprised how fast it detects new duplicate blocks. But it is still
working through the rest of the file system (since days), at least
without much impact on performance. Giving up 1G of RAM for this is
totally okay.

Once it finished scanning the first time, I'm thinking about starting
it at timed intervals. But it looks like impact will be so low that I
can keep it running all the time. Using cgroups to limit cpu and io
shares works really great.

I still didn't evaluate how it interferes with defragmenting, tho, or
how big the impact is of bees fragmenting extents.

> That aside, I find it works fine memory-wise (I don't use dedup).  It
> has its own cache system not integrated fully into the kernel's native
> cache, so it tends to hold on to a lot more ram than other
> filesystems, but you can tune this behavior so that it stays fairly
> tame.

I think the reasoning using own caching is, that block caching at the
vfs layer cannot just be done in an efficient way for a cow file system
with scrubbing and everything. You need to use good cache hinting
throuhout the whole pipeline which is currently slowly integrated into
the kernel.

E.g., when btrfs does cow action, bcache doesn't get notified that it
can discard the free block from cache. I don't know if this is handled
in the kernel cache layer...

-- 
Regards,
Kai

Replies to list-only preferred.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-16 17:48             ` Kai Krakow
@ 2017-09-16 19:02               ` Rich Freeman
  0 siblings, 0 replies; 15+ messages in thread
From: Rich Freeman @ 2017-09-16 19:02 UTC (permalink / raw
  To: gentoo-user

On Sat, Sep 16, 2017 at 1:48 PM, Kai Krakow <hurikhan77@gmail.com> wrote:
> Am Sat, 16 Sep 2017 10:05:21 -0700
> schrieb Rich Freeman <rich0@gentoo.org>:
>
>>
>> My main concern with xfs/ext4 is that neither provides on-disk
>> checksums or protection against the raid write hole.
>
> Btrfs suffers the same RAID5 write hole problem since years. I always
> planned moving to RAID5 later (which is why I have 3 disks) but I fear
> this won't be fixed any time soon due to design decisions made too
> early.
>

Btrfs RAID5 simply doesn't work.  I don't think it was ever able to
recover from a failed drive - it really only exists so that they can
develop it.

>
>> I just switched motherboards a few weeks ago and either a connection
>> or a SATA port was bad because one of my drives was getting a TON of
>> checksum errors on zfs.  I moved it to an LSI card and scrubbed, and
>> while it took forever and the system degraded the array more than once
>> due to the high error rate, eventually it patched up all the errors
>> and now the array is working without issue.  I didn't suffer more than
>> a bit of inconvenience but with even mdadm raid1 I'd have had a HUGE
>> headache trying to recover from that (doing who knows how much
>> troubleshooting before realizing I had to do a slow full restore from
>> backup with the system down).
>
> I found md raid not very reliable in the past but I didn't try again in
> years. So this may have changed. I only remember it destroyed a file
> system after an unclean shutdown not only once, that's not what I
> expect from RAID1. Other servers with file systems on bare metal
> survived this just fine.
>

mdadm provides no protection against either silent corruption or the
raid hole.

If your system dies/panics/etc while it is in the middle of writing a
stripe, then whatever previously occupied the space in that stripe is
likely to be lost.

If your hard drive writes something to disk other than what the OS
told it to write, you'll also be likely to lose a stripe unless you
want to try to manually repair it (in theory you could try to process
the data manually excluding each of the drives and try to work out
which version of the data is correct, and then do that for every
damaged stripe).

Sure, both failure modes are rare, but they still exist.  The fact
that you haven't personally experienced them doesn't change that.

If I had been using mdadm a few weeks ago I'd be restoring from
backups.  The software would have worked fine, but if the disk doesn't
write what it was supposed to write, and the software has no way to
recover from this, then you're up the creek.  With zfs and btrfs you
aren't dependent on the drive hardware detecting and reporting errors.
(In my case there were no errors reported by the drive at all for any
of this.  I suspect the issue was in the SATA port or something else
on the motherboard.  I haven't tried plugging in a scratch drive to
try to debug it, but I will be taking care not to use that port in the
future.)

>
> I think the reasoning using own caching is, that block caching at the
> vfs layer cannot just be done in an efficient way for a cow file system
> with scrubbing and everything.

Btrfs doesn't appear to have any issues despite being COW.  There
might or might not be truth to your statement.

However, I think the real reason that ZFS on Linux uses its own cache
is just because it made it easier to just port the code over wholesale
by doing it this way.  The goal of migrating the existing code was to
reduce the risk of regressions, which is why ZFS on Linux works as
well as it does.  It would take them a long time to replace the
caching layer and there would be a lot of risk of introducing errors
along the way, so it just isn't as high a priority as getting it
running in the first place.  Plus ZFS has a bunch of both read and
write cache features which aren't really built into the kernel as far
as I'm aware.  Sure, there is bcache and so on, but that isn't part of
the regular kernel cache.  Rewriting ZFS to do things the linux way
would be down the road, and it wouldn't help them get it into the
mainline kernel anyway due to the licensing issue.

-- 
Rich

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-16 12:06     ` Kai Krakow
  2017-09-16 13:39       ` Rich Freeman
@ 2017-09-17  6:20       ` Dan Douglas
  2017-09-17  9:17         ` Kai Krakow
  1 sibling, 1 reply; 15+ messages in thread
From: Dan Douglas @ 2017-09-17  6:20 UTC (permalink / raw
  To: gentoo-user


[-- Attachment #1.1: Type: text/plain, Size: 1682 bytes --]

On 09/16/2017 07:06 AM, Kai Krakow wrote:
> Am Fri, 15 Sep 2017 14:28:49 -0400
> schrieb Rich Freeman <rich0@gentoo.org>:
> 
>> On Fri, Sep 8, 2017 at 3:16 PM, Kai Krakow <hurikhan77@gmail.com>
>> wrote:
>>>
>>> At least in btrfs there's also a caveat that the original extents
>>> may not actually be split and the split extents share parts of the
>>> original extent. That means, if you delete the original later, the
>>> copy will occupy more space than expected until you defragment the
>>> file: 
>>
>> True, but keep in mind that this applies in general in btrfs to any
>> kind of modification to a file.  If you modify 1MB in the middle of a
>> 10GB file on ext4 you end up it taking up 10GB of space.  If you do
>> the same thing in btrfs you'll probably end up with the file taking up
>> 10.001GB.  Since btrfs doesn't overwrite files in-place it will
>> typically allocate a new extent for the additional 1MB, and the
>> original content at that position within the file is still on disk in
>> the original extent.  It works a bit like a log-based filesystem in
>> this regard (which is also effectively copy on write).
> 
> Good point, this makes sense. I never thought about that.
> 
> But I guess that btrfs doesn't use 10G sized extents? And I also guess,
> this is where autodefrag jumps in.

According to btrfs-filesystem(8), defragmentation breaks reflinks, in
all but a few old kernel versions where I guess they tried to fix the
problem and apparently failed. This really makes much of what btrfs does
altogether pointless if you ever defragment manually or have autodefrag
enabled. Deduplication is broken for the same reason.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-17  6:20       ` Dan Douglas
@ 2017-09-17  9:17         ` Kai Krakow
  2017-09-17 13:20           ` Dan Douglas
  0 siblings, 1 reply; 15+ messages in thread
From: Kai Krakow @ 2017-09-17  9:17 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 2528 bytes --]

Am Sun, 17 Sep 2017 01:20:45 -0500
schrieb Dan Douglas <ormaaj@gmail.com>:

> On 09/16/2017 07:06 AM, Kai Krakow wrote:
> > Am Fri, 15 Sep 2017 14:28:49 -0400
> > schrieb Rich Freeman <rich0@gentoo.org>:
> >   
> >> On Fri, Sep 8, 2017 at 3:16 PM, Kai Krakow <hurikhan77@gmail.com>
> >> wrote:  
>  [...]  
> >>
> >> True, but keep in mind that this applies in general in btrfs to any
> >> kind of modification to a file.  If you modify 1MB in the middle
> >> of a 10GB file on ext4 you end up it taking up 10GB of space.  If
> >> you do the same thing in btrfs you'll probably end up with the
> >> file taking up 10.001GB.  Since btrfs doesn't overwrite files
> >> in-place it will typically allocate a new extent for the
> >> additional 1MB, and the original content at that position within
> >> the file is still on disk in the original extent.  It works a bit
> >> like a log-based filesystem in this regard (which is also
> >> effectively copy on write).  
> > 
> > Good point, this makes sense. I never thought about that.
> > 
> > But I guess that btrfs doesn't use 10G sized extents? And I also
> > guess, this is where autodefrag jumps in.  
> 
> According to btrfs-filesystem(8), defragmentation breaks reflinks, in
> all but a few old kernel versions where I guess they tried to fix the
> problem and apparently failed.

It was splitting and splicing all the reflinks which is actually a tree
walk with more and more extents coming into the equation, and ended up
doing a lot of small IO and needing a lot of memory. I think you really
cannot fix this when working with extents.

> This really makes much of what btrfs
> does altogether pointless if you ever defragment manually or have
> autodefrag enabled. Deduplication is broken for the same reason.

It's much easier to fix this for deduplication: Just write your common
denominator of an extent to a tmp file, then walk all the reflinks and
share them with parts of this extent.

If you carefully select what to defragment, there should be no problem.
A defrag tool could simply skip all the shared extents. A few fragments
do not hurt performance at all, but what's important is spatial
locality. A lot small fragments may hurt performance a lot, so one
could give the defragger a hint when to ignore the rule and still
defragment the extent. Also, when your deduplication window is 1M you
could probably safely defrag all extents smaller than 1M.

-- 
Regards,
Kai

Replies to list-only preferred.

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-17  9:17         ` Kai Krakow
@ 2017-09-17 13:20           ` Dan Douglas
  2017-09-17 23:15             ` Kai Krakow
  0 siblings, 1 reply; 15+ messages in thread
From: Dan Douglas @ 2017-09-17 13:20 UTC (permalink / raw
  To: gentoo-user


[-- Attachment #1.1: Type: text/plain, Size: 3154 bytes --]

On 09/17/2017 04:17 AM, Kai Krakow wrote:
> Am Sun, 17 Sep 2017 01:20:45 -0500
> schrieb Dan Douglas <ormaaj@gmail.com>:
> 
>> On 09/16/2017 07:06 AM, Kai Krakow wrote:
>>> Am Fri, 15 Sep 2017 14:28:49 -0400
>>> schrieb Rich Freeman <rich0@gentoo.org>:
>>>   
>>>> On Fri, Sep 8, 2017 at 3:16 PM, Kai Krakow <hurikhan77@gmail.com>
>>>> wrote:  
>>  [...]  
>>>>
>>>> True, but keep in mind that this applies in general in btrfs to any
>>>> kind of modification to a file.  If you modify 1MB in the middle
>>>> of a 10GB file on ext4 you end up it taking up 10GB of space.  If
>>>> you do the same thing in btrfs you'll probably end up with the
>>>> file taking up 10.001GB.  Since btrfs doesn't overwrite files
>>>> in-place it will typically allocate a new extent for the
>>>> additional 1MB, and the original content at that position within
>>>> the file is still on disk in the original extent.  It works a bit
>>>> like a log-based filesystem in this regard (which is also
>>>> effectively copy on write).  
>>>
>>> Good point, this makes sense. I never thought about that.
>>>
>>> But I guess that btrfs doesn't use 10G sized extents? And I also
>>> guess, this is where autodefrag jumps in.  
>>
>> According to btrfs-filesystem(8), defragmentation breaks reflinks, in
>> all but a few old kernel versions where I guess they tried to fix the
>> problem and apparently failed.
> 
> It was splitting and splicing all the reflinks which is actually a tree
> walk with more and more extents coming into the equation, and ended up
> doing a lot of small IO and needing a lot of memory. I think you really
> cannot fix this when working with extents.

I figured by "break up" they meant it eliminates the reflink by making
a full copy... so the increased space they're talking about isn't really
double that of the original data in other words.

> 
>> This really makes much of what btrfs
>> does altogether pointless if you ever defragment manually or have
>> autodefrag enabled. Deduplication is broken for the same reason.
> 
> It's much easier to fix this for deduplication: Just write your common
> denominator of an extent to a tmp file, then walk all the reflinks and
> share them with parts of this extent.
> 
> If you carefully select what to defragment, there should be no problem.
> A defrag tool could simply skip all the shared extents. A few fragments
> do not hurt performance at all, but what's important is spatial
> locality. A lot small fragments may hurt performance a lot, so one
> could give the defragger a hint when to ignore the rule and still
> defragment the extent. Also, when your deduplication window is 1M you
> could probably safely defrag all extents smaller than 1M.

Yeah this sort of hurts with the way I deal wtih KVM image snapshots. I
have raw base images as backing files with lots of shared and null
data, so I run `fallocate --dig-holes' followed by `duperemove
--dedupe-options=same' on the cow-enabled base images and hope that
btrfs defrag can clean up the resulting fragmented mess, but it's a slow
process and doesn't seem to do a good job.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [gentoo-user] Re: [offtopic] Copy-On-Write ?
  2017-09-17 13:20           ` Dan Douglas
@ 2017-09-17 23:15             ` Kai Krakow
  0 siblings, 0 replies; 15+ messages in thread
From: Kai Krakow @ 2017-09-17 23:15 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 3030 bytes --]

Am Sun, 17 Sep 2017 08:20:50 -0500
schrieb Dan Douglas <ormaaj@gmail.com>:

> On 09/17/2017 04:17 AM, Kai Krakow wrote:
> > Am Sun, 17 Sep 2017 01:20:45 -0500
> > schrieb Dan Douglas <ormaaj@gmail.com>:
> >   
> >> On 09/16/2017 07:06 AM, Kai Krakow wrote:  
>  [...]  
>  [...]  
> >>  [...]    
>  [...]  
>  [...]  
> >>
> >> According to btrfs-filesystem(8), defragmentation breaks reflinks,
> >> in all but a few old kernel versions where I guess they tried to
> >> fix the problem and apparently failed.  
> > 
> > It was splitting and splicing all the reflinks which is actually a
> > tree walk with more and more extents coming into the equation, and
> > ended up doing a lot of small IO and needing a lot of memory. I
> > think you really cannot fix this when working with extents.  
> 
> I figured by "break up" they meant it eliminates the reflink by making
> a full copy... so the increased space they're talking about isn't
> really double that of the original data in other words.
> 
> >   
> >> This really makes much of what btrfs
> >> does altogether pointless if you ever defragment manually or have
> >> autodefrag enabled. Deduplication is broken for the same reason.  
> > 
> > It's much easier to fix this for deduplication: Just write your
> > common denominator of an extent to a tmp file, then walk all the
> > reflinks and share them with parts of this extent.
> > 
> > If you carefully select what to defragment, there should be no
> > problem. A defrag tool could simply skip all the shared extents. A
> > few fragments do not hurt performance at all, but what's important
> > is spatial locality. A lot small fragments may hurt performance a
> > lot, so one could give the defragger a hint when to ignore the rule
> > and still defragment the extent. Also, when your deduplication
> > window is 1M you could probably safely defrag all extents smaller
> > than 1M.  
> 
> Yeah this sort of hurts with the way I deal wtih KVM image snapshots.
> I have raw base images as backing files with lots of shared and null
> data, so I run `fallocate --dig-holes' followed by `duperemove
> --dedupe-options=same' on the cow-enabled base images and hope that
> btrfs defrag can clean up the resulting fragmented mess, but it's a
> slow process and doesn't seem to do a good job.

I would be interested about your results if you try bees[1] to
deduplicate your KVM images. It should be able to dig holes and merge
blocks by reflinking. I'm not sure if it would merge continuous extents
back into one single extent, I think that's on a todo list. It could
act as a reflink-aware defragger then.

It currently does not work well for mixed datasum/nodatasum workloads,
so I made a PR[2] to ignore nocow files. A more elaborated patch would
not try to reflink datasum and nodatasum extents (nocow implies
nodatasum).

[1]: https://github.com/Zygo/bees
[2]: https://github.com/Zygo/bees/pull/21


-- 
Regards,
Kai

Replies to list-only preferred.

[-- Attachment #2: Digitale Signatur von OpenPGP --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2017-09-17 23:15 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-07 15:46 [gentoo-user] [offtopic] Copy-On-Write ? Helmut Jarausch
2017-09-08 19:09 ` Simon Thelen
2017-09-08 19:10 ` Marc Joliet
2017-09-08 19:16 ` [gentoo-user] " Kai Krakow
2017-09-15 18:28   ` Rich Freeman
2017-09-16 12:06     ` Kai Krakow
2017-09-16 13:39       ` Rich Freeman
2017-09-16 16:43         ` Kai Krakow
2017-09-16 17:05           ` Rich Freeman
2017-09-16 17:48             ` Kai Krakow
2017-09-16 19:02               ` Rich Freeman
2017-09-17  6:20       ` Dan Douglas
2017-09-17  9:17         ` Kai Krakow
2017-09-17 13:20           ` Dan Douglas
2017-09-17 23:15             ` Kai Krakow

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox