[gentoo-user] [Gentoo-User] emerge --sync likely to kill SSD?

public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed

* [gentoo-user] [Gentoo-User] emerge --sync likely to kill SSD?
@ 2014-06-19  2:36 microcai
  2014-06-19  8:40 ` Amankwah
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: microcai @ 2014-06-19  2:36 UTC (permalink / raw
  To: gentoo-user

rsync is doing bunch of  4k ramdon IO when updateing portage tree,
that will kill SSDs with much higher Write Amplification Factror.

I have a 2year old SSDs that have reported Write Amplification Factor
of 26. I think the only reason is that I put portage tree on this SSD
to speed it up.

what is the suggest way  to reduce Write Amplification  of a portage sync ?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-19  2:36 [gentoo-user] [Gentoo-User] emerge --sync likely to kill SSD? microcai
@ 2014-06-19  8:40 ` Amankwah
  2014-06-19 11:44   ` Neil Bothwick
  2014-06-19 22:03 ` Full Analyst
  2014-06-20 17:48 ` [gentoo-user] " Kai Krakow
  2 siblings, 1 reply; 17+ messages in thread
From: Amankwah @ 2014-06-19  8:40 UTC (permalink / raw
  To: gentoo-user

On Thu, Jun 19, 2014 at 10:36:59AM +0800, microcai wrote:
> rsync is doing bunch of  4k ramdon IO when updateing portage tree,
> that will kill SSDs with much higher Write Amplification Factror.
> 
> 
> I have a 2year old SSDs that have reported Write Amplification Factor
> of 26. I think the only reason is that I put portage tree on this SSD
> to speed it up.
> 
> what is the suggest way  to reduce Write Amplification  of a portage sync ?
> 

Maybe the only solution is that move the portage tree to HDD??


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-19  8:40 ` Amankwah
@ 2014-06-19 11:44   ` Neil Bothwick
  2014-06-19 11:56     ` Rich Freeman
  0 siblings, 1 reply; 17+ messages in thread
From: Neil Bothwick @ 2014-06-19 11:44 UTC (permalink / raw
  To: gentoo-user

[-- Attachment #1: Type: text/plain, Size: 319 bytes --]

On Thu, 19 Jun 2014 16:40:08 +0800, Amankwah wrote:

> Maybe the only solution is that move the portage tree to HDD??

Or tmpfs if you rarely reboot or have a fast enough connection to your
preferred portage mirror.


-- 
Neil Bothwick

The voices in my head may not be real, but they have some good ideas!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-19 11:44   ` Neil Bothwick
@ 2014-06-19 11:56     ` Rich Freeman
  2014-06-19 12:16       ` Kerin Millar
  0 siblings, 1 reply; 17+ messages in thread
From: Rich Freeman @ 2014-06-19 11:56 UTC (permalink / raw
  To: gentoo-user

On Thu, Jun 19, 2014 at 7:44 AM, Neil Bothwick <neil@digimed.co.uk> wrote:
> On Thu, 19 Jun 2014 16:40:08 +0800, Amankwah wrote:
>
>> Maybe the only solution is that move the portage tree to HDD??
>
> Or tmpfs if you rarely reboot or have a fast enough connection to your
> preferred portage mirror.

There has been a proposal to move it to squashfs, which might
potentially also help.

The portage tree is 700M uncompressed, which seems like a bit much to
just leave in RAM all the time.

Mine is on an SSD, but the SMART attributes aren't well-documented so
I have no idea what the erase count or WAF is - just the LBA written
count.

Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-19 11:56     ` Rich Freeman
@ 2014-06-19 12:16       ` Kerin Millar
  0 siblings, 0 replies; 17+ messages in thread
From: Kerin Millar @ 2014-06-19 12:16 UTC (permalink / raw
  To: gentoo-user

On 19/06/2014 12:56, Rich Freeman wrote:
> On Thu, Jun 19, 2014 at 7:44 AM, Neil Bothwick <neil@digimed.co.uk> wrote:
>> On Thu, 19 Jun 2014 16:40:08 +0800, Amankwah wrote:
>>
>>> Maybe the only solution is that move the portage tree to HDD??
>>
>> Or tmpfs if you rarely reboot or have a fast enough connection to your
>> preferred portage mirror.
>
> There has been a proposal to move it to squashfs, which might
> potentially also help.
>
> The portage tree is 700M uncompressed, which seems like a bit much to
> just leave in RAM all the time.

The tree will not necessarily be left in RAM all of the time. Pages 
allocated by tmpfs reside in pagecache. Given sufficient pressure, they 
may be migrated to swap. Even then, zswap [1] could be used so as to 
reduce write amplification. I like Neil's suggestion, assuming that the 
need to reboot is infrequent.

--Kerin

[1] https://www.kernel.org/doc/Documentation/vm/zswap.txt


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-19  2:36 [gentoo-user] [Gentoo-User] emerge --sync likely to kill SSD? microcai
  2014-06-19  8:40 ` Amankwah
@ 2014-06-19 22:03 ` Full Analyst
  2014-06-20 17:48 ` [gentoo-user] " Kai Krakow
  2 siblings, 0 replies; 17+ messages in thread
From: Full Analyst @ 2014-06-19 22:03 UTC (permalink / raw
  To: gentoo-user

Hello Microcal,

I use tmpfs heavily as I have an SSD.
Here are some information that can help you :

tank woody # mount -v | grep tmpfs
devtmpfs on /dev type devtmpfs 
(rw,relatime,size=8050440k,nr_inodes=2012610,mode=755)
tmpfs on /run type tmpfs (rw,nosuid,nodev,relatime,size=1610408k,mode=755)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime)
cgroup_root on /sys/fs/cgroup type tmpfs 
(rw,nosuid,nodev,noexec,relatime,size=10240k,mode=755)
tmpfs on /var/tmp/portage type tmpfs (rw,size=12G)
tmpfs on /usr/portage type tmpfs (rw,size=12G)
tmpfs on /usr/src type tmpfs (rw,size=12G)
tmpfs on /tmp type tmpfs (rw,size=12G)
tmpfs on /home/woody/.mutt/cache type tmpfs (rw)

tank woody # cat /etc/fstab
# /etc/fstab: static file system information.
#
# noatime turns off atimes for increased performance (atimes normally 
aren't
# needed); notail increases performance of ReiserFS (at the expense of 
storage
# efficiency).  It's safe to drop the noatime options if you want and to
# switch between notail / tail freely.
#
# The root filesystem should have a pass number of either 0 or 1.
# All other filesystems should have a pass number of 0 or greater than 1.
#
# See the manpage fstab(5) for more information.
#

# <fs>            <mountpoint>    <type> <opts>        <dump/pass>

# NOTE: If your BOOT partition is ReiserFS, add the notail option to opts.
/dev/sda1        /            ext4 noatime,discard,user_xattr    0 1
/dev/sda3        /home            ext4 noatime,discard,user_xattr    0 1
#/dev/sda6        /home            ext3        noatime        0 1
#/dev/sda2        none            swap        sw        0 0
tmpfs            /var/tmp/portage    tmpfs        size=12G 0 0
tmpfs            /usr/portage        tmpfs        size=12G 0 0
tmpfs            /usr/src        tmpfs        size=12G        0 0
tmpfs            /tmp            tmpfs        size=12G        0 0
tmpfs            /home/woody/.mutt/cache/    tmpfs size=12G        0 0
#/dev/cdrom        /mnt/cdrom        auto        noauto,ro    0 0
#/dev/fd0        /mnt/floppy        auto        noauto        0 0
tank woody #

For the /usr/portage directory, if you reboot, all you have to do is 
emerge-webrsync or do like me :
tank woody # l /usr/ | grep portage
  924221  4 drwxr-xr-x 170 root root  4096  1 mars  02:51 portage_tmpfs
    6771  0 drwxr-xr-x 171 root root  3500 11 juin  20:40 portage
tank woody #

The /usr/portage_tmpfs is a backup of the /usr/portage, this avoid me 
retreiving all portage information from gentoo's servers.

Please note that I also use www-misc/profile-sync-daemon in order to 
store my browsers cache on /tmp.

I rarely shutdown my computer :)
Have fun

On 19/06/2014 04:36, microcai wrote:
> rsync is doing bunch of  4k ramdon IO when updateing portage tree,
> that will kill SSDs with much higher Write Amplification Factror.
>
>
> I have a 2year old SSDs that have reported Write Amplification Factor
> of 26. I think the only reason is that I put portage tree on this SSD
> to speed it up.
>
> what is the suggest way  to reduce Write Amplification  of a portage sync ?
>



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [gentoo-user] Re: [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-19  2:36 [gentoo-user] [Gentoo-User] emerge --sync likely to kill SSD? microcai
  2014-06-19  8:40 ` Amankwah
  2014-06-19 22:03 ` Full Analyst
@ 2014-06-20 17:48 ` Kai Krakow
  2014-06-21  4:54   ` microcai
  2014-06-21 14:27   ` Peter Humphrey
  2 siblings, 2 replies; 17+ messages in thread
From: Kai Krakow @ 2014-06-20 17:48 UTC (permalink / raw
  To: gentoo-user

microcai <microcai@fedoraproject.org> schrieb:

> rsync is doing bunch of  4k ramdon IO when updateing portage tree,
> that will kill SSDs with much higher Write Amplification Factror.
> 
> 
> I have a 2year old SSDs that have reported Write Amplification Factor
> of 26. I think the only reason is that I put portage tree on this SSD
> to speed it up.

Use a file system that turns random writes into sequential writes, like the 
pretty newcomer f2fs. You could try using it for your rootfs but currently I 
suggest just creating a separate partition for it and either mount it as 
/usr/portage or symlink that dir into this directory (that way you could use 
it for other purposes, too, that generate random short writes, like log 
files).

Then, I'd recommend changing your scheduler to deadline, bump up the io 
queue depth to a much higher value (echo -n 2048 > 
/sys/block/sdX/queue/nr_requests) and then change the dirty io flusher to 
not run as early as it usually would (change vm.dirty_writeback_centisecs to 
1500 and vm.dirty_expire_centisecs to 3000). That way the vfs layer has a 
chance to better coalesce multi-block writes into one batch write, and f2fs 
will take care of doing it in sequential order.

I'd also suggest not to use the discard mount options and instead create a 
cronjob that runs fstrim on the SSD devices. But YMMV.

As a safety measure, only ever partition and use only 70-80% of your SSD so 
it can reliably do its wear-leveling. It will improve lifetime and keep the 
performance up even with filled filesystems.

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Re: [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-20 17:48 ` [gentoo-user] " Kai Krakow
@ 2014-06-21  4:54   ` microcai
  2014-06-21 14:27   ` Peter Humphrey
  1 sibling, 0 replies; 17+ messages in thread
From: microcai @ 2014-06-21  4:54 UTC (permalink / raw
  To: gentoo-user

2014-06-21 1:48 GMT+08:00 Kai Krakow <hurikhan77@gmail.com>:
> microcai <microcai@fedoraproject.org> schrieb:
>
>> rsync is doing bunch of  4k ramdon IO when updateing portage tree,
>> that will kill SSDs with much higher Write Amplification Factror.
>>
>>
>> I have a 2year old SSDs that have reported Write Amplification Factor
>> of 26. I think the only reason is that I put portage tree on this SSD
>> to speed it up.
>
> Use a file system that turns random writes into sequential writes, like the
> pretty newcomer f2fs. You could try using it for your rootfs but currently I
> suggest just creating a separate partition for it and either mount it as
> /usr/portage or symlink that dir into this directory (that way you could use
> it for other purposes, too, that generate random short writes, like log
> files).
>
> Then, I'd recommend changing your scheduler to deadline, bump up the io
> queue depth to a much higher value (echo -n 2048 >
> /sys/block/sdX/queue/nr_requests) and then change the dirty io flusher to
> not run as early as it usually would (change vm.dirty_writeback_centisecs to
> 1500 and vm.dirty_expire_centisecs to 3000). That way the vfs layer has a
> chance to better coalesce multi-block writes into one batch write, and f2fs
> will take care of doing it in sequential order.
>
> I'd also suggest not to use the discard mount options and instead create a
> cronjob that runs fstrim on the SSD devices. But YMMV.
>
> As a safety measure, only ever partition and use only 70-80% of your SSD so
> it can reliably do its wear-leveling. It will improve lifetime and keep the
> performance up even with filled filesystems.
>
> --


many thanks to all of you!

no I've put my portage tree on an F2FS partation now.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Re: [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-20 17:48 ` [gentoo-user] " Kai Krakow
  2014-06-21  4:54   ` microcai
@ 2014-06-21 14:27   ` Peter Humphrey
  2014-06-21 14:54     ` Rich Freeman
  2014-06-21 19:24     ` Kai Krakow
  1 sibling, 2 replies; 17+ messages in thread
From: Peter Humphrey @ 2014-06-21 14:27 UTC (permalink / raw
  To: gentoo-user

On Friday 20 June 2014 19:48:14 Kai Krakow wrote:
> microcai <microcai@fedoraproject.org> schrieb:
> > rsync is doing bunch of  4k ramdon IO when updateing portage tree,
> > that will kill SSDs with much higher Write Amplification Factror.
> > 
> > I have a 2year old SSDs that have reported Write Amplification Factor
> > of 26. I think the only reason is that I put portage tree on this SSD
> > to speed it up.
> 
> Use a file system that turns random writes into sequential writes, like the
> pretty newcomer f2fs. You could try using it for your rootfs but currently I
> suggest just creating a separate partition for it and either mount it as
> /usr/portage or symlink that dir into this directory (that way you could
> use it for other purposes, too, that generate random short writes, like log
> files).

Well, there's a surprise! Thanks for mentioning f2fs. I've just converted my 
Atom box's seven partitions to it, recompiled the kernel to include it, 
changed the fstab entries and rebooted. It just worked.

--->8

> I'd also suggest not to use the discard mount options and instead create a
> cronjob that runs fstrim on the SSD devices. But YMMV.

I found that fstrim can't work on f2fs file systems. I don't know whether 
discard works yet.

Thanks again.

-- 
Regards
Peter



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Re: [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-21 14:27   ` Peter Humphrey
@ 2014-06-21 14:54     ` Rich Freeman
  2014-06-21 19:19       ` [gentoo-user] " Kai Krakow
  2014-06-21 19:24     ` Kai Krakow
  1 sibling, 1 reply; 17+ messages in thread
From: Rich Freeman @ 2014-06-21 14:54 UTC (permalink / raw
  To: gentoo-user

On Sat, Jun 21, 2014 at 10:27 AM, Peter Humphrey <peter@prh.myzen.co.uk> wrote:
>
> I found that fstrim can't work on f2fs file systems. I don't know whether
> discard works yet.

Fstrim is to be preferred over discard in general.  However, I suspect
neither is needed for something like f2fs.  Being log-based it doesn't
really overwrite data in place.  I suspect that it waits until an
entire region of the disk is unused and then it TRIMs the whole
region.

However, I haven't actually used it and only know the little I've read
about it.  That is the principle of a log-based filesystem.

I'm running btrfs on my SSD root, which is supposed to be decent for
flash, but the SMART attributes of my drive aren't well-documented so
I couldn't tell you what the erase count is up to.

Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [gentoo-user] Re: Re: [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-21 14:54     ` Rich Freeman
@ 2014-06-21 19:19       ` Kai Krakow
  0 siblings, 0 replies; 17+ messages in thread
From: Kai Krakow @ 2014-06-21 19:19 UTC (permalink / raw
  To: gentoo-user

Rich Freeman <rich0@gentoo.org> schrieb:

> On Sat, Jun 21, 2014 at 10:27 AM, Peter Humphrey <peter@prh.myzen.co.uk>
> wrote:
>>
>> I found that fstrim can't work on f2fs file systems. I don't know whether
>> discard works yet.
> 
> Fstrim is to be preferred over discard in general.  However, I suspect
> neither is needed for something like f2fs.  Being log-based it doesn't
> really overwrite data in place.  I suspect that it waits until an
> entire region of the disk is unused and then it TRIMs the whole
> region.

F2fs prefers to fill an entire erase block before touching the next. It also 
tries to coalese small writes into 16k blocks before submitting them to 
disk. And according to the docs it supports trim/discard internally.

> However, I haven't actually used it and only know the little I've read
> about it.  That is the principle of a log-based filesystem.

There's an article at LWN [1] and in the comments you can find a few 
important information about the technical details.

Posted Oct 11, 2012 21:11 UTC (Thu) by arnd:
| * Wear leveling usually works by having a pool of available erase blocks
|   in the drive. When you write to a new location, the drive takes on block
|   out of that pool and writes the data there. When the drive thinks you
|   are done writing to one block, it cleans up any partially written data
|   and puts a different block back into the pool.
| * f2fs tries to group writes into larger operations of at least page size
|   (16KB or more) to be efficient, current FTLs are horribly bad at 4KB
|   page size writes. It also tries to fill erase blocks (multiples of 2MB)
|   in the order that the devices can handle.
| * logfs actually works on block devices but hasn't been actively worked on
|   over the last few years. f2fs also promises better performance by using
|   only 6 erase blocks concurrently rather than 12 in the case of logfs. A
|   lot of the underlying principles are the same though.
| * The "industry" is moving away from raw flash interfaces towards eMMC and
|   related technologies (UFS, SD, ...). We are not going back to raw flash
|   any time soon, which is unfortunate for a number of reasons but also has
|   a few significant advantages. Having the FTL take care of bad block
|   management and wear leveling is one such advantage, at least if they get
|   it right.

According to wikipedia [2], some more interesting features are on the way, 
like compression and data deduplication to lower the impact of writes.

[1]: http://lwn.net/Articles/518988/
[2]: http://en.wikipedia.org/wiki/F2FS

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [gentoo-user] Re: Re: [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-21 14:27   ` Peter Humphrey
  2014-06-21 14:54     ` Rich Freeman
@ 2014-06-21 19:24     ` Kai Krakow
  2014-06-22  1:40       ` Rich Freeman
  1 sibling, 1 reply; 17+ messages in thread
From: Kai Krakow @ 2014-06-21 19:24 UTC (permalink / raw
  To: gentoo-user

Peter Humphrey <peter@prh.myzen.co.uk> schrieb:

> On Friday 20 June 2014 19:48:14 Kai Krakow wrote:
>> microcai <microcai@fedoraproject.org> schrieb:
>> > rsync is doing bunch of  4k ramdon IO when updateing portage tree,
>> > that will kill SSDs with much higher Write Amplification Factror.
>> > 
>> > I have a 2year old SSDs that have reported Write Amplification Factor
>> > of 26. I think the only reason is that I put portage tree on this SSD
>> > to speed it up.
>> 
>> Use a file system that turns random writes into sequential writes, like
>> the pretty newcomer f2fs. You could try using it for your rootfs but
>> currently I suggest just creating a separate partition for it and either
>> mount it as /usr/portage or symlink that dir into this directory (that
>> way you could use it for other purposes, too, that generate random short
>> writes, like log files).
> 
> Well, there's a surprise! Thanks for mentioning f2fs. I've just converted
> my Atom box's seven partitions to it, recompiled the kernel to include it,
> changed the fstab entries and rebooted. It just worked.

It's said to be twice as fast with some workloads (especially write 
workloads). Can you confirm that? I didn't try it that much yet - usually I 
use it for pendrives only. I have no experience using it for rootfs.

And while we are at it, I'd also like to mention bcache. Tho, conversion is 
not straight forward. However, I'm going to try that soon for my spinning 
rust btrfs.

-- 
Replies to list only preferred.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Re: Re: [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-21 19:24     ` Kai Krakow
@ 2014-06-22  1:40       ` Rich Freeman
  2014-06-22 11:44         ` [gentoo-user] " Kai Krakow
  0 siblings, 1 reply; 17+ messages in thread
From: Rich Freeman @ 2014-06-22  1:40 UTC (permalink / raw
  To: gentoo-user

On Sat, Jun 21, 2014 at 3:24 PM, Kai Krakow <hurikhan77@gmail.com> wrote:
> And while we are at it, I'd also like to mention bcache. Tho, conversion is
> not straight forward. However, I'm going to try that soon for my spinning
> rust btrfs.

I contemplated that, but I'd really like to see btrfs support
something more native.  Bcache is way too low-level for me and strikes
me as inefficient as a result.  Plus, since it sits UNDER btrfs you'd
probably lose all the fancy volume management features.

ZFS has ssd caching as part of the actual filesystem, and that seems
MUCH cleaner.

Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [gentoo-user] Re: Re: Re: [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-22  1:40       ` Rich Freeman
@ 2014-06-22 11:44         ` Kai Krakow
  2014-06-22 13:44           ` Rich Freeman
  0 siblings, 1 reply; 17+ messages in thread
From: Kai Krakow @ 2014-06-22 11:44 UTC (permalink / raw
  To: gentoo-user

Rich Freeman <rich0@gentoo.org> schrieb:

> On Sat, Jun 21, 2014 at 3:24 PM, Kai Krakow <hurikhan77@gmail.com> wrote:
>> And while we are at it, I'd also like to mention bcache. Tho, conversion
>> is not straight forward. However, I'm going to try that soon for my
>> spinning rust btrfs.
> 
> I contemplated that, but I'd really like to see btrfs support
> something more native.  Bcache is way too low-level for me and strikes
> me as inefficient as a result.  Plus, since it sits UNDER btrfs you'd
> probably lose all the fancy volume management features.

I don't see where you could lose the volume management features. You just 
add device on top of the bcache device after you initialized the raw device 
with a bcache superblock and attached it. The rest works the same, just that 
you use bcacheX instead of sdX devices.

Bcache is a general approach and it seems to work very well for that 
already. There are hot data tracking patches and proposals to support adding 
a cache device to the btrfs pool and let btrfs migrate data back and forth 
between each. That would be native. But it still would lack the advanced 
features ZFS implements to make use of such caching devices, implementing 
even different strategies for ZIL, ARC, and L2ARC. That's the gap bcache 
tries to jump.

> ZFS has ssd caching as part of the actual filesystem, and that seems
> MUCH cleaner.

Yes, it is much more mature in that regard. Comparing with ZFS, bcache is a 
lot like ZIL, while hot data relocation in btrfs would be a lot like L2ARC. 
ARC is a special purpose RAM cache separate from the VFS caches which has 
special knowledge about ZFS structures to keep performance high. Some 
filesystems implement something similar by keeping tree structures 
completely in RAM. I think, both bcache and hot data tracking take parts of 
the work that ARC does for ZFS - note that "hot data tracking" is a generic 
VFS interface, while "hot data relocation" is something from btrfs. Both 
work together but it is not there yet.

From that point of view, I don't think something like ZIL should be 
implemented in btrfs itself but as a generic approach like bcache so every 
component in Linux can make use of it. Hot data relocation OTOH is 
interesting from another point of view and may become part of future btrfs 
as it benefits from knowledge about the filesystem itself, using a generic 
interface like "hot data tracking" in VFS - so other components can make use 
of that, too.

A ZIL-like cache and hot data relocation could probably solve a lot of 
fragmentation issues (especially a ZIL-like cache), so I hope work for that 
will get pushed a little more soon.

Having to prepare devices for bcache is kind of a show-stopper because it is 
no drop-in component that way. But OTOH I like that approach better than dm-
cache because it protects from using the backing device without going 
through the caching layer which could otherwise severely damage your data, 
and you get along with fewer devices and don't need to size a meta device 
(which probably needs to grow later if you add devices, I don't know).

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Re: Re: Re: [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-22 11:44         ` [gentoo-user] " Kai Krakow
@ 2014-06-22 13:44           ` Rich Freeman
  2014-06-24 18:34             ` [gentoo-user] " Kai Krakow
  0 siblings, 1 reply; 17+ messages in thread
From: Rich Freeman @ 2014-06-22 13:44 UTC (permalink / raw
  To: gentoo-user

On Sun, Jun 22, 2014 at 7:44 AM, Kai Krakow <hurikhan77@gmail.com> wrote:
> I don't see where you could lose the volume management features. You just
> add device on top of the bcache device after you initialized the raw device
> with a bcache superblock and attached it. The rest works the same, just that
> you use bcacheX instead of sdX devices.

Ah, didn't realize you could attach/remove devices to bcache later.
Presumably it handles device failures gracefully, ie exposing them to
the underlying filesystem so that it can properly recover?

>
> From that point of view, I don't think something like ZIL should be
> implemented in btrfs itself but as a generic approach like bcache so every
> component in Linux can make use of it. Hot data relocation OTOH is
> interesting from another point of view and may become part of future btrfs
> as it benefits from knowledge about the filesystem itself, using a generic
> interface like "hot data tracking" in VFS - so other components can make use
> of that, too.

The only problem with doing stuff like this at a lower level (both
write and read caching) is that it isn't RAID-aware.  If you write
10GB of data, you use 20GB of cache to do it if you're mirrored,
because the cache doesn't know about mirroring.  Offhand I'm not sure
if there are any performance penalties as well around the need for
barriers/etc with the cache not being able to be relied on to do the
right thing in terms of what gets written out - also, the data isn't
redundant while it is on the cache, unless you mirror the cache.
Granted, if you're using it for write intent logging then there isn't
much getting around that.

> Having to prepare devices for bcache is kind of a show-stopper because it is
> no drop-in component that way. But OTOH I like that approach better than dm-
> cache because it protects from using the backing device without going
> through the caching layer which could otherwise severely damage your data,
> and you get along with fewer devices and don't need to size a meta device
> (which probably needs to grow later if you add devices, I don't know).

And this is the main thing keeping me away from it.  It is REALLY
painful to migrate to/from.  Having it integrated into the filesystem
delivers all the same benefits of not being able to mount it without
the cache present.

Now excuse me while I go fix my btrfs (I tried re-enabling snapper and
it again got the filesystem into a worked-up state after trying to
clean up half a dozen snapshots at the same time - it works fine until
you go and try to write a lot of data to it, then it stops syncing
though you don't necessarily notice until a few hours later when the
write cache exhausts RAM and on reboot your disk reverts back a few
hours).  I suspect that if I just treat it gently for a few hours
btrfs will clean up the mess and it will work normally again, but the
damage apparently persists after a reboot if you go heavy in the disk
too quickly...

Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [gentoo-user] Re: Re: Re: Re: [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-22 13:44           ` Rich Freeman
@ 2014-06-24 18:34             ` Kai Krakow
  2014-06-24 20:01               ` Rich Freeman
  0 siblings, 1 reply; 17+ messages in thread
From: Kai Krakow @ 2014-06-24 18:34 UTC (permalink / raw
  To: gentoo-user

Rich Freeman <rich0@gentoo.org> schrieb:

> On Sun, Jun 22, 2014 at 7:44 AM, Kai Krakow <hurikhan77@gmail.com> wrote:
>> I don't see where you could lose the volume management features. You just
>> add device on top of the bcache device after you initialized the raw
>> device with a bcache superblock and attached it. The rest works the same,
>> just that you use bcacheX instead of sdX devices.
> 
> Ah, didn't realize you could attach/remove devices to bcache later.
> Presumably it handles device failures gracefully, ie exposing them to
> the underlying filesystem so that it can properly recover?

I'm not sure if multiple partitions can share the same cache device 
partition but more or less that's it: Initialize bcache, then attach your 
backing devices, then add those bcache devices to your btrfs.

I don't know how errors are handled, tho. But as with every caching 
technique (even in ZFS) your data is likely toast if the cache device dies 
in the middle of action. Thus, you should put bcache on LVM RAID if you are 
going to use it for write caching (i.e. write-back mode). Read caching 
should be okay (write-through mode). Bcache is a little slower than other 
flash-cache implementations because it only reports data as written back to 
the FS if it reached stable storage (which can be the cache device, tho, if 
you are using write-back mode). It was also designed with unexpected reboots 
in mind, read. It will replay transactions from its log on reboot. This 
means, you can have unstable data conditions on the raw device which is why 
you should never try to use that directly, e.g. from a rescue disk. But 
since bcache wraps the partition with its own superblock this mistake should 
be impossible.

I'm not sure how graceful device failures are handled. I suppose in write-
back mode you can get into trouble because it's too late for bcache to tell 
the FS that there is a write error when it already confirmed that stable 
storage has been hit. Maybe it will just keep the data around so you could 
swap devices or will report the error next time when data is written to that 
location. It probably interferes with btrfs RAID logic on that matter.

> The only problem with doing stuff like this at a lower level (both
> write and read caching) is that it isn't RAID-aware.  If you write
> 10GB of data, you use 20GB of cache to do it if you're mirrored,
> because the cache doesn't know about mirroring.

Yes, it will write double the data to the cache then - but only if btrfs 
also did actually read both copies (which it probably does not because it 
has checksums and does not need to compare data, and lets just ignore the 
case that another process could try to read the same data from the other 
raid member later, that case should become optimized-out by the OS cache). 
Otherwise both caches should work pretty individually with their own set of 
data depending on how btrfs uses each device individually. Remember that 
btrfs raid is not a block-based raid where block locations would match 1:1 
on each device. Btrfs raid can place one mirror of data in two completely 
different locations on each member device (which is actually a good thing in 
case block errors accumulate in specific locations for a "faulty" model of a 
disk). In case of write caching it will of course cache double the data 
(because both members will be written to). But I think that's okay for the 
same reasons, except it will wear your cache device faster. But in that case 
I suggest to use individual SSDs for each btrfs member device anyways. It's 
not optimal, I know. Could be useful to see some best practices and 
pros/cons on that topic (individual cache device per btrfs member vs. bcache 
on LVM RAID with bcache partitions on the RAID for all members). I think the 
best strategy depends on if you are write-most or read-most.

Thanks for mentioning. Interesting thoughts. ;-)

> Offhand I'm not sure
> if there are any performance penalties as well around the need for
> barriers/etc with the cache not being able to be relied on to do the
> right thing in terms of what gets written out - also, the data isn't
> redundant while it is on the cache, unless you mirror the cache.

This is partialy what I outlined above. I think in case of write-caching, 
there is no barriers pass-thru needed. Bcache will confirm the barriers and 
that's all the FS needs to know (because bcache is supervising the FS, all 
requests go through the bcache layer, no direct access to the backing 
device). Of course, it's then bcache's job to ensure everything gets written 
out correctly in the background (whenever it feels to do so). But it can use 
its own write-barriers to ensure that for the underlying device - that's 
nothing the FS has to care about. Performance should be faster anyway 
because, well, you are writing to a faster device - that is what bcache is 
all about, isn't it? ;-)

I don't think write-barriers for read caching are needed, at least not from 
point of view of the FS. The caching layer, tho, will use it internally for 
its caching structures. If that will have a bad effect on performance is 
probably dependent on the implementation, but my intuition says: No 
performance impact because putting read data in the cache can be defered and 
then data will be written in the background (write-behind).

> Granted, if you're using it for write intent logging then there isn't
> much getting around that.

Well, sure for bcache. But I think in case of FS-internal write caching 
devices that case could be handled gracefully (the method which you'd 
prefer). Since in the internal case the cache has knowledge about the FS bad 
block handling, it can just retry writing data to another location/disk or 
keep it around until the admin fixed the problem with the backing device.

BTW: SSD firmwares usually suffer similar problems like outlined above 
because they do writes in the background when they already confirmed 
persistence to the OS layer. This is why SSD failures are usually much 
severe compared HDD failures. Do some research, and you should find tests 
about that topic. Especially consumer SSD firmwares have a big problem with 
that. So I'm not sure if it really should be bcache's job to fix that 
particular problem. You should just ensure good firmware and proper failure 
protection at the hardware level if you want to do fancy caching stuff - the 
FTL should be able to hide those problems before the whole thing explodes, 
then report errors before it is able to no longer ensure correct 
persistence. I suppose that is also the detail where enterprise grade SSDs 
behave different. HDDs have related issues (SATA vs enterprise SCSI vs SAS, 
hotword: IO timeouts and bad blocks, and why you should not use consumer 
hardware for RAIDs). I think all the same holds true for ZFS.

>> Having to prepare devices for bcache is kind of a show-stopper because it
>> is no drop-in component that way. But OTOH I like that approach better
>> than dm- cache because it protects from using the backing device without
>> going through the caching layer which could otherwise severely damage
>> your data, and you get along with fewer devices and don't need to size a
>> meta device (which probably needs to grow later if you add devices, I
>> don't know).
> 
> And this is the main thing keeping me away from it.  It is REALLY
> painful to migrate to/from.  Having it integrated into the filesystem
> delivers all the same benefits of not being able to mount it without
> the cache present.

The migration pain is what currently keeps me away, too. Otherwise I would 
just buy one of those fancy new cheap but still speedy Crucial SSDs and 
"just enable" bcache... :-\

> Now excuse me while I go fix my btrfs (I tried re-enabling snapper and
> it again got the filesystem into a worked-up state after trying to
> clean up half a dozen snapshots at the same time - it works fine until
> you go and try to write a lot of data to it, then it stops syncing
> though you don't necessarily notice until a few hours later when the
> write cache exhausts RAM and on reboot your disk reverts back a few
> hours).  I suspect that if I just treat it gently for a few hours
> btrfs will clean up the mess and it will work normally again, but the
> damage apparently persists after a reboot if you go heavy in the disk
> too quickly...

You should report that to the btrfs list. You could try to "echo w > 
/proc/sysrq-trigger" and look at the blocked processes list in dmesg 
afterwards. I'm sure one important btrfs thread is in blocked state then...

-- 
Replies to list only preferred.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-user] Re: Re: Re: Re: [Gentoo-User] emerge --sync likely to kill SSD?
  2014-06-24 18:34             ` [gentoo-user] " Kai Krakow
@ 2014-06-24 20:01               ` Rich Freeman
  0 siblings, 0 replies; 17+ messages in thread
From: Rich Freeman @ 2014-06-24 20:01 UTC (permalink / raw
  To: gentoo-user

On Tue, Jun 24, 2014 at 2:34 PM, Kai Krakow <hurikhan77@gmail.com> wrote:
> I'm not sure if multiple partitions can share the same cache device
> partition but more or less that's it: Initialize bcache, then attach your
> backing devices, then add those bcache devices to your btrfs.

Ah, if you are stuck with one bcache partition per cached device then
that will be fairly painful to manage.

> Yes, it will write double the data to the cache then - but only if btrfs
> also did actually read both copies (which it probably does not because it
> has checksums and does not need to compare data, and lets just ignore the
> case that another process could try to read the same data from the other
> raid member later, that case should become optimized-out by the OS cache).

I didn't realize you were proposing read caching only.  If you're only
caching reads then obviously that is much safer.  I think with btrfs
in raid1 mode with only two devices you can tell it to prefer a
particular device for reading in which case you could just bcache that
drive.  It would only read from the other drive if the cache failed.

However, I don't think btrfs lets you manually arrange drives into
array-like structures.  It auto-balances everything which is usually a
plus, but if you have 30 disks you can't tell it to treat them as 6x
5-disk RAID5s vs one 30-disk raid5 (I think).

Rich

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-06-24 20:01 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-19  2:36 [gentoo-user] [Gentoo-User] emerge --sync likely to kill SSD? microcai
2014-06-19  8:40 ` Amankwah
2014-06-19 11:44   ` Neil Bothwick
2014-06-19 11:56     ` Rich Freeman
2014-06-19 12:16       ` Kerin Millar
2014-06-19 22:03 ` Full Analyst
2014-06-20 17:48 ` [gentoo-user] " Kai Krakow
2014-06-21  4:54   ` microcai
2014-06-21 14:27   ` Peter Humphrey
2014-06-21 14:54     ` Rich Freeman
2014-06-21 19:19       ` [gentoo-user] " Kai Krakow
2014-06-21 19:24     ` Kai Krakow
2014-06-22  1:40       ` Rich Freeman
2014-06-22 11:44         ` [gentoo-user] " Kai Krakow
2014-06-22 13:44           ` Rich Freeman
2014-06-24 18:34             ` [gentoo-user] " Kai Krakow
2014-06-24 20:01               ` Rich Freeman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox