public inbox for gentoo-amd64@lists.gentoo.org
 help / color / mirror / Atom feed
From: Duncan <1i5t5.duncan@cox.net>
To: gentoo-amd64@lists.gentoo.org
Subject: [gentoo-amd64] Re: Systemd migration: opinion and questions
Date: Thu, 26 Feb 2015 01:55:51 +0000 (UTC)	[thread overview]
Message-ID: <pan$63dac$8fbeccdf$99e3327c$a5c109ed@cox.net> (raw)
In-Reply-To: 20150225195632.2dbe5cda@marcec.fritz.box

Marc Joliet posted on Wed, 25 Feb 2015 19:56:32 +0100 as excerpted:

> But regardless of what you use, I think that the worst offenders are
> services that write logs themselves (I'm looking at you, samba).

>> c) I use btrfs for my primary filesystems, and btrfs and journald's
>> binary-format journals don't play so well together. [...]
> 
> Well, I'm on an SSD, but even on the laptop I haven't noticed any
> performance issues (yet).  Then again, I use autodefrag, so that
> probably helps.

Autodefrag does help.

There are two related issues at work, here.

The primary one is that pretty much any COW-based filesystem, including 
btrfs, is going to have problems with internal-rewrite-pattern (as 
opposed to append-only rewrites) files of any significant size.  At the 
small end this includes sqlite database files such as those firefox and 
other mozilla products use.  These, autodefrag manages well.

At the larger end are multi-gig VM images and similarly sized database 
files.  These, autodefrag doesn't manage so well, particularly if writes 
are coming in at any significant rate, because at some point it's going 
to take longer to rewrite the entire file (or even the affected normally 
one-gig data chunk) than the time between incoming writes.

And the place where such fragmentation REALLY shows up is trying to run 
btrfs filesystem maintenance commands like balance.  On a sufficiently 
fragmented filesystemsystem, particularly with quotas on too as their 
tracking significantly complicates things, balance can take WEEKS on a 
single-digits terabyte filesystem.

IOW, a lot of people don't notice it until something goes wrong and 
they're trying to replace a failed device with one of the btrfs raid 
modes, etc.  That's a nasty time to find out how tangled things were, and 
realize it'll take weeks to sort out, during which another device could 
well fail, leaving you high and dry!

The immediate (partial) solution to the problem with these large files, 
typically over a gig, is to set them nocow (which on btrfs must be done 
at creation time, while the file is still zero-sized, in ordered to take 
proper effect; this is normally accomplished by setting the directory 
they'll be in to nocow, which doesn't affect the directory itself, but 
does cause any newly created files or subdirs in it to inherit the nocow 
attribute).

And this is actually what systemd-219 is doing with the journal files now.

But, setting nocow automatically disables both transparent compression 
(if otherwise enabled) and checksumming.  The latter isn't actually as 
bad as one might expect, because most applications (including systemd/
journald) that deal with such files already have some sort of builtin 
corruption detection and possible repair functionality -- they have to in 
ordered to work acceptably on traditional filesystems that didn't do 
filesystem level checksumming, and letting them have at it would indeed 
seem to be the best policy in this case.

The second, related problem, is snapshotting.  Because snapshotting 
relies on COW, snapshotting a nocow file forces it to effectively cow-1 
-- the first time a block is rewritten after a snapshot, it is cowed, 
despite the ordinary nocow.  Now setup say hourly auto-snapshotting using 
snapper or the like, and continue to write to that "nocow" file, and 
pretty soon it'll be as fragmented as if it weren't nocow at all!

With careful planning, separate subvolumes for the nocow files so they 
aren't snapshotted with the rest of the system, snapshotting the nocow 
subvolume with a period near the low frequency end of your target range 
(say every other day or weekly instead of daily or twice a day), and if 
they aren't rotated out regularly, periodic scripted btrfs defrags (say 
weekly or monthly) of the affected files, good admins generally can keep 
fragmentation from this source at least within reason.

And systemd-219 is actually creating a separate subvolume for its journal 
files now, by default, thus keeping them out of the general system (or 
/var) snapshot.  But while both that and nocowing the journal files now 
does help, it's still a reasonably fragile solution, as long as admins 
don't realize what's going on, and can be tempted to set daily or more 
frequent snapshotting on the journal subvolume too (or if the subvolume 
doesn't take, say because it's an existing installation where there's 
already a directory by that name and thus there can't be a subvolume at 
the same place with the same name).


**BUT A BIG CAVEAT** lest anyone on stable with btrfs and systemd jump 
onto 219 too fast.  Yes, 219 DOES have some nice new features.  
Unfortunately, it's broken in a few new ways as well.

* Apparently, systemd-219's networkd breaks with at least static IPv4-
only configurations, as my network failed to come up with it.  From the 
errors it was trying IPv6 and because that failed (it's not even in my 
kernel), it gave up and didn't even try IPv4, instead trying to set the 
IPv4 IP and gateway values into IPv6, which obviously isn't going to work 
at all!

* There's also issues with the new tmpfiles.d configuration that has 
replaced d lines (create a directory if it doesn't exist) with new v 
lines (create a subvolume if on btrfs and possible, else fallback to d 
behavior and create a directory), because subvolume creation fails 
differently than directory creation, and the differences aren't all 
sorted, yet.

Hopefully, systemd-220 will fix the IPv4 issue and bring a bit more 
maturity to the tmpfiles.d subvolumes-creation feature by properly 
falling back to d/directories if need be, instead of erroring out.  
Meanwhile, hopefully a gentoo systemd-219-rX release will fix some of 
these issues as well.  But for right now, I'd suggest staying away from 
it, as it's definitely not prime-time ready in its current form.

FWIW, I'm back on 218-r3 for now, done with a quick emerge --pkgonly 
<systemd-219.  I've not yet masked 219, however, so an update will try to 
bring it back in, and I will thus have to see what changes have happened 
and either mask it or try building it again, next time I update.

> What's funny though is that the systemd news file
> (http://cgit.freedesktop.org/systemd/systemd/tree/NEWS) occasionally
> refers to non-btrfs file systems as "legacy file sysetms".  At least, as
> a btrfs user I think it's funny :) .

Indeed.  They've definitely adopted btrfs and are running with it.  If 
you've read anything about their plans, the features of btrfs really do 
provide a filesystem-side ready-made solution for them to adopt, altho 
I'd still not call btrfs itself exactly mature -- even more than with 
other filesystems, if an admin is putting data on btrfs and doesn't have 
tested backups available, they really do NOT value that data, claims to 
the contrary not withstanding.

And in a way, it's good, because systemd pushing it like that means 
systemd based distros will be pushing it too, which will bring far wider 
deployment of btrfs, ready or not, which will in turn help btrfs mature 
faster with all those additional strange-corner-case bug reports and 
hopefully fixes.  I just feel for the poor admins trusting their distro 
as they head into this without the backups they really should have... as 
ultimately, a lot of them are unfortunately going to have to learn that 
no backups really DOES mean you'd rather lose that data than bother with 
backups, lesson, the HARD way! =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



  reply	other threads:[~2015-02-26  1:56 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-24 20:15 [gentoo-amd64] Systemd migration: opinion and questions Marc Joliet
2015-02-24 20:41 ` Randy Barlow
2015-02-24 23:11   ` Marc Joliet
2015-02-25 22:42     ` Marc Joliet
2015-02-27 22:29       ` Marc Joliet
2015-02-24 21:44 ` Rich Freeman
2015-02-25  7:50   ` Marc Joliet
2015-02-25 12:01     ` Rich Freeman
2015-02-25 18:25       ` Marc Joliet
2015-03-01 12:48         ` Marc Joliet
2015-03-01 13:34           ` Rich Freeman
2015-03-01 18:20             ` Marc Joliet
2015-03-01 19:13               ` Rich Freeman
2015-03-02  5:13                 ` [gentoo-amd64] " Duncan
2015-03-14 14:01                   ` Marc Joliet
2015-03-14 12:57                 ` [gentoo-amd64] " Marc Joliet
2015-03-14 13:02               ` Marc Joliet
2015-02-25 10:13   ` [gentoo-amd64] " Duncan
2015-02-25 12:13     ` Rich Freeman
2015-02-26  0:35       ` Duncan
2015-02-25 18:56     ` Marc Joliet
2015-02-26  1:55       ` Duncan [this message]
2015-02-24 21:51 ` [gentoo-amd64] " Frank Peters
2015-02-25 14:31   ` Michael Mattes
2015-02-25 20:28   ` Marc Joliet
2015-02-25 10:15 ` [gentoo-amd64] " Duncan
2015-02-25 10:33 ` Duncan
2015-02-25 19:17   ` Marc Joliet
2015-02-25 19:31     ` Rich Freeman
2015-02-25 19:54       ` Marc Joliet
2015-02-25 22:30 ` [gentoo-amd64] " Marc Joliet
2015-05-20  8:01 ` Marc Joliet
2015-05-20 10:44   ` [gentoo-amd64] " Duncan
2015-05-20 11:22     ` Rich Freeman
2015-05-21  9:36       ` Duncan
2015-05-21 11:33         ` Marc Joliet
2015-05-23  8:49         ` Marc Joliet
2015-05-23  9:32           ` Marc Joliet
2015-05-23 10:41           ` Duncan
2015-05-23 11:11             ` Marc Joliet
2015-05-23 11:37               ` Rich Freeman
2015-05-23 12:02                 ` Duncan
2015-05-23 18:07               ` Marc Joliet
2015-05-23  8:17       ` Duncan
2015-05-23 12:14         ` Duncan
2015-05-21 11:29     ` Marc Joliet
  -- strict thread matches above, loose matches on Subject: below --
2015-02-25 11:04 Duncan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='pan$63dac$8fbeccdf$99e3327c$a5c109ed@cox.net' \
    --to=1i5t5.duncan@cox.net \
    --cc=gentoo-amd64@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox