From: Duncan <1i5t5.duncan@cox.net>
To: gentoo-amd64@lists.gentoo.org
Subject: [gentoo-amd64] Re: Systemd migration: opinion and questions
Date: Thu, 26 Feb 2015 01:55:51 +0000 (UTC) [thread overview]
Message-ID: <pan$63dac$8fbeccdf$99e3327c$a5c109ed@cox.net> (raw)
In-Reply-To: 20150225195632.2dbe5cda@marcec.fritz.box
Marc Joliet posted on Wed, 25 Feb 2015 19:56:32 +0100 as excerpted:
> But regardless of what you use, I think that the worst offenders are
> services that write logs themselves (I'm looking at you, samba).
>> c) I use btrfs for my primary filesystems, and btrfs and journald's
>> binary-format journals don't play so well together. [...]
>
> Well, I'm on an SSD, but even on the laptop I haven't noticed any
> performance issues (yet). Then again, I use autodefrag, so that
> probably helps.
Autodefrag does help.
There are two related issues at work, here.
The primary one is that pretty much any COW-based filesystem, including
btrfs, is going to have problems with internal-rewrite-pattern (as
opposed to append-only rewrites) files of any significant size. At the
small end this includes sqlite database files such as those firefox and
other mozilla products use. These, autodefrag manages well.
At the larger end are multi-gig VM images and similarly sized database
files. These, autodefrag doesn't manage so well, particularly if writes
are coming in at any significant rate, because at some point it's going
to take longer to rewrite the entire file (or even the affected normally
one-gig data chunk) than the time between incoming writes.
And the place where such fragmentation REALLY shows up is trying to run
btrfs filesystem maintenance commands like balance. On a sufficiently
fragmented filesystemsystem, particularly with quotas on too as their
tracking significantly complicates things, balance can take WEEKS on a
single-digits terabyte filesystem.
IOW, a lot of people don't notice it until something goes wrong and
they're trying to replace a failed device with one of the btrfs raid
modes, etc. That's a nasty time to find out how tangled things were, and
realize it'll take weeks to sort out, during which another device could
well fail, leaving you high and dry!
The immediate (partial) solution to the problem with these large files,
typically over a gig, is to set them nocow (which on btrfs must be done
at creation time, while the file is still zero-sized, in ordered to take
proper effect; this is normally accomplished by setting the directory
they'll be in to nocow, which doesn't affect the directory itself, but
does cause any newly created files or subdirs in it to inherit the nocow
attribute).
And this is actually what systemd-219 is doing with the journal files now.
But, setting nocow automatically disables both transparent compression
(if otherwise enabled) and checksumming. The latter isn't actually as
bad as one might expect, because most applications (including systemd/
journald) that deal with such files already have some sort of builtin
corruption detection and possible repair functionality -- they have to in
ordered to work acceptably on traditional filesystems that didn't do
filesystem level checksumming, and letting them have at it would indeed
seem to be the best policy in this case.
The second, related problem, is snapshotting. Because snapshotting
relies on COW, snapshotting a nocow file forces it to effectively cow-1
-- the first time a block is rewritten after a snapshot, it is cowed,
despite the ordinary nocow. Now setup say hourly auto-snapshotting using
snapper or the like, and continue to write to that "nocow" file, and
pretty soon it'll be as fragmented as if it weren't nocow at all!
With careful planning, separate subvolumes for the nocow files so they
aren't snapshotted with the rest of the system, snapshotting the nocow
subvolume with a period near the low frequency end of your target range
(say every other day or weekly instead of daily or twice a day), and if
they aren't rotated out regularly, periodic scripted btrfs defrags (say
weekly or monthly) of the affected files, good admins generally can keep
fragmentation from this source at least within reason.
And systemd-219 is actually creating a separate subvolume for its journal
files now, by default, thus keeping them out of the general system (or
/var) snapshot. But while both that and nocowing the journal files now
does help, it's still a reasonably fragile solution, as long as admins
don't realize what's going on, and can be tempted to set daily or more
frequent snapshotting on the journal subvolume too (or if the subvolume
doesn't take, say because it's an existing installation where there's
already a directory by that name and thus there can't be a subvolume at
the same place with the same name).
**BUT A BIG CAVEAT** lest anyone on stable with btrfs and systemd jump
onto 219 too fast. Yes, 219 DOES have some nice new features.
Unfortunately, it's broken in a few new ways as well.
* Apparently, systemd-219's networkd breaks with at least static IPv4-
only configurations, as my network failed to come up with it. From the
errors it was trying IPv6 and because that failed (it's not even in my
kernel), it gave up and didn't even try IPv4, instead trying to set the
IPv4 IP and gateway values into IPv6, which obviously isn't going to work
at all!
* There's also issues with the new tmpfiles.d configuration that has
replaced d lines (create a directory if it doesn't exist) with new v
lines (create a subvolume if on btrfs and possible, else fallback to d
behavior and create a directory), because subvolume creation fails
differently than directory creation, and the differences aren't all
sorted, yet.
Hopefully, systemd-220 will fix the IPv4 issue and bring a bit more
maturity to the tmpfiles.d subvolumes-creation feature by properly
falling back to d/directories if need be, instead of erroring out.
Meanwhile, hopefully a gentoo systemd-219-rX release will fix some of
these issues as well. But for right now, I'd suggest staying away from
it, as it's definitely not prime-time ready in its current form.
FWIW, I'm back on 218-r3 for now, done with a quick emerge --pkgonly
<systemd-219. I've not yet masked 219, however, so an update will try to
bring it back in, and I will thus have to see what changes have happened
and either mask it or try building it again, next time I update.
> What's funny though is that the systemd news file
> (http://cgit.freedesktop.org/systemd/systemd/tree/NEWS) occasionally
> refers to non-btrfs file systems as "legacy file sysetms". At least, as
> a btrfs user I think it's funny :) .
Indeed. They've definitely adopted btrfs and are running with it. If
you've read anything about their plans, the features of btrfs really do
provide a filesystem-side ready-made solution for them to adopt, altho
I'd still not call btrfs itself exactly mature -- even more than with
other filesystems, if an admin is putting data on btrfs and doesn't have
tested backups available, they really do NOT value that data, claims to
the contrary not withstanding.
And in a way, it's good, because systemd pushing it like that means
systemd based distros will be pushing it too, which will bring far wider
deployment of btrfs, ready or not, which will in turn help btrfs mature
faster with all those additional strange-corner-case bug reports and
hopefully fixes. I just feel for the poor admins trusting their distro
as they head into this without the backups they really should have... as
ultimately, a lot of them are unfortunately going to have to learn that
no backups really DOES mean you'd rather lose that data than bother with
backups, lesson, the HARD way! =:^(
--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman
next prev parent reply other threads:[~2015-02-26 1:56 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-02-24 20:15 [gentoo-amd64] Systemd migration: opinion and questions Marc Joliet
2015-02-24 20:41 ` Randy Barlow
2015-02-24 23:11 ` Marc Joliet
2015-02-25 22:42 ` Marc Joliet
2015-02-27 22:29 ` Marc Joliet
2015-02-24 21:44 ` Rich Freeman
2015-02-25 7:50 ` Marc Joliet
2015-02-25 12:01 ` Rich Freeman
2015-02-25 18:25 ` Marc Joliet
2015-03-01 12:48 ` Marc Joliet
2015-03-01 13:34 ` Rich Freeman
2015-03-01 18:20 ` Marc Joliet
2015-03-01 19:13 ` Rich Freeman
2015-03-02 5:13 ` [gentoo-amd64] " Duncan
2015-03-14 14:01 ` Marc Joliet
2015-03-14 12:57 ` [gentoo-amd64] " Marc Joliet
2015-03-14 13:02 ` Marc Joliet
2015-02-25 10:13 ` [gentoo-amd64] " Duncan
2015-02-25 12:13 ` Rich Freeman
2015-02-26 0:35 ` Duncan
2015-02-25 18:56 ` Marc Joliet
2015-02-26 1:55 ` Duncan [this message]
2015-02-24 21:51 ` [gentoo-amd64] " Frank Peters
2015-02-25 14:31 ` Michael Mattes
2015-02-25 20:28 ` Marc Joliet
2015-02-25 10:15 ` [gentoo-amd64] " Duncan
2015-02-25 10:33 ` Duncan
2015-02-25 19:17 ` Marc Joliet
2015-02-25 19:31 ` Rich Freeman
2015-02-25 19:54 ` Marc Joliet
2015-02-25 22:30 ` [gentoo-amd64] " Marc Joliet
2015-05-20 8:01 ` Marc Joliet
2015-05-20 10:44 ` [gentoo-amd64] " Duncan
2015-05-20 11:22 ` Rich Freeman
2015-05-21 9:36 ` Duncan
2015-05-21 11:33 ` Marc Joliet
2015-05-23 8:49 ` Marc Joliet
2015-05-23 9:32 ` Marc Joliet
2015-05-23 10:41 ` Duncan
2015-05-23 11:11 ` Marc Joliet
2015-05-23 11:37 ` Rich Freeman
2015-05-23 12:02 ` Duncan
2015-05-23 18:07 ` Marc Joliet
2015-05-23 8:17 ` Duncan
2015-05-23 12:14 ` Duncan
2015-05-21 11:29 ` Marc Joliet
-- strict thread matches above, loose matches on Subject: below --
2015-02-25 11:04 Duncan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to='pan$63dac$8fbeccdf$99e3327c$a5c109ed@cox.net' \
--to=1i5t5.duncan@cox.net \
--cc=gentoo-amd64@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox