From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) by finch.gentoo.org (Postfix) with ESMTP id 849CD1388BF for ; Wed, 17 Feb 2016 14:24:58 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 60C7E21C056; Wed, 17 Feb 2016 14:24:50 +0000 (UTC) Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 54EF721C006 for ; Wed, 17 Feb 2016 14:24:49 +0000 (UTC) Received: from [IPv6:2001:470:8840::4c3a:ac04:ed57:ecaf] (unknown [IPv6:2001:470:8840:0:4c3a:ac04:ed57:ecaf]) (using TLSv1 with cipher ECDHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) (Authenticated sender: ryao) by smtp.gentoo.org (Postfix) with ESMTPSA id 4F139340B09 for ; Wed, 17 Feb 2016 14:24:48 +0000 (UTC) From: Richard Yao Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org Mime-Version: 1.0 (1.0) Subject: Re: [gentoo-dev] Re: rfc: Does OpenRC really need mount-ro Message-Id: <4800E3E6-4D70-4DF8-8F40-705C6B77882B@gentoo.org> Date: Wed, 17 Feb 2016 09:24:44 -0500 References: <20160216180533.GB1450@whubbs1.gaikai.biz> <20160216184129.GB1704@whubbs1.gaikai.biz> In-Reply-To: To: gentoo-dev@lists.gentoo.org X-Mailer: iPhone Mail (13D15) X-Archives-Salt: 6e735366-9d0b-4c61-8f66-e769d6d4f7ea X-Archives-Hash: e2fd2f8e28307bc8afae2df2c7fb481f > On Feb 16, 2016, at 9:20 PM, Duncan <1i5t5.duncan@cox.net> wrote: >=20 > William Hubbs posted on Tue, 16 Feb 2016 12:41:29 -0600 as excerpted: >=20 >> What I'm trying to figure out is, what to do about re-mounting file >> systems read-only. >>=20 >> How does systemd do this? I didn't find an equivalent of the mount-ro >> service there. >=20 > For quite some time now, systemd has actually had a mechanism whereby the=20= > main systemd process reexecs (with a pivot-root) the initr* systemd and=20= > returns control to it during the shutdown process, thereby allowing a=20 > more controlled shutdown than traditional init systems because the final=20= > stages are actually running from the virtual-filesystem of the initr*,=20 > such that after everything running on the main root is shutdown, the main=20= > root itself can actually be unmounted, not just mounted read-only,=20 > because there is literally nothing running on it any longer. >=20 > There's still a fallback to read-only mounting if an initr* isn't used or=20= > if reinvoking the initr* version fails for some reason, but with an=20 > initr*, when everything's working properly, while there are still some=20 > bits of userspace running, they're no longer actually running off of the=20= > main root, so main root can actually be unmounted much like any other=20 > filesystem. Systemd installs that go back into the initramfs at shutdown are rare becaus= e there is a hook for the initramfs to tell systemd that it should re-exec i= t and very few configurations do that. Even fewer that do it actually need i= t. The biggest user of that mechanism of which I am aware is ZFS on EL/Fedora w= hen booted with Dracut. It does not need it and it was only implemented was t= hat someone who did not understand how ZFS was designed to integrate with th= e boot and startup processes thought it was a good idea. As it turns out, that behavior actually breaks the mechanism intended to mak= e multipath sane by marking the pool in such a way that it tells all systems= with access to the disks that a pool that will be used on next boot is not g= oing to be used by anyone. If they import it and the system boots, the pool c= an be damaged beyond repair. Thankfully, no one seems to boot EL/Fedora systems off ZFS pools in multipat= h environments. The code to hook into this special behavior will be removed i= n the future, but that is a low priority as none of the developers' employer= s care about it and the almost negligible possibility that the mechanism wou= ld save someone from data loss has made it too low of a priority for any of= us to spend our free time on it. > The process is explained a bit better in the copious blogposted systemd=20= > documentation. Let's see if I can find a link... >=20 > OK, this isn't where I originally read about it, which IIRC was aimed=20 > more at admins, while this is aimed at initr* devs, but that's probably a=20= > good thing as it includes more specific detail... >=20 > https://www.freedesktop.org/wiki/Software/systemd/InitrdInterface/ >=20 > And here's some more, this time in the storage daemon controlled root and=20= > initr* context... >=20 > https://www.freedesktop.org/wiki/Software/systemd/RootStorageDaemons/ >=20 >=20 > But... all that doesn't answer the original question directly, does it? =20= > Where there's no return to initr*, how /does/ systemd handle read-only=20 > mounting? >=20 > First, the nice ascii-diagram flow charts in the bootup (7) manpage may=20= > be useful, in particular here, the shutdown diagram (tho IDK if you can=20= > find such things useful or not??). >=20 > https://www.freedesktop.org/software/systemd/man/bootup.html >=20 > Here's the shutdown diagram described in words: >=20 > Initial shutdown is via two targets (as opposed to specific services),=20 > shutdown.target, which conflicts with all (normal) system services=20 > thereby shutting them down, and umount.target, which conflicts with file=20= > mounts, swaps, cryptsetup device, etc. Here, we're obviously interested=20= > in umount.target. Then after those two targets are reached, various low=20= > level services are run or stopped, in ordered to reach final.target. =20 > After final.target, the appropriate systemd-(reboot|poweroff|halt|kexec)=20= > service is run, to hit the ultimate (reboot|poweroff|halt|kexec).target,=20= > which of course is never actually evaluated, since the service actually=20= > does the intended action. >=20 > The primary takeaway is that you might not be finding a specific systemd=20= > remount-ro service, because it might be a target, defined in terms of=20 > conflicts with mount units, etc, rather than a specific service. >=20 > Neither shutdown.target nor umount.target have any wants or requires by=20= > default, but the various normal services and mount units conflict with=20 > them, either via default or specifically, so are shut down before the=20 > target can be reached. >=20 > final.target has the After=3Dshutdown.target umount.target setting, so=20 > won't be reached until they are reached. >=20 > The respective (reboot|poweroff|halt|kexec).target units Requires=3D and=20= > After=3D their respective systemd-*.service units, and reboot and poweroff= =20 > (but not halt and kexec) have 30-minute timeouts after which they run=20 > reboot-force or poweroff-force, respectively. >=20 > The respective systemd-(reboot|poweroff|halt|kexec).service units=20 > Requires=3D and After=3D shutdown.target, umount.target and final.target, a= ll=20 > three, so won't be run until those complete. They simply=20 > ExecStart=3D/usr/bin/systemctl --force their respective actions. >=20 > And here's what the systemd.special (7) manpage says about umount.target: >=20 > umount.target > A special target unit that umounts all mount and automount points > on system shutdown. >=20 > Mounts that shall be unmounted on system shutdown shall add > Conflicts dependencies to this unit for their mount unit, > which is implicitly done when DefaultDependencies=3Dyes is set > (the default). >=20 > But that /still/ doesn't reveal what actually does the remount-ro, as=20 > opposed to umount. I don't see that either, at the unit level, nor do I=20= > see anything related to it in for instance my auto-generated from fstab=20= > /run/systemd/generators/-.mount file or in the systemd-fstab-generator=20 > (8) manpage. >=20 > Thus I must conclude that it's actually resolved in the mount-unit=20 > conflicts handling in systemd's source code, itself. >=20 > And indeed... in systemd's tarball, we see in src/core/umount.c, in=20 > mount_points_list_umount... >=20 > That the function actually remounts /everything/ (well, everything not in=20= > a container) read-only, before actually trying to umount them. Indention=20= > restandardized on two-space here, to avoid unnecessary wrapping as=20 > posted. This is from systemd-228: >=20 > static int mount_points_list_umount(MountPoint **head, bool *changed, bool= =20 > log_error) { > MountPoint *m, *n; > int n_failed =3D 0; >=20 > assert(head); >=20 > LIST_FOREACH_SAFE(mount_point, m, n, *head) { >=20 > /* If we are in a container, don't attempt to > read-only mount anything as that brings no real > benefits, but might confuse the host, as we remount > the superblock here, not the bind mound. */ > if (detect_container() <=3D 0) { > _cleanup_free_ char *options =3D NULL; > /* MS_REMOUNT requires that the data parameter > * should be the same from the original mount > * except for the desired changes. Since we want > * to remount read-only, we should filter out > * rw (and ro too, because it confuses the kernel) */ > (void) fstab_filter_options(m->options, "rw\0ro\0", NULL, NULL,=20 > &options); >=20 > /* We always try to remount directories read-only > * first, before we go on and umount them. > * > * Mount points can be stacked. If a mount > * point is stacked below / or /usr, we > * cannot umount or remount it directly, > * since there is no way to refer to the > * underlying mount. There's nothing we can do > * about it for the general case, but we can > * do something about it if it is aliased > * somehwere else via a bind mount. If we > * explicitly remount the super block of that > * alias read-only we hence should be > * relatively safe regarding keeping the fs we > * can otherwise not see dirty. */ > log_info("Remounting '%s' read-only with options '%s'.", m->path,=20 > options); > (void) mount(NULL, m->path, NULL, MS_REMOUNT|MS_RDONLY, options); > } >=20 > /* Skip / and /usr since we cannot unmount that > * anyway, since we are running from it. They have > * already been remounted ro. */ > if (path_equal(m->path, "/") > #ifndef HAVE_SPLIT_USR > || path_equal(m->path, "/usr") > #endif > ) > continue; >=20 > /* Trying to umount. We don't force here since we rely > * on busy NFS and FUSE file systems to return EBUSY > * until we closed everything on top of them. */ > log_info("Unmounting %s.", m->path); > if (umount2(m->path, 0) =3D=3D 0) { > if (changed) > *changed =3D true; >=20 > mount_point_free(head, m); > } else if (log_error) { > log_warning_errno(errno, "Could not unmount %s: %m", m->path); > n_failed++; > } > } >=20 > return n_failed; > } >=20 >=20 > So the short answer ultimately is... Systemd has a single umount=20 > function, which first does remount-ro, so it's actually remounting=20 > (nearly) everything read-only, then tries umount. >=20 >=20 > Meanwhile, (semi-)answering the elsewhere implied question of why only=20 > Linux needs the mount-ro service... I'm no BSD expert, but in my=20 > wanderings I came across a remark that they didn't need it, because their=20= > kernel reboot/halt/poweroff routines have a built-in kernelspace sync-and-= > remount-ro routine for anything that can't be unmounted, which Linux=20 > lacks. They obviously consider this a Linux deficiency, but while I've=20= > not come across the Linux reason for not doing it, an educated guess is=20= > that it's considered putting policy into the kernel, and that's=20 > considered a no-no, policy is userspace; the kernel simply enforces it as=20= > directed (which is why kernel 2.4's devfs was removed for 2.6, to be=20 > replaced with the userspace-based udev). Additionally, not kernel- > forcing the remount-ro bit does give developers a way to test results of=20= > an uncontrolled shutdown, say on a specific testing filesystem only,=20 > without exposing the rest of the system, which can still be shut down=20 > normally, to it. >=20 > So on Linux userspace must do the final umounts and force-read-onlys,=20 > because unlike the BSDs, the Linux kernel doesn't have builtin routines=20= > that automatically force it, regardless of userspace. >=20 > But as others have said, on Linux the remount-ro is _definitely_=20 > required, and "bad things _will_ happen" if it's not done. (Just how bad=20= > depends on the filesystem and its mount options, and hardware, among=20 > other things.) >=20 >=20 > Finally, one more thing to mention. On systems with magic-srq in the=20 > kernel... >=20 > echo 0x30 > /proc/sys/kernel/sysrq >=20 > ... enables the sync (0x10) and remount-readonly (0x20) functions. (Of=20= > course only do this at shutdown/reboot, as you don't want to disturb the=20= > user's configured srq defaults in normal runtime.) >=20 > You can then force emergency sync (s) and remount-read-only (u) with... >=20 > echo s > /proc/sysrq-trigger > echo u > /proc/sysrq-trigger >=20 > As that's kernel emergency priority, it should force-sync and force=20 > everything readonly (and quiesce mid-layer layer block devices such as md=20= > and dm), even if it would normally refuse to do so due to files open for=20= > writing. You might consider something like that as a fallback, if normal=20= > mount-readonly fails. Of course it won't work if magic-srq functionality=20= > isn't built into the kernel, but then you're no worse off than before,=20 > and are far better off on kernels where it's supported, so it's certainly=20= > worth considering. =3D:^) >=20 > --=20 > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman >=20 >=20