public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] An example overlayfs sandbox test
@ 2017-09-22 23:43 James McMechan
  2017-09-23  0:18 ` Rich Freeman
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: James McMechan @ 2017-09-22 23:43 UTC (permalink / raw
  To: gentoo-dev@lists.gentoo.org

[-- Attachment #1: Type: text/plain, Size: 1220 bytes --]

Hello,
I thought a example of how a overlay sandbox could work was in order.

###
# load the overlayfs filesystem for this test
modprobe overlay

# make the directories for the test
mkdir -p /var/tmp/upper /var/tmp/work /mnt/gentoo

# now create a separate mount namespace non-persistent
unshare -m bash

# setup the overlay
mount -toverlay -oupperdir=/var/tmp/upper/,workdir=/var/tmp/work/,lowerdir=/ overlay /mnt/gentoo/

# since I don't care about protecting /var/tmp/portage
# put the original on top of the overlay for better performance maybe?
mount -o bind /var/tmp/portage /mnt/gentoo/var/tmp/portage

# then like the handbook
cd /mnt/gentoo
mount -t proc proc proc
mount --rbind /sys sys
mount --rbind /dev dev

#finally change into the protected sandbox
chroot . bash

# mess up the system

exit # the chroot
exit # the unshare
### done.

This version allows the sandbox to work with the special files in /dev, /proc, /sys
other options are available for example a second separate dev/pts and dev/shm submounts

When you exit the chroot and then the unshare, the /var/tmp/upper directory will contain all the changes made while in the chroot.

Enjoy,

Jim McMechan


[-- Attachment #2: Type: text/html, Size: 2014 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] An example overlayfs sandbox test
  2017-09-22 23:43 [gentoo-dev] An example overlayfs sandbox test James McMechan
@ 2017-09-23  0:18 ` Rich Freeman
  2017-09-23  1:29   ` James McMechan
  2017-09-23 23:42 ` Alec Warner
  2017-09-24 12:55 ` [gentoo-dev] " Michał Górny
  2 siblings, 1 reply; 17+ messages in thread
From: Rich Freeman @ 2017-09-23  0:18 UTC (permalink / raw
  To: gentoo-dev

On Fri, Sep 22, 2017 at 4:43 PM, James McMechan
<james_mcmechan@hotmail.com> wrote:
>
> # now create a separate mount namespace non-persistent
> unshare -m bash
>

If you're going to go to the trouble to set up a container, you might
as well add some more isolation:

unshare --mount --net --pid --uts --cgroup --fork --ipc --mount-proc bash

I'm not sure how much of a hassle mapping a uid namespace would be or
if it would really add anything, especially if this chroots to portage
right away.

-- 
Rich


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] An example overlayfs sandbox test
  2017-09-23  0:18 ` Rich Freeman
@ 2017-09-23  1:29   ` James McMechan
  2017-09-23  2:26     ` Rich Freeman
  0 siblings, 1 reply; 17+ messages in thread
From: James McMechan @ 2017-09-23  1:29 UTC (permalink / raw
  To: gentoo-dev@lists.gentoo.org

On Fri, Sep 22, 2017 at 5:18 PM, Rich Freeman <rich0@gentoo.org> wrote:
>On Fri, Sep 22, 2017 at 4:43 PM, James McMechan
><james_mcmechan@hotmail.com> wrote:
>>
>> # now create a separate mount namespace non-persistent
>> unshare -m bash
>>
>
>If you're going to go to the trouble to set up a container, you might
>as well add some more isolation:
>
>unshare --mount --net --pid --uts --cgroup --fork --ipc --mount-proc bash
>
>I'm not sure how much of a hassle mapping a uid namespace would be or
>if it would really add anything, especially if this chroots to portage
>right away.
>
>-- 
>Rich

Well mostly it was an example, I am not actually very good at containers.
the more stuff is isolated the more it needs to be setup.

The mount namespace is the whole point of the example

I would not want to change the networking, it should already be working
and I would be better served by not messing with it.

portage should not care about the --pid --uts(hostname/domainname) --cgroup or --ipc

The --mount-proc is not really helpful as I immediately remount the entire
"/" filesystem at /mnt/gentoo and chroot into it after custom setup of proc sys and dev

Now I could see a use for  --map-root-user --user, then portage could run as
root in the container with the least danger by being user portage:portage outside.

Enjoy

Jim McMechan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] An example overlayfs sandbox test
  2017-09-23  1:29   ` James McMechan
@ 2017-09-23  2:26     ` Rich Freeman
  2017-09-24  4:36       ` Tim Harder
  2017-09-24 15:39       ` James McMechan
  0 siblings, 2 replies; 17+ messages in thread
From: Rich Freeman @ 2017-09-23  2:26 UTC (permalink / raw
  To: gentoo-dev

On Fri, Sep 22, 2017 at 6:29 PM, James McMechan
<james_mcmechan@hotmail.com> wrote:
> On Fri, Sep 22, 2017 at 5:18 PM, Rich Freeman <rich0@gentoo.org> wrote:
>>On Fri, Sep 22, 2017 at 4:43 PM, James McMechan
>><james_mcmechan@hotmail.com> wrote:
>>>
>>> # now create a separate mount namespace non-persistent
>>> unshare -m bash
>>>
>>
>>If you're going to go to the trouble to set up a container, you might
>>as well add some more isolation:
>>
>>unshare --mount --net --pid --uts --cgroup --fork --ipc --mount-proc bash
>>
>
> I would not want to change the networking, it should already be working
> and I would be better served by not messing with it.
>

Well, that's the point.  You don't want networking to work during the
build phases.  Maybe you'd want it for the test phase.  In any case,
you would definitely want control over that in the ebuild.  Random
build systems shouldn't be talking to the internet, if for no other
reason than to avoid it fetching stuff to install that bypasses the
integrity checks.

If you create a new net namespace by default it won't have any
interfaces other than lo.

>
> The --mount-proc is not really helpful as I immediately remount the entire
> "/" filesystem at /mnt/gentoo and chroot into it after custom setup of proc sys and dev
>

As long as it doesn't see the host /proc then you're fine.  You just
wouldn't want to have it mounted into the container.

> Now I could see a use for  --map-root-user --user, then portage could run as
> root in the container with the least danger by being user portage:portage outside.
>

Certainly, but that takes a bit more work, and to be honest I've never
actually bothered to get it working using unshare.  It probably isn't
too difficult.

The options I listed basically "just work" without any real additional effort.

So, we're drifting in topic, but as long as we're coming up with
nice-to-have utilities it would be lovely if our install CDs had
something similar to systemd-nspawn to set up a container instead of a
chroot for performing the install.  If nothing else it would make
mount cleanup easier when you're done.  I imagine it would just be a
bit of shell scripting with util-linux on the CD - while nspawn is
bundled with systemd you don't need any of its fancier features for
doing an install.

Back on topic - none of this stuff will work on FreeBSD, which might
be an issue for those running Gentoo on that kernel.  Ditto for Prefix
I suppose.  I suspect that jails/etc would also do the job but you'd
need some arch-dependent code to set up the container.  Just about all
of these tricks are involving non-POSIX functionality.  Actually, I'm
not sure if even the current LD_PRELOAD approach is completely
portable, though it has the advantage of being entirely in userspace.

-- 
Rich


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] An example overlayfs sandbox test
  2017-09-22 23:43 [gentoo-dev] An example overlayfs sandbox test James McMechan
  2017-09-23  0:18 ` Rich Freeman
@ 2017-09-23 23:42 ` Alec Warner
  2017-09-23 23:59   ` Rich Freeman
  2017-09-24 12:55 ` [gentoo-dev] " Michał Górny
  2 siblings, 1 reply; 17+ messages in thread
From: Alec Warner @ 2017-09-23 23:42 UTC (permalink / raw
  To: Gentoo Dev

[-- Attachment #1: Type: text/plain, Size: 2385 bytes --]

On Fri, Sep 22, 2017 at 7:43 PM, James McMechan <james_mcmechan@hotmail.com>
wrote:

> Hello,
> I thought a example of how a overlay sandbox could work was in order.
>
> ###
> # load the overlayfs filesystem for this test
> modprobe overlay
>
> # make the directories for the test
> mkdir -p /var/tmp/upper /var/tmp/work /mnt/gentoo
>
> # now create a separate mount namespace non-persistent
> unshare -m bash
>
> # setup the overlay
> mount -toverlay -oupperdir=/var/tmp/upper/,workdir=/var/tmp/work/,lowerdir=/
> overlay /mnt/gentoo/
>
> # since I don't care about protecting /var/tmp/portage
> # put the original on top of the overlay for better performance maybe?
> mount -o bind /var/tmp/portage /mnt/gentoo/var/tmp/portage
>
> # then like the handbook
> cd /mnt/gentoo
> mount -t proc proc proc
> mount --rbind /sys sys
> mount --rbind /dev dev
>
> #finally change into the protected sandbox
> chroot . bash
>
> # mess up the system
>

> exit # the chroot
> exit # the unshare
> ### done.
>
> This version allows the sandbox to work with the special files in /dev,
> /proc, /sys
> other options are available for example a second separate dev/pts and
> dev/shm submounts
>
> When you exit the chroot and then the unshare, the /var/tmp/upper
> directory will contain all the changes made while in the chroot.
>

I'm not quite grasping how this informs me of violations though.

Like inside of the chroot lets say I read /etc/foo and then modify it (via
something like a sed call.)
In this implementation /etc/foo is available (because / is lowerdir?) but
my writes will end up in /var/tmp/upper.

So by simple inspection of /var/tmp/upper I can detect a violation
occurred...but I don't get any information as to what caused it; right? I'm
sure in trivial cases (sed $FOO) its easy to figure out but other cases its
a lot more complicated to determine which portion of the build is the
culprit. That is why the tracing portion is so useful. A thing tries to do
the bad thing and it fails.

We could try forcing failures (say, by not having / mounted as lowerdir, so
syscalls against the rootfs would just fail as E_NOENT) but then we are
still stuck with the tricky part; which is that sometimes things *do* need
to read / write from the rootfs and the sandbox add* API is available to do
that. How would we implement something like that here?

-A


> Enjoy,
>
> Jim McMechan
>
>

[-- Attachment #2: Type: text/html, Size: 4428 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] An example overlayfs sandbox test
  2017-09-23 23:42 ` Alec Warner
@ 2017-09-23 23:59   ` Rich Freeman
  2017-09-24  4:44     ` Tim Harder
  0 siblings, 1 reply; 17+ messages in thread
From: Rich Freeman @ 2017-09-23 23:59 UTC (permalink / raw
  To: gentoo-dev

On Sat, Sep 23, 2017 at 7:42 PM, Alec Warner <antarus@gentoo.org> wrote:
>
> We could try forcing failures (say, by not having / mounted as lowerdir, so
> syscalls against the rootfs would just fail as E_NOENT) but then we are
> still stuck with the tricky part; which is that sometimes things *do* need
> to read / write from the rootfs and the sandbox add* API is available to do
> that. How would we implement something like that here?
>

I would personally recommend against the overlay approach for all the
reasons you state.

A read-only container is a much simpler solution and generates the
same kinds of errors as the current sandbox approach, but likely with
fewer compatibility issues.  I'm not really sure what tracing gets us
that containers don't, other than having to make sure you trap
everything and handle it.  The kernel already handles attempts to
write to read-only files and so on.

We could add an API to designate specific files/directories/etc as
read-write, and then portage would bind mount them as writable in the
container.

-- 
Rich


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] An example overlayfs sandbox test
  2017-09-23  2:26     ` Rich Freeman
@ 2017-09-24  4:36       ` Tim Harder
  2017-09-24 15:39       ` James McMechan
  1 sibling, 0 replies; 17+ messages in thread
From: Tim Harder @ 2017-09-24  4:36 UTC (permalink / raw
  To: gentoo-dev

On 2017-09-22 22:26, Rich Freeman wrote:
> So, we're drifting in topic, but as long as we're coming up with
> nice-to-have utilities it would be lovely if our install CDs had
> something similar to systemd-nspawn to set up a container instead of a
> chroot for performing the install.  If nothing else it would make
> mount cleanup easier when you're done.  I imagine it would just be a
> bit of shell scripting with util-linux on the CD - while nspawn is
> bundled with systemd you don't need any of its fancier features for
> doing an install.

If you're fine with python, you could try pychroot [1]. I wrote it with
this kind of thing in mind and for doing easy namespaced chroots in
python using a context manager.

It's even in the tree already. ;)

Tim

[1]: https://github.com/pkgcore/pychroot


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] An example overlayfs sandbox test
  2017-09-23 23:59   ` Rich Freeman
@ 2017-09-24  4:44     ` Tim Harder
  2017-09-24  8:24       ` [gentoo-dev] " Martin Vaeth
  0 siblings, 1 reply; 17+ messages in thread
From: Tim Harder @ 2017-09-24  4:44 UTC (permalink / raw
  To: gentoo-dev

On 2017-09-23 19:59, Rich Freeman wrote:
> A read-only container is a much simpler solution and generates the
> same kinds of errors as the current sandbox approach, but likely with
> fewer compatibility issues.  I'm not really sure what tracing gets us
> that containers don't, other than having to make sure you trap
> everything and handle it.  The kernel already handles attempts to
> write to read-only files and so on.

> We could add an API to designate specific files/directories/etc as
> read-write, and then portage would bind mount them as writable in the
> container.

I doubt bind mounts will scale as far as we'd need them for this
approach, i.e. tons of bind mounts needed for complicated builds would
cause issues.

As has been mentioned before, a different way would be to write some
sort of FUSE fs that can handle only allowing access to a specified set
of files in a performant manner. Leveraged alongside
namespaces/containers this would probably provide us what we need
assuming an API of sorts could be written to perform the various
request/deny/etc actions on the FUSE fs that we already use for
sandboxing.

Tim


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [gentoo-dev] Re: An example overlayfs sandbox test
  2017-09-24  4:44     ` Tim Harder
@ 2017-09-24  8:24       ` Martin Vaeth
  2017-09-24 11:31         ` Rich Freeman
  0 siblings, 1 reply; 17+ messages in thread
From: Martin Vaeth @ 2017-09-24  8:24 UTC (permalink / raw
  To: gentoo-dev

Tim Harder <radhermit@gentoo.org> wrote:
> On 2017-09-23 19:59, Rich Freeman wrote:
>> A read-only container
>
> I doubt bind mounts will scale
>
> As has been mentioned before, a different way would be to write some
> sort of FUSE fs

The problem with both, containers and FUSE, is performance.
(For containers with thousands of binds, I haven't tried,
but for FUSE I know how unionfs-fuse slows down "normal"
operation: only for the reason that the implementation in
userspace requires many additional context switches.)

Both is fine for testing, but I am afraid not for regular
user's emerge operations which usually involve too many file
operations, at least for certain packages (e.g. *-sources,
texlive-*).

It is the big advantage of overlay that it is implemented in
kernel and does not involve any time-consuming checks during
normal file operations.

Indeed, the price you pay is that the actual checking can be
done only once only at the very end of the compilation, and
so the only information you get is the name and time of
violation (paths and filestamps). But concerning performance
this "only once" checking is an advantage, of course.

Main disadvantages: It requires the user to have overlay
support in the kernel and extended attribute support for
the file system containing the upper directory.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] Re: An example overlayfs sandbox test
  2017-09-24  8:24       ` [gentoo-dev] " Martin Vaeth
@ 2017-09-24 11:31         ` Rich Freeman
  2017-09-24 18:11           ` Martin Vaeth
  0 siblings, 1 reply; 17+ messages in thread
From: Rich Freeman @ 2017-09-24 11:31 UTC (permalink / raw
  To: gentoo-dev

On Sun, Sep 24, 2017 at 4:24 AM, Martin Vaeth <martin@mvath.de> wrote:
> Tim Harder <radhermit@gentoo.org> wrote:
>
> It is the big advantage of overlay that it is implemented in
> kernel and does not involve any time-consuming checks during
> normal file operations.
>

Why would you expect containers to behave any differently?  Either way
the kernel gets a path and has to figure out where the path is
actually stored, and check the inode for access permissions.

Now, I am concerned about the time to create the container, if we're
going to specify individual files, but the same would be true of an
overlay.

If you create a container and just read-only bind mount all the
top-level dirs from the root filesystem into it, and then mount a
read-write bind mount into the package build directory, that is just a
few operations.  I'd expect that to go fast with either a container or
overlay solution.

If you actually want to go to the next step (which our current sandbox
does not) and only bind mount the specific files specified in DEPEND
and their RDEPEND then you're talking about creating thousands of bind
mounts.  I have no idea how that performs.  However, I suspect it
would be at least as slow to populate an overlayfs with just that
specific list of files.

You can't compare the file-level container solution against the
filesystem-level overlay solution when both solutions can be
implemented either way.  If you just replicate the current sandbox
functionality then setup time is tiny and you get visibility into
write violations only.  If you resolve dependencies and map in
individual files then you additionally get visibility into read
violations, at the cost of more time to create the build environment.

I'd be interested in how other distros are solving this problem,
because fundamentally what we're doing isn't really any different.

-- 
Rich


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] An example overlayfs sandbox test
  2017-09-22 23:43 [gentoo-dev] An example overlayfs sandbox test James McMechan
  2017-09-23  0:18 ` Rich Freeman
  2017-09-23 23:42 ` Alec Warner
@ 2017-09-24 12:55 ` Michał Górny
  2 siblings, 0 replies; 17+ messages in thread
From: Michał Górny @ 2017-09-24 12:55 UTC (permalink / raw
  To: gentoo-dev

W dniu pią, 22.09.2017 o godzinie 23∶43 +0000, użytkownik James McMechan
napisał:
> Hello,
> I thought a example of how a overlay sandbox could work was in order.
> 
> ###
> # load the overlayfs filesystem for this test
> modprobe overlay
> 
> # make the directories for the test
> mkdir -p /var/tmp/upper /var/tmp/work /mnt/gentoo
> 
> # now create a separate mount namespace non-persistent
> unshare -m bash
> 
> # setup the overlay
> mount -toverlay -oupperdir=/var/tmp/upper/,workdir=/var/tmp/work/,lowerdir=/ overlay /mnt/gentoo/
> 
> # since I don't care about protecting /var/tmp/portage
> # put the original on top of the overlay for better performance maybe?
> mount -o bind /var/tmp/portage /mnt/gentoo/var/tmp/portage
> 
> # then like the handbook
> cd /mnt/gentoo
> mount -t proc proc proc
> mount --rbind /sys sys
> mount --rbind /dev dev
> 
> #finally change into the protected sandbox
> chroot . bash
> 
> # mess up the system
> 
> exit # the chroot
> exit # the unshare
> ### done.
> 
> This version allows the sandbox to work with the special files in /dev, /proc, /sys
> other options are available for example a second separate dev/pts and dev/shm submounts
> 
> When you exit the chroot and then the unshare, the /var/tmp/upper directory will contain all the changes made while in the chroot.
> 

How does that deal with access violations to device nodes? Named pipes,
UNIX sockets?

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] An example overlayfs sandbox test
  2017-09-23  2:26     ` Rich Freeman
  2017-09-24  4:36       ` Tim Harder
@ 2017-09-24 15:39       ` James McMechan
  1 sibling, 0 replies; 17+ messages in thread
From: James McMechan @ 2017-09-24 15:39 UTC (permalink / raw
  To: gentoo-dev@lists.gentoo.org

On Fri, Sep 22, 2017 at 7:26 PM, Rich Freeman <rich0@gentoo.org> wrote:
>On Fri, Sep 22, 2017 at 6:29 PM, James McMechan <james_mcmechan@hotmail.com> wrote:
>> On Fri, Sep 22, 2017 at 5:18 PM, Rich Freeman <rich0@gentoo.org> wrote:
>>>On Fri, Sep 22, 2017 at 4:43 PM, James McMechan
>>><james_mcmechan@hotmail.com> wrote:
>>>>
>>>> # now create a separate mount namespace non-persistent
>>>> unshare -m bash
>>>>
>>>
>>>If you're going to go to the trouble to set up a container, you might
>>>as well add some more isolation:
>>>
>>>unshare --mount --net --pid --uts --cgroup --fork --ipc --mount-proc bash
>>>
>>
>> I would not want to change the networking, it should already be working
>> and I would be better served by not messing with it.
>>
>
>Well, that's the point.  You don't want networking to work during the
>build phases.  Maybe you'd want it for the test phase.  In any case,
>you would definitely want control over that in the ebuild.  Random
>build systems shouldn't be talking to the internet, if for no other
>reason than to avoid it fetching stuff to install that bypasses the
>integrity checks.
>
>If you create a new net namespace by default it won't have any
>interfaces other than lo.

Err, perhaps I was thinking of doing the switch too early, I had been thinking
of making sure networking works for the fetch phase...

>> The --mount-proc is not really helpful as I immediately remount the entire
>> "/" filesystem at /mnt/gentoo and chroot into it after custom setup of proc sys and dev
>>
>
>As long as it doesn't see the host /proc then you're fine.  You just
>wouldn't want to have it mounted into the container.

I am pretty sure a /proc needs to be mounted inside of any container if you want builds
to work. mount, ps, and a lot of other stuff does not work without it.

I think you mean that only the container /proc is mounted.

>> Now I could see a use for  --map-root-user --user, then portage could run as
>> root in the container with the least danger by being user portage:portage outside.
>>
>
>Certainly, but that takes a bit more work, and to be honest I've never
>actually bothered to get it working using unshare.  It probably isn't
>too difficult.

You were right, I just tried it and it seemed easy enough.

>The options I listed basically "just work" without any real additional effort.

I like the lack of effort part ;) I want the computer doing the work not me.

>So, we're drifting in topic, but as long as we're coming up with
>nice-to-have utilities it would be lovely if our install CDs had
>something similar to systemd-nspawn to set up a container instead of a
>chroot for performing the install.  If nothing else it would make
>mount cleanup easier when you're done.  I imagine it would just be a
>bit of shell scripting with util-linux on the CD - while nspawn is
>bundled with systemd you don't need any of its fancier features for
>doing an install.

Ok, but what is the advantage? the mounts disappearing when you exit
the container, and anything else?

>Back on topic - none of this stuff will work on FreeBSD, which might
>be an issue for those running Gentoo on that kernel.  Ditto for Prefix
>I suppose.  I suspect that jails/etc would also do the job but you'd
>need some arch-dependent code to set up the container.  Just about all
>of these tricks are involving non-POSIX functionality.  Actually, I'm
>not sure if even the current LD_PRELOAD approach is completely
>portable, though it has the advantage of being entirely in userspace.

I don't see any BSD stuff on the download page, so it slips my mind.

I have not used any of the BSD derived systems in quite a few years.
A quick glance shows they are running things like docker so container
like functionality exists, and I seem to remember that Linux was late to
the overlay stuff, the freeBSD mount_unionfs says it went in as
mount_null in 4.4BSD according to the man page so that is present also.

Regarding prefix, or maybe RAP does sandbox even work there?

It looks like prefix uses the host system's library... would either method
sandbox currently uses work?

RAP appeared to have it's own C library so it should be possible there
I do not remember if sandbox worked under RAP it the whole project felt
very experimental when I tried it.

I think non-linux systems did adopt LD_PRELOAD and ptrace some,
I do not think either was in any of the POSIX versions though I did not
look much at the last version, it seemed mostly irrelevant by then.

Enjoy,

Jim McMechan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [gentoo-dev] Re: An example overlayfs sandbox test
  2017-09-24 11:31         ` Rich Freeman
@ 2017-09-24 18:11           ` Martin Vaeth
  2017-09-25  0:49             ` Rich Freeman
  0 siblings, 1 reply; 17+ messages in thread
From: Martin Vaeth @ 2017-09-24 18:11 UTC (permalink / raw
  To: gentoo-dev

Rich Freeman <rich0@gentoo.org> wrote:
> On Sun, Sep 24, 2017 at 4:24 AM, Martin Vaeth <martin@mvath.de> wrote:
>> Tim Harder <radhermit@gentoo.org> wrote:
>>
>> It is the big advantage of overlay that it is implemented in
>> kernel and does not involve any time-consuming checks during
>> normal file operations.
>
> Why would you expect containers to behave any differently?

For overlay, there is only one directory to be checked in
addition for every file access.

For containers, at least a dozens of binds are minimally required
(/usr /proc /sys /dev ...). But as you mentioned in your posting,
if you want to take more care you easily have thousands of bind mounts.
At least implicitly in the kernel, all of these binds must be checked
for every file access. I am not sure whether this happens very quickly
by hashing (so that essentially really only the creation costs time).

As mentioned, I do not have actual timing results. I am just afraid
that it might easily cost more than a context-switch which already
gives a slowdown for fuse-overlay which is so large that I would
not recommend it for a sandbox.

> Now, I am concerned about the time to create the container, if we're
> going to specify individual files, but the same would be true of an
> overlay. [...]
> to populate an overlayfs with just that specific list of files.

No. For overlay you need only one mount (not even a bind)
and only one directory traversal at the end to check for
violations.
The nice thing is that this is practically independent of
the number or structure of directories/files you want to protect,
i.e. it scales perfectly well.
For the more fine-grained approach, you just delete the files
you do not want to have in the beginning. Not sure, how quick this
can be done, but once it is done, the slowdown when running the
sandbox is independent of the number of deleted files (because
here certainly only one hash lookup is required).

Of course, as mgorny already observed, overlay alone is not an
absolute protection (e.g. against writing to some /dev/...),
so perhaps it is a good idea to use containers as an additional
protection level.

> If you just replicate the current sandbox
> functionality then setup time is tiny

I am not so much concerned about the setup time but more about the
delay caused for file operations once the sandbox is set up.
Perhaps even a dozen bind directories already give a considerable
slowdown...



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] Re: An example overlayfs sandbox test
  2017-09-24 18:11           ` Martin Vaeth
@ 2017-09-25  0:49             ` Rich Freeman
  2017-09-25 15:27               ` Martin Vaeth
  0 siblings, 1 reply; 17+ messages in thread
From: Rich Freeman @ 2017-09-25  0:49 UTC (permalink / raw
  To: gentoo-dev

On Sun, Sep 24, 2017 at 2:11 PM, Martin Vaeth <martin@mvath.de> wrote:
> Rich Freeman <rich0@gentoo.org> wrote:
>> On Sun, Sep 24, 2017 at 4:24 AM, Martin Vaeth <martin@mvath.de> wrote:
>>> Tim Harder <radhermit@gentoo.org> wrote:
>>>
>>> It is the big advantage of overlay that it is implemented in
>>> kernel and does not involve any time-consuming checks during
>>> normal file operations.
>>
>> Why would you expect containers to behave any differently?
>
> For overlay, there is only one directory to be checked in
> addition for every file access.
>
> For containers, at least a dozens of binds are minimally required
> (/usr /proc /sys /dev ...).

I wouldn't be surprised if it works with a single bind mount with
/proc and /dev and so on mounted on top of that.  You really don't
want to be passing these directories through to the host filesystem
anyway.

>
>> Now, I am concerned about the time to create the container, if we're
>> going to specify individual files, but the same would be true of an
>> overlay. [...]
>> to populate an overlayfs with just that specific list of files.
>
> No. For overlay you need only one mount (not even a bind)
> and only one directory traversal at the end to check for
> violations.

You say "not even a bind" as if that is a benefit.  I suspect bind
mounts operate more quickly than an overlayfs if anything.

> The nice thing is that this is practically independent of
> the number or structure of directories/files you want to protect,
> i.e. it scales perfectly well.
> For the more fine-grained approach, you just delete the files
> you do not want to have in the beginning. Not sure, how quick this
> can be done, but once it is done, the slowdown when running the
> sandbox is independent of the number of deleted files (because
> here certainly only one hash lookup is required).

Honestly, you can't really claim that overlayfs is superior to bind
mounts when it comes to access times without actually looking into how
fast bind mounts actually operate.  I'd have to read up on the kernel
VFS myself but people run hosts with lots of containers all the time
and they usually contain a ton of mountpoints.  The kernel obviously
has an efficient way to figure out what filesystem a path is actually
on, and it actually has to work this out even if you're using
overlayfs, since the kernel has to first figure out that the path is
even on the overlayfs.

It is possible that bind mount performance is inferior when you've
removed all but a thousand files from your overlayfs, and it is
possible that overlayfs performance is inferior.

>
>> If you just replicate the current sandbox
>> functionality then setup time is tiny
>
> I am not so much concerned about the setup time but more about the
> delay caused for file operations once the sandbox is set up.
> Perhaps even a dozen bind directories already give a considerable
> slowdown...
>

I run builds on Gentoo containers all the time and the host is
juggling dozens of bind mounts already.  Before I started using
containers I'd use bind mounts fairly often on monolithic hosts.  I
certainly haven't noticed any overhead.  There are certainly people
running FAR more containers per host than I am.  I wouldn't be
concerned with a couple of bind mounts. I have a ton of zfs
mountpoints as well and no issues.  (Bind mounts shouldn't have any
more cost than any other type of mount, and probably less.)

I wouldn't assume that thousands of bind mounts would have zero impact
without testing it, but I also have no reason to be concerned.

-- 
Rich


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [gentoo-dev] Re: An example overlayfs sandbox test
  2017-09-25  0:49             ` Rich Freeman
@ 2017-09-25 15:27               ` Martin Vaeth
  2017-09-25 15:34                 ` Rich Freeman
  0 siblings, 1 reply; 17+ messages in thread
From: Martin Vaeth @ 2017-09-25 15:27 UTC (permalink / raw
  To: gentoo-dev

Rich Freeman <rich0@gentoo.org> wrote:
>>
>> For containers, at least a dozens of binds are minimally required
>> (/usr /proc /sys /dev ...).
>
> I wouldn't be surprised if it works with a single bind mount with
> /proc and /dev and so on mounted on top of that.

Either you start with a writable tree and bind-mount some directories
non-writable or the opposite way. Either way, a dozen or so bind-mounts
are minimally necessary.

> You say "not even a bind" as if that is a benefit.

In case the "non-scaling" argument has not become clear,
I try to visualize it by a table:

         | "simple"       | "fine grained"
---------+----------------+-------------------
 Overlay | 1 mount        | 1 mount
---------+----------------+-------------------
Container| 10? bind mounts| 1000? bind mounts

> Honestly, you can't really claim that overlayfs is superior to bind

Correct. If the number of bind mounts really has no influence on the
file operations in the corresponding part of the tree - e.g. if there
is really a clever hashing of bind mounts - the above table does not
indicate any scaling problem.

We are at a point where some kernel source code inspection (or at the
very least serious benchmarking, preferrably with a slow and low-memory
machine) is needed before we can continue the discussion in a serious
way. I do not have the time for this currently.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [gentoo-dev] Re: An example overlayfs sandbox test
  2017-09-25 15:27               ` Martin Vaeth
@ 2017-09-25 15:34                 ` Rich Freeman
  2017-09-27 16:51                   ` Martin Vaeth
  0 siblings, 1 reply; 17+ messages in thread
From: Rich Freeman @ 2017-09-25 15:34 UTC (permalink / raw
  To: gentoo-dev

On Mon, Sep 25, 2017 at 11:27 AM, Martin Vaeth <martin@mvath.de> wrote:
> Rich Freeman <rich0@gentoo.org> wrote:
>>
>> I wouldn't be surprised if it works with a single bind mount with
>> /proc and /dev and so on mounted on top of that.
>
> Either you start with a writable tree and bind-mount some directories
> non-writable or the opposite way. Either way, a dozen or so bind-mounts
> are minimally necessary.
>

/proc, /sys, and /dev wouldn't be bind mounts.  They're just mounts.
And everything else would be pulled in with a read-only bind mount of
/.

You're going to need the same mounts of /proc, /sys, and /dev on an
overlay, unless you really wanted to let those pass through, which
seems like a bad idea.

>> You say "not even a bind" as if that is a benefit.
>
> In case the "non-scaling" argument has not become clear,
> I try to visualize it by a table:
>
>          | "simple"       | "fine grained"
> ---------+----------------+-------------------
>  Overlay | 1 mount        | 1 mount
> ---------+----------------+-------------------
> Container| 10? bind mounts| 1000? bind mounts

Except it is more like:

         | "simple"       | "fine grained"
---------+----------------+-------------------
 Overlay | 1 mount         | 1 mount + 1000? file deletions in the overlay
---------+----------------+-------------------
Container| 1-2 bind mounts | 1000? bind mounts

I left out dev+sys+proc in both cases - it would be a few more mounts
either way.

And there is really no difference in performance between 1 mount and
10 in practice.

-- 
Rich


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [gentoo-dev] Re: An example overlayfs sandbox test
  2017-09-25 15:34                 ` Rich Freeman
@ 2017-09-27 16:51                   ` Martin Vaeth
  0 siblings, 0 replies; 17+ messages in thread
From: Martin Vaeth @ 2017-09-27 16:51 UTC (permalink / raw
  To: gentoo-dev

Rich Freeman <rich0@gentoo.org> wrote:
>>
>>          | "simple"       | "fine grained"
>> ---------+----------------+-------------------
>>  Overlay | 1 mount        | 1 mount
>> ---------+----------------+-------------------
>> Container| 10? bind mounts| 1000? bind mounts
>
> Except it is more like:
>
>          | "simple"       | "fine grained"
> ---------+----------------+-------------------
>  Overlay | 1 mount        | 1 mount + 1000? file deletions in the overlay
> ---------+----------------+-------------------
> Container| 1-2 bind mounts| 1000? bind mounts

I was not talking about the time to setup the overlay.
File deletions involve only the latter.

> I left out dev+sys+proc in both cases

No, they were not forgotten:
They are not necessary for the overlay approach!
As I emphasized, you do not even need a single bind for that approach.

> And there is really no difference in performance between 1 mount and
> 10 in practice.

Really? Tested with a few million file creations/deletions/openings etc?
Such a number is not unusual for some projects: Already gentoo-sources
has ~60k files, all of them being accessed several times in various
manner. So even a very small delay multiplies by a huge number.

That's also a reason why I mentioned that a slow machine would be good
for timing. For instance, gentoo-sources needs several minutes to emerge
on a machine with a slow processor and little ram: the harddisk speed
is not the reason for the delay. I would not like to see another
factor due to a sandbox which is perhaps negligible on a fast system.



^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2017-09-27 16:51 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-22 23:43 [gentoo-dev] An example overlayfs sandbox test James McMechan
2017-09-23  0:18 ` Rich Freeman
2017-09-23  1:29   ` James McMechan
2017-09-23  2:26     ` Rich Freeman
2017-09-24  4:36       ` Tim Harder
2017-09-24 15:39       ` James McMechan
2017-09-23 23:42 ` Alec Warner
2017-09-23 23:59   ` Rich Freeman
2017-09-24  4:44     ` Tim Harder
2017-09-24  8:24       ` [gentoo-dev] " Martin Vaeth
2017-09-24 11:31         ` Rich Freeman
2017-09-24 18:11           ` Martin Vaeth
2017-09-25  0:49             ` Rich Freeman
2017-09-25 15:27               ` Martin Vaeth
2017-09-25 15:34                 ` Rich Freeman
2017-09-27 16:51                   ` Martin Vaeth
2017-09-24 12:55 ` [gentoo-dev] " Michał Górny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox