From: "Michał Górny" <mgorny@gentoo.org>
To: gentoo-dev@lists.gentoo.org
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
Date: Sun, 28 Jan 2018 10:10:43 +0100 [thread overview]
Message-ID: <1517130643.1270.11.camel@gentoo.org> (raw)
In-Reply-To: <20180128070111.GA17078@meriadoc.perfinion.com>
W dniu nie, 28.01.2018 o godzinie 15∶01 +0800, użytkownik Jason Zaman
napisał:
> On Sat, Jan 27, 2018 at 12:24:39AM +0100, Michał Górny wrote:
> > Migrating mirrors to the hashed structure
> > -----------------------------------------
> > The hard link solution allows us to save space on the master mirror.
> > Additionally, if ``-H`` option is used by the mirrors it avoids
> > transferring existing files again. However, this option is known
> > to be expensive and could cause significant server load. Without it,
> > all mirrors need to transfer a second copy of all the existing files.
> >
> > The symbolic link solution could be more reliable if we could rely
> > on mirrors using the ``--links`` rsync option. Without that, symbolic
> > links are not transferred at all.
>
> These rsync options might help for mirrors too:
> --compare-dest=DIR also compare destination files relative to DIR
> --copy-dest=DIR ... and include copies of unchanged files
> --link-dest=DIR hardlink to files in DIR when unchanged
>
> > Using hashed structure for local distfiles
> > ------------------------------------------
> > The hashed structure defined above could also be used for local distfile
> > storage as used by the package manager. For this to work, the package
> > manager authors need to ensure that:
> >
> > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
> > directory where distfiles specific to the package are linked
> > in a flat structure.
> >
> > b. All tools are updated to support the nested structure.
> >
> > c. The package manager provides a tool for users to easily manipulate
> > distfiles, in particular to add distfiles for fetch-restricted
> > packages into an appropriate subdirectory.
> >
> > For extended compatibility, the package manager may support finding
> > distfiles in flat and nested structure simultaneously.
>
> trying nested first then falling back to flat would make it easy for
> users if they have to download distfiles for fetch-restricted packages
> because then the instructions stay as "move it to
> /usr/portage/distfiles".
> or alternatively the tool could have a mode which will go through all
> files in the base dir and move it to where it should be in the nested
> tree. then you save everything to the same dir and run edist --fix
This is really outside the scope, and up to Portage maintainers.
> > Rationale
> > =========
> > Algorithm for splitting distfiles
> > ---------------------------------
> > In the original debate that occurred in bug #534528 [#BUG534528]_,
> > three possible solutions for splitting distfiles were listed:
> >
> > a. using initial portion of filename,
> >
> > b. using initial portion of file hash,
> >
> > c. using initial portion of filename hash.
> >
> > The significant advantage of the filename option was simplicity. With
> > that solution, the users could easily determine the correct subdirectory
> > themselves. However, it's significant disadvantage was very uneven
> > shuffling of data. In particular, the TeΧ Live packages alone count
> > almost 23500 distfiles and all use a common prefix, making it impossible
> > to split them further.
>
> the filename is the original upstream or the renamed one? eg
> SRC_URI="http://foo/foo.tar -> bar.tar" it will be bar.tar?
Renamed one. This is what distfiles use already. Otherwise we'd have
a lot of collisions on files named 'v1.2.3.tar.gz'.
> I think im in favour of using the initial part of the filename anyway.
> sure its not balanced but its still a hell of a lot more balanced than
> today and its really easy.
'More balanced' does not mean it solves the problem. If you have one
directory with ~25000 files, and others between almost empty and 4000,
then you still have a huge problem and a lot of silly reorganization
that looks like a 'good idea that misfired'.
> Another thing im wondering is if we can just use the same dir layout as
> the packages themselves. that would fix texlive since it has a whole lot
> of separate packages. eg /usr/portage/distfiles/app-cat/pkg/pkg-1.0.tgz
Then you're replacing the problem of many files in a single directory
with a problem of huge number of almost empty directories. In other
words, you replace performance problem of one kind with performance
problem of another kind, plus potential inode problem...
> there is a problem if many packages use the same distfiles (quite
> extensive for SELinux, every single of the sec-policy/selinux-* packages
> has identical distfiles) so im not sure how to deal with it.
...and yes, the problem that we have a lot of distfiles shared between
different packages. Also, frequently those distfiles are actually huge
(think of big upstream tarball being split into N packages in Gentoo,
e.g. Qt).
> this would also make it easy in future to make the sandbox restrict
> access to files outside of that package if we wanted to do that.
I don't see how that's relevant at all.
> > The alternate option of using file hash has the advantage of having
> > a more balanced split. Furthermore, since hashes are stored
> > in Manifests using them is zero-cost. However, this solution has two
> > significant disadvantages:
> >
> > 1. The hash values are unknown for newly-downloaded distfiles, so
> > ``repoman`` (or an equivalent tool) would have to use a temporary
> > directory before locating the file in appropriate subdirectory.
> >
> > 2. User-provided distfiles (e.g. for fetch-restricted packages) with
> > hash mismatches would be placed in the wrong subdirectory,
> > potentially causing confusing errors.
>
> Not just this, but on principle, I also think you should be able to read
> an ebuild and compute the url to download the file from the mirrors
> without any extra knowledge (especially downloading the distfile).
>
> > Using filename hashes has proven to provide a similar balance
> > to using file hashes. Furthermore, since filenames are known up front
> > this solution does not suffer from the both listed problems. While
> > hashes need to be computed manually, hashing short string should not
> > cause any performance problems.
> >
> > .. figure:: glep-0075-extras/by-filename.png
> >
> > Distribution of distfiles by first character of filenames
> >
> > .. figure:: glep-0075-extras/by-csum.png
> >
> > Distribution of distfiles by first hex-digit of checksum
> > (x --- content checksum, + --- filename checksum)
> >
> > .. figure:: glep-0075-extras/by-csum2.png
> >
> > Distribution of distfiles by two first hex-digits of checksum
> > (x --- content checksum, + --- filename checksum)
>
> do you have an easy way to calculate how big the distfiles are per
> category or cat/pkg? i'd be interested to see.
Easy, no. But should be easy to write a script that does that.
The sources for my stuff are at:
https://github.com/mgorny/manifest-distfile-stats
Except most of it won't be useful for that case since it works
on combined and deduplicated Manifests.
If you want to do that, please also include a graph of total file sizes,
and mark how much of that is duplicated between groups.
> > Backwards Compatibility
> > =======================
> > Mirror compatibility
> > --------------------
> > The mirrored files are propagated to other mirrors as opaque directory
> > structure. Therefore, there are no backwards compatibility concerns
> > on the mirroring side.
> >
> > Backwards compatibility with existing clients is detailed
> > in `migrating mirrors to the hashed structure`_ section. Backwards
> > compatibility with the old clients will be provided by preserving
> > the flat structure during the transitional period.
>
> Even if there was no transition, things wouldnt be terrible because
> portage would fall back to just downloading from SRC_URI directly
> if the mirrors fail.
>
>
--
Best regards,
Michał Górny
next prev parent reply other threads:[~2018-01-28 9:10 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-26 23:24 [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure Michał Górny
2018-01-27 1:48 ` Michael Orlitzky
2018-01-27 2:44 ` R0b0t1
2018-01-27 8:30 ` Michał Górny
2018-01-27 11:36 ` Roy Bamford
2018-01-27 11:41 ` Michał Górny
2018-01-27 16:42 ` Gordon Pettey
2018-01-27 16:48 ` Michael Orlitzky
2018-01-27 19:01 ` Gordon Pettey
2018-01-27 20:16 ` Michael Orlitzky
2018-01-30 1:21 ` Kent Fredric
2018-01-30 2:53 ` Robin H. Johnson
2018-01-30 7:25 ` Michał Górny
2018-01-30 19:46 ` Kent Fredric
2018-01-27 16:47 ` Michael Orlitzky
2018-01-27 18:14 ` Michał Górny
2018-01-27 18:24 ` Michael Orlitzky
2018-01-27 19:47 ` Michał Górny
2018-01-27 20:30 ` Michael Orlitzky
2018-01-30 1:27 ` Kent Fredric
2018-01-30 7:17 ` Ulrich Mueller
2018-01-28 7:01 ` Jason Zaman
2018-01-28 9:10 ` Michał Górny [this message]
2018-01-29 7:33 ` Robin H. Johnson
2018-01-28 10:14 ` Ulrich Mueller
2018-01-28 10:16 ` Michał Górny
2018-01-28 10:22 ` Ulrich Mueller
2018-01-28 10:40 ` Michał Górny
2018-01-28 13:03 ` Ulrich Mueller
2018-01-30 1:41 ` Kent Fredric
2018-01-30 7:11 ` Ulrich Mueller
2018-01-28 20:43 ` Andrew Barchuk
2018-01-28 21:17 ` Gordon Pettey
2018-01-28 22:00 ` Andrew Barchuk
2018-01-28 22:13 ` Gordon Pettey
2018-01-28 22:14 ` Zac Medico
2018-01-28 22:46 ` Andrew Barchuk
2018-01-29 5:36 ` Michał Górny
2018-01-29 9:22 ` Andrew Barchuk
2018-01-29 19:37 ` [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2) Michał Górny
2018-01-29 20:00 ` Robin H. Johnson
2018-01-29 21:09 ` Michał Górny
2018-01-29 20:26 ` R0b0t1
2018-01-29 20:55 ` Alec Warner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1517130643.1270.11.camel@gentoo.org \
--to=mgorny@gentoo.org \
--cc=gentoo-dev@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox