* [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
@ 2018-01-26 23:24 Michał Górny
2018-01-27 1:48 ` Michael Orlitzky
` (4 more replies)
0 siblings, 5 replies; 44+ messages in thread
From: Michał Górny @ 2018-01-26 23:24 UTC (permalink / raw
To: gentoo-dev
Hi, everyone.
Here's a little new something we've been silently debating back
in the day, then forgotten about it, then I've written a GLEP about it.
The number's not official yet.
HTML (with plots): https://dev.gentoo.org/~mgorny/tmp/glep-0075.html
---
GLEP: 75
Title: Split distfile mirror directory structure
Author:
Michał Górny <mgorny@gentoo.org>,
Robin H. Johnson <robbat2@gento
o.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2018-01-26
Las
t-Modified: 2018-01-27
Post-History: 2018-01-27
Content-Type: text/x-rst
--
-
Abstract
========
This GLEP describes the procedure for splitting the distfiles on mirrors
into multiple directories with the goal of reducing the number of files
in a single directory.
Motivation
==========
At the moment, both the package manager and Gentoo mirrors use flat
directory structure to store files. While this solution usually works,
it does not scale well. Directories with large number of files usually
have significant performance penalty, unless using filesystems
specifically designed for that purpose.
According to the Gentoo repository state at 2018-01-26 16:23, there
was a total of 62652 unique distfiles in the repository. While
the users realistically hit around 10% of that, distfile mirrors often
hold even more files --- more so if old distfiles are not wiped
immediately.
While all filesystems used on Linux boxes should be able to cope with
a number that large, they may suffer a performance penalty with even
a few thousand files. Additionally, if mirrors enable directory indexes
then generating the index imposes both a significant server overhead
and a significant data transfer. At this moment, the index
of distfiles.gentoo.org has around 17 MiB.
Splitting the distfiles into multiple directories makes it possible
to avoid those problems by reducing the number of files in a single
directory. For example, splitting the forementioned set of distfiles
into 16 directories that are roughly balanced allows to reduce
the number of files in a single directory to around 4000. Splitting
them further into 256 directories (16x16) results in 200-300 files
per directory which should avoid any performance problems long-term,
even assuming 300% growth of number of distfiles.
Specification
=============
Mirror layout file
------------------
A mirror adhering to this specification should include a ``layout.conf``
file in the top distfile directory. This file uses the format
derived from the freedesktop Desktop Entry Specification file format
[#DESKTOP_FORMAT]_.
Before using each Gentoo mirror, the package manager should attempt
to fetch (update) its ``layout.conf`` file and process it to determine
how to use the mirror. If the file is not present, the package manager
should behave as if it were empty.
The package manager should recognize the sections and keys listed below.
It should ignore any unrecognized sections or keys --- the format
is intended to account for future extensions.
This specification currently defines one section: ``[structure]``.
This section defines one or more repository structure definitions
using sequential integer keys. The definition keyed as ``0``
is the most preferred structure. The package manager should use
the first structure format it recognizes as supported, and ignore any
it does not recognize. If this section is not present, the package
manager should behave as if only ``flat`` structure were supported.
The following structure definitions are supported:
* ``flat`` to indicate the traditional flat structure where all
distfiles are located in the top directory,
* ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename
hash structure`_ explained below.
Filename hash structure
-----------------------
When using the filename hash structure, the distfiles are split
into directories whose names are derived from the hash of distfile
filename. This structure has two parameters: *algorithm name*
and *cutoffs* list.
The algorithm name must correspond to a valid Manifest hash name.
An informational list of hashes is included in GLEP 74 [#GLEP74]_,
and the policies for introducing new hashes are covered by GLEP 59
[#GLEP59]_.
The cutoffs list specifies one or more integers separated by colons
(``:``), indicating the number of bits (starting with the most
significant bit) of the hash used to form subsequent subdirectory names.
For example, the list of ``2:4`` would indicate that top-level directory
names are formed using 2 most significant bits of the hash (resulting
in 2² = 4 directories), and each of this directories would have
subdirectories formed using the next 4 bits of the hash (resulting
in 2⁴ = 8 subdirectories each).
The exact algorithm for determining the distfile location follows:
1. Let the distfile filename be **F**.
2. Compute the hash of **F** and store its binary value as **H**.
3. For each integer **C** in cutoff list:
a. Take **C** most significant bits of **H** and store them as **V**.
b. Convert **V** into hexadecimal integer, left padded with zeros
to **C/4** digits (rounded up) and append it to the path, followed
by the path separator.
c. Shift **H** left **C** bits.
4. Finally, append **F** to the obtained path.
In particular, note that when using nested directories
the subdirectories do not repeat the hash bits used in parent directory.
Migrating mirrors to the hashed structure
-----------------------------------------
Since all distfile mirrors sync to the master Gentoo mirror, it should
be enough to perform all the needed changes on the master mirror
and wait for other mirrors to sync. The following procedure
is recommended:
1. Include the initial ``layout.conf`` listing only ``flat`` layout.
2. Create the new structure alongside the flat structure. Wait for
mirrors to sync.
3. Once all mirrors receive the new structure, update ``layout.conf``
to list the ``filename-hash`` structure.
4. Once a version of Portage supporting the new structure is stable long
enough, remove the fallback ``flat`` structure from ``layout.conf``
and duplicate distfiles.
This implies that during the migration period the distfiles will
be stored duplicated on the mirrors and therefore will occupy twice
as much space. Technically, this could be avoided either by using
hard links or symbolic links.
The hard link solution allows us to save space on the master mirror.
Additionally, if ``-H`` option is used by the mirrors it avoids
transferring existing files again. However, this option is known
to be expensive and could cause significant server load. Without it,
all mirrors need to transfer a second copy of all the existing files.
The symbolic link solution could be more reliable if we could rely
on mirrors using the ``--links`` rsync option. Without that, symbolic
links are not transferred at all.
Using hashed structure for local distfiles
------------------------------------------
The hashed structure defined above could also be used for local distfile
storage as used by the package manager. For this to work, the package
manager authors need to ensure that:
a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
directory where distfiles specific to the package are linked
in a flat structure.
b. All tools are updated to support the nested structure.
c. The package manager provides a tool for users to easily manipulate
distfiles, in particular to add distfiles for fetch-restricted
packages into an appropriate subdirectory.
For extended compatibility, the package manager may support finding
distfiles in flat and nested structure simultaneously.
Rationale
=========
Algorithm for splitting distfiles
---------------------------------
In the original debate that occurred in bug #534528 [#BUG534528]_,
three possible solutions for splitting distfiles were listed:
a. using initial portion of filename,
b. using initial portion of file hash,
c. using initial portion of filename hash.
The significant advantage of the filename option was simplicity. With
that solution, the users could easily determine the correct subdirectory
themselves. However, it's significant disadvantage was very uneven
shuffling of data. In particular, the TeΧ Live packages alone count
almost 23500 distfiles and all use a common prefix, making it impossible
to split them further.
The alternate option of using file hash has the advantage of having
a more balanced split. Furthermore, since hashes are stored
in Manifests using them is zero-cost. However, this solution has two
significant disadvantages:
1. The hash values are unknown for newly-downloaded distfiles, so
``repoman`` (or an equivalent tool) would have to use a temporary
directory before locating the file in appropriate subdirectory.
2. User-provided distfiles (e.g. for fetch-restricted packages) with
hash mismatches would be placed in the wrong subdirectory,
potentially causing confusing errors.
Using filename hashes has proven to provide a similar balance
to using file hashes. Furthermore, since filenames are known up front
this solution does not suffer from the both listed problems. While
hashes need to be computed manually, hashing short string should not
cause any performance problems.
.. figure:: glep-0075-extras/by-filename.png
Distribution of distfiles by first character of filenames
.. figure:: glep-0075-extras/by-csum.png
Distribution of distfiles by first hex-digit of checksum
(x --- content checksum, + --- filename checksum)
.. figure:: glep-0075-extras/by-csum2.png
Distribution of distfiles by two first hex-digits of checksum
(x --- content checksum, + --- filename checksum)
Layout file
-----------
The presence of control file has been suggested in the original
discussion. Its main purpose is to let package managers cleanly handle
the migration and detect how to correctly query the mirrors throughout
it. Furthermore, it makes future changes easier.
The format lines specifically mean to hardcode as little about
the actual algorithm as possible. Therefore, we can easily change
the hash used or the exact split structure without having to update
the package managers or even provide a compatibility layout.
The file is also open for future extensions to provide additional mirror
metadata. However, no clear use for that has been determined so far.
Hash algorithm
--------------
The hash algorithm support is fully deferred to the existing code
in the package managers that is required to handle Manifests.
In particular, it is recommended to reuse one of the hashes that are
used in Manifest entries at the time. This avoids code duplication
and reuses an existing mechanism to handle hash upgrades.
During the discussion, it has been pointed that this particular use case
does not require a cryptographically strong hash and a faster algorithm
could be used instead. However, given the short length of hashed
strings performance is not a problem, and speed does not justify
the resulting code duplication.
It has also been pointed out that e.g. the BLAKE2 hash family provides
the ability of creating arbitrary length hashes instead of truncating
the standard-length hash. However, not all implementations of BLAKE2
support that and relying on it could reduce portability for no apparent
gain.
Backwards Compatibility
=======================
Mirror compatibility
--------------------
The mirrored files are propagated to other mirrors as opaque directory
structure. Therefore, there are no backwards compatibility concerns
on the mirroring side.
Backwards compatibility with existing clients is detailed
in `migrating mirrors to the hashed structure`_ section. Backwards
compatibility with the old clients will be provided by preserving
the flat structure during the transitional period.
The new clients will fetch the ``layout.conf`` file to avoid backwards
compatibility concerns in the future. In case of hitting an old mirror,
the package manager will default to the ``flat`` structure.
Package manager storage compatibility
-------------------------------------
The exact means of preserving backwards compatibility in package manager
storage are left to the package manager authors. However, it is
recommended that package managers continue to support the flat layout
even if it is no longer the default. The package manager may either
continue to read files from this location or automatically move them
to an appropriate subdirectory.
Reference Implementation
========================
TODO.
References
==========
.. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the file
(https://standards.freedesktop.org/desktop-entry-spec/latest/ar01s03.html)
.. [#GLEP74] GLEP 74: Full-tree verification using Manifest files:
Checksum algorithms (informational)
(https://www.gentoo.org/glep/glep-0074.html#checksum-algorithms-informational)
.. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
(https://www.gentoo.org/glep/glep-0059.html)
.. [#BUG534528] Bug 534528 - distfiles should be sorted into subdirectories
of DISTDIR
(https://bugs.gentoo.org/534528)
Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-26 23:24 [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure Michał Górny
@ 2018-01-27 1:48 ` Michael Orlitzky
2018-01-27 2:44 ` R0b0t1
2018-01-27 8:30 ` Michał Górny
2018-01-28 7:01 ` Jason Zaman
` (3 subsequent siblings)
4 siblings, 2 replies; 44+ messages in thread
From: Michael Orlitzky @ 2018-01-27 1:48 UTC (permalink / raw
To: gentoo-dev
On 01/26/2018 06:24 PM, Michał Górny wrote:
>
> The alternate option of using file hash has the advantage of having
> a more balanced split. Furthermore, since hashes are stored
> in Manifests using them is zero-cost. However, this solution has two
> significant disadvantages:
>
> 1. The hash values are unknown for newly-downloaded distfiles, so
> ``repoman`` (or an equivalent tool) would have to use a temporary
> directory before locating the file in appropriate subdirectory.
>
> 2. User-provided distfiles (e.g. for fetch-restricted packages) with
> hash mismatches would be placed in the wrong subdirectory,
> potentially causing confusing errors.
>
The filename proposal sounds fine, so this is only academic, but: are
these two points really disadvantages?
What are we worried about in using a temporary directory? Copying across
filesystem boundaries? Except in rare cases, $DISTDIR itself will be
usable a temporary location (on the same filesystem), won't it?
For the second point, portage is going to tell me where to put the file,
isn't it? Then no matter what garbage I download, won't portage look for
it in the right place, because where-to-put-it is determined using the
same manifest hash that determines where-to-find-it?
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 1:48 ` Michael Orlitzky
@ 2018-01-27 2:44 ` R0b0t1
2018-01-27 8:30 ` Michał Górny
1 sibling, 0 replies; 44+ messages in thread
From: R0b0t1 @ 2018-01-27 2:44 UTC (permalink / raw
To: gentoo-dev
On Fri, Jan 26, 2018 at 7:48 PM, Michael Orlitzky <mjo@gentoo.org> wrote:
> On 01/26/2018 06:24 PM, Michał Górny wrote:
>>
>> The alternate option of using file hash has the advantage of having
>> a more balanced split. Furthermore, since hashes are stored
>> in Manifests using them is zero-cost. However, this solution has two
>> significant disadvantages:
>>
>> 1. The hash values are unknown for newly-downloaded distfiles, so
>> ``repoman`` (or an equivalent tool) would have to use a temporary
>> directory before locating the file in appropriate subdirectory.
>>
>> 2. User-provided distfiles (e.g. for fetch-restricted packages) with
>> hash mismatches would be placed in the wrong subdirectory,
>> potentially causing confusing errors.
>>
>
> The filename proposal sounds fine,
I've had to interact with the distfile server by hand, and would
appreciate it if the files can be maintained in some way that finding
them is obvious without tools.
Every once and a while I navigate to the distfile root and need to
forcefully exit Firefox.
Cheers,
R0b0t1
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 1:48 ` Michael Orlitzky
2018-01-27 2:44 ` R0b0t1
@ 2018-01-27 8:30 ` Michał Górny
2018-01-27 11:36 ` Roy Bamford
2018-01-27 16:47 ` Michael Orlitzky
1 sibling, 2 replies; 44+ messages in thread
From: Michał Górny @ 2018-01-27 8:30 UTC (permalink / raw
To: gentoo-dev
W dniu pią, 26.01.2018 o godzinie 20∶48 -0500, użytkownik Michael
Orlitzky napisał:
> On 01/26/2018 06:24 PM, Michał Górny wrote:
> >
> > The alternate option of using file hash has the advantage of having
> > a more balanced split. Furthermore, since hashes are stored
> > in Manifests using them is zero-cost. However, this solution has two
> > significant disadvantages:
> >
> > 1. The hash values are unknown for newly-downloaded distfiles, so
> > ``repoman`` (or an equivalent tool) would have to use a temporary
> > directory before locating the file in appropriate subdirectory.
> >
> > 2. User-provided distfiles (e.g. for fetch-restricted packages) with
> > hash mismatches would be placed in the wrong subdirectory,
> > potentially causing confusing errors.
> >
>
> The filename proposal sounds fine, so this is only academic, but: are
> these two points really disadvantages?
>
> What are we worried about in using a temporary directory? Copying across
> filesystem boundaries? Except in rare cases, $DISTDIR itself will be
> usable a temporary location (on the same filesystem), won't it?
Why add the extra complexity when there's no need for one? Note that
there's also the problem of resuming transfers, so in the end we're
talking about permanent temporary directory where we keep unfinished
transfers.
> For the second point, portage is going to tell me where to put the file,
> isn't it? Then no matter what garbage I download, won't portage look for
> it in the right place, because where-to-put-it is determined using the
> same manifest hash that determines where-to-find-it?
No, it won't. Why would it? You're going to call something like:
edistadd foo.tar.gz bar.tar.gz
...and it will place the files in the right subdirectories.
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 8:30 ` Michał Górny
@ 2018-01-27 11:36 ` Roy Bamford
2018-01-27 11:41 ` Michał Górny
2018-01-27 16:47 ` Michael Orlitzky
1 sibling, 1 reply; 44+ messages in thread
From: Roy Bamford @ 2018-01-27 11:36 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 2464 bytes --]
On 2018.01.27 08:30, Michał Górny wrote:
> W dniu pią, 26.01.2018 o godzinie 20∶48 -0500, użytkownik Michael
> Orlitzky napisał:
> > On 01/26/2018 06:24 PM, Michał Górny wrote:
> > >
> > > The alternate option of using file hash has the advantage of
> having
> > > a more balanced split. Furthermore, since hashes are stored
> > > in Manifests using them is zero-cost. However, this solution has
> two
> > > significant disadvantages:
> > >
> > > 1. The hash values are unknown for newly-downloaded distfiles, so
> > > ``repoman`` (or an equivalent tool) would have to use a
> temporary
> > > directory before locating the file in appropriate subdirectory.
> > >
> > > 2. User-provided distfiles (e.g. for fetch-restricted packages)
> with
> > > hash mismatches would be placed in the wrong subdirectory,
> > > potentially causing confusing errors.
> > >
> >
> > The filename proposal sounds fine, so this is only academic, but:
> are
> > these two points really disadvantages?
> >
> > What are we worried about in using a temporary directory? Copying
> across
> > filesystem boundaries? Except in rare cases, $DISTDIR itself will be
> > usable a temporary location (on the same filesystem), won't it?
>
> Why add the extra complexity when there's no need for one? Note that
> there's also the problem of resuming transfers, so in the end we're
> talking about permanent temporary directory where we keep unfinished
> transfers.
>
> > For the second point, portage is going to tell me where to put the
> file,
> > isn't it? Then no matter what garbage I download, won't portage look
> for
> > it in the right place, because where-to-put-it is determined using
> the
> > same manifest hash that determines where-to-find-it?
>
> No, it won't. Why would it? You're going to call something like:
>
> edistadd foo.tar.gz bar.tar.gz
>
> ...and it will place the files in the right subdirectories.
>
> --
> Best regards,
> Michał Górny
>
>
>
>
Michał,
How does this work for fetch restricted files and finding other files
no longer on the mirrors?
Its no longer a download and move it to $DISTFILES, or is it?
Whatever it is, users will need to do it unless files in $DISTFILES
are accepted by package managers if they are not found in the main
structure.
--
Regards,
Roy Bamford
(Neddyseagoon) a member of
elections
gentoo-ops
forum-mods
[-- Attachment #2: Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 11:36 ` Roy Bamford
@ 2018-01-27 11:41 ` Michał Górny
2018-01-27 16:42 ` Gordon Pettey
2018-01-30 1:21 ` Kent Fredric
0 siblings, 2 replies; 44+ messages in thread
From: Michał Górny @ 2018-01-27 11:41 UTC (permalink / raw
To: gentoo-dev
W dniu sob, 27.01.2018 o godzinie 11∶36 +0000, użytkownik Roy Bamford
napisał:
> On 2018.01.27 08:30, Michał Górny wrote:
> > W dniu pią, 26.01.2018 o godzinie 20∶48 -0500, użytkownik Michael
> > Orlitzky napisał:
> > > On 01/26/2018 06:24 PM, Michał Górny wrote:
> > > >
> > > > The alternate option of using file hash has the advantage of
> >
> > having
> > > > a more balanced split. Furthermore, since hashes are stored
> > > > in Manifests using them is zero-cost. However, this solution has
> >
> > two
> > > > significant disadvantages:
> > > >
> > > > 1. The hash values are unknown for newly-downloaded distfiles, so
> > > > ``repoman`` (or an equivalent tool) would have to use a
> >
> > temporary
> > > > directory before locating the file in appropriate subdirectory.
> > > >
> > > > 2. User-provided distfiles (e.g. for fetch-restricted packages)
> >
> > with
> > > > hash mismatches would be placed in the wrong subdirectory,
> > > > potentially causing confusing errors.
> > > >
> > >
> > > The filename proposal sounds fine, so this is only academic, but:
> >
> > are
> > > these two points really disadvantages?
> > >
> > > What are we worried about in using a temporary directory? Copying
> >
> > across
> > > filesystem boundaries? Except in rare cases, $DISTDIR itself will be
> > > usable a temporary location (on the same filesystem), won't it?
> >
> > Why add the extra complexity when there's no need for one? Note that
> > there's also the problem of resuming transfers, so in the end we're
> > talking about permanent temporary directory where we keep unfinished
> > transfers.
> >
> > > For the second point, portage is going to tell me where to put the
> >
> > file,
> > > isn't it? Then no matter what garbage I download, won't portage look
> >
> > for
> > > it in the right place, because where-to-put-it is determined using
> >
> > the
> > > same manifest hash that determines where-to-find-it?
> >
> > No, it won't. Why would it? You're going to call something like:
> >
> > edistadd foo.tar.gz bar.tar.gz
> >
> > ...and it will place the files in the right subdirectories.
> >
> > --
> > Best regards,
> > Michał Górny
> >
> >
> >
> >
>
> Michał,
>
> How does this work for fetch restricted files and finding other files
> no longer on the mirrors?
>
> Its no longer a download and move it to $DISTFILES, or is it?
> Whatever it is, users will need to do it unless files in $DISTFILES
> are accepted by package managers if they are not found in the main
> structure.
I've just answered that, and it's in the GLEP also. There will be
a helper tool to make this easy. Furthermore, I think we may even make
Portage keep accepting both locations indefinitely.
As for finding files in your distdir, there's no reason why plain:
find -name 'foo.tar.gz'
wouldn't work.
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 11:41 ` Michał Górny
@ 2018-01-27 16:42 ` Gordon Pettey
2018-01-27 16:48 ` Michael Orlitzky
2018-01-30 1:21 ` Kent Fredric
1 sibling, 1 reply; 44+ messages in thread
From: Gordon Pettey @ 2018-01-27 16:42 UTC (permalink / raw
To: gentoo-dev
Why not use a hash of the file name instead of its contents? That
seems like it would be much simpler, and that's not going to reduce
the output space for balance...
On Sat, Jan 27, 2018 at 5:41 AM, Michał Górny <mgorny@gentoo.org> wrote:
> W dniu sob, 27.01.2018 o godzinie 11∶36 +0000, użytkownik Roy Bamford
> napisał:
>> On 2018.01.27 08:30, Michał Górny wrote:
>> > W dniu pią, 26.01.2018 o godzinie 20∶48 -0500, użytkownik Michael
>> > Orlitzky napisał:
>> > > On 01/26/2018 06:24 PM, Michał Górny wrote:
>> > > >
>> > > > The alternate option of using file hash has the advantage of
>> >
>> > having
>> > > > a more balanced split. Furthermore, since hashes are stored
>> > > > in Manifests using them is zero-cost. However, this solution has
>> >
>> > two
>> > > > significant disadvantages:
>> > > >
>> > > > 1. The hash values are unknown for newly-downloaded distfiles, so
>> > > > ``repoman`` (or an equivalent tool) would have to use a
>> >
>> > temporary
>> > > > directory before locating the file in appropriate subdirectory.
>> > > >
>> > > > 2. User-provided distfiles (e.g. for fetch-restricted packages)
>> >
>> > with
>> > > > hash mismatches would be placed in the wrong subdirectory,
>> > > > potentially causing confusing errors.
>> > > >
>> > >
>> > > The filename proposal sounds fine, so this is only academic, but:
>> >
>> > are
>> > > these two points really disadvantages?
>> > >
>> > > What are we worried about in using a temporary directory? Copying
>> >
>> > across
>> > > filesystem boundaries? Except in rare cases, $DISTDIR itself will be
>> > > usable a temporary location (on the same filesystem), won't it?
>> >
>> > Why add the extra complexity when there's no need for one? Note that
>> > there's also the problem of resuming transfers, so in the end we're
>> > talking about permanent temporary directory where we keep unfinished
>> > transfers.
>> >
>> > > For the second point, portage is going to tell me where to put the
>> >
>> > file,
>> > > isn't it? Then no matter what garbage I download, won't portage look
>> >
>> > for
>> > > it in the right place, because where-to-put-it is determined using
>> >
>> > the
>> > > same manifest hash that determines where-to-find-it?
>> >
>> > No, it won't. Why would it? You're going to call something like:
>> >
>> > edistadd foo.tar.gz bar.tar.gz
>> >
>> > ...and it will place the files in the right subdirectories.
>> >
>> > --
>> > Best regards,
>> > Michał Górny
>> >
>> >
>> >
>> >
>>
>> Michał,
>>
>> How does this work for fetch restricted files and finding other files
>> no longer on the mirrors?
>>
>> Its no longer a download and move it to $DISTFILES, or is it?
>> Whatever it is, users will need to do it unless files in $DISTFILES
>> are accepted by package managers if they are not found in the main
>> structure.
>
> I've just answered that, and it's in the GLEP also. There will be
> a helper tool to make this easy. Furthermore, I think we may even make
> Portage keep accepting both locations indefinitely.
>
> As for finding files in your distdir, there's no reason why plain:
>
> find -name 'foo.tar.gz'
>
> wouldn't work.
>
> --
> Best regards,
> Michał Górny
>
>
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 8:30 ` Michał Górny
2018-01-27 11:36 ` Roy Bamford
@ 2018-01-27 16:47 ` Michael Orlitzky
2018-01-27 18:14 ` Michał Górny
1 sibling, 1 reply; 44+ messages in thread
From: Michael Orlitzky @ 2018-01-27 16:47 UTC (permalink / raw
To: gentoo-dev
On 01/27/2018 03:30 AM, Michał Górny wrote:
>>
>> What are we worried about in using a temporary directory? Copying across
>> filesystem boundaries? Except in rare cases, $DISTDIR itself will be
>> usable a temporary location (on the same filesystem), won't it?
>
> Why add the extra complexity when there's no need for one? Note that
> there's also the problem of resuming transfers, so in the end we're
> talking about permanent temporary directory where we keep unfinished
> transfers.
Can't argue with that, but I don't see it as a huge "con."
>> For the second point, portage is going to tell me where to put the file,
>> isn't it? Then no matter what garbage I download, won't portage look for
>> it in the right place, because where-to-put-it is determined using the
>> same manifest hash that determines where-to-find-it?
>
> No, it won't. Why would it? You're going to call something like:
>
> edistadd foo.tar.gz bar.tar.gz
>
> ...and it will place the files in the right subdirectories.
If we have a tool like edistadd, then I see the problem. But if we were
going to use file-data based hashes, then there would be no need for a
tool in most cases. As a developer, "repoman manifest" would handle it.
As a user, I'm going to see a message like,
Fetch instructions for games-fps/doom3-lms-4:
* Please download LastManStandingCoop4Multiplatform.zip from:
* http://www.moddb.com/mods/last-man-standing-coop/downloads
* and move it to /var/cache/portage/distfiles
except instead of $DISTDIR, it would suggest whatever directory is
computed from the hash in the manifest.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 16:42 ` Gordon Pettey
@ 2018-01-27 16:48 ` Michael Orlitzky
2018-01-27 19:01 ` Gordon Pettey
0 siblings, 1 reply; 44+ messages in thread
From: Michael Orlitzky @ 2018-01-27 16:48 UTC (permalink / raw
To: gentoo-dev
On 01/27/2018 11:42 AM, Gordon Pettey wrote:
> Why not use a hash of the file name instead of its contents? That
> seems like it would be much simpler, and that's not going to reduce
> the output space for balance...
That's the proposal =P
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 16:47 ` Michael Orlitzky
@ 2018-01-27 18:14 ` Michał Górny
2018-01-27 18:24 ` Michael Orlitzky
0 siblings, 1 reply; 44+ messages in thread
From: Michał Górny @ 2018-01-27 18:14 UTC (permalink / raw
To: gentoo-dev
W dniu sob, 27.01.2018 o godzinie 11∶47 -0500, użytkownik Michael
Orlitzky napisał:
> On 01/27/2018 03:30 AM, Michał Górny wrote:
> > >
> > > What are we worried about in using a temporary directory? Copying across
> > > filesystem boundaries? Except in rare cases, $DISTDIR itself will be
> > > usable a temporary location (on the same filesystem), won't it?
> >
> > Why add the extra complexity when there's no need for one? Note that
> > there's also the problem of resuming transfers, so in the end we're
> > talking about permanent temporary directory where we keep unfinished
> > transfers.
>
> Can't argue with that, but I don't see it as a huge "con."
>
>
> > > For the second point, portage is going to tell me where to put the file,
> > > isn't it? Then no matter what garbage I download, won't portage look for
> > > it in the right place, because where-to-put-it is determined using the
> > > same manifest hash that determines where-to-find-it?
> >
> > No, it won't. Why would it? You're going to call something like:
> >
> > edistadd foo.tar.gz bar.tar.gz
> >
> > ...and it will place the files in the right subdirectories.
>
> If we have a tool like edistadd, then I see the problem. But if we were
> going to use file-data based hashes, then there would be no need for a
> tool in most cases. As a developer, "repoman manifest" would handle it.
> As a user, I'm going to see a message like,
>
> Fetch instructions for games-fps/doom3-lms-4:
> * Please download LastManStandingCoop4Multiplatform.zip from:
> * http://www.moddb.com/mods/last-man-standing-coop/downloads
> * and move it to /var/cache/portage/distfiles
>
> except instead of $DISTDIR, it would suggest whatever directory is
> computed from the hash in the manifest.
>
How would that work if you had 5 different files, every one evaluating
to a different directory?
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 18:14 ` Michał Górny
@ 2018-01-27 18:24 ` Michael Orlitzky
2018-01-27 19:47 ` Michał Górny
2018-01-30 1:27 ` Kent Fredric
0 siblings, 2 replies; 44+ messages in thread
From: Michael Orlitzky @ 2018-01-27 18:24 UTC (permalink / raw
To: gentoo-dev
On 01/27/2018 01:14 PM, Michał Górny wrote:
>>
>> If we have a tool like edistadd, then I see the problem. But if we were
>> going to use file-data based hashes, then there would be no need for a
>> tool in most cases. As a developer, "repoman manifest" would handle it.
>> As a user, I'm going to see a message like,
>>
>> Fetch instructions for games-fps/doom3-lms-4:
>> * Please download LastManStandingCoop4Multiplatform.zip from:
>> * http://www.moddb.com/mods/last-man-standing-coop/downloads
>> * and move it to /var/cache/portage/distfiles
>>
>> except instead of $DISTDIR, it would suggest whatever directory is
>> computed from the hash in the manifest.
>>
>
> How would that work if you had 5 different files, every one evaluating
> to a different directory?
>
for i in range(1,N):
do-what-you-did-for-the-first-one(i)
For example,
Fetch instructions for app-cat/pkg:
*
* Please download file1 from:
* wherever file1 can be found
* and move it to $DISTDIR/subdir1
*
* Please download file2 from
* wherever file2 can be found
* and move it to $DISTDIR/subdir2
*
* ...
*
* Please download fileN from
* wherever fileN can be found
* and move it to $DISTDIR/subdirN
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 16:48 ` Michael Orlitzky
@ 2018-01-27 19:01 ` Gordon Pettey
2018-01-27 20:16 ` Michael Orlitzky
0 siblings, 1 reply; 44+ messages in thread
From: Gordon Pettey @ 2018-01-27 19:01 UTC (permalink / raw
To: gentoo-dev
On Sat, Jan 27, 2018 at 10:48 AM, Michael Orlitzky <mjo@gentoo.org> wrote:
> On 01/27/2018 11:42 AM, Gordon Pettey wrote:
>> Why not use a hash of the file name instead of its contents? That
>> seems like it would be much simpler, and that's not going to reduce
>> the output space for balance...
>
> That's the proposal =P
I'm not following, then. What's all this about a temporary directory
because of not knowing the hash in advance? The ebuild must specify
the file name, or src_unpack wouldn't work. There is never a point
where the file name, and therefore its hash, is unknown.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 18:24 ` Michael Orlitzky
@ 2018-01-27 19:47 ` Michał Górny
2018-01-27 20:30 ` Michael Orlitzky
2018-01-30 1:27 ` Kent Fredric
1 sibling, 1 reply; 44+ messages in thread
From: Michał Górny @ 2018-01-27 19:47 UTC (permalink / raw
To: gentoo-dev
W dniu sob, 27.01.2018 o godzinie 13∶24 -0500, użytkownik Michael
Orlitzky napisał:
> On 01/27/2018 01:14 PM, Michał Górny wrote:
> > >
> > > If we have a tool like edistadd, then I see the problem. But if we were
> > > going to use file-data based hashes, then there would be no need for a
> > > tool in most cases. As a developer, "repoman manifest" would handle it.
> > > As a user, I'm going to see a message like,
> > >
> > > Fetch instructions for games-fps/doom3-lms-4:
> > > * Please download LastManStandingCoop4Multiplatform.zip from:
> > > * http://www.moddb.com/mods/last-man-standing-coop/downloads
> > > * and move it to /var/cache/portage/distfiles
> > >
> > > except instead of $DISTDIR, it would suggest whatever directory is
> > > computed from the hash in the manifest.
> > >
> >
> > How would that work if you had 5 different files, every one evaluating
> > to a different directory?
> >
>
> for i in range(1,N):
> do-what-you-did-for-the-first-one(i)
>
> For example,
>
> Fetch instructions for app-cat/pkg:
> *
> * Please download file1 from:
> * wherever file1 can be found
> * and move it to $DISTDIR/subdir1
> *
> * Please download file2 from
> * wherever file2 can be found
> * and move it to $DISTDIR/subdir2
> *
> * ...
> *
> * Please download fileN from
> * wherever fileN can be found
> * and move it to $DISTDIR/subdirN
>
Do you really believe this to be more friendly than a helper that places
all the files in correct directories?
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 19:01 ` Gordon Pettey
@ 2018-01-27 20:16 ` Michael Orlitzky
0 siblings, 0 replies; 44+ messages in thread
From: Michael Orlitzky @ 2018-01-27 20:16 UTC (permalink / raw
To: gentoo-dev
On 01/27/2018 02:01 PM, Gordon Pettey wrote:
> On Sat, Jan 27, 2018 at 10:48 AM, Michael Orlitzky <mjo@gentoo.org> wrote:
>> On 01/27/2018 11:42 AM, Gordon Pettey wrote:
>>> Why not use a hash of the file name instead of its contents?...
>>
>> That's the proposal =P
>
> I'm not following, then. What's all this about a temporary directory
> because of not knowing the hash in advance? The ebuild must specify
> the file name, or src_unpack wouldn't work. There is never a point
> where the file name, and therefore its hash, is unknown.
>
There were three proposed ways to split up the files: by filename,
hash(filename), and hash(filedata). The winner was hash(filename). The
GLEP lists only two reasons for rejecting hash(filedata), and I was
curious about them.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 19:47 ` Michał Górny
@ 2018-01-27 20:30 ` Michael Orlitzky
0 siblings, 0 replies; 44+ messages in thread
From: Michael Orlitzky @ 2018-01-27 20:30 UTC (permalink / raw
To: gentoo-dev
On 01/27/2018 02:47 PM, Michał Górny wrote:
>> *
>> * Please download fileN from
>> * wherever fileN can be found
>> * and move it to $DISTDIR/subdirN
>
> Do you really believe this to be more friendly than a helper that places
> all the files in correct directories?
Well, it's no worse than what we have now, because it is what we have
now. And if having an "edistadd" helper is easier, then it could check
the computed hash against the manifest before moving it under $DISTDIR.
If the computed hash and manifest hash don't match, a decent error
message can be displayed.
The person who asked for it and the person who are implementing this
prefer hash(filename), so by all means, do that. The two stated reasons
for rejecting hash(filedata) just stood out to me as not very
compelling. Maybe Robin's comment from the bug is a better reason?
> Checksum from the manifest is problematic because not every file in
> distfiles is in a Manifest (there are whitelisted files).
(https://bugs.gentoo.org/534528#c29)
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-26 23:24 [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure Michał Górny
2018-01-27 1:48 ` Michael Orlitzky
@ 2018-01-28 7:01 ` Jason Zaman
2018-01-28 9:10 ` Michał Górny
2018-01-29 7:33 ` Robin H. Johnson
2018-01-28 10:14 ` Ulrich Mueller
` (2 subsequent siblings)
4 siblings, 2 replies; 44+ messages in thread
From: Jason Zaman @ 2018-01-28 7:01 UTC (permalink / raw
To: gentoo-dev
On Sat, Jan 27, 2018 at 12:24:39AM +0100, Michał Górny wrote:
> Migrating mirrors to the hashed structure
> -----------------------------------------
> The hard link solution allows us to save space on the master mirror.
> Additionally, if ``-H`` option is used by the mirrors it avoids
> transferring existing files again. However, this option is known
> to be expensive and could cause significant server load. Without it,
> all mirrors need to transfer a second copy of all the existing files.
>
> The symbolic link solution could be more reliable if we could rely
> on mirrors using the ``--links`` rsync option. Without that, symbolic
> links are not transferred at all.
These rsync options might help for mirrors too:
--compare-dest=DIR also compare destination files relative to DIR
--copy-dest=DIR ... and include copies of unchanged files
--link-dest=DIR hardlink to files in DIR when unchanged
> Using hashed structure for local distfiles
> ------------------------------------------
> The hashed structure defined above could also be used for local distfile
> storage as used by the package manager. For this to work, the package
> manager authors need to ensure that:
>
> a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
> directory where distfiles specific to the package are linked
> in a flat structure.
>
> b. All tools are updated to support the nested structure.
>
> c. The package manager provides a tool for users to easily manipulate
> distfiles, in particular to add distfiles for fetch-restricted
> packages into an appropriate subdirectory.
>
> For extended compatibility, the package manager may support finding
> distfiles in flat and nested structure simultaneously.
trying nested first then falling back to flat would make it easy for
users if they have to download distfiles for fetch-restricted packages
because then the instructions stay as "move it to
/usr/portage/distfiles".
or alternatively the tool could have a mode which will go through all
files in the base dir and move it to where it should be in the nested
tree. then you save everything to the same dir and run edist --fix
> Rationale
> =========
> Algorithm for splitting distfiles
> ---------------------------------
> In the original debate that occurred in bug #534528 [#BUG534528]_,
> three possible solutions for splitting distfiles were listed:
>
> a. using initial portion of filename,
>
> b. using initial portion of file hash,
>
> c. using initial portion of filename hash.
>
> The significant advantage of the filename option was simplicity. With
> that solution, the users could easily determine the correct subdirectory
> themselves. However, it's significant disadvantage was very uneven
> shuffling of data. In particular, the TeΧ Live packages alone count
> almost 23500 distfiles and all use a common prefix, making it impossible
> to split them further.
the filename is the original upstream or the renamed one? eg
SRC_URI="http://foo/foo.tar -> bar.tar" it will be bar.tar?
I think im in favour of using the initial part of the filename anyway.
sure its not balanced but its still a hell of a lot more balanced than
today and its really easy.
Another thing im wondering is if we can just use the same dir layout as
the packages themselves. that would fix texlive since it has a whole lot
of separate packages. eg /usr/portage/distfiles/app-cat/pkg/pkg-1.0.tgz
there is a problem if many packages use the same distfiles (quite
extensive for SELinux, every single of the sec-policy/selinux-* packages
has identical distfiles) so im not sure how to deal with it.
this would also make it easy in future to make the sandbox restrict
access to files outside of that package if we wanted to do that.
> The alternate option of using file hash has the advantage of having
> a more balanced split. Furthermore, since hashes are stored
> in Manifests using them is zero-cost. However, this solution has two
> significant disadvantages:
>
> 1. The hash values are unknown for newly-downloaded distfiles, so
> ``repoman`` (or an equivalent tool) would have to use a temporary
> directory before locating the file in appropriate subdirectory.
>
> 2. User-provided distfiles (e.g. for fetch-restricted packages) with
> hash mismatches would be placed in the wrong subdirectory,
> potentially causing confusing errors.
Not just this, but on principle, I also think you should be able to read
an ebuild and compute the url to download the file from the mirrors
without any extra knowledge (especially downloading the distfile).
> Using filename hashes has proven to provide a similar balance
> to using file hashes. Furthermore, since filenames are known up front
> this solution does not suffer from the both listed problems. While
> hashes need to be computed manually, hashing short string should not
> cause any performance problems.
>
> .. figure:: glep-0075-extras/by-filename.png
>
> Distribution of distfiles by first character of filenames
>
> .. figure:: glep-0075-extras/by-csum.png
>
> Distribution of distfiles by first hex-digit of checksum
> (x --- content checksum, + --- filename checksum)
>
> .. figure:: glep-0075-extras/by-csum2.png
>
> Distribution of distfiles by two first hex-digits of checksum
> (x --- content checksum, + --- filename checksum)
do you have an easy way to calculate how big the distfiles are per
category or cat/pkg? i'd be interested to see.
> Backwards Compatibility
> =======================
> Mirror compatibility
> --------------------
> The mirrored files are propagated to other mirrors as opaque directory
> structure. Therefore, there are no backwards compatibility concerns
> on the mirroring side.
>
> Backwards compatibility with existing clients is detailed
> in `migrating mirrors to the hashed structure`_ section. Backwards
> compatibility with the old clients will be provided by preserving
> the flat structure during the transitional period.
Even if there was no transition, things wouldnt be terrible because
portage would fall back to just downloading from SRC_URI directly
if the mirrors fail.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 7:01 ` Jason Zaman
@ 2018-01-28 9:10 ` Michał Górny
2018-01-29 7:33 ` Robin H. Johnson
1 sibling, 0 replies; 44+ messages in thread
From: Michał Górny @ 2018-01-28 9:10 UTC (permalink / raw
To: gentoo-dev
W dniu nie, 28.01.2018 o godzinie 15∶01 +0800, użytkownik Jason Zaman
napisał:
> On Sat, Jan 27, 2018 at 12:24:39AM +0100, Michał Górny wrote:
> > Migrating mirrors to the hashed structure
> > -----------------------------------------
> > The hard link solution allows us to save space on the master mirror.
> > Additionally, if ``-H`` option is used by the mirrors it avoids
> > transferring existing files again. However, this option is known
> > to be expensive and could cause significant server load. Without it,
> > all mirrors need to transfer a second copy of all the existing files.
> >
> > The symbolic link solution could be more reliable if we could rely
> > on mirrors using the ``--links`` rsync option. Without that, symbolic
> > links are not transferred at all.
>
> These rsync options might help for mirrors too:
> --compare-dest=DIR also compare destination files relative to DIR
> --copy-dest=DIR ... and include copies of unchanged files
> --link-dest=DIR hardlink to files in DIR when unchanged
>
> > Using hashed structure for local distfiles
> > ------------------------------------------
> > The hashed structure defined above could also be used for local distfile
> > storage as used by the package manager. For this to work, the package
> > manager authors need to ensure that:
> >
> > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
> > directory where distfiles specific to the package are linked
> > in a flat structure.
> >
> > b. All tools are updated to support the nested structure.
> >
> > c. The package manager provides a tool for users to easily manipulate
> > distfiles, in particular to add distfiles for fetch-restricted
> > packages into an appropriate subdirectory.
> >
> > For extended compatibility, the package manager may support finding
> > distfiles in flat and nested structure simultaneously.
>
> trying nested first then falling back to flat would make it easy for
> users if they have to download distfiles for fetch-restricted packages
> because then the instructions stay as "move it to
> /usr/portage/distfiles".
> or alternatively the tool could have a mode which will go through all
> files in the base dir and move it to where it should be in the nested
> tree. then you save everything to the same dir and run edist --fix
This is really outside the scope, and up to Portage maintainers.
> > Rationale
> > =========
> > Algorithm for splitting distfiles
> > ---------------------------------
> > In the original debate that occurred in bug #534528 [#BUG534528]_,
> > three possible solutions for splitting distfiles were listed:
> >
> > a. using initial portion of filename,
> >
> > b. using initial portion of file hash,
> >
> > c. using initial portion of filename hash.
> >
> > The significant advantage of the filename option was simplicity. With
> > that solution, the users could easily determine the correct subdirectory
> > themselves. However, it's significant disadvantage was very uneven
> > shuffling of data. In particular, the TeΧ Live packages alone count
> > almost 23500 distfiles and all use a common prefix, making it impossible
> > to split them further.
>
> the filename is the original upstream or the renamed one? eg
> SRC_URI="http://foo/foo.tar -> bar.tar" it will be bar.tar?
Renamed one. This is what distfiles use already. Otherwise we'd have
a lot of collisions on files named 'v1.2.3.tar.gz'.
> I think im in favour of using the initial part of the filename anyway.
> sure its not balanced but its still a hell of a lot more balanced than
> today and its really easy.
'More balanced' does not mean it solves the problem. If you have one
directory with ~25000 files, and others between almost empty and 4000,
then you still have a huge problem and a lot of silly reorganization
that looks like a 'good idea that misfired'.
> Another thing im wondering is if we can just use the same dir layout as
> the packages themselves. that would fix texlive since it has a whole lot
> of separate packages. eg /usr/portage/distfiles/app-cat/pkg/pkg-1.0.tgz
Then you're replacing the problem of many files in a single directory
with a problem of huge number of almost empty directories. In other
words, you replace performance problem of one kind with performance
problem of another kind, plus potential inode problem...
> there is a problem if many packages use the same distfiles (quite
> extensive for SELinux, every single of the sec-policy/selinux-* packages
> has identical distfiles) so im not sure how to deal with it.
...and yes, the problem that we have a lot of distfiles shared between
different packages. Also, frequently those distfiles are actually huge
(think of big upstream tarball being split into N packages in Gentoo,
e.g. Qt).
> this would also make it easy in future to make the sandbox restrict
> access to files outside of that package if we wanted to do that.
I don't see how that's relevant at all.
> > The alternate option of using file hash has the advantage of having
> > a more balanced split. Furthermore, since hashes are stored
> > in Manifests using them is zero-cost. However, this solution has two
> > significant disadvantages:
> >
> > 1. The hash values are unknown for newly-downloaded distfiles, so
> > ``repoman`` (or an equivalent tool) would have to use a temporary
> > directory before locating the file in appropriate subdirectory.
> >
> > 2. User-provided distfiles (e.g. for fetch-restricted packages) with
> > hash mismatches would be placed in the wrong subdirectory,
> > potentially causing confusing errors.
>
> Not just this, but on principle, I also think you should be able to read
> an ebuild and compute the url to download the file from the mirrors
> without any extra knowledge (especially downloading the distfile).
>
> > Using filename hashes has proven to provide a similar balance
> > to using file hashes. Furthermore, since filenames are known up front
> > this solution does not suffer from the both listed problems. While
> > hashes need to be computed manually, hashing short string should not
> > cause any performance problems.
> >
> > .. figure:: glep-0075-extras/by-filename.png
> >
> > Distribution of distfiles by first character of filenames
> >
> > .. figure:: glep-0075-extras/by-csum.png
> >
> > Distribution of distfiles by first hex-digit of checksum
> > (x --- content checksum, + --- filename checksum)
> >
> > .. figure:: glep-0075-extras/by-csum2.png
> >
> > Distribution of distfiles by two first hex-digits of checksum
> > (x --- content checksum, + --- filename checksum)
>
> do you have an easy way to calculate how big the distfiles are per
> category or cat/pkg? i'd be interested to see.
Easy, no. But should be easy to write a script that does that.
The sources for my stuff are at:
https://github.com/mgorny/manifest-distfile-stats
Except most of it won't be useful for that case since it works
on combined and deduplicated Manifests.
If you want to do that, please also include a graph of total file sizes,
and mark how much of that is duplicated between groups.
> > Backwards Compatibility
> > =======================
> > Mirror compatibility
> > --------------------
> > The mirrored files are propagated to other mirrors as opaque directory
> > structure. Therefore, there are no backwards compatibility concerns
> > on the mirroring side.
> >
> > Backwards compatibility with existing clients is detailed
> > in `migrating mirrors to the hashed structure`_ section. Backwards
> > compatibility with the old clients will be provided by preserving
> > the flat structure during the transitional period.
>
> Even if there was no transition, things wouldnt be terrible because
> portage would fall back to just downloading from SRC_URI directly
> if the mirrors fail.
>
>
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-26 23:24 [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure Michał Górny
2018-01-27 1:48 ` Michael Orlitzky
2018-01-28 7:01 ` Jason Zaman
@ 2018-01-28 10:14 ` Ulrich Mueller
2018-01-28 10:16 ` Michał Górny
2018-01-28 20:43 ` Andrew Barchuk
2018-01-29 19:37 ` [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2) Michał Górny
4 siblings, 1 reply; 44+ messages in thread
From: Ulrich Mueller @ 2018-01-28 10:14 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 855 bytes --]
>>>>> On Sat, 27 Jan 2018, Michał Górny wrote:
> This specification currently defines one section: ``[structure]``.
> This section defines one or more repository structure definitions
> using sequential integer keys. The definition keyed as ``0``
> is the most preferred structure. The package manager should use
> the first structure format it recognizes as supported, and ignore any
> it does not recognize. If this section is not present, the package
> manager should behave as if only ``flat`` structure were supported.
It is not at all clear from this how integer keys are ordered. The
paragraph only says that "0" is most preferred, but says nothing about
comparison of other numbers.
For example, if there are keys "-1", "0", and "1" (these are
"sequential integer keys", right?), what is their order of preference?
Ulrich
[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 10:14 ` Ulrich Mueller
@ 2018-01-28 10:16 ` Michał Górny
2018-01-28 10:22 ` Ulrich Mueller
0 siblings, 1 reply; 44+ messages in thread
From: Michał Górny @ 2018-01-28 10:16 UTC (permalink / raw
To: gentoo-dev
W dniu nie, 28.01.2018 o godzinie 11∶14 +0100, użytkownik Ulrich Mueller
napisał:
> > > > > > On Sat, 27 Jan 2018, Michał Górny wrote:
> > This specification currently defines one section: ``[structure]``.
> > This section defines one or more repository structure definitions
> > using sequential integer keys. The definition keyed as ``0``
> > is the most preferred structure. The package manager should use
> > the first structure format it recognizes as supported, and ignore any
> > it does not recognize. If this section is not present, the package
> > manager should behave as if only ``flat`` structure were supported.
>
> It is not at all clear from this how integer keys are ordered. The
> paragraph only says that "0" is most preferred, but says nothing about
> comparison of other numbers.
>
> For example, if there are keys "-1", "0", and "1" (these are
> "sequential integer keys", right?), what is their order of preference?
>
Please suggest a better wording. The idea was to use 0=, 1=, 2=...
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 10:16 ` Michał Górny
@ 2018-01-28 10:22 ` Ulrich Mueller
2018-01-28 10:40 ` Michał Górny
0 siblings, 1 reply; 44+ messages in thread
From: Ulrich Mueller @ 2018-01-28 10:22 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 1154 bytes --]
>>>>> On Sun, 28 Jan 2018, Michał Górny wrote:
>> > This specification currently defines one section: ``[structure]``.
>> > This section defines one or more repository structure definitions
>> > using sequential integer keys. The definition keyed as ``0``
>> > is the most preferred structure. The package manager should use
>> > the first structure format it recognizes as supported, and ignore any
>> > it does not recognize. If this section is not present, the package
>> > manager should behave as if only ``flat`` structure were supported.
>>
>> It is not at all clear from this how integer keys are ordered. The
>> paragraph only says that "0" is most preferred, but says nothing about
>> comparison of other numbers.
>>
>> For example, if there are keys "-1", "0", and "1" (these are
>> "sequential integer keys", right?), what is their order of preference?
> Please suggest a better wording. The idea was to use 0=, 1=, 2=...
"... using non-negative integer keys. The definition with the
smallest key is the most preferred structure. The package manager
should ignore any formats it does not recognize."
Ulrich
[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 10:22 ` Ulrich Mueller
@ 2018-01-28 10:40 ` Michał Górny
2018-01-28 13:03 ` Ulrich Mueller
0 siblings, 1 reply; 44+ messages in thread
From: Michał Górny @ 2018-01-28 10:40 UTC (permalink / raw
To: gentoo-dev
W dniu nie, 28.01.2018 o godzinie 11∶22 +0100, użytkownik Ulrich Mueller
napisał:
> > > > > > On Sun, 28 Jan 2018, Michał Górny wrote:
> > > > This specification currently defines one section: ``[structure]``.
> > > > This section defines one or more repository structure definitions
> > > > using sequential integer keys. The definition keyed as ``0``
> > > > is the most preferred structure. The package manager should use
> > > > the first structure format it recognizes as supported, and ignore any
> > > > it does not recognize. If this section is not present, the package
> > > > manager should behave as if only ``flat`` structure were supported.
> > >
> > > It is not at all clear from this how integer keys are ordered. The
> > > paragraph only says that "0" is most preferred, but says nothing about
> > > comparison of other numbers.
> > >
> > > For example, if there are keys "-1", "0", and "1" (these are
> > > "sequential integer keys", right?), what is their order of preference?
> > Please suggest a better wording. The idea was to use 0=, 1=, 2=...
>
> "... using non-negative integer keys. The definition with the
> smallest key is the most preferred structure. The package manager
> should ignore any formats it does not recognize."
>
> Ulrich
How about this then:
| This specification currently defines one section: ``[structure]``.
| This section defines one or more repository structure definitions
| using non-negative sequential integer keys. The definition with
| the ``0`` key is the most preferred structure. The package manager
| should ignore any formats it does not recognize. If this section
| is not present, the package manager should behave as if only ``flat``
| structure were specified.
I don't want people to skip numbers, and I want to avoid confusion
between 0/1 as initial number.
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 10:40 ` Michał Górny
@ 2018-01-28 13:03 ` Ulrich Mueller
2018-01-30 1:41 ` Kent Fredric
0 siblings, 1 reply; 44+ messages in thread
From: Ulrich Mueller @ 2018-01-28 13:03 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 836 bytes --]
>>>>> On Sun, 28 Jan 2018, Michał Górny wrote:
> How about this then:
> | This specification currently defines one section: ``[structure]``.
> | This section defines one or more repository structure definitions
> | using non-negative sequential integer keys. The definition with
> | the ``0`` key is the most preferred structure. The package manager
> | should ignore any formats it does not recognize. If this section
> | is not present, the package manager should behave as if only ``flat``
> | structure were specified.
> I don't want people to skip numbers, and I want to avoid confusion
> between 0/1 as initial number.
I believe "sequential" is still somewhat ambiguous, and I wouldn't
split "non-negative integer" which is a technical term.
How about "consecutive non-negative integer keys"?
Ulrich
[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-26 23:24 [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure Michał Górny
` (2 preceding siblings ...)
2018-01-28 10:14 ` Ulrich Mueller
@ 2018-01-28 20:43 ` Andrew Barchuk
2018-01-28 21:17 ` Gordon Pettey
2018-01-29 5:36 ` Michał Górny
2018-01-29 19:37 ` [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2) Michał Górny
4 siblings, 2 replies; 44+ messages in thread
From: Andrew Barchuk @ 2018-01-28 20:43 UTC (permalink / raw
To: gentoo-dev
[my apologies for posting the message to a wrong thread before]
Hi everyone,
> three possible solutions for splitting distfiles were listed:
>
> a. using initial portion of filename,
>
> b. using initial portion of file hash,
>
> c. using initial portion of filename hash.
>
> The significant advantage of the filename option was simplicity. With
> that solution, the users could easily determine the correct subdirectory
> themselves. However, it's significant disadvantage was very uneven
> shuffling of data. In particular, the TeΧ Live packages alone count
> almost 23500 distfiles and all use a common prefix, making it impossible
> to split them further.
>
> The alternate option of using file hash has the advantage of having
> a more balanced split.
There's another option to use character ranges for each directory
computed in a way to have the files distributed evenly. One way to do
that is to use filename prefix of dynamic length so that each range
holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
but simpler option is to use file names as range bounds (the same way
dictionaries use words to demarcate page bounds): each directory will
have a name of the first file located inside. This way files will be
distributed evenly and it's still easy to pick a correct directory where
a file will be located manually.
I have implemented a sketch of distfiles splitting that's using file
names as bounds in Python to demonstrate the idea (excuse possibly
non-idiomatic code, I'm not very versed in Python):
$ cat distfile-dirs.py
#!/usr/bin/env python3
import sys
"""
Builds list of dictionary directories to split the list of input files
into evenly. Each directory has name of the first file that is located
in the directory. Takes number of directories as an argument and reads
list of files from stdin. The resulting list or directories is printed
to stdout.
"""
dir_num = int(sys.argv[1])
distfiles = sys.stdin.read().splitlines()
distfile_num = len(distfiles)
dir_size = distfile_num / dir_num
# allows adding files in the beginning without repartitioning
dirs = ["0"]
next_dir = dir_size
while next_dir < distfile_num:
dirs.append(distfiles[round(next_dir)])
next_dir += dir_size
print("/\n".join(dirs) + "/")
$ cat pick-distfiles-dir.py
#!/usr/bin/env python3
"""
Picks the directory for a given file name. Takes a distfile name as an
argument. Reads sorted list of directories from stdin, name of each
directory is assumed to be the name of first file that's located inside.
"""
import sys
distfile = sys.argv[1]
dirs = sys.stdin.read().splitlines()
left = 0
right = len(dirs) - 1
while left < right:
pivot = round((left + right) / 2)
if (dirs[pivot] <= distfile):
left = pivot + 1
else:
right = pivot - 1
if distfile < dirs[right]:
print(dirs[right-1])
else:
print(dirs[right])
$ # distfiles.txt contains all the distfile names
$ head -n5 distfiles.txt
0CD9CDDE3F56BB5250D87C54592F04CBC24F03BF-wagon-provider-api-2.10.jar
0CE1EDB914C94EBC388F086C6827E8BDEEC71AC2-commons-lang-2.6.jar
0DCC973606CBD9737541AA5F3E76DED6E3F4D0D0-iri.jar
0ad-0.0.22-alpha-unix-build.tar.xz
0ad-0.0.22-alpha-unix-data.tar.xz
$ # calculate 500 directories to split distfiles into evenly
$ cat distfiles.txt | ./distfile-dirs.py 500 > dirs.txt
$ tail -n5 dirs.txt
xrmap-2.29.tar.bz2/
xview-3.2p1.4-18c.tar.gz/
yasat-700.tar.gz/
yubikey-manager-qt-0.4.0.tar.gz/
zimg-2.5.1.tar.gz
$ # pick a directory for xvinfo-1.0.1.tar.bz2
$ cat dirs.txt | ./pick-distfiles-dir.py xvinfo-1.0.1.tar.bz2
xview-3.2p1.4-18c.tar.gz/
Using the approach above the files will distributed evenly among the
directories keeping the possibility to determine the directory for a
specific file by hand. It's possible if necessary to keep the directory
structure unchanged for very long time and it will likely stay
well-balanced. Picking a directory for a file is very cheap. The only
obvious downside I see is that it's necessary to know list of
directories to pick the correct one (can be mitigated by caching the
list of directories if important). If it's desirable to make directory
names shorter or to look less like file names it's fairly easy to
achieve by keeping only unique prefixes of directories. For example:
xrmap-2.29.tar.bz2/
xview-3.2p1.4-18c.tar.gz/
yasat-700.tar.gz/
yubikey-manager-qt-0.4.0.tar.gz/
zimg-2.5.1.tar.gz/
will become
xr/
xv/
ya/
yu/
z/
Thanks for taking time to consider the suggestion.
---
Andrew
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 20:43 ` Andrew Barchuk
@ 2018-01-28 21:17 ` Gordon Pettey
2018-01-28 22:00 ` Andrew Barchuk
2018-01-29 5:36 ` Michał Górny
1 sibling, 1 reply; 44+ messages in thread
From: Gordon Pettey @ 2018-01-28 21:17 UTC (permalink / raw
To: gentoo-dev
On Sun, Jan 28, 2018 at 2:43 PM, Andrew Barchuk <andrew@raindev.io> wrote:
> There's another option to use character ranges for each directory
> computed in a way to have the files distributed evenly. One way to do
> that is to use filename prefix of dynamic length so that each range
> holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
> texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
> but simpler option is to use file names as range bounds (the same way
> dictionaries use words to demarcate page bounds): each directory will
> have a name of the first file located inside. This way files will be
> distributed evenly and it's still easy to pick a correct directory where
> a file will be located manually.
>
> ...snip...
>
> Using the approach above the files will distributed evenly among the
> directories keeping the possibility to determine the directory for a
> specific file by hand. It's possible if necessary to keep the directory
> structure unchanged for very long time and it will likely stay
> well-balanced. Picking a directory for a file is very cheap. The only
> obvious downside I see is that it's necessary to know list of
> directories to pick the correct one (can be mitigated by caching the
> list of directories if important). If it's desirable to make directory
> names shorter or to look less like file names it's fairly easy to
> achieve by keeping only unique prefixes of directories. For example:
To the contrary, that would not remain balanced, because your
boundaries are entirely dependent on exactly what is in the tree at
the moment you run your script. Now the package manager has to perform
directory listing, sort and find the file name that's closest, open
that directory, find the next closest filename (assuming multiple
levels of hierarchy), and so on, or you have to store yet another
index that duplicates information and takes additional space. Locating
by distfile name hash is effectively free.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 21:17 ` Gordon Pettey
@ 2018-01-28 22:00 ` Andrew Barchuk
2018-01-28 22:13 ` Gordon Pettey
2018-01-28 22:14 ` Zac Medico
0 siblings, 2 replies; 44+ messages in thread
From: Andrew Barchuk @ 2018-01-28 22:00 UTC (permalink / raw
To: gentoo-dev
> To the contrary, that would not remain balanced, because your
> boundaries are entirely dependent on exactly what is in the tree at
> the moment you run your script. Now the package manager has to perform
> directory listing, sort and find the file name that's closest, open
> that directory, find the next closest filename (assuming multiple
> levels of hierarchy), and so on, or you have to store yet another
> index that duplicates information and takes additional space. Locating
> by distfile name hash is effectively free.
Sure, the tree won't be perfectly balanced but it will be pretty close.
E.g. if texlive-* dominates the tree today it will likely continue
dominating it for another 5 years. Statistical distribution of distfile
names will likely be changing very slowly.
Doing a binary search through a list of couple of hundred of directories
is really cheap. I don't see a reason to organize distfiles in a
multi-level hierarchy: e.g. if the goal is to keep no more than 1000
files in a folder than the limit of single level hierarchy is a million
which is more than enough for foreseeable future. The list of 500
directories takes 15kB when using full file names and will be couple of
times smaller when using only unique prefixes.
---
Andrew
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 22:00 ` Andrew Barchuk
@ 2018-01-28 22:13 ` Gordon Pettey
2018-01-28 22:14 ` Zac Medico
1 sibling, 0 replies; 44+ messages in thread
From: Gordon Pettey @ 2018-01-28 22:13 UTC (permalink / raw
To: gentoo-dev
On Sun, Jan 28, 2018 at 4:00 PM, Andrew Barchuk <andrew@raindev.io> wrote:
> I don't see a reason to organize distfiles in a
> multi-level hierarchy: e.g. if the goal is to keep no more than 1000
> files in a folder than the limit of single level hierarchy is a million
> which is more than enough for foreseeable future. The list of 500
> directories takes 15kB when using full file names and will be couple of
> times smaller when using only unique prefixes.
Then don't. Using one level of first-4-characters-of-filename-hash is
still more efficient.
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 22:00 ` Andrew Barchuk
2018-01-28 22:13 ` Gordon Pettey
@ 2018-01-28 22:14 ` Zac Medico
2018-01-28 22:46 ` Andrew Barchuk
1 sibling, 1 reply; 44+ messages in thread
From: Zac Medico @ 2018-01-28 22:14 UTC (permalink / raw
To: gentoo-dev, Andrew Barchuk
[-- Attachment #1.1: Type: text/plain, Size: 1640 bytes --]
On 01/28/2018 02:00 PM, Andrew Barchuk wrote:
>> To the contrary, that would not remain balanced, because your
>> boundaries are entirely dependent on exactly what is in the tree at
>> the moment you run your script. Now the package manager has to perform
>> directory listing, sort and find the file name that's closest, open
>> that directory, find the next closest filename (assuming multiple
>> levels of hierarchy), and so on, or you have to store yet another
>> index that duplicates information and takes additional space. Locating
>> by distfile name hash is effectively free.
>
> Sure, the tree won't be perfectly balanced but it will be pretty close.
> E.g. if texlive-* dominates the tree today it will likely continue
> dominating it for another 5 years. Statistical distribution of distfile
> names will likely be changing very slowly.
>
> Doing a binary search through a list of couple of hundred of directories
> is really cheap. I don't see a reason to organize distfiles in a
> multi-level hierarchy: e.g. if the goal is to keep no more than 1000
> files in a folder than the limit of single level hierarchy is a million
> which is more than enough for foreseeable future. The list of 500
> directories takes 15kB when using full file names and will be couple of
> times smaller when using only unique prefixes.
In order to use that for distfiles mirrors, such that clients could know
where to fetch the files from, you'd need the mirror's http server to
redirect the request to the appropriate location (since the location
would not be predictable from the client side).
--
Thanks,
Zac
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 224 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 22:14 ` Zac Medico
@ 2018-01-28 22:46 ` Andrew Barchuk
0 siblings, 0 replies; 44+ messages in thread
From: Andrew Barchuk @ 2018-01-28 22:46 UTC (permalink / raw
To: Zac Medico, gentoo-dev
> In order to use that for distfiles mirrors, such that clients could know
> where to fetch the files from, you'd need the mirror's http server to
> redirect the request to the appropriate location (since the location
> would not be predictable from the client side).
That's not necessary: the client can fetch listing of the top level distfiles/
which now would be hundreds of directories, not tens of thousands
of files. That would allow the client to calculate the correct directory to
fetch a distfile from locally. List of directories can be cached until an
attempt to fetch a distfile will result in a miss (meaning that distfiles were
redistributed).
--
Andrew
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 20:43 ` Andrew Barchuk
2018-01-28 21:17 ` Gordon Pettey
@ 2018-01-29 5:36 ` Michał Górny
2018-01-29 9:22 ` Andrew Barchuk
1 sibling, 1 reply; 44+ messages in thread
From: Michał Górny @ 2018-01-29 5:36 UTC (permalink / raw
To: gentoo-dev
W dniu nie, 28.01.2018 o godzinie 21∶43 +0100, użytkownik Andrew Barchuk
napisał:
> [my apologies for posting the message to a wrong thread before]
>
> Hi everyone,
>
> > three possible solutions for splitting distfiles were listed:
> >
> > a. using initial portion of filename,
> >
> > b. using initial portion of file hash,
> >
> > c. using initial portion of filename hash.
> >
> > The significant advantage of the filename option was simplicity. With
> > that solution, the users could easily determine the correct subdirectory
> > themselves. However, it's significant disadvantage was very uneven
> > shuffling of data. In particular, the TeΧ Live packages alone count
> > almost 23500 distfiles and all use a common prefix, making it impossible
> > to split them further.
> >
> > The alternate option of using file hash has the advantage of having
> > a more balanced split.
>
>
> There's another option to use character ranges for each directory
> computed in a way to have the files distributed evenly. One way to do
> that is to use filename prefix of dynamic length so that each range
> holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
> texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
> but simpler option is to use file names as range bounds (the same way
> dictionaries use words to demarcate page bounds): each directory will
> have a name of the first file located inside. This way files will be
> distributed evenly and it's still easy to pick a correct directory where
> a file will be located manually.
What you're talking about is pretty much an adaptive algorithm. It may
look like a good at first but it's really hard to predict how it'll work
in the future because you can't really predict what will happen to
distfiles in the future.
A few major events that could result in it going competely off:
a. we stop using split texlive packages and distribute a few big
tarballs instead,
b. texlive packages are renamed to use date before subpackage name,
c. someone adds another big package set.
That said, you don't need a big event for that. Many small events may
(or may not) cause it to gradually go off. Whenever that happens, we
would have to have a contingency plan -- and I don't really like
the idea of having to reshuffle all the mirrors all of a sudden.
I think the cryptographic hash algorithms are a better choice. They may
not be perfect but they can cope with a lot of very different data
by design. Yes, we could technically accidentally hit a data set that is
completely uneven. But it is rather unlikely, compared to home-made
algorithms.
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 7:01 ` Jason Zaman
2018-01-28 9:10 ` Michał Górny
@ 2018-01-29 7:33 ` Robin H. Johnson
1 sibling, 0 replies; 44+ messages in thread
From: Robin H. Johnson @ 2018-01-29 7:33 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 1509 bytes --]
On Sun, Jan 28, 2018 at 03:01:11PM +0800, Jason Zaman wrote:
> Another thing im wondering is if we can just use the same dir layout as
> the packages themselves. that would fix texlive since it has a whole lot
> of separate packages. eg /usr/portage/distfiles/app-cat/pkg/pkg-1.0.tgz
Texlive is worse than that:
dev-texlive/texlive-latexextra/Manifest contains 8556 DIST entries, ALL
starting with 'texlive-module-'.
> there is a problem if many packages use the same distfiles (quite
> extensive for SELinux, every single of the sec-policy/selinux-* packages
> has identical distfiles) so im not sure how to deal with it.
The new MetaManifest proposed that common distfiles could be moved to
the category level Manifest (but needs a long transition period).
> do you have an easy way to calculate how big the distfiles are per
> category or cat/pkg? i'd be interested to see.
very quick awk:
per-package, no-dedupe:
gawk '/^DIST/{f=gensub("/Manifest","",1,FILENAME); sum[f]+=$3}END{for(f in sum){print f,sum[f]}}' */*/Manifest
(games-board/tablebase-syzygy is NOT a typo, it has ~150GiB of distfiles)
per-category, no-dedupe:
awk '/^DIST/{f=gensub("/[^/]+/Manifest","",1,FILENAME); sum[f]+=$3}END{for(f in sum){print f,sum[f]}}' */*/Manifest |sort -k +2n
--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 1113 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-29 5:36 ` Michał Górny
@ 2018-01-29 9:22 ` Andrew Barchuk
0 siblings, 0 replies; 44+ messages in thread
From: Andrew Barchuk @ 2018-01-29 9:22 UTC (permalink / raw
To: gentoo-dev
> I don't really like the idea of having to reshuffle all the mirrors all
> of a sudden.
If keeping invariable directory structure is the goal (even though it
should be possible to shuffle files among the directories without having
to retransfer the whole tree with hard links) in addition to be able to
determine the target directory without knowledge of directory layout as
Robin mentioned in another thread[1]:
> As for the problem you describe, one of the requirements in the
> discussion is that given ONLY the file or filename, and NOTHING ELSE, it
> should be possible to determine where in a hierarchy it should go. No
> prior knowledge about the hierarchy was permitted.
then hashing is indeed a better option.
1. https://archives.gentoo.org/gentoo-dev/message/1d69d65b47a18fdbe36f5edac58c7591
---
Andrew
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2)
2018-01-26 23:24 [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure Michał Górny
` (3 preceding siblings ...)
2018-01-28 20:43 ` Andrew Barchuk
@ 2018-01-29 19:37 ` Michał Górny
2018-01-29 20:00 ` Robin H. Johnson
2018-01-29 20:26 ` R0b0t1
4 siblings, 2 replies; 44+ messages in thread
From: Michał Górny @ 2018-01-29 19:37 UTC (permalink / raw
To: gentoo-dev
Here's an updated version. I've tried to incorporate most
of the feedback so far.
---
GLEP: 75
Title: Split distfile mirror directory structure
Author: Michał Górny <mgorny@gentoo.org>,
Robin H. Johnson <robbat2@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2018-01-26
Last-Modified: 2018-01-27
Post-History: 2018-01-27
Content-Type: text/x-rst
---
Abstract
========
This GLEP describes the procedure for splitting the distfiles on mirrors
into multiple directories with the goal of reducing the number of files
in a single directory.
Motivation
==========
At the moment, both the package manager and Gentoo mirrors use flat
directory structure to store files. While this solution usually works,
it does not scale well. Directories with large number of files usually
have significant performance penalty, unless using filesystems
specifically designed for that purpose.
According to the Gentoo repository state at 2018-01-26 16:23, there
was a total of 62652 unique distfiles in the repository. While
the users realistically hit around 10% of that, distfile mirrors often
hold even more files --- more so if old distfiles are not wiped
immediately.
While all filesystems used on Linux boxes should be able to cope with
a number that large, they may suffer a performance penalty with even
a few thousand files. Additionally, if mirrors enable directory indexes
then generating the index imposes both a significant server overhead
and a significant data transfer. At this moment, the index
of distfiles.gentoo.org has around 17 MiB.
Splitting the distfiles into multiple directories makes it possible
to avoid those problems by reducing the number of files in a single
directory. For example, splitting the forementioned set of distfiles
into 16 directories that are roughly balanced allows to reduce
the number of files in a single directory to around 4000. Splitting
them further into 256 directories (16x16) results in 200-300 files
per directory which should avoid any performance problems long-term,
even assuming 300% growth of number of distfiles.
Specification
=============
Mirror layout file
------------------
A mirror adhering to this specification should include a ``layout.conf``
file in the top distfile directory. This file uses the format
derived from the freedesktop Desktop Entry Specification file format
[#DESKTOP_FORMAT]_.
Before using each Gentoo mirror, the package manager should attempt
to fetch (update) its ``layout.conf`` file and process it to determine
how to use the mirror. If the file is not present, the package manager
should behave as if it were empty.
The package manager should recognize the sections and keys listed below.
It should ignore any unrecognized sections or keys --- the format
is intended to account for future extensions.
This specification currently defines one section: ``[structure]``.
This section defines one or more repository structure definitions
using non-negative sequential integer keys. The definition with
the ``0`` key is the most preferred structure. The package manager
should ignore any formats it does not recognize. If this section
is not present, the package manager should behave as if only ``flat``
structure were specified.
The following structure definitions are supported:
* ``flat`` to indicate the traditional flat structure where all
distfiles are located in the top directory,
* ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename
hash structure`_ explained below.
Filename hash structure
-----------------------
When using the filename hash structure, the distfiles are split
into directories whose names are derived from the hash of distfile
filename. This structure has two parameters: *algorithm name*
and *cutoffs* list.
The algorithm name must correspond to a valid Manifest hash name.
An informational list of hashes is included in GLEP 74 [#GLEP74]_,
and the policies for introducing new hashes are covered by GLEP 59
[#GLEP59]_.
The cutoffs list specifies one or more integers separated by colons
(``:``), indicating the number of bits (starting with the most
significant bit) of the hash used to form subsequent subdirectory names.
For example, the list of ``2:4`` would indicate that top-level directory
names are formed using 2 most significant bits of the hash (resulting
in 2² = 4 directories), and each of this directories would have
subdirectories formed using the next 4 bits of the hash (resulting
in 2⁴ = 8 subdirectories each).
The exact algorithm for determining the distfile location follows:
1. Let the distfile filename be **F**.
2. Compute the hash of **F** and store its binary value as **H**.
3. For each integer **C** in cutoff list:
a. Take **C** most significant bits of **H** and store them as **V**.
b. Convert **V** into hexadecimal integer, left padded with zeros
to **C/4** digits (rounded up) and append it to the path, followed
by the path separator.
c. Shift **H** left **C** bits.
4. Finally, append **F** to the obtained path.
In particular, note that when using nested directories
the subdirectories do not repeat the hash bits used in parent directory.
Migrating mirrors to the hashed structure
-----------------------------------------
Since all distfile mirrors sync to the master Gentoo mirror, it should
be enough to perform all the needed changes on the master mirror
and wait for other mirrors to sync. The following procedure
is recommended:
1. Include the initial ``layout.conf`` listing only ``flat`` layout.
2. Create the new structure alongside the flat structure. Wait for
mirrors to sync.
3. Once all mirrors receive the new structure, update ``layout.conf``
to list the ``filename-hash`` structure.
4. Once a version of Portage supporting the new structure is stable long
enough, remove the fallback ``flat`` structure from ``layout.conf``
and duplicate distfiles.
This implies that during the migration period the distfiles will
be stored duplicated on the mirrors and therefore will occupy twice
as much space. Technically, this could be avoided either by using
hard links or symbolic links.
The hard link solution allows us to save space on the master mirror.
Additionally, if ``-H`` option is used by the mirrors it avoids
transferring existing files again. However, this option is known
to be expensive and could cause significant server load. Without it,
all mirrors need to transfer a second copy of all the existing files.
The symbolic link solution could be more reliable if we could rely
on mirrors using the ``--links`` rsync option. Without that, symbolic
links are not transferred at all.
Using hashed structure for local distfiles
------------------------------------------
The hashed structure defined above could also be used for local distfile
storage as used by the package manager. For this to work, the package
manager authors need to ensure that:
a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
directory where distfiles specific to the package are linked
in a flat structure.
b. All tools are updated to support the nested structure.
c. The package manager provides a tool for users to easily manipulate
distfiles, in particular to add distfiles for fetch-restricted
packages into an appropriate subdirectory.
For extended compatibility, the package manager may support finding
distfiles in flat and nested structure simultaneously.
Rationale
=========
Algorithm for splitting distfiles
---------------------------------
The possible algorithms were considered with the following goals
in mind:
- the number of files in a single directory should not exceed 1000,
- the total size of files in a single directory is not considered
relevant,
- the solution should preferably be future-proof,
- moving distfiles should be avoided once it is deployed.
It should also be noted that at this moment the package having most
distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
with the number of 8556 distfiles. All of them start with a common
prefix of ``texlive-module-``. This specific prefix is used by a total
of 23435 distfiles.
In the original debate that occurred in bug #534528 [#BUG534528]_
and the mailing list review of the initial version of this GLEP [#ML1]_,
four fundamental ideas for splitting distfiles were listed:
a. using initial portion of filename,
b. using initial portion of file hash,
c. using initial portion of filename hash,
d. using package category (and package name).
The initial filename idea was to use the first character of filename,
possibly followed by a longer part which was the idea historically
used e.g. by PyPI Python package hosting. Its main advantage is
simplicity. The users can easily determine the correct subdirectory
by just looking at the distfile name. Sadly, this solution is not only
very uneven but does not solve the problem. As mentioned above,
the TeΧ Live packages share a long common prefix that make it impossible
to split it properly with other packages on fixed-length prefixes.
This idea has been followed by an adaptive proposal by Andrew Barchuk
[#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly
mapped to groups by a common prefix but instead each group contains
all files between two prefixes being used (like in a dictionary).
However, it has been pointed out that while this option can provide
very even results initially, it is impossible to predict how it would
be affected by future distfile changes and there will be a risk of
needing to change the groups in the future. Furthermore, it is
relatively complex and requires explicitly listing or obtaining used
groups.
Another option was to use an initial portion of distfile hashes. Its
main advantage is that cryptographic hash algorithms can provide
a more balanced split with random data. Furthermore, since hashes are
stored in Manifests using them has no cost for users. However, this
solution has three disadvantages:
1. Not all files in the distfile tree are covered by package Manifests.
Additional files are injected into the mirrors, and those will
not have a clearly-defined location.
2. User-provided distfiles (e.g. for fetch-restricted packages) with
hash mismatches would be placed in the wrong subdirectory,
potentially causing confusing errors.
3. The hash values are unknown for newly-downloaded distfiles, so
``repoman`` (or an equivalent tool) would have to use a temporary
directory before locating the file in appropriate subdirectory.
Using filename hashes has proven to provide a similar balance to using
file hashes. Furthermore, since filenames are known up front this
solution does not suffer from the listed problems. While hashes need
to be computed manually, hashing short string should not cause
any performance problems.
Jason Zaman has suggested to use package categories (and package names)
[#PKGNAME]_. However, this solution has multiple problems:
a. it does not solve the problem for large packages such as TeΧ Live,
b. it introduces many unnecessarily small directories,
c. it requires an explicit knowledge of which package distfiles
belong to,
d. it does not provide an explicit solution to the problem of distfiles
shared by multiple packages,
e. it does not provide a solution to the problem of injected distfiles.
All the options considered, the filename hash solution was selected
as one that solves all the forementioned problems while introducing
relatively low complexity and being reasonably future-proof.
.. figure:: glep-0075-extras/by-filename.png
Distribution of distfiles by first character of filenames
(note: y axis is on log scale)
.. figure:: glep-0075-extras/by-csum.png
Distribution of distfiles by first hex-digit of checksum
(x --- content checksum, + --- filename checksum)
.. figure:: glep-0075-extras/by-csum2.png
Distribution of distfiles by two first hex-digits of checksum
(x --- content checksum, + --- filename checksum)
Layout file
-----------
The presence of control file has been suggested in the original
discussion. Its main purpose is to let package managers cleanly handle
the migration and detect how to correctly query the mirrors throughout
it. Furthermore, it makes future changes easier.
The format lines specifically mean to hardcode as little about
the actual algorithm as possible. Therefore, we can easily change
the hash used or the exact split structure without having to update
the package managers or even provide a compatibility layout.
The file is also open for future extensions to provide additional mirror
metadata. However, no clear use for that has been determined so far.
Hash algorithm
--------------
The hash algorithm support is fully deferred to the existing code
in the package managers that is required to handle Manifests.
In particular, it is recommended to reuse one of the hashes that are
used in Manifest entries at the time. This avoids code duplication
and reuses an existing mechanism to handle hash upgrades.
During the discussion, it has been pointed that this particular use case
does not require a cryptographically strong hash and a faster algorithm
could be used instead. However, given the short length of hashed
strings performance is not a problem, and speed does not justify
the resulting code duplication.
It has also been pointed out that e.g. the BLAKE2 hash family provides
the ability of creating arbitrary length hashes instead of truncating
the standard-length hash. However, not all implementations of BLAKE2
support that and relying on it could reduce portability for no apparent
gain.
Backwards Compatibility
=======================
Mirror compatibility
--------------------
The mirrored files are propagated to other mirrors as opaque directory
structure. Therefore, there are no backwards compatibility concerns
on the mirroring side.
Backwards compatibility with existing clients is detailed
in `migrating mirrors to the hashed structure`_ section. Backwards
compatibility with the old clients will be provided by preserving
the flat structure during the transitional period.
The new clients will fetch the ``layout.conf`` file to avoid backwards
compatibility concerns in the future. In case of hitting an old mirror,
the package manager will default to the ``flat`` structure.
Package manager storage compatibility
-------------------------------------
The exact means of preserving backwards compatibility in package manager
storage are left to the package manager authors. However, it is
recommended that package managers continue to support the flat layout
even if it is no longer the default. The package manager may either
continue to read files from this location or automatically move them
to an appropriate subdirectory.
Reference Implementation
========================
TODO.
References
==========
.. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the file
(https://standards.freedesktop.org/desktop-entry-spec/latest/ar01s03.html)
.. [#GLEP74] GLEP 74: Full-tree verification using Manifest files:
Checksum algorithms (informational)
(https://www.gentoo.org/glep/glep-0074.html#checksum-algorithms-informational)
.. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
(https://www.gentoo.org/glep/glep-0059.html)
.. [#BUG534528] Bug 534528 - distfiles should be sorted into subdirectories
of DISTDIR
(https://bugs.gentoo.org/534528)
.. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
(https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba)
.. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
for each directory computed in a way to have the files distributed evenly'
(https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8)
.. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
as the packages themselves)
(https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975)
Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2)
2018-01-29 19:37 ` [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2) Michał Górny
@ 2018-01-29 20:00 ` Robin H. Johnson
2018-01-29 21:09 ` Michał Górny
2018-01-29 20:26 ` R0b0t1
1 sibling, 1 reply; 44+ messages in thread
From: Robin H. Johnson @ 2018-01-29 20:00 UTC (permalink / raw
To: gentoo-dev
On Mon, Jan 29, 2018 at 08:37:47PM +0100, Michał Górny wrote:
> Migrating mirrors to the hashed structure
> -----------------------------------------
...
> The hard link solution allows us to save space on the master mirror.
> Additionally, if ``-H`` option is used by the mirrors it avoids
> transferring existing files again. However, this option is known
> to be expensive and could cause significant server load. Without it,
> all mirrors need to transfer a second copy of all the existing files.
Informational only, not for addition to GLEP:
I have surveyed some Gentoo mirrors (via the -mirrors mailing list, and
direct email), and everybody so far was using already hard-links. Older
releng practices for the 2004/2005 releases used hard-links on mirrors,
which is where this came from. The Mirror documentation should be
updated to say that mirrors should use both --links and --hard-links.
--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2)
2018-01-29 19:37 ` [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2) Michał Górny
2018-01-29 20:00 ` Robin H. Johnson
@ 2018-01-29 20:26 ` R0b0t1
2018-01-29 20:55 ` Alec Warner
1 sibling, 1 reply; 44+ messages in thread
From: R0b0t1 @ 2018-01-29 20:26 UTC (permalink / raw
To: gentoo-dev@lists.gentoo.org
[-- Attachment #1: Type: text/plain, Size: 17972 bytes --]
On Monday, January 29, 2018, Michał Górny <mgorny@gentoo.org> wrote:
> Here's an updated version. I've tried to incorporate most
> of the feedback so far.
>
>
> ---
> GLEP: 75
> Title: Split distfile mirror directory structure
> Author: Michał Górny <mgorny@gentoo.org>,
> Robin H. Johnson <robbat2@gentoo.org>
> Type: Standards Track
> Status: Draft
> Version: 1
> Created: 2018-01-26
> Last-Modified: 2018-01-27
> Post-History: 2018-01-27
> Content-Type: text/x-rst
> ---
>
> Abstract
> ========
> This GLEP describes the procedure for splitting the distfiles on mirrors
> into multiple directories with the goal of reducing the number of files
> in a single directory.
>
>
> Motivation
> ==========
> At the moment, both the package manager and Gentoo mirrors use flat
> directory structure to store files. While this solution usually works,
> it does not scale well. Directories with large number of files usually
> have significant performance penalty, unless using filesystems
> specifically designed for that purpose.
>
> According to the Gentoo repository state at 2018-01-26 16:23, there
> was a total of 62652 unique distfiles in the repository. While
> the users realistically hit around 10% of that, distfile mirrors often
> hold even more files --- more so if old distfiles are not wiped
> immediately.
>
> While all filesystems used on Linux boxes should be able to cope with
> a number that large, they may suffer a performance penalty with even
> a few thousand files. Additionally, if mirrors enable directory indexes
> then generating the index imposes both a significant server overhead
> and a significant data transfer. At this moment, the index
> of distfiles.gentoo.org has around 17 MiB.
>
> Splitting the distfiles into multiple directories makes it possible
> to avoid those problems by reducing the number of files in a single
> directory. For example, splitting the forementioned set of distfiles
> into 16 directories that are roughly balanced allows to reduce
> the number of files in a single directory to around 4000. Splitting
> them further into 256 directories (16x16) results in 200-300 files
> per directory which should avoid any performance problems long-term,
> even assuming 300% growth of number of distfiles.
>
>
> Specification
> =============
> Mirror layout file
> ------------------
> A mirror adhering to this specification should include a ``layout.conf``
> file in the top distfile directory. This file uses the format
> derived from the freedesktop Desktop Entry Specification file format
> [#DESKTOP_FORMAT]_.
>
> Before using each Gentoo mirror, the package manager should attempt
> to fetch (update) its ``layout.conf`` file and process it to determine
> how to use the mirror. If the file is not present, the package manager
> should behave as if it were empty.
>
> The package manager should recognize the sections and keys listed below.
> It should ignore any unrecognized sections or keys --- the format
> is intended to account for future extensions.
>
> This specification currently defines one section: ``[structure]``.
> This section defines one or more repository structure definitions
> using non-negative sequential integer keys. The definition with
> the ``0`` key is the most preferred structure. The package manager
> should ignore any formats it does not recognize. If this section
> is not present, the package manager should behave as if only ``flat``
> structure were specified.
>
> The following structure definitions are supported:
>
> * ``flat`` to indicate the traditional flat structure where all
> distfiles are located in the top directory,
>
> * ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename
> hash structure`_ explained below.
>
>
> Filename hash structure
> -----------------------
> When using the filename hash structure, the distfiles are split
> into directories whose names are derived from the hash of distfile
> filename. This structure has two parameters: *algorithm name*
> and *cutoffs* list.
>
> The algorithm name must correspond to a valid Manifest hash name.
> An informational list of hashes is included in GLEP 74 [#GLEP74]_,
> and the policies for introducing new hashes are covered by GLEP 59
> [#GLEP59]_.
>
> The cutoffs list specifies one or more integers separated by colons
> (``:``), indicating the number of bits (starting with the most
> significant bit) of the hash used to form subsequent subdirectory names.
> For example, the list of ``2:4`` would indicate that top-level directory
> names are formed using 2 most significant bits of the hash (resulting
> in 2² = 4 directories), and each of this directories would have
> subdirectories formed using the next 4 bits of the hash (resulting
> in 2⁴ = 8 subdirectories each).
>
> The exact algorithm for determining the distfile location follows:
>
> 1. Let the distfile filename be **F**.
>
> 2. Compute the hash of **F** and store its binary value as **H**.
>
> 3. For each integer **C** in cutoff list:
>
> a. Take **C** most significant bits of **H** and store them as **V**.
>
> b. Convert **V** into hexadecimal integer, left padded with zeros
> to **C/4** digits (rounded up) and append it to the path, followed
> by the path separator.
>
> c. Shift **H** left **C** bits.
>
> 4. Finally, append **F** to the obtained path.
>
> In particular, note that when using nested directories
> the subdirectories do not repeat the hash bits used in parent directory.
>
>
> Migrating mirrors to the hashed structure
> -----------------------------------------
> Since all distfile mirrors sync to the master Gentoo mirror, it should
> be enough to perform all the needed changes on the master mirror
> and wait for other mirrors to sync. The following procedure
> is recommended:
>
> 1. Include the initial ``layout.conf`` listing only ``flat`` layout.
>
> 2. Create the new structure alongside the flat structure. Wait for
> mirrors to sync.
>
> 3. Once all mirrors receive the new structure, update ``layout.conf``
> to list the ``filename-hash`` structure.
>
> 4. Once a version of Portage supporting the new structure is stable long
> enough, remove the fallback ``flat`` structure from ``layout.conf``
> and duplicate distfiles.
>
> This implies that during the migration period the distfiles will
> be stored duplicated on the mirrors and therefore will occupy twice
> as much space. Technically, this could be avoided either by using
> hard links or symbolic links.
>
> The hard link solution allows us to save space on the master mirror.
> Additionally, if ``-H`` option is used by the mirrors it avoids
> transferring existing files again. However, this option is known
> to be expensive and could cause significant server load. Without it,
> all mirrors need to transfer a second copy of all the existing files.
>
> The symbolic link solution could be more reliable if we could rely
> on mirrors using the ``--links`` rsync option. Without that, symbolic
> links are not transferred at all.
>
>
> Using hashed structure for local distfiles
> ------------------------------------------
> The hashed structure defined above could also be used for local distfile
> storage as used by the package manager. For this to work, the package
> manager authors need to ensure that:
>
> a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
> directory where distfiles specific to the package are linked
> in a flat structure.
>
> b. All tools are updated to support the nested structure.
>
> c. The package manager provides a tool for users to easily manipulate
> distfiles, in particular to add distfiles for fetch-restricted
> packages into an appropriate subdirectory.
>
> For extended compatibility, the package manager may support finding
> distfiles in flat and nested structure simultaneously.
>
>
> Rationale
> =========
> Algorithm for splitting distfiles
> ---------------------------------
> The possible algorithms were considered with the following goals
> in mind:
>
> - the number of files in a single directory should not exceed 1000,
>
> - the total size of files in a single directory is not considered
> relevant,
>
> - the solution should preferably be future-proof,
>
> - moving distfiles should be avoided once it is deployed.
>
> It should also be noted that at this moment the package having most
> distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
> with the number of 8556 distfiles. All of them start with a common
> prefix of ``texlive-module-``. This specific prefix is used by a total
> of 23435 distfiles.
>
> In the original debate that occurred in bug #534528 [#BUG534528]_
> and the mailing list review of the initial version of this GLEP [#ML1]_,
> four fundamental ideas for splitting distfiles were listed:
>
> a. using initial portion of filename,
>
> b. using initial portion of file hash,
>
> c. using initial portion of filename hash,
>
> d. using package category (and package name).
>
> The initial filename idea was to use the first character of filename,
> possibly followed by a longer part which was the idea historically
> used e.g. by PyPI Python package hosting. Its main advantage is
> simplicity. The users can easily determine the correct subdirectory
> by just looking at the distfile name. Sadly, this solution is not only
> very uneven but does not solve the problem. As mentioned above,
> the TeΧ Live packages share a long common prefix that make it impossible
> to split it properly with other packages on fixed-length prefixes.
>
> This idea has been followed by an adaptive proposal by Andrew Barchuk
> [#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly
> mapped to groups by a common prefix but instead each group contains
> all files between two prefixes being used (like in a dictionary).
> However, it has been pointed out that while this option can provide
> very even results initially, it is impossible to predict how it would
> be affected by future distfile changes and there will be a risk of
> needing to change the groups in the future. Furthermore, it is
> relatively complex and requires explicitly listing or obtaining used
> groups.
>
> Another option was to use an initial portion of distfile hashes. Its
> main advantage is that cryptographic hash algorithms can provide
> a more balanced split with random data. Furthermore, since hashes are
> stored in Manifests using them has no cost for users. However, this
> solution has three disadvantages:
>
> 1. Not all files in the distfile tree are covered by package Manifests.
> Additional files are injected into the mirrors, and those will
> not have a clearly-defined location.
>
> 2. User-provided distfiles (e.g. for fetch-restricted packages) with
> hash mismatches would be placed in the wrong subdirectory,
> potentially causing confusing errors.
>
> 3. The hash values are unknown for newly-downloaded distfiles, so
> ``repoman`` (or an equivalent tool) would have to use a temporary
> directory before locating the file in appropriate subdirectory.
>
> Using filename hashes has proven to provide a similar balance to using
> file hashes. Furthermore, since filenames are known up front this
> solution does not suffer from the listed problems. While hashes need
> to be computed manually, hashing short string should not cause
> any performance problems.
>
> Jason Zaman has suggested to use package categories (and package names)
> [#PKGNAME]_. However, this solution has multiple problems:
>
> a. it does not solve the problem for large packages such as TeΧ Live,
>
> b. it introduces many unnecessarily small directories,
>
> c. it requires an explicit knowledge of which package distfiles
> belong to,
>
> d. it does not provide an explicit solution to the problem of distfiles
> shared by multiple packages,
>
> e. it does not provide a solution to the problem of injected distfiles.
>
> All the options considered, the filename hash solution was selected
> as one that solves all the forementioned problems while introducing
> relatively low complexity and being reasonably future-proof.
>
> .. figure:: glep-0075-extras/by-filename.png
>
> Distribution of distfiles by first character of filenames
> (note: y axis is on log scale)
>
> .. figure:: glep-0075-extras/by-csum.png
>
> Distribution of distfiles by first hex-digit of checksum
> (x --- content checksum, + --- filename checksum)
>
> .. figure:: glep-0075-extras/by-csum2.png
>
> Distribution of distfiles by two first hex-digits of checksum
> (x --- content checksum, + --- filename checksum)
>
>
> Layout file
> -----------
> The presence of control file has been suggested in the original
> discussion. Its main purpose is to let package managers cleanly handle
> the migration and detect how to correctly query the mirrors throughout
> it. Furthermore, it makes future changes easier.
>
> The format lines specifically mean to hardcode as little about
> the actual algorithm as possible. Therefore, we can easily change
> the hash used or the exact split structure without having to update
> the package managers or even provide a compatibility layout.
>
> The file is also open for future extensions to provide additional mirror
> metadata. However, no clear use for that has been determined so far.
>
>
> Hash algorithm
> --------------
> The hash algorithm support is fully deferred to the existing code
> in the package managers that is required to handle Manifests.
> In particular, it is recommended to reuse one of the hashes that are
> used in Manifest entries at the time. This avoids code duplication
> and reuses an existing mechanism to handle hash upgrades.
>
> During the discussion, it has been pointed that this particular use case
> does not require a cryptographically strong hash and a faster algorithm
> could be used instead. However, given the short length of hashed
> strings performance is not a problem, and speed does not justify
> the resulting code duplication.
>
> It has also been pointed out that e.g. the BLAKE2 hash family provides
> the ability of creating arbitrary length hashes instead of truncating
> the standard-length hash. However, not all implementations of BLAKE2
> support that and relying on it could reduce portability for no apparent
> gain.
>
>
> Backwards Compatibility
> =======================
> Mirror compatibility
> --------------------
> The mirrored files are propagated to other mirrors as opaque directory
> structure. Therefore, there are no backwards compatibility concerns
> on the mirroring side.
>
> Backwards compatibility with existing clients is detailed
> in `migrating mirrors to the hashed structure`_ section. Backwards
> compatibility with the old clients will be provided by preserving
> the flat structure during the transitional period.
>
> The new clients will fetch the ``layout.conf`` file to avoid backwards
> compatibility concerns in the future. In case of hitting an old mirror,
> the package manager will default to the ``flat`` structure.
>
>
> Package manager storage compatibility
> -------------------------------------
> The exact means of preserving backwards compatibility in package manager
> storage are left to the package manager authors. However, it is
> recommended that package managers continue to support the flat layout
> even if it is no longer the default. The package manager may either
> continue to read files from this location or automatically move them
> to an appropriate subdirectory.
>
>
> Reference Implementation
> ========================
> TODO.
>
>
> References
> ==========
> .. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the file
> (
https://standards.freedesktop.org/desktop-entry-spec/latest/ar01s03.html)
>
> .. [#GLEP74] GLEP 74: Full-tree verification using Manifest files:
> Checksum algorithms (informational)
> (
https://www.gentoo.org/glep/glep-0074.html#checksum-algorithms-informational
)
>
> .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
> (https://www.gentoo.org/glep/glep-0059.html)
>
> .. [#BUG534528] Bug 534528 - distfiles should be sorted into
subdirectories
> of DISTDIR
> (https://bugs.gentoo.org/534528)
>
> .. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory
structure
> (
https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba
)
>
> .. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
> for each directory computed in a way to have the files distributed
evenly'
> (
https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8
)
>
> .. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
> as the packages themselves)
> (
https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975
)
>
>
> Copyright
> =========
> This work is licensed under the Creative Commons Attribution-ShareAlike
3.0
> Unported License. To view a copy of this license, visit
> http://creativecommons.org/licenses/by-sa/3.0/.
>
> --
> Best regards,
> Michał Górny
>
It's going to be hash based? Why? I tried to follow the conversation but
there's now close to 5 of these posts in the mailing list with different
conversations in each.
Using filename prefixes is boring and not uniform, but I feel I should
point out that most distfile hosts are still doing fine. Microoptimizing
this seems like wasted effort.
Cheers,
R0b0t1
[-- Attachment #2: Type: text/html, Size: 20992 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2)
2018-01-29 20:26 ` R0b0t1
@ 2018-01-29 20:55 ` Alec Warner
0 siblings, 0 replies; 44+ messages in thread
From: Alec Warner @ 2018-01-29 20:55 UTC (permalink / raw
To: Gentoo Dev
[-- Attachment #1: Type: text/plain, Size: 19061 bytes --]
On Mon, Jan 29, 2018 at 3:26 PM, R0b0t1 <r030t1@gmail.com> wrote:
> On Monday, January 29, 2018, Michał Górny <mgorny@gentoo.org> wrote:
> > Here's an updated version. I've tried to incorporate most
> > of the feedback so far.
> >
> >
> > ---
> > GLEP: 75
> > Title: Split distfile mirror directory structure
> > Author: Michał Górny <mgorny@gentoo.org>,
> > Robin H. Johnson <robbat2@gentoo.org>
> > Type: Standards Track
> > Status: Draft
> > Version: 1
> > Created: 2018-01-26
> > Last-Modified: 2018-01-27
> > Post-History: 2018-01-27
> > Content-Type: text/x-rst
> > ---
> >
> > Abstract
> > ========
> > This GLEP describes the procedure for splitting the distfiles on mirrors
> > into multiple directories with the goal of reducing the number of files
> > in a single directory.
> >
> >
> > Motivation
> > ==========
> > At the moment, both the package manager and Gentoo mirrors use flat
> > directory structure to store files. While this solution usually works,
> > it does not scale well. Directories with large number of files usually
> > have significant performance penalty, unless using filesystems
> > specifically designed for that purpose.
> >
> > According to the Gentoo repository state at 2018-01-26 16:23, there
> > was a total of 62652 unique distfiles in the repository. While
> > the users realistically hit around 10% of that, distfile mirrors often
> > hold even more files --- more so if old distfiles are not wiped
> > immediately.
> >
> > While all filesystems used on Linux boxes should be able to cope with
> > a number that large, they may suffer a performance penalty with even
> > a few thousand files. Additionally, if mirrors enable directory indexes
> > then generating the index imposes both a significant server overhead
> > and a significant data transfer. At this moment, the index
> > of distfiles.gentoo.org has around 17 MiB.
> >
> > Splitting the distfiles into multiple directories makes it possible
> > to avoid those problems by reducing the number of files in a single
> > directory. For example, splitting the forementioned set of distfiles
> > into 16 directories that are roughly balanced allows to reduce
> > the number of files in a single directory to around 4000. Splitting
> > them further into 256 directories (16x16) results in 200-300 files
> > per directory which should avoid any performance problems long-term,
> > even assuming 300% growth of number of distfiles.
> >
> >
> > Specification
> > =============
> > Mirror layout file
> > ------------------
> > A mirror adhering to this specification should include a ``layout.conf``
> > file in the top distfile directory. This file uses the format
> > derived from the freedesktop Desktop Entry Specification file format
> > [#DESKTOP_FORMAT]_.
> >
> > Before using each Gentoo mirror, the package manager should attempt
> > to fetch (update) its ``layout.conf`` file and process it to determine
> > how to use the mirror. If the file is not present, the package manager
> > should behave as if it were empty.
> >
> > The package manager should recognize the sections and keys listed below.
> > It should ignore any unrecognized sections or keys --- the format
> > is intended to account for future extensions.
> >
> > This specification currently defines one section: ``[structure]``.
> > This section defines one or more repository structure definitions
> > using non-negative sequential integer keys. The definition with
> > the ``0`` key is the most preferred structure. The package manager
> > should ignore any formats it does not recognize. If this section
> > is not present, the package manager should behave as if only ``flat``
> > structure were specified.
> >
> > The following structure definitions are supported:
> >
> > * ``flat`` to indicate the traditional flat structure where all
> > distfiles are located in the top directory,
> >
> > * ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename
> > hash structure`_ explained below.
> >
> >
> > Filename hash structure
> > -----------------------
> > When using the filename hash structure, the distfiles are split
> > into directories whose names are derived from the hash of distfile
> > filename. This structure has two parameters: *algorithm name*
> > and *cutoffs* list.
> >
> > The algorithm name must correspond to a valid Manifest hash name.
> > An informational list of hashes is included in GLEP 74 [#GLEP74]_,
> > and the policies for introducing new hashes are covered by GLEP 59
> > [#GLEP59]_.
> >
> > The cutoffs list specifies one or more integers separated by colons
> > (``:``), indicating the number of bits (starting with the most
> > significant bit) of the hash used to form subsequent subdirectory names.
> > For example, the list of ``2:4`` would indicate that top-level directory
> > names are formed using 2 most significant bits of the hash (resulting
> > in 2² = 4 directories), and each of this directories would have
> > subdirectories formed using the next 4 bits of the hash (resulting
> > in 2⁴ = 8 subdirectories each).
> >
> > The exact algorithm for determining the distfile location follows:
> >
> > 1. Let the distfile filename be **F**.
> >
> > 2. Compute the hash of **F** and store its binary value as **H**.
> >
> > 3. For each integer **C** in cutoff list:
> >
> > a. Take **C** most significant bits of **H** and store them as **V**.
> >
> > b. Convert **V** into hexadecimal integer, left padded with zeros
> > to **C/4** digits (rounded up) and append it to the path, followed
> > by the path separator.
> >
> > c. Shift **H** left **C** bits.
> >
> > 4. Finally, append **F** to the obtained path.
> >
> > In particular, note that when using nested directories
> > the subdirectories do not repeat the hash bits used in parent directory.
> >
> >
> > Migrating mirrors to the hashed structure
> > -----------------------------------------
> > Since all distfile mirrors sync to the master Gentoo mirror, it should
> > be enough to perform all the needed changes on the master mirror
> > and wait for other mirrors to sync. The following procedure
> > is recommended:
> >
> > 1. Include the initial ``layout.conf`` listing only ``flat`` layout.
> >
> > 2. Create the new structure alongside the flat structure. Wait for
> > mirrors to sync.
> >
> > 3. Once all mirrors receive the new structure, update ``layout.conf``
> > to list the ``filename-hash`` structure.
> >
> > 4. Once a version of Portage supporting the new structure is stable long
> > enough, remove the fallback ``flat`` structure from ``layout.conf``
> > and duplicate distfiles.
> >
> > This implies that during the migration period the distfiles will
> > be stored duplicated on the mirrors and therefore will occupy twice
> > as much space. Technically, this could be avoided either by using
> > hard links or symbolic links.
> >
> > The hard link solution allows us to save space on the master mirror.
> > Additionally, if ``-H`` option is used by the mirrors it avoids
> > transferring existing files again. However, this option is known
> > to be expensive and could cause significant server load. Without it,
> > all mirrors need to transfer a second copy of all the existing files.
> >
> > The symbolic link solution could be more reliable if we could rely
> > on mirrors using the ``--links`` rsync option. Without that, symbolic
> > links are not transferred at all.
> >
> >
> > Using hashed structure for local distfiles
> > ------------------------------------------
> > The hashed structure defined above could also be used for local distfile
> > storage as used by the package manager. For this to work, the package
> > manager authors need to ensure that:
> >
> > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporary
> > directory where distfiles specific to the package are linked
> > in a flat structure.
> >
> > b. All tools are updated to support the nested structure.
> >
> > c. The package manager provides a tool for users to easily manipulate
> > distfiles, in particular to add distfiles for fetch-restricted
> > packages into an appropriate subdirectory.
> >
> > For extended compatibility, the package manager may support finding
> > distfiles in flat and nested structure simultaneously.
> >
> >
> > Rationale
> > =========
> > Algorithm for splitting distfiles
> > ---------------------------------
> > The possible algorithms were considered with the following goals
> > in mind:
> >
> > - the number of files in a single directory should not exceed 1000,
> >
> > - the total size of files in a single directory is not considered
> > relevant,
> >
> > - the solution should preferably be future-proof,
> >
> > - moving distfiles should be avoided once it is deployed.
> >
> > It should also be noted that at this moment the package having most
> > distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
> > with the number of 8556 distfiles. All of them start with a common
> > prefix of ``texlive-module-``. This specific prefix is used by a total
> > of 23435 distfiles.
> >
> > In the original debate that occurred in bug #534528 [#BUG534528]_
> > and the mailing list review of the initial version of this GLEP [#ML1]_,
> > four fundamental ideas for splitting distfiles were listed:
> >
> > a. using initial portion of filename,
> >
> > b. using initial portion of file hash,
> >
> > c. using initial portion of filename hash,
> >
> > d. using package category (and package name).
> >
> > The initial filename idea was to use the first character of filename,
> > possibly followed by a longer part which was the idea historically
> > used e.g. by PyPI Python package hosting. Its main advantage is
> > simplicity. The users can easily determine the correct subdirectory
> > by just looking at the distfile name. Sadly, this solution is not only
> > very uneven but does not solve the problem. As mentioned above,
> > the TeΧ Live packages share a long common prefix that make it impossible
> > to split it properly with other packages on fixed-length prefixes.
> >
> > This idea has been followed by an adaptive proposal by Andrew Barchuk
> > [#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly
> > mapped to groups by a common prefix but instead each group contains
> > all files between two prefixes being used (like in a dictionary).
> > However, it has been pointed out that while this option can provide
> > very even results initially, it is impossible to predict how it would
> > be affected by future distfile changes and there will be a risk of
> > needing to change the groups in the future. Furthermore, it is
> > relatively complex and requires explicitly listing or obtaining used
> > groups.
> >
> > Another option was to use an initial portion of distfile hashes. Its
> > main advantage is that cryptographic hash algorithms can provide
> > a more balanced split with random data. Furthermore, since hashes are
> > stored in Manifests using them has no cost for users. However, this
> > solution has three disadvantages:
> >
> > 1. Not all files in the distfile tree are covered by package Manifests.
> > Additional files are injected into the mirrors, and those will
> > not have a clearly-defined location.
> >
> > 2. User-provided distfiles (e.g. for fetch-restricted packages) with
> > hash mismatches would be placed in the wrong subdirectory,
> > potentially causing confusing errors.
> >
> > 3. The hash values are unknown for newly-downloaded distfiles, so
> > ``repoman`` (or an equivalent tool) would have to use a temporary
> > directory before locating the file in appropriate subdirectory.
> >
> > Using filename hashes has proven to provide a similar balance to using
> > file hashes. Furthermore, since filenames are known up front this
> > solution does not suffer from the listed problems. While hashes need
> > to be computed manually, hashing short string should not cause
> > any performance problems.
> >
> > Jason Zaman has suggested to use package categories (and package names)
> > [#PKGNAME]_. However, this solution has multiple problems:
> >
> > a. it does not solve the problem for large packages such as TeΧ Live,
> >
> > b. it introduces many unnecessarily small directories,
> >
> > c. it requires an explicit knowledge of which package distfiles
> > belong to,
> >
> > d. it does not provide an explicit solution to the problem of distfiles
> > shared by multiple packages,
> >
> > e. it does not provide a solution to the problem of injected distfiles.
> >
> > All the options considered, the filename hash solution was selected
> > as one that solves all the forementioned problems while introducing
> > relatively low complexity and being reasonably future-proof.
> >
> > .. figure:: glep-0075-extras/by-filename.png
> >
> > Distribution of distfiles by first character of filenames
> > (note: y axis is on log scale)
> >
> > .. figure:: glep-0075-extras/by-csum.png
> >
> > Distribution of distfiles by first hex-digit of checksum
> > (x --- content checksum, + --- filename checksum)
> >
> > .. figure:: glep-0075-extras/by-csum2.png
> >
> > Distribution of distfiles by two first hex-digits of checksum
> > (x --- content checksum, + --- filename checksum)
> >
> >
> > Layout file
> > -----------
> > The presence of control file has been suggested in the original
> > discussion. Its main purpose is to let package managers cleanly handle
> > the migration and detect how to correctly query the mirrors throughout
> > it. Furthermore, it makes future changes easier.
> >
> > The format lines specifically mean to hardcode as little about
> > the actual algorithm as possible. Therefore, we can easily change
> > the hash used or the exact split structure without having to update
> > the package managers or even provide a compatibility layout.
> >
> > The file is also open for future extensions to provide additional mirror
> > metadata. However, no clear use for that has been determined so far.
> >
> >
> > Hash algorithm
> > --------------
> > The hash algorithm support is fully deferred to the existing code
> > in the package managers that is required to handle Manifests.
> > In particular, it is recommended to reuse one of the hashes that are
> > used in Manifest entries at the time. This avoids code duplication
> > and reuses an existing mechanism to handle hash upgrades.
> >
> > During the discussion, it has been pointed that this particular use case
> > does not require a cryptographically strong hash and a faster algorithm
> > could be used instead. However, given the short length of hashed
> > strings performance is not a problem, and speed does not justify
> > the resulting code duplication.
> >
> > It has also been pointed out that e.g. the BLAKE2 hash family provides
> > the ability of creating arbitrary length hashes instead of truncating
> > the standard-length hash. However, not all implementations of BLAKE2
> > support that and relying on it could reduce portability for no apparent
> > gain.
> >
> >
> > Backwards Compatibility
> > =======================
> > Mirror compatibility
> > --------------------
> > The mirrored files are propagated to other mirrors as opaque directory
> > structure. Therefore, there are no backwards compatibility concerns
> > on the mirroring side.
> >
> > Backwards compatibility with existing clients is detailed
> > in `migrating mirrors to the hashed structure`_ section. Backwards
> > compatibility with the old clients will be provided by preserving
> > the flat structure during the transitional period.
> >
> > The new clients will fetch the ``layout.conf`` file to avoid backwards
> > compatibility concerns in the future. In case of hitting an old mirror,
> > the package manager will default to the ``flat`` structure.
> >
> >
> > Package manager storage compatibility
> > -------------------------------------
> > The exact means of preserving backwards compatibility in package manager
> > storage are left to the package manager authors. However, it is
> > recommended that package managers continue to support the flat layout
> > even if it is no longer the default. The package manager may either
> > continue to read files from this location or automatically move them
> > to an appropriate subdirectory.
> >
> >
> > Reference Implementation
> > ========================
> > TODO.
> >
> >
> > References
> > ==========
> > .. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the
> file
> > (https://standards.freedesktop.org/desktop-entry-
> spec/latest/ar01s03.html)
> >
> > .. [#GLEP74] GLEP 74: Full-tree verification using Manifest files:
> > Checksum algorithms (informational)
> > (https://www.gentoo.org/glep/glep-0074.html#checksum-
> algorithms-informational)
> >
> > .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
> > (https://www.gentoo.org/glep/glep-0059.html)
> >
> > .. [#BUG534528] Bug 534528 - distfiles should be sorted into
> subdirectories
> > of DISTDIR
> > (https://bugs.gentoo.org/534528)
> >
> > .. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory
> structure
> > (https://archives.gentoo.org/gentoo-dev/message/
> cfc4f8595df2edf9a25ba9ecae2463ba)
> >
> > .. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
> > for each directory computed in a way to have the files distributed
> evenly'
> > (https://archives.gentoo.org/gentoo-dev/message/
> 611bdaa76be049c1d650e8995748e7b8)
> >
> > .. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
> > as the packages themselves)
> > (https://archives.gentoo.org/gentoo-dev/message/
> f26ed870c3a6d4ecf69a821723642975)
> >
> >
> > Copyright
> > =========
> > This work is licensed under the Creative Commons Attribution-ShareAlike
> 3.0
> > Unported License. To view a copy of this license, visit
> > http://creativecommons.org/licenses/by-sa/3.0/.
> >
> > --
> > Best regards,
> > Michał Górny
> >
>
> It's going to be hash based? Why? I tried to follow the conversation but
> there's now close to 5 of these posts in the mailing list with different
> conversations in each.
>
The reasoning is embedded in the proposal; search for "Algorithm for
splitting distfiles". Read that section. I think it summarizes the space
well.
-A
>
> Using filename prefixes is boring and not uniform, but I feel I should
> point out that most distfile hosts are still doing fine. Microoptimizing
> this seems like wasted effort.
>
> Cheers,
> R0b0t1
[-- Attachment #2: Type: text/html, Size: 22107 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2)
2018-01-29 20:00 ` Robin H. Johnson
@ 2018-01-29 21:09 ` Michał Górny
0 siblings, 0 replies; 44+ messages in thread
From: Michał Górny @ 2018-01-29 21:09 UTC (permalink / raw
To: gentoo-dev
W dniu pon, 29.01.2018 o godzinie 20∶00 +0000, użytkownik Robin H.
Johnson napisał:
> On Mon, Jan 29, 2018 at 08:37:47PM +0100, Michał Górny wrote:
> > Migrating mirrors to the hashed structure
> > -----------------------------------------
>
> ...
> > The hard link solution allows us to save space on the master mirror.
> > Additionally, if ``-H`` option is used by the mirrors it avoids
> > transferring existing files again. However, this option is known
> > to be expensive and could cause significant server load. Without it,
> > all mirrors need to transfer a second copy of all the existing files.
>
> Informational only, not for addition to GLEP:
> I have surveyed some Gentoo mirrors (via the -mirrors mailing list, and
> direct email), and everybody so far was using already hard-links.
That's great news! So now we just have to figure a tool out that will
create the hardlinks ;-). Are we still using emirrordist?
> Older
> releng practices for the 2004/2005 releases used hard-links on mirrors,
> which is where this came from. The Mirror documentation should be
> updated to say that mirrors should use both --links and --hard-links.
>
Does this mean you'll update it /have updated it/ or you're suggesting
that somebody else should do it? ;-)
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 11:41 ` Michał Górny
2018-01-27 16:42 ` Gordon Pettey
@ 2018-01-30 1:21 ` Kent Fredric
2018-01-30 2:53 ` Robin H. Johnson
2018-01-30 7:25 ` Michał Górny
1 sibling, 2 replies; 44+ messages in thread
From: Kent Fredric @ 2018-01-30 1:21 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 241 bytes --]
On Sat, 27 Jan 2018 12:41:58 +0100
Michał Górny <mgorny@gentoo.org> wrote:
> find -name 'foo.tar.gz'
Other than being *worse* than the current "ls" situation due to the
existence of distfiles/git3-src/ and distfiles/git-src/
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-27 18:24 ` Michael Orlitzky
2018-01-27 19:47 ` Michał Górny
@ 2018-01-30 1:27 ` Kent Fredric
2018-01-30 7:17 ` Ulrich Mueller
1 sibling, 1 reply; 44+ messages in thread
From: Kent Fredric @ 2018-01-30 1:27 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 872 bytes --]
On Sat, 27 Jan 2018 13:24:44 -0500
Michael Orlitzky <mjo@gentoo.org> wrote:
> Fetch instructions for app-cat/pkg:
> *
> * Please download file1 from:
> * wherever file1 can be found
> * and move it to $DISTDIR/subdir1
> *
> * Please download file2 from
> * wherever file2 can be found
> * and move it to $DISTDIR/subdir2
> *
> * ...
> *
> * Please download fileN from
> * wherever fileN can be found
> * and move it to $DISTDIR/subdirN
Or ...:
Please download:
src_url1/filename1.tar.gz
src_url2/filename2.tar.gz
src_url3/filename3.tar.gz
And install them to your $DISTDIR with
edistadd ./filename1.tar.gz ./filename2.tar.gz ./filename3.tar.gz
An extra perk is edistadd could also ensure the permissions are correct
for portage to read the file afterwards, which is a bit of an annoyance
sometimes as it is.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-28 13:03 ` Ulrich Mueller
@ 2018-01-30 1:41 ` Kent Fredric
2018-01-30 7:11 ` Ulrich Mueller
0 siblings, 1 reply; 44+ messages in thread
From: Kent Fredric @ 2018-01-30 1:41 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 148 bytes --]
On Sun, 28 Jan 2018 14:03:07 +0100
Ulrich Mueller <ulm@gentoo.org> wrote:
> How about "consecutive non-negative integer keys"?
"Unsigned" ?
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-30 1:21 ` Kent Fredric
@ 2018-01-30 2:53 ` Robin H. Johnson
2018-01-30 7:25 ` Michał Górny
1 sibling, 0 replies; 44+ messages in thread
From: Robin H. Johnson @ 2018-01-30 2:53 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 584 bytes --]
On Tue, Jan 30, 2018 at 02:21:34PM +1300, Kent Fredric wrote:
> On Sat, 27 Jan 2018 12:41:58 +0100
> Michał Górny <mgorny@gentoo.org> wrote:
>
> > find -name 'foo.tar.gz'
>
> Other than being *worse* than the current "ls" situation due to the
> existence of distfiles/git3-src/ and distfiles/git-src/
find -name 'foo.tar.gz' ! -path '*-src/*'
--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 1113 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-30 1:41 ` Kent Fredric
@ 2018-01-30 7:11 ` Ulrich Mueller
0 siblings, 0 replies; 44+ messages in thread
From: Ulrich Mueller @ 2018-01-30 7:11 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 132 bytes --]
>>>>> On Tue, 30 Jan 2018, Kent Fredric wrote:
>> How about "consecutive non-negative integer keys"?
> "Unsigned" ?
Even better.
[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-30 1:27 ` Kent Fredric
@ 2018-01-30 7:17 ` Ulrich Mueller
0 siblings, 0 replies; 44+ messages in thread
From: Ulrich Mueller @ 2018-01-30 7:17 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 323 bytes --]
>>>>> On Tue, 30 Jan 2018, Kent Fredric wrote:
> Please download:
>
> src_url1/filename1.tar.gz
> src_url2/filename2.tar.gz
> src_url3/filename3.tar.gz
>
> And install them to your $DISTDIR with
DISTDIR is not valid in pkg_nofetch(), though.
>
> edistadd ./filename1.tar.gz ./filename2.tar.gz ./filename3.tar.gz
[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-30 1:21 ` Kent Fredric
2018-01-30 2:53 ` Robin H. Johnson
@ 2018-01-30 7:25 ` Michał Górny
2018-01-30 19:46 ` Kent Fredric
1 sibling, 1 reply; 44+ messages in thread
From: Michał Górny @ 2018-01-30 7:25 UTC (permalink / raw
To: gentoo-dev
W dniu wto, 30.01.2018 o godzinie 14∶21 +1300, użytkownik Kent Fredric
napisał:
> On Sat, 27 Jan 2018 12:41:58 +0100
> Michał Górny <mgorny@gentoo.org> wrote:
>
> > find -name 'foo.tar.gz'
>
> Other than being *worse* than the current "ls" situation due to the
> existence of distfiles/git3-src/ and distfiles/git-src/
>
Wait... so people actually don't override those locations?
--
Best regards,
Michał Górny
^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
2018-01-30 7:25 ` Michał Górny
@ 2018-01-30 19:46 ` Kent Fredric
0 siblings, 0 replies; 44+ messages in thread
From: Kent Fredric @ 2018-01-30 19:46 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 698 bytes --]
On Tue, 30 Jan 2018 08:25:28 +0100
Michał Górny <mgorny@gentoo.org> wrote:
> W dniu wto, 30.01.2018 o godzinie 14∶21 +1300, użytkownik Kent Fredric
> napisał:
> > On Sat, 27 Jan 2018 12:41:58 +0100
> > Michał Górny <mgorny@gentoo.org> wrote:
> >
> > > find -name 'foo.tar.gz'
> >
> > Other than being *worse* than the current "ls" situation due to the
> > existence of distfiles/git3-src/ and distfiles/git-src/
> >
>
> Wait... so people actually don't override those locations?
>
I'm installing "foo-9999.ebuild". Why would I read its ebuild, and then
read its eclass, and then, read the documentation on that eclass, and
then, override its defaults?
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2018-01-30 19:46 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-01-26 23:24 [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure Michał Górny
2018-01-27 1:48 ` Michael Orlitzky
2018-01-27 2:44 ` R0b0t1
2018-01-27 8:30 ` Michał Górny
2018-01-27 11:36 ` Roy Bamford
2018-01-27 11:41 ` Michał Górny
2018-01-27 16:42 ` Gordon Pettey
2018-01-27 16:48 ` Michael Orlitzky
2018-01-27 19:01 ` Gordon Pettey
2018-01-27 20:16 ` Michael Orlitzky
2018-01-30 1:21 ` Kent Fredric
2018-01-30 2:53 ` Robin H. Johnson
2018-01-30 7:25 ` Michał Górny
2018-01-30 19:46 ` Kent Fredric
2018-01-27 16:47 ` Michael Orlitzky
2018-01-27 18:14 ` Michał Górny
2018-01-27 18:24 ` Michael Orlitzky
2018-01-27 19:47 ` Michał Górny
2018-01-27 20:30 ` Michael Orlitzky
2018-01-30 1:27 ` Kent Fredric
2018-01-30 7:17 ` Ulrich Mueller
2018-01-28 7:01 ` Jason Zaman
2018-01-28 9:10 ` Michał Górny
2018-01-29 7:33 ` Robin H. Johnson
2018-01-28 10:14 ` Ulrich Mueller
2018-01-28 10:16 ` Michał Górny
2018-01-28 10:22 ` Ulrich Mueller
2018-01-28 10:40 ` Michał Górny
2018-01-28 13:03 ` Ulrich Mueller
2018-01-30 1:41 ` Kent Fredric
2018-01-30 7:11 ` Ulrich Mueller
2018-01-28 20:43 ` Andrew Barchuk
2018-01-28 21:17 ` Gordon Pettey
2018-01-28 22:00 ` Andrew Barchuk
2018-01-28 22:13 ` Gordon Pettey
2018-01-28 22:14 ` Zac Medico
2018-01-28 22:46 ` Andrew Barchuk
2018-01-29 5:36 ` Michał Górny
2018-01-29 9:22 ` Andrew Barchuk
2018-01-29 19:37 ` [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2) Michał Górny
2018-01-29 20:00 ` Robin H. Johnson
2018-01-29 21:09 ` Michał Górny
2018-01-29 20:26 ` R0b0t1
2018-01-29 20:55 ` Alec Warner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox