From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-dev+bounces-83693-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by finch.gentoo.org (Postfix) with ESMTPS id 2F10F1382C5
	for <garchives@archives.gentoo.org>; Mon, 29 Jan 2018 21:00:35 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id C7954E0ADE;
	Mon, 29 Jan 2018 21:00:28 +0000 (UTC)
Received: from mail-vk0-x232.google.com (mail-vk0-x232.google.com [IPv6:2607:f8b0:400c:c05::232])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id 37FF3E0A8E
	for <gentoo-dev@lists.gentoo.org>; Mon, 29 Jan 2018 21:00:27 +0000 (UTC)
Received: by mail-vk0-x232.google.com with SMTP id z9so5417595vkd.5
        for <gentoo-dev@lists.gentoo.org>; Mon, 29 Jan 2018 13:00:27 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=scriptkitty-com.20150623.gappssmtp.com; s=20150623;
        h=mime-version:sender:in-reply-to:references:from:date:message-id
         :subject:to;
        bh=L4vR9IkzQ1hsBv06XGt/esLxprXfdVkWDwLw3OIDpbI=;
        b=NZ04Wx9A7E30dL7Tlqhdk/dkELTxTYaXqt0csCP/Znetd3tiQmryFEsF8xo3trxX3i
         CgltPBHO/GCgCvi0miigIbQjNfVBlPJk4D3FlYySvfPuOVWOF9GYLrLg2QPm3Ltrre+d
         TCDkHZbSEznnHdzCUC0rOIRbe1R4ulBvYNmIrDZObEFW29d021DIwyk5D3G/bjuLK2G7
         Zkh/0sh8ATWzAOPikLZVL7IugFg8NE4sIzHkCSittsIY9MDteNQxDZs0izcXO359X98v
         wiKN1P04NmtgGyL/+bM8RYBoPkO3y/8j5HfDGG8Jk4cZtn7LFZioiDMUGpTMBhclNiXd
         SwVw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:sender:in-reply-to:references:from
         :date:message-id:subject:to;
        bh=L4vR9IkzQ1hsBv06XGt/esLxprXfdVkWDwLw3OIDpbI=;
        b=rVl6Y3LfkpUlX5P8iiBwax3OjdaNtjlu64kMWlYIKanujv2qZmyh8BlMiZ2/gfUIFC
         5pxcM6wR1Bv/y52MiZY/2k7MXyTnhQGIAMmQ9X+1If3WoOVo8yWyPIAykZ8AYLQHI3fD
         cp4k7ke2sV4I+ddzb6faOPXigim7l5GGg1s2DYHL3F6ndBeZUWwvyvFKIYmlXr4HKFRA
         hadw3ipWHmrNtv1+M0fZa38BEmvNaQjF6YNCQf9byLJMFUwdeBdZm4pmt9yEFheScPiA
         8M1yy3f9LoXHjW4asAtaoxDEmN3lpqn0NevH/o8U9wtZOJdy1RO1Pb0oOb1HMtTb/mRI
         i35w==
X-Gm-Message-State: AKwxyteuBFxiHG6OfMZ4OS1bRbxhGvGC+ExVloSi30n8VSr+9SCJRQmw
	QMKHUmjo2qxP0M4dslilAm4obZ8+2EoX2qQblyLmWdk6
X-Google-Smtp-Source: AH8x227s7p8fu9B88wQ62G9fDAvtoX4l/wBBczl7bEdk3u2MOOVcfsjKRFEdvks2lhM3/NPGVaNXeAwBun3XRB2amlQ=
X-Received: by 10.31.52.12 with SMTP id b12mr19198682vka.178.1517259626701;
 Mon, 29 Jan 2018 13:00:26 -0800 (PST)
Precedence: bulk
List-Post: <mailto:gentoo-dev@lists.gentoo.org>
List-Help: <mailto:gentoo-dev+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-dev+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-dev+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-dev.gentoo.org>
X-BeenThere: gentoo-dev@lists.gentoo.org
Reply-to: gentoo-dev@lists.gentoo.org
MIME-Version: 1.0
Sender: antarus@scriptkitty.com
Received: by 10.176.13.139 with HTTP; Mon, 29 Jan 2018 12:55:43 -0800 (PST)
X-Originating-IP: [2620:0:1003:512:4092:948:1f63:8004]
In-Reply-To: <CAAD4mYj=ZDr3WFopvrqHoh7Cg+MeDS+yUKtTMn+-p=8DJuymgw@mail.gmail.com>
References: <1517009079.31015.3.camel@gentoo.org> <1517254667.1187.25.camel@gentoo.org>
 <CAAD4mYj=ZDr3WFopvrqHoh7Cg+MeDS+yUKtTMn+-p=8DJuymgw@mail.gmail.com>
From: Alec Warner <antarus@gentoo.org>
Date: Mon, 29 Jan 2018 15:55:43 -0500
X-Google-Sender-Auth: jsi0v7HGeuUrSUHK79LFIC1Opu4
Message-ID: <CAAr7Pr-1PRaeK2--rnNnTFmKhH6WitztoLambUq1tRCK1wjwUg@mail.gmail.com>
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
 (draft v2)
To: Gentoo Dev <gentoo-dev@lists.gentoo.org>
Content-Type: multipart/alternative; boundary="001a1143fa4014fcaf0563f08962"
X-Archives-Salt: 1513201d-ef48-490e-a36a-5ea6c86279f5
X-Archives-Hash: e539e3f564031d130d7ac773fb91ce89

--001a1143fa4014fcaf0563f08962
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, Jan 29, 2018 at 3:26 PM, R0b0t1 <r030t1@gmail.com> wrote:

> On Monday, January 29, 2018, Micha=C5=82 G=C3=B3rny <mgorny@gentoo.org> w=
rote:
> > Here's an updated version. I've tried to incorporate most
> > of the feedback so far.
> >
> >
> > ---
> > GLEP: 75
> > Title: Split distfile mirror directory structure
> > Author: Micha=C5=82 G=C3=B3rny <mgorny@gentoo.org>,
> >         Robin H. Johnson <robbat2@gentoo.org>
> > Type: Standards Track
> > Status: Draft
> > Version: 1
> > Created: 2018-01-26
> > Last-Modified: 2018-01-27
> > Post-History: 2018-01-27
> > Content-Type: text/x-rst
> > ---
> >
> > Abstract
> > =3D=3D=3D=3D=3D=3D=3D=3D
> > This GLEP describes the procedure for splitting the distfiles on mirror=
s
> > into multiple directories with the goal of reducing the number of files
> > in a single directory.
> >
> >
> > Motivation
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > At the moment, both the package manager and Gentoo mirrors use flat
> > directory structure to store files.  While this solution usually works,
> > it does not scale well.  Directories with large number of files usually
> > have significant performance penalty, unless using filesystems
> > specifically designed for that purpose.
> >
> > According to the Gentoo repository state at 2018-01-26 16:23, there
> > was a total of 62652 unique distfiles in the repository.  While
> > the users realistically hit around 10% of that, distfile mirrors often
> > hold even more files --- more so if old distfiles are not wiped
> > immediately.
> >
> > While all filesystems used on Linux boxes should be able to cope with
> > a number that large, they may suffer a performance penalty with even
> > a few thousand files.  Additionally, if mirrors enable directory indexe=
s
> > then generating the index imposes both a significant server overhead
> > and a significant data transfer.  At this moment, the index
> > of distfiles.gentoo.org has around 17 MiB.
> >
> > Splitting the distfiles into multiple directories makes it possible
> > to avoid those problems by reducing the number of files in a single
> > directory.  For example, splitting the forementioned set of distfiles
> > into 16 directories that are roughly balanced allows to reduce
> > the number of files in a single directory to around 4000.  Splitting
> > them further into 256 directories (16x16) results in 200-300 files
> > per directory which should avoid any performance problems long-term,
> > even assuming 300% growth of number of distfiles.
> >
> >
> > Specification
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > Mirror layout file
> > ------------------
> > A mirror adhering to this specification should include a ``layout.conf`=
`
> > file in the top distfile directory.  This file uses the format
> > derived from the freedesktop Desktop Entry Specification file format
> > [#DESKTOP_FORMAT]_.
> >
> > Before using each Gentoo mirror, the package manager should attempt
> > to fetch (update) its ``layout.conf`` file and process it to determine
> > how to use the mirror.  If the file is not present, the package manager
> > should behave as if it were empty.
> >
> > The package manager should recognize the sections and keys listed below=
.
> > It should ignore any unrecognized sections or keys --- the format
> > is intended to account for future extensions.
> >
> > This specification currently defines one section: ``[structure]``.
> > This section defines one or more repository structure definitions
> > using non-negative sequential integer keys.  The definition with
> > the ``0`` key is the most preferred structure.  The package manager
> > should ignore any formats it does not recognize.  If this section
> > is not present, the package manager should behave as if only ``flat``
> > structure were specified.
> >
> > The following structure definitions are supported:
> >
> > * ``flat`` to indicate the traditional flat structure where all
> >   distfiles are located in the top directory,
> >
> > * ``filename-hash <algorithm> <cutoffs>`` to indicate the `filename
> >   hash structure`_ explained below.
> >
> >
> > Filename hash structure
> > -----------------------
> > When using the filename hash structure, the distfiles are split
> > into directories whose names are derived from the hash of distfile
> > filename.  This structure has two parameters: *algorithm name*
> > and *cutoffs* list.
> >
> > The algorithm name must correspond to a valid Manifest hash name.
> > An informational list of hashes is included in GLEP 74 [#GLEP74]_,
> > and the policies for introducing new hashes are covered by GLEP 59
> > [#GLEP59]_.
> >
> > The cutoffs list specifies one or more integers separated by colons
> > (``:``), indicating the number of bits (starting with the most
> > significant bit) of the hash used to form subsequent subdirectory names=
.
> > For example, the list of ``2:4`` would indicate that top-level director=
y
> > names are formed using 2 most significant bits of the hash (resulting
> > in 2=C2=B2 =3D 4 directories), and each of this directories would have
> > subdirectories formed using the next 4 bits of the hash (resulting
> > in 2=E2=81=B4 =3D 8 subdirectories each).
> >
> > The exact algorithm for determining the distfile location follows:
> >
> > 1. Let the distfile filename be **F**.
> >
> > 2. Compute the hash of **F** and store its binary value as **H**.
> >
> > 3. For each integer **C** in cutoff list:
> >
> >    a. Take **C** most significant bits of **H** and store them as **V**=
.
> >
> >    b. Convert **V** into hexadecimal integer, left padded with zeros
> >       to **C/4** digits (rounded up) and append it to the path, followe=
d
> >       by the path separator.
> >
> >    c. Shift **H** left **C** bits.
> >
> > 4. Finally, append **F** to the obtained path.
> >
> > In particular, note that when using nested directories
> > the subdirectories do not repeat the hash bits used in parent directory=
.
> >
> >
> > Migrating mirrors to the hashed structure
> > -----------------------------------------
> > Since all distfile mirrors sync to the master Gentoo mirror, it should
> > be enough to perform all the needed changes on the master mirror
> > and wait for other mirrors to sync.  The following procedure
> > is recommended:
> >
> > 1. Include the initial ``layout.conf`` listing only ``flat`` layout.
> >
> > 2. Create the new structure alongside the flat structure. Wait for
> >    mirrors to sync.
> >
> > 3. Once all mirrors receive the new structure, update ``layout.conf``
> >    to list the ``filename-hash`` structure.
> >
> > 4. Once a version of Portage supporting the new structure is stable lon=
g
> >    enough, remove the fallback ``flat`` structure from ``layout.conf``
> >    and duplicate distfiles.
> >
> > This implies that during the migration period the distfiles will
> > be stored duplicated on the mirrors and therefore will occupy twice
> > as much space.  Technically, this could be avoided either by using
> > hard links or symbolic links.
> >
> > The hard link solution allows us to save space on the master mirror.
> > Additionally, if ``-H`` option is used by the mirrors it avoids
> > transferring existing files again.  However, this option is known
> > to be expensive and could cause significant server load.  Without it,
> > all mirrors need to transfer a second copy of all the existing files.
> >
> > The symbolic link solution could be more reliable if we could rely
> > on mirrors using the ``--links`` rsync option.  Without that, symbolic
> > links are not transferred at all.
> >
> >
> > Using hashed structure for local distfiles
> > ------------------------------------------
> > The hashed structure defined above could also be used for local distfil=
e
> > storage as used by the package manager.  For this to work, the package
> > manager authors need to ensure that:
> >
> > a. The ``${DISTDIR}`` variable in the ebuild scope points to a temporar=
y
> >    directory where distfiles specific to the package are linked
> >    in a flat structure.
> >
> > b. All tools are updated to support the nested structure.
> >
> > c. The package manager provides a tool for users to easily manipulate
> >    distfiles, in particular to add distfiles for fetch-restricted
> >    packages into an appropriate subdirectory.
> >
> > For extended compatibility, the package manager may support finding
> > distfiles in flat and nested structure simultaneously.
> >
> >
> > Rationale
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D
> > Algorithm for splitting distfiles
> > ---------------------------------
> > The possible algorithms were considered with the following goals
> > in mind:
> >
> > - the number of files in a single directory should not exceed 1000,
> >
> > - the total size of files in a single directory is not considered
> >   relevant,
> >
> > - the solution should preferably be future-proof,
> >
> > - moving distfiles should be avoided once it is deployed.
> >
> > It should also be noted that at this moment the package having most
> > distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
> > with the number of 8556 distfiles.  All of them start with a common
> > prefix of ``texlive-module-``.  This specific prefix is used by a total
> > of 23435 distfiles.
> >
> > In the original debate that occurred in bug #534528 [#BUG534528]_
> > and the mailing list review of the initial version of this GLEP [#ML1]_=
,
> > four fundamental ideas for splitting distfiles were listed:
> >
> > a. using initial portion of filename,
> >
> > b. using initial portion of file hash,
> >
> > c. using initial portion of filename hash,
> >
> > d. using package category (and package name).
> >
> > The initial filename idea was to use the first character of filename,
> > possibly followed by a longer part which was the idea historically
> > used e.g. by PyPI Python package hosting.  Its main advantage is
> > simplicity.  The users can easily determine the correct subdirectory
> > by just looking at the distfile name.  Sadly, this solution is not only
> > very uneven but does not solve the problem.  As mentioned above,
> > the Te=CE=A7 Live packages share a long common prefix that make it impo=
ssible
> > to split it properly with other packages on fixed-length prefixes.
> >
> > This idea has been followed by an adaptive proposal by Andrew Barchuk
> > [#ADAPTIVE_FILENAME]_.  In this proposal, the filenames are not strictl=
y
> > mapped to groups by a common prefix but instead each group contains
> > all files between two prefixes being used (like in a dictionary).
> > However, it has been pointed out that while this option can provide
> > very even results initially, it is impossible to predict how it would
> > be affected by future distfile changes and there will be a risk of
> > needing to change the groups in the future.  Furthermore, it is
> > relatively complex and requires explicitly listing or obtaining used
> > groups.
> >
> > Another option was to use an initial portion of distfile hashes.  Its
> > main advantage is that cryptographic hash algorithms can provide
> > a more balanced split with random data.  Furthermore, since hashes are
> > stored in Manifests using them has no cost for users.  However, this
> > solution has three disadvantages:
> >
> > 1. Not all files in the distfile tree are covered by package Manifests.
> >    Additional files are injected into the mirrors, and those will
> >    not have a clearly-defined location.
> >
> > 2. User-provided distfiles (e.g. for fetch-restricted packages) with
> >    hash mismatches would be placed in the wrong subdirectory,
> >    potentially causing confusing errors.
> >
> > 3. The hash values are unknown for newly-downloaded distfiles, so
> >    ``repoman`` (or an equivalent tool) would have to use a temporary
> >    directory before locating the file in appropriate subdirectory.
> >
> > Using filename hashes has proven to provide a similar balance to using
> > file hashes.  Furthermore, since filenames are known up front this
> > solution does not suffer from the listed problems.  While hashes need
> > to be computed manually, hashing short string should not cause
> > any performance problems.
> >
> > Jason Zaman has suggested to use package categories (and package names)
> > [#PKGNAME]_.  However, this solution has multiple problems:
> >
> > a. it does not solve the problem for large packages such as Te=CE=A7 Li=
ve,
> >
> > b. it introduces many unnecessarily small directories,
> >
> > c. it requires an explicit knowledge of which package distfiles
> >    belong to,
> >
> > d. it does not provide an explicit solution to the problem of distfiles
> >    shared by multiple packages,
> >
> > e. it does not provide a solution to the problem of injected distfiles.
> >
> > All the options considered, the filename hash solution was selected
> > as one that solves all the forementioned problems while introducing
> > relatively low complexity and being reasonably future-proof.
> >
> > .. figure:: glep-0075-extras/by-filename.png
> >
> >    Distribution of distfiles by first character of filenames
> >    (note: y axis is on log scale)
> >
> > .. figure:: glep-0075-extras/by-csum.png
> >
> >    Distribution of distfiles by first hex-digit of checksum
> >    (x --- content checksum, + --- filename checksum)
> >
> > .. figure:: glep-0075-extras/by-csum2.png
> >
> >    Distribution of distfiles by two first hex-digits of checksum
> >    (x --- content checksum, + --- filename checksum)
> >
> >
> > Layout file
> > -----------
> > The presence of control file has been suggested in the original
> > discussion.  Its main purpose is to let package managers cleanly handle
> > the migration and detect how to correctly query the mirrors throughout
> > it.  Furthermore, it makes future changes easier.
> >
> > The format lines specifically mean to hardcode as little about
> > the actual algorithm as possible.  Therefore, we can easily change
> > the hash used or the exact split structure without having to update
> > the package managers or even provide a compatibility layout.
> >
> > The file is also open for future extensions to provide additional mirro=
r
> > metadata.  However, no clear use for that has been determined so far.
> >
> >
> > Hash algorithm
> > --------------
> > The hash algorithm support is fully deferred to the existing code
> > in the package managers that is required to handle Manifests.
> > In particular, it is recommended to reuse one of the hashes that are
> > used in Manifest entries at the time.  This avoids code duplication
> > and reuses an existing mechanism to handle hash upgrades.
> >
> > During the discussion, it has been pointed that this particular use cas=
e
> > does not require a cryptographically strong hash and a faster algorithm
> > could be used instead.  However, given the short length of hashed
> > strings performance is not a problem, and speed does not justify
> > the resulting code duplication.
> >
> > It has also been pointed out that e.g. the BLAKE2 hash family provides
> > the ability of creating arbitrary length hashes instead of truncating
> > the standard-length hash.  However, not all implementations of BLAKE2
> > support that and relying on it could reduce portability for no apparent
> > gain.
> >
> >
> > Backwards Compatibility
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > Mirror compatibility
> > --------------------
> > The mirrored files are propagated to other mirrors as opaque directory
> > structure.  Therefore, there are no backwards compatibility concerns
> > on the mirroring side.
> >
> > Backwards compatibility with existing clients is detailed
> > in `migrating mirrors to the hashed structure`_ section.  Backwards
> > compatibility with the old clients will be provided by preserving
> > the flat structure during the transitional period.
> >
> > The new clients will fetch the ``layout.conf`` file to avoid backwards
> > compatibility concerns in the future.  In case of hitting an old mirror=
,
> > the package manager will default to the ``flat`` structure.
> >
> >
> > Package manager storage compatibility
> > -------------------------------------
> > The exact means of preserving backwards compatibility in package manage=
r
> > storage are left to the package manager authors.  However, it is
> > recommended that package managers continue to support the flat layout
> > even if it is no longer the default.  The package manager may either
> > continue to read files from this location or automatically move them
> > to an appropriate subdirectory.
> >
> >
> > Reference Implementation
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D
> > TODO.
> >
> >
> > References
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> > .. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format of the
> file
> >    (https://standards.freedesktop.org/desktop-entry-
> spec/latest/ar01s03.html)
> >
> > .. [#GLEP74] GLEP 74: Full-tree verification using Manifest files:
> >    Checksum algorithms (informational)
> >    (https://www.gentoo.org/glep/glep-0074.html#checksum-
> algorithms-informational)
> >
> > .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
> >    (https://www.gentoo.org/glep/glep-0059.html)
> >
> > .. [#BUG534528] Bug 534528 - distfiles should be sorted into
> subdirectories
> >    of DISTDIR
> >    (https://bugs.gentoo.org/534528)
> >
> > .. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory
> structure
> >    (https://archives.gentoo.org/gentoo-dev/message/
> cfc4f8595df2edf9a25ba9ecae2463ba)
> >
> > .. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character rang=
es
> >    for each directory computed in a way to have the files distributed
> evenly'
> >    (https://archives.gentoo.org/gentoo-dev/message/
> 611bdaa76be049c1d650e8995748e7b8)
> >
> > .. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
> >    as the packages themselves)
> >    (https://archives.gentoo.org/gentoo-dev/message/
> f26ed870c3a6d4ecf69a821723642975)
> >
> >
> > Copyright
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D
> > This work is licensed under the Creative Commons Attribution-ShareAlike
> 3.0
> > Unported License. To view a copy of this license, visit
> > http://creativecommons.org/licenses/by-sa/3.0/.
> >
> > --
> > Best regards,
> > Micha=C5=82 G=C3=B3rny
> >
>
> It's going to be hash based? Why? I tried to follow the conversation but
> there's now close to 5 of these posts in the mailing list with different
> conversations in each.
>

The reasoning is embedded in the proposal; search for "Algorithm for
splitting distfiles". Read that section. I think it summarizes the space
well.

-A


>
> Using filename prefixes is boring and not uniform, but I feel I should
> point out that most distfile hosts are still doing fine. Microoptimizing
> this seems like wasted effort.
>
> Cheers,
>     R0b0t1

--001a1143fa4014fcaf0563f08962
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr"><div class=3D"gmail_extra"><div class=3D"gmail_quote">On M=
on, Jan 29, 2018 at 3:26 PM, R0b0t1 <span dir=3D"ltr">&lt;<a href=3D"mailto=
:r030t1@gmail.com" target=3D"_blank">r030t1@gmail.com</a>&gt;</span> wrote:=
<br><blockquote class=3D"gmail_quote" style=3D"margin:0px 0px 0px 0.8ex;bor=
der-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=3D"gmail-H=
OEnZb"><div class=3D"gmail-h5">On Monday, January 29, 2018, Micha=C5=82 G=
=C3=B3rny &lt;<a href=3D"mailto:mgorny@gentoo.org" target=3D"_blank">mgorny=
@gentoo.org</a>&gt; wrote:<br>&gt; Here&#39;s an updated version. I&#39;ve =
tried to incorporate most<br>&gt; of the feedback so far.<br>&gt;<br>&gt;<b=
r>&gt; ---<br>&gt; GLEP: 75<br>&gt; Title: Split distfile mirror directory =
structure<br>&gt; Author: Micha=C5=82 G=C3=B3rny &lt;<a href=3D"mailto:mgor=
ny@gentoo.org" target=3D"_blank">mgorny@gentoo.org</a>&gt;,<br>&gt; =C2=A0 =
=C2=A0 =C2=A0 =C2=A0 Robin H. Johnson &lt;<a href=3D"mailto:robbat2@gentoo.=
org" target=3D"_blank">robbat2@gentoo.org</a>&gt;<br>&gt; Type: Standards T=
rack<br>&gt; Status: Draft<br>&gt; Version: 1<br>&gt; Created: 2018-01-26<b=
r>&gt; Last-Modified: 2018-01-27<br>&gt; Post-History: 2018-01-27<br>&gt; C=
ontent-Type: text/x-rst<br>&gt; ---<br>&gt;<br>&gt; Abstract<br>&gt; =3D=3D=
=3D=3D=3D=3D=3D=3D<br>&gt; This GLEP describes the procedure for splitting =
the distfiles on mirrors<br>&gt; into multiple directories with the goal of=
 reducing the number of files<br>&gt; in a single directory.<br>&gt;<br>&gt=
;<br>&gt; Motivation<br>&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>&gt; At the =
moment, both the package manager and Gentoo mirrors use flat<br>&gt; direct=
ory structure to store files.=C2=A0 While this solution usually works,<br>&=
gt; it does not scale well.=C2=A0 Directories with large number of files us=
ually<br>&gt; have significant performance penalty, unless using filesystem=
s<br>&gt; specifically designed for that purpose.<br>&gt;<br>&gt; According=
 to the Gentoo repository state at 2018-01-26 16:23, there<br>&gt; was a to=
tal of 62652 unique distfiles in the repository.=C2=A0 While<br>&gt; the us=
ers realistically hit around 10% of that, distfile mirrors often<br>&gt; ho=
ld even more files --- more so if old distfiles are not wiped<br>&gt; immed=
iately.<br>&gt;<br>&gt; While all filesystems used on Linux boxes should be=
 able to cope with<br>&gt; a number that large, they may suffer a performan=
ce penalty with even<br>&gt; a few thousand files.=C2=A0 Additionally, if m=
irrors enable directory indexes<br>&gt; then generating the index imposes b=
oth a significant server overhead<br>&gt; and a significant data transfer.=
=C2=A0 At this moment, the index<br>&gt; of <a href=3D"http://distfiles.gen=
too.org" target=3D"_blank">distfiles.gentoo.org</a> has around 17 MiB.<br>&=
gt;<br>&gt; Splitting the distfiles into multiple directories makes it poss=
ible<br>&gt; to avoid those problems by reducing the number of files in a s=
ingle<br>&gt; directory.=C2=A0 For example, splitting the forementioned set=
 of distfiles<br>&gt; into 16 directories that are roughly balanced allows =
to reduce<br>&gt; the number of files in a single directory to around 4000.=
=C2=A0 Splitting<br>&gt; them further into 256 directories (16x16) results =
in 200-300 files<br>&gt; per directory which should avoid any performance p=
roblems long-term,<br>&gt; even assuming 300% growth of number of distfiles=
.<br>&gt;<br>&gt;<br>&gt; Specification<br>&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D<br>&gt; Mirror layout file<br>&gt; ------------------<br>&gt; =
A mirror adhering to this specification should include a ``layout.conf``<br=
>&gt; file in the top distfile directory.=C2=A0 This file uses the format<b=
r>&gt; derived from the freedesktop Desktop Entry Specification file format=
<br>&gt; [#DESKTOP_FORMAT]_.<br>&gt;<br>&gt; Before using each Gentoo mirro=
r, the package manager should attempt<br>&gt; to fetch (update) its ``layou=
t.conf`` file and process it to determine<br>&gt; how to use the mirror.=C2=
=A0 If the file is not present, the package manager<br>&gt; should behave a=
s if it were empty.<br>&gt;<br>&gt; The package manager should recognize th=
e sections and keys listed below.<br>&gt; It should ignore any unrecognized=
 sections or keys --- the format<br>&gt; is intended to account for future =
extensions.<br>&gt;<br>&gt; This specification currently defines one sectio=
n: ``[structure]``.<br>&gt; This section defines one or more repository str=
ucture definitions<br>&gt; using non-negative sequential integer keys.=C2=
=A0 The definition with<br>&gt; the ``0`` key is the most preferred structu=
re.=C2=A0 The package manager<br>&gt; should ignore any formats it does not=
 recognize.=C2=A0 If this section<br>&gt; is not present, the package manag=
er should behave as if only ``flat``<br>&gt; structure were specified.<br>&=
gt;<br>&gt; The following structure definitions are supported:<br>&gt;<br>&=
gt; * ``flat`` to indicate the traditional flat structure where all<br>&gt;=
 =C2=A0 distfiles are located in the top directory,<br>&gt;<br>&gt; * ``fil=
ename-hash &lt;algorithm&gt; &lt;cutoffs&gt;`` to indicate the `filename<br=
>&gt; =C2=A0 hash structure`_ explained below.<br>&gt;<br>&gt;<br>&gt; File=
name hash structure<br>&gt; -----------------------<br>&gt; When using the =
filename hash structure, the distfiles are split<br>&gt; into directories w=
hose names are derived from the hash of distfile<br>&gt; filename.=C2=A0 Th=
is structure has two parameters: *algorithm name*<br>&gt; and *cutoffs* lis=
t.<br>&gt;<br>&gt; The algorithm name must correspond to a valid Manifest h=
ash name.<br>&gt; An informational list of hashes is included in GLEP 74 [#=
GLEP74]_,<br>&gt; and the policies for introducing new hashes are covered b=
y GLEP 59<br>&gt; [#GLEP59]_.<br>&gt;<br>&gt; The cutoffs list specifies on=
e or more integers separated by colons<br>&gt; (``:``), indicating the numb=
er of bits (starting with the most<br>&gt; significant bit) of the hash use=
d to form subsequent subdirectory names.<br>&gt; For example, the list of `=
`2:4`` would indicate that top-level directory<br>&gt; names are formed usi=
ng 2 most significant bits of the hash (resulting<br>&gt; in 2=C2=B2 =3D 4 =
directories), and each of this directories would have<br>&gt; subdirectorie=
s formed using the next 4 bits of the hash (resulting<br>&gt; in 2=E2=81=B4=
 =3D 8 subdirectories each).<br>&gt;<br>&gt; The exact algorithm for determ=
ining the distfile location follows:<br>&gt;<br>&gt; 1. Let the distfile fi=
lename be **F**.<br>&gt;<br>&gt; 2. Compute the hash of **F** and store its=
 binary value as **H**.<br>&gt;<br>&gt; 3. For each integer **C** in cutoff=
 list:<br>&gt;<br>&gt; =C2=A0 =C2=A0a. Take **C** most significant bits of =
**H** and store them as **V**.<br>&gt;<br>&gt; =C2=A0 =C2=A0b. Convert **V*=
* into hexadecimal integer, left padded with zeros<br>&gt; =C2=A0 =C2=A0 =
=C2=A0 to **C/4** digits (rounded up) and append it to the path, followed<b=
r>&gt; =C2=A0 =C2=A0 =C2=A0 by the path separator.<br>&gt;<br>&gt; =C2=A0 =
=C2=A0c. Shift **H** left **C** bits.<br>&gt;<br>&gt; 4. Finally, append **=
F** to the obtained path.<br>&gt;<br>&gt; In particular, note that when usi=
ng nested directories<br>&gt; the subdirectories do not repeat the hash bit=
s used in parent directory.<br>&gt;<br>&gt;<br>&gt; Migrating mirrors to th=
e hashed structure<br>&gt; ------------------------------<wbr>-----------<b=
r>&gt; Since all distfile mirrors sync to the master Gentoo mirror, it shou=
ld<br>&gt; be enough to perform all the needed changes on the master mirror=
<br>&gt; and wait for other mirrors to sync.=C2=A0 The following procedure<=
br>&gt; is recommended:<br>&gt;<br>&gt; 1. Include the initial ``layout.con=
f`` listing only ``flat`` layout.<br>&gt;<br>&gt; 2. Create the new structu=
re alongside the flat structure. Wait for<br>&gt; =C2=A0 =C2=A0mirrors to s=
ync.<br>&gt;<br>&gt; 3. Once all mirrors receive the new structure, update =
``layout.conf``<br>&gt; =C2=A0 =C2=A0to list the ``filename-hash`` structur=
e.<br>&gt;<br>&gt; 4. Once a version of Portage supporting the new structur=
e is stable long<br>&gt; =C2=A0 =C2=A0enough, remove the fallback ``flat`` =
structure from ``layout.conf``<br>&gt; =C2=A0 =C2=A0and duplicate distfiles=
.<br>&gt;<br>&gt; This implies that during the migration period the distfil=
es will<br>&gt; be stored duplicated on the mirrors and therefore will occu=
py twice<br>&gt; as much space.=C2=A0 Technically, this could be avoided ei=
ther by using<br>&gt; hard links or symbolic links.<br>&gt;<br>&gt; The har=
d link solution allows us to save space on the master mirror.<br>&gt; Addit=
ionally, if ``-H`` option is used by the mirrors it avoids<br>&gt; transfer=
ring existing files again.=C2=A0 However, this option is known<br>&gt; to b=
e expensive and could cause significant server load.=C2=A0 Without it,<br>&=
gt; all mirrors need to transfer a second copy of all the existing files.<b=
r>&gt;<br>&gt; The symbolic link solution could be more reliable if we coul=
d rely<br>&gt; on mirrors using the ``--links`` rsync option.=C2=A0 Without=
 that, symbolic<br>&gt; links are not transferred at all.<br>&gt;<br>&gt;<b=
r>&gt; Using hashed structure for local distfiles<br>&gt; -----------------=
-------------<wbr>------------<br>&gt; The hashed structure defined above c=
ould also be used for local distfile<br>&gt; storage as used by the package=
 manager.=C2=A0 For this to work, the package<br>&gt; manager authors need =
to ensure that:<br>&gt;<br>&gt; a. The ``${DISTDIR}`` variable in the ebuil=
d scope points to a temporary<br>&gt; =C2=A0 =C2=A0directory where distfile=
s specific to the package are linked<br>&gt; =C2=A0 =C2=A0in a flat structu=
re.<br>&gt;<br>&gt; b. All tools are updated to support the nested structur=
e.<br>&gt;<br>&gt; c. The package manager provides a tool for users to easi=
ly manipulate<br>&gt; =C2=A0 =C2=A0distfiles, in particular to add distfile=
s for fetch-restricted<br>&gt; =C2=A0 =C2=A0packages into an appropriate su=
bdirectory.<br>&gt;<br>&gt; For extended compatibility, the package manager=
 may support finding<br>&gt; distfiles in flat and nested structure simulta=
neously.<br>&gt;<br>&gt;<br>&gt; Rationale<br>&gt; =3D=3D=3D=3D=3D=3D=3D=3D=
=3D<br>&gt; Algorithm for splitting distfiles<br>&gt; ---------------------=
---------<wbr>---<br>&gt; The possible algorithms were considered with the =
following goals<br>&gt; in mind:<br>&gt;<br>&gt; - the number of files in a=
 single directory should not exceed 1000,<br>&gt;<br>&gt; - the total size =
of files in a single directory is not considered<br>&gt; =C2=A0 relevant,<b=
r>&gt;<br>&gt; - the solution should preferably be future-proof,<br>&gt;<br=
>&gt; - moving distfiles should be avoided once it is deployed.<br>&gt;<br>=
&gt; It should also be noted that at this moment the package having most<br=
>&gt; distfiles in Gentoo at the time is dev-texlive/texlive-<wbr>latexextr=
a,<br>&gt; with the number of 8556 distfiles.=C2=A0 All of them start with =
a common<br>&gt; prefix of ``texlive-module-``.=C2=A0 This specific prefix =
is used by a total<br>&gt; of 23435 distfiles.<br>&gt;<br>&gt; In the origi=
nal debate that occurred in bug #534528 [#BUG534528]_<br>&gt; and the maili=
ng list review of the initial version of this GLEP [#ML1]_,<br>&gt; four fu=
ndamental ideas for splitting distfiles were listed:<br>&gt;<br>&gt; a. usi=
ng initial portion of filename,<br>&gt;<br>&gt; b. using initial portion of=
 file hash,<br>&gt;<br>&gt; c. using initial portion of filename hash,<br>&=
gt;<br>&gt; d. using package category (and package name).<br>&gt;<br>&gt; T=
he initial filename idea was to use the first character of filename,<br>&gt=
; possibly followed by a longer part which was the idea historically<br>&gt=
; used e.g. by PyPI Python package hosting.=C2=A0 Its main advantage is<br>=
&gt; simplicity.=C2=A0 The users can easily determine the correct subdirect=
ory<br>&gt; by just looking at the distfile name.=C2=A0 Sadly, this solutio=
n is not only<br>&gt; very uneven but does not solve the problem.=C2=A0 As =
mentioned above,<br>&gt; the Te=CE=A7 Live packages share a long common pre=
fix that make it impossible<br>&gt; to split it properly with other package=
s on fixed-length prefixes.<br>&gt;<br>&gt; This idea has been followed by =
an adaptive proposal by Andrew Barchuk<br>&gt; [#ADAPTIVE_FILENAME]_.=C2=A0=
 In this proposal, the filenames are not strictly<br>&gt; mapped to groups =
by a common prefix but instead each group contains<br>&gt; all files betwee=
n two prefixes being used (like in a dictionary).<br>&gt; However, it has b=
een pointed out that while this option can provide<br>&gt; very even result=
s initially, it is impossible to predict how it would<br>&gt; be affected b=
y future distfile changes and there will be a risk of<br>&gt; needing to ch=
ange the groups in the future.=C2=A0 Furthermore, it is<br>&gt; relatively =
complex and requires explicitly listing or obtaining used<br>&gt; groups.<b=
r>&gt;<br>&gt; Another option was to use an initial portion of distfile has=
hes.=C2=A0 Its<br>&gt; main advantage is that cryptographic hash algorithms=
 can provide<br>&gt; a more balanced split with random data.=C2=A0 Furtherm=
ore, since hashes are<br>&gt; stored in Manifests using them has no cost fo=
r users.=C2=A0 However, this<br>&gt; solution has three disadvantages:<br>&=
gt;<br>&gt; 1. Not all files in the distfile tree are covered by package Ma=
nifests.<br>&gt; =C2=A0 =C2=A0Additional files are injected into the mirror=
s, and those will<br>&gt; =C2=A0 =C2=A0not have a clearly-defined location.=
<br>&gt;<br>&gt; 2. User-provided distfiles (e.g. for fetch-restricted pack=
ages) with<br>&gt; =C2=A0 =C2=A0hash mismatches would be placed in the wron=
g subdirectory,<br>&gt; =C2=A0 =C2=A0potentially causing confusing errors.<=
br>&gt;<br>&gt; 3. The hash values are unknown for newly-downloaded distfil=
es, so<br>&gt; =C2=A0 =C2=A0``repoman`` (or an equivalent tool) would have =
to use a temporary<br>&gt; =C2=A0 =C2=A0directory before locating the file =
in appropriate subdirectory.<br>&gt;<br>&gt; Using filename hashes has prov=
en to provide a similar balance to using<br>&gt; file hashes.=C2=A0 Further=
more, since filenames are known up front this<br>&gt; solution does not suf=
fer from the listed problems.=C2=A0 While hashes need<br>&gt; to be compute=
d manually, hashing short string should not cause<br>&gt; any performance p=
roblems.<br>&gt;<br>&gt; Jason Zaman has suggested to use package categorie=
s (and package names)<br>&gt; [#PKGNAME]_.=C2=A0 However, this solution has=
 multiple problems:<br>&gt;<br>&gt; a. it does not solve the problem for la=
rge packages such as Te=CE=A7 Live,<br>&gt;<br>&gt; b. it introduces many u=
nnecessarily small directories,<br>&gt;<br>&gt; c. it requires an explicit =
knowledge of which package distfiles<br>&gt; =C2=A0 =C2=A0belong to,<br>&gt=
;<br>&gt; d. it does not provide an explicit solution to the problem of dis=
tfiles<br>&gt; =C2=A0 =C2=A0shared by multiple packages,<br>&gt;<br>&gt; e.=
 it does not provide a solution to the problem of injected distfiles.<br>&g=
t;<br>&gt; All the options considered, the filename hash solution was selec=
ted<br>&gt; as one that solves all the forementioned problems while introdu=
cing<br>&gt; relatively low complexity and being reasonably future-proof.<b=
r>&gt;<br>&gt; .. figure:: glep-0075-extras/by-filename.<wbr>png<br>&gt;<br=
>&gt; =C2=A0 =C2=A0Distribution of distfiles by first character of filename=
s<br>&gt; =C2=A0 =C2=A0(note: y axis is on log scale)<br>&gt;<br>&gt; .. fi=
gure:: glep-0075-extras/by-csum.png<br>&gt;<br>&gt; =C2=A0 =C2=A0Distributi=
on of distfiles by first hex-digit of checksum<br>&gt; =C2=A0 =C2=A0(x --- =
content checksum, + --- filename checksum)<br>&gt;<br>&gt; .. figure:: glep=
-0075-extras/by-csum2.png<br>&gt;<br>&gt; =C2=A0 =C2=A0Distribution of dist=
files by two first hex-digits of checksum<br>&gt; =C2=A0 =C2=A0(x --- conte=
nt checksum, + --- filename checksum)<br>&gt;<br>&gt;<br>&gt; Layout file<b=
r>&gt; -----------<br>&gt; The presence of control file has been suggested =
in the original<br>&gt; discussion.=C2=A0 Its main purpose is to let packag=
e managers cleanly handle<br>&gt; the migration and detect how to correctly=
 query the mirrors throughout<br>&gt; it.=C2=A0 Furthermore, it makes futur=
e changes easier.<br>&gt;<br>&gt; The format lines specifically mean to har=
dcode as little about<br>&gt; the actual algorithm as possible.=C2=A0 There=
fore, we can easily change<br>&gt; the hash used or the exact split structu=
re without having to update<br>&gt; the package managers or even provide a =
compatibility layout.<br>&gt;<br>&gt; The file is also open for future exte=
nsions to provide additional mirror<br>&gt; metadata.=C2=A0 However, no cle=
ar use for that has been determined so far.<br>&gt;<br>&gt;<br>&gt; Hash al=
gorithm<br>&gt; --------------<br>&gt; The hash algorithm support is fully =
deferred to the existing code<br>&gt; in the package managers that is requi=
red to handle Manifests.<br>&gt; In particular, it is recommended to reuse =
one of the hashes that are<br>&gt; used in Manifest entries at the time.=C2=
=A0 This avoids code duplication<br>&gt; and reuses an existing mechanism t=
o handle hash upgrades.<br>&gt;<br>&gt; During the discussion, it has been =
pointed that this particular use case<br>&gt; does not require a cryptograp=
hically strong hash and a faster algorithm<br>&gt; could be used instead.=
=C2=A0 However, given the short length of hashed<br>&gt; strings performanc=
e is not a problem, and speed does not justify<br>&gt; the resulting code d=
uplication.<br>&gt;<br>&gt; It has also been pointed out that e.g. the BLAK=
E2 hash family provides<br>&gt; the ability of creating arbitrary length ha=
shes instead of truncating<br>&gt; the standard-length hash.=C2=A0 However,=
 not all implementations of BLAKE2<br>&gt; support that and relying on it c=
ould reduce portability for no apparent<br>&gt; gain.<br>&gt;<br>&gt;<br>&g=
t; Backwards Compatibility<br>&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>&gt; Mirror compatibility<br>&gt; -------=
-------------<br>&gt; The mirrored files are propagated to other mirrors as=
 opaque directory<br>&gt; structure.=C2=A0 Therefore, there are no backward=
s compatibility concerns<br>&gt; on the mirroring side.<br>&gt;<br>&gt; Bac=
kwards compatibility with existing clients is detailed<br>&gt; in `migratin=
g mirrors to the hashed structure`_ section.=C2=A0 Backwards<br>&gt; compat=
ibility with the old clients will be provided by preserving<br>&gt; the fla=
t structure during the transitional period.<br>&gt;<br>&gt; The new clients=
 will fetch the ``layout.conf`` file to avoid backwards<br>&gt; compatibili=
ty concerns in the future.=C2=A0 In case of hitting an old mirror,<br>&gt; =
the package manager will default to the ``flat`` structure.<br>&gt;<br>&gt;=
<br>&gt; Package manager storage compatibility<br>&gt; --------------------=
----------<wbr>-------<br>&gt; The exact means of preserving backwards comp=
atibility in package manager<br>&gt; storage are left to the package manage=
r authors.=C2=A0 However, it is<br>&gt; recommended that package managers c=
ontinue to support the flat layout<br>&gt; even if it is no longer the defa=
ult.=C2=A0 The package manager may either<br>&gt; continue to read files fr=
om this location or automatically move them<br>&gt; to an appropriate subdi=
rectory.<br>&gt;<br>&gt;<br>&gt; Reference Implementation<br>&gt; =3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D<br>&gt; TOD=
O.<br>&gt;<br>&gt;<br>&gt; References<br>&gt; =3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D<br>&gt; .. [#DESKTOP_FORMAT] Desktop Entry Specification: Basic format =
of the file<br>&gt; =C2=A0 =C2=A0(<a href=3D"https://standards.freedesktop.=
org/desktop-entry-spec/latest/ar01s03.html" target=3D"_blank">https://stand=
ards.<wbr>freedesktop.org/desktop-entry-<wbr>spec/latest/ar01s03.html</a>)<=
br>&gt;<br>&gt; .. [#GLEP74] GLEP 74: Full-tree verification using Manifest=
 files:<br>&gt; =C2=A0 =C2=A0Checksum algorithms (informational)<br>&gt; =
=C2=A0 =C2=A0(<a href=3D"https://www.gentoo.org/glep/glep-0074.html#checksu=
m-algorithms-informational" target=3D"_blank">https://www.gentoo.org/glep/<=
wbr>glep-0074.html#checksum-<wbr>algorithms-informational</a>)<br>&gt;<br>&=
gt; .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications=
<br>&gt; =C2=A0 =C2=A0(<a href=3D"https://www.gentoo.org/glep/glep-0059.htm=
l" target=3D"_blank">https://www.gentoo.org/glep/<wbr>glep-0059.html</a>)<b=
r>&gt;<br>&gt; .. [#BUG534528] Bug 534528 - distfiles should be sorted into=
 subdirectories<br>&gt; =C2=A0 =C2=A0of DISTDIR<br>&gt; =C2=A0 =C2=A0(<a hr=
ef=3D"https://bugs.gentoo.org/534528" target=3D"_blank">https://bugs.gentoo=
.org/<wbr>534528</a>)<br>&gt;<br>&gt; .. [#ML1] [gentoo-dev] [pre-GLEP] Spl=
it distfile mirror directory structure<br>&gt; =C2=A0 =C2=A0(<a href=3D"htt=
ps://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463b=
a" target=3D"_blank">https://archives.gentoo.org/<wbr>gentoo-dev/message/<w=
br>cfc4f8595df2edf9a25ba9ecae2463<wbr>ba</a>)<br>&gt;<br>&gt; .. [#ADAPTIVE=
_FILENAME] Andrew Barchuk&#39;s reply on &#39;using character ranges<br>&gt=
; =C2=A0 =C2=A0for each directory computed in a way to have the files distr=
ibuted evenly&#39;<br>&gt; =C2=A0 =C2=A0(<a href=3D"https://archives.gentoo=
.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8" target=3D"_blank"=
>https://archives.gentoo.org/<wbr>gentoo-dev/message/<wbr>611bdaa76be049c1d=
650e8995748e7<wbr>b8</a>)<br>&gt;<br>&gt; .. [#PKGNAME] Jason Zamal&#39;s r=
eply including &#39;using the same dir layout<br>&gt; =C2=A0 =C2=A0as the p=
ackages themselves)<br>&gt; =C2=A0 =C2=A0(<a href=3D"https://archives.gento=
o.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975" target=3D"_blank=
">https://archives.gentoo.org/<wbr>gentoo-dev/message/<wbr>f26ed870c3a6d4ec=
f69a8217236429<wbr>75</a>)<br>&gt;<br>&gt;<br>&gt; Copyright<br>&gt; =3D=3D=
=3D=3D=3D=3D=3D=3D=3D<br>&gt; This work is licensed under the Creative Comm=
ons Attribution-ShareAlike 3.0<br>&gt; Unported License. To view a copy of =
this license, visit<br>&gt; <a href=3D"http://creativecommons.org/licenses/=
by-sa/3.0/" target=3D"_blank">http://creativecommons.org/<wbr>licenses/by-s=
a/3.0/</a>.<br>&gt;<br>&gt; --<br>&gt; Best regards,<br>&gt; Micha=C5=82 G=
=C3=B3rny<br>&gt;<br><br></div></div>It&#39;s going to be hash based? Why? =
I tried to follow the conversation but there&#39;s now close to 5 of these =
posts in the mailing list with different conversations in each.<br></blockq=
uote><div><br></div>The reasoning is embedded in the proposal; search for &=
quot;Algorithm for splitting distfiles&quot;. Read that section. I think it=
 summarizes the space well.</div><div class=3D"gmail_quote"><div><br></div>=
<div>-A</div><div>=C2=A0</div><blockquote class=3D"gmail_quote" style=3D"ma=
rgin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:=
1ex"><br>Using filename prefixes is boring and not uniform, but I feel I sh=
ould point out that most distfile hosts are still doing fine. Microoptimizi=
ng this seems like wasted effort.<br><br>Cheers,<br> =C2=A0 =C2=A0 R0b0t1
</blockquote></div><br></div></div>

--001a1143fa4014fcaf0563f08962--