* [gentoo-commits] data/glep:glep-mirrors commit in: /
@ 2018-02-07 13:22 Michał Górny
0 siblings, 0 replies; 4+ messages in thread
From: Michał Górny @ 2018-02-07 13:22 UTC (permalink / raw
To: gentoo-commits
commit: 5ee8a0c8e93784d04f983168ceefe102bc6d759f
Author: Michał Górny <mgorny <AT> gentoo <DOT> org>
AuthorDate: Sun Jan 28 10:38:24 2018 +0000
Commit: Michał Górny <mgorny <AT> gentoo <DOT> org>
CommitDate: Wed Feb 7 13:22:08 2018 +0000
URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=5ee8a0c8
glep-0075: Clarify structure key description
Clarify structure key description. Suggested by Ulrich Müller.
glep-0075.rst | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/glep-0075.rst b/glep-0075.rst
index c098987..157514e 100644
--- a/glep-0075.rst
+++ b/glep-0075.rst
@@ -70,11 +70,11 @@ is intended to account for future extensions.
This specification currently defines one section: ``[structure]``.
This section defines one or more repository structure definitions
-using sequential integer keys. The definition keyed as ``0``
-is the most preferred structure. The package manager should use
-the first structure format it recognizes as supported, and ignore any
-it does not recognize. If this section is not present, the package
-manager should behave as if only ``flat`` structure were supported.
+using non-negative sequential integer keys. The definition with
+the ``0`` key is the most preferred structure. The package manager
+should ignore any formats it does not recognize. If this section
+is not present, the package manager should behave as if only ``flat``
+structure were specified.
The following structure definitions are supported:
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [gentoo-commits] data/glep:glep-mirrors commit in: /
@ 2018-02-07 13:22 Michał Górny
0 siblings, 0 replies; 4+ messages in thread
From: Michał Górny @ 2018-02-07 13:22 UTC (permalink / raw
To: gentoo-commits
commit: e4dc2627c8107339b13e20709125e2d9fc91ffde
Author: Michał Górny <mgorny <AT> gentoo <DOT> org>
AuthorDate: Wed Feb 7 13:20:45 2018 +0000
Commit: Michał Górny <mgorny <AT> gentoo <DOT> org>
CommitDate: Wed Feb 7 13:22:08 2018 +0000
URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=e4dc2627
glep-0075: Extend rationale for splitting algorithm
Extend and refactor the rationale for splitting algorithm. Explicitly
state the goals, list all the options that occurred during the ml
discussion.
glep-0075.rst | 116 +++++++++++++++++++++++++++++++++++++++++++++-------------
1 file changed, 91 insertions(+), 25 deletions(-)
diff --git a/glep-0075.rst b/glep-0075.rst
index 157514e..00d14c3 100644
--- a/glep-0075.rst
+++ b/glep-0075.rst
@@ -187,43 +187,98 @@ Rationale
=========
Algorithm for splitting distfiles
---------------------------------
-In the original debate that occurred in bug #534528 [#BUG534528]_,
-three possible solutions for splitting distfiles were listed:
+The possible algorithms were considered with the following goals
+in mind:
-a. using initial portion of filename,
+- the number of files in a single directory should not exceed 1000,
-b. using initial portion of file hash,
+- the total size of files in a single directory is not considered
+ relevant,
-c. using initial portion of filename hash.
+- the solution should preferably be future-proof,
-The significant advantage of the filename option was simplicity. With
-that solution, the users could easily determine the correct subdirectory
-themselves. However, it's significant disadvantage was very uneven
-shuffling of data. In particular, the TeΧ Live packages alone count
-almost 23500 distfiles and all use a common prefix, making it impossible
-to split them further.
+- moving distfiles should be avoided once it is deployed.
-The alternate option of using file hash has the advantage of having
-a more balanced split. Furthermore, since hashes are stored
-in Manifests using them is zero-cost. However, this solution has three
-significant disadvantages:
+It should also be noted that at this moment the package having most
+distfiles in Gentoo at the time is dev-texlive/texlive-latexextra,
+with the number of 8556 distfiles. All of them start with a common
+prefix of ``texlive-module-``. This specific prefix is used by a total
+of 23435 distfiles.
-1. The hash values are unknown for newly-downloaded distfiles, so
- ``repoman`` (or an equivalent tool) would have to use a temporary
- directory before locating the file in appropriate subdirectory.
+In the original debate that occurred in bug #534528 [#BUG534528]_
+and the mailing list review of the initial version of this GLEP [#ML1]_,
+four fundamental ideas for splitting distfiles were listed:
+
+a. using initial portion of filename,
+
+b. using initial portion of file hash,
+
+c. using initial portion of filename hash,
+
+d. using package category (and package name).
+
+The initial filename idea was to use the first character of filename,
+possibly followed by a longer part which was the idea historically
+used e.g. by PyPI Python package hosting. Its main advantage is
+simplicity. The users can easily determine the correct subdirectory
+by just looking at the distfile name. Sadly, this solution is not only
+very uneven but does not solve the problem. As mentioned above,
+the TeΧ Live packages share a long common prefix that make it impossible
+to split it properly with other packages on fixed-length prefixes.
+
+This idea has been followed by an adaptive proposal by Andrew Barchuk
+[#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly
+mapped to groups by a common prefix but instead each group contains
+all files between two prefixes being used (like in a dictionary).
+However, it has been pointed out that while this option can provide
+very even results initially, it is impossible to predict how it would
+be affected by future distfile changes and there will be a risk of
+needing to change the groups in the future. Furthermore, it is
+relatively complex and requires explicitly listing or obtaining used
+groups.
+
+Another option was to use an initial portion of distfile hashes. Its
+main advantage is that cryptographic hash algorithms can provide
+a more balanced split with random data. Furthermore, since hashes are
+stored in Manifests using them has no cost for users. However, this
+solution has three disadvantages:
+
+1. Not all files in the distfile tree are covered by package Manifests.
+ Additional files are injected into the mirrors, and those will
+ not have a clearly-defined location.
2. User-provided distfiles (e.g. for fetch-restricted packages) with
hash mismatches would be placed in the wrong subdirectory,
potentially causing confusing errors.
-3. Not all files in the distfiles tree are covered by package Manifests
- --- there are additional files that are injected into distfiles.
+3. The hash values are unknown for newly-downloaded distfiles, so
+ ``repoman`` (or an equivalent tool) would have to use a temporary
+ directory before locating the file in appropriate subdirectory.
-Using filename hashes has proven to provide a similar balance
-to using file hashes. Furthermore, since filenames are known up front
-this solution does not suffer from the both listed problems. While
-hashes need to be computed manually, hashing short string should not
-cause any performance problems.
+Using filename hashes has proven to provide a similar balance to using
+file hashes. Furthermore, since filenames are known up front this
+solution does not suffer from the listed problems. While hashes need
+to be computed manually, hashing short string should not cause
+any performance problems.
+
+Jason Zaman has suggested to use package categories (and package names)
+[#PKGNAME]_. However, this solution has multiple problems:
+
+a. it does not solve the problem for large packages such as TeΧ Live,
+
+b. it introduces many unnecessarily small directories,
+
+c. it requires an explicit knowledge of which package distfiles
+ belong to,
+
+d. it does not provide an explicit solution to the problem of distfiles
+ shared by multiple packages,
+
+e. it does not provide a solution to the problem of injected distfiles.
+
+All the options considered, the filename hash solution was selected
+as one that solves all the forementioned problems while introducing
+relatively low complexity and being reasonably future-proof.
.. figure:: glep-0075-extras/by-filename.png
@@ -327,6 +382,17 @@ References
of DISTDIR
(https://bugs.gentoo.org/534528)
+.. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
+ (https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba)
+
+.. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges
+ for each directory computed in a way to have the files distributed evenly'
+ (https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8)
+
+.. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout
+ as the packages themselves)
+ (https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975)
+
Copyright
=========
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [gentoo-commits] data/glep:glep-mirrors commit in: /
@ 2018-02-07 13:22 Michał Górny
0 siblings, 0 replies; 4+ messages in thread
From: Michał Górny @ 2018-02-07 13:22 UTC (permalink / raw
To: gentoo-commits
commit: a9971615ef69dbb26f04af2c9fd1cddd1ec6d626
Author: Michał Górny <mgorny <AT> gentoo <DOT> org>
AuthorDate: Sun Jan 28 08:55:35 2018 +0000
Commit: Michał Górny <mgorny <AT> gentoo <DOT> org>
CommitDate: Wed Feb 7 13:22:07 2018 +0000
URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=a9971615
glep-0075: Include argument on injected distfiles
Include the argument that we have injected distfiles as noted
on the bug by Robin. Inclusion suggested by Michael Orlitzky.
glep-0075.rst | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/glep-0075.rst b/glep-0075.rst
index c0a1d5c..ced1231 100644
--- a/glep-0075.rst
+++ b/glep-0075.rst
@@ -205,7 +205,7 @@ to split them further.
The alternate option of using file hash has the advantage of having
a more balanced split. Furthermore, since hashes are stored
-in Manifests using them is zero-cost. However, this solution has two
+in Manifests using them is zero-cost. However, this solution has three
significant disadvantages:
1. The hash values are unknown for newly-downloaded distfiles, so
@@ -216,6 +216,9 @@ significant disadvantages:
hash mismatches would be placed in the wrong subdirectory,
potentially causing confusing errors.
+3. Not all files in the distfiles tree are covered by package Manifests
+ --- there are additional files that are injected into distfiles.
+
Using filename hashes has proven to provide a similar balance
to using file hashes. Furthermore, since filenames are known up front
this solution does not suffer from the both listed problems. While
^ permalink raw reply related [flat|nested] 4+ messages in thread
* [gentoo-commits] data/glep:glep-mirrors commit in: /
@ 2018-02-07 14:56 Michał Górny
0 siblings, 0 replies; 4+ messages in thread
From: Michał Górny @ 2018-02-07 14:56 UTC (permalink / raw
To: gentoo-commits
commit: 0183148733177dab0a2a1cdbdc748744553102c3
Author: Michał Górny <mgorny <AT> gentoo <DOT> org>
AuthorDate: Wed Feb 7 14:55:23 2018 +0000
Commit: Michał Górny <mgorny <AT> gentoo <DOT> org>
CommitDate: Wed Feb 7 14:55:36 2018 +0000
URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=01831487
glep-0075: Use Unicode dashes
glep-0075.rst | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/glep-0075.rst b/glep-0075.rst
index 00d14c3..38d57e2 100644
--- a/glep-0075.rst
+++ b/glep-0075.rst
@@ -30,7 +30,7 @@ specifically designed for that purpose.
According to the Gentoo repository state at 2018-01-26 16:23, there
was a total of 62652 unique distfiles in the repository. While
the users realistically hit around 10% of that, distfile mirrors often
-hold even more files --- more so if old distfiles are not wiped
+hold even more files — more so if old distfiles are not wiped
immediately.
While all filesystems used on Linux boxes should be able to cope with
@@ -65,7 +65,7 @@ how to use the mirror. If the file is not present, the package manager
should behave as if it were empty.
The package manager should recognize the sections and keys listed below.
-It should ignore any unrecognized sections or keys --- the format
+It should ignore any unrecognized sections or keys — the format
is intended to account for future extensions.
This specification currently defines one section: ``[structure]``.
@@ -288,12 +288,12 @@ relatively low complexity and being reasonably future-proof.
.. figure:: glep-0075-extras/by-csum.png
Distribution of distfiles by first hex-digit of checksum
- (x --- content checksum, + --- filename checksum)
+ (x — content checksum, + — filename checksum)
.. figure:: glep-0075-extras/by-csum2.png
Distribution of distfiles by two first hex-digits of checksum
- (x --- content checksum, + --- filename checksum)
+ (x — content checksum, + — filename checksum)
Layout file
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2018-02-07 14:56 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-02-07 13:22 [gentoo-commits] data/glep:glep-mirrors commit in: / Michał Górny
-- strict thread matches above, loose matches on Subject: below --
2018-02-07 13:22 Michał Górny
2018-02-07 13:22 Michał Górny
2018-02-07 14:56 Michał Górny
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox