From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 72E5E1382C5 for ; Wed, 7 Feb 2018 15:01:00 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 41F97E0ABF; Wed, 7 Feb 2018 15:00:59 +0000 (UTC) Received: from smtp.gentoo.org (dev.gentoo.org [IPv6:2001:470:ea4a:1:5054:ff:fec7:86e4]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 124B3E0ABF for ; Wed, 7 Feb 2018 15:00:58 +0000 (UTC) Received: from oystercatcher.gentoo.org (unknown [IPv6:2a01:4f8:202:4333:225:90ff:fed9:fc84]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.gentoo.org (Postfix) with ESMTPS id B755C335C4C for ; Wed, 7 Feb 2018 15:00:57 +0000 (UTC) Received: from localhost.localdomain (localhost [IPv6:::1]) by oystercatcher.gentoo.org (Postfix) with ESMTP id 4543A1E5 for ; Wed, 7 Feb 2018 15:00:56 +0000 (UTC) From: "Ulrich Müller" To: gentoo-commits@lists.gentoo.org Content-Transfer-Encoding: 8bit Content-type: text/plain; charset=UTF-8 Reply-To: gentoo-dev@lists.gentoo.org, "Ulrich Müller" Message-ID: <1518009728.e4dc2627c8107339b13e20709125e2d9fc91ffde.ulm@gentoo> Subject: [gentoo-commits] data/glep:master commit in: / X-VCS-Repository: data/glep X-VCS-Files: glep-0075.rst X-VCS-Directories: / X-VCS-Committer: ulm X-VCS-Committer-Name: Ulrich Müller X-VCS-Revision: e4dc2627c8107339b13e20709125e2d9fc91ffde X-VCS-Branch: master Date: Wed, 7 Feb 2018 15:00:56 +0000 (UTC) Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-commits@lists.gentoo.org X-Archives-Salt: b585565f-23eb-4033-8166-ff86f39596f7 X-Archives-Hash: 7e106aeb170c73e590396f8a269a97d6 commit: e4dc2627c8107339b13e20709125e2d9fc91ffde Author: Michał Górny gentoo org> AuthorDate: Wed Feb 7 13:20:45 2018 +0000 Commit: Ulrich Müller gentoo org> CommitDate: Wed Feb 7 13:22:08 2018 +0000 URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=e4dc2627 glep-0075: Extend rationale for splitting algorithm Extend and refactor the rationale for splitting algorithm. Explicitly state the goals, list all the options that occurred during the ml discussion. glep-0075.rst | 116 +++++++++++++++++++++++++++++++++++++++++++++------------- 1 file changed, 91 insertions(+), 25 deletions(-) diff --git a/glep-0075.rst b/glep-0075.rst index 157514e..00d14c3 100644 --- a/glep-0075.rst +++ b/glep-0075.rst @@ -187,43 +187,98 @@ Rationale ========= Algorithm for splitting distfiles --------------------------------- -In the original debate that occurred in bug #534528 [#BUG534528]_, -three possible solutions for splitting distfiles were listed: +The possible algorithms were considered with the following goals +in mind: -a. using initial portion of filename, +- the number of files in a single directory should not exceed 1000, -b. using initial portion of file hash, +- the total size of files in a single directory is not considered + relevant, -c. using initial portion of filename hash. +- the solution should preferably be future-proof, -The significant advantage of the filename option was simplicity. With -that solution, the users could easily determine the correct subdirectory -themselves. However, it's significant disadvantage was very uneven -shuffling of data. In particular, the TeΧ Live packages alone count -almost 23500 distfiles and all use a common prefix, making it impossible -to split them further. +- moving distfiles should be avoided once it is deployed. -The alternate option of using file hash has the advantage of having -a more balanced split. Furthermore, since hashes are stored -in Manifests using them is zero-cost. However, this solution has three -significant disadvantages: +It should also be noted that at this moment the package having most +distfiles in Gentoo at the time is dev-texlive/texlive-latexextra, +with the number of 8556 distfiles. All of them start with a common +prefix of ``texlive-module-``. This specific prefix is used by a total +of 23435 distfiles. -1. The hash values are unknown for newly-downloaded distfiles, so - ``repoman`` (or an equivalent tool) would have to use a temporary - directory before locating the file in appropriate subdirectory. +In the original debate that occurred in bug #534528 [#BUG534528]_ +and the mailing list review of the initial version of this GLEP [#ML1]_, +four fundamental ideas for splitting distfiles were listed: + +a. using initial portion of filename, + +b. using initial portion of file hash, + +c. using initial portion of filename hash, + +d. using package category (and package name). + +The initial filename idea was to use the first character of filename, +possibly followed by a longer part which was the idea historically +used e.g. by PyPI Python package hosting. Its main advantage is +simplicity. The users can easily determine the correct subdirectory +by just looking at the distfile name. Sadly, this solution is not only +very uneven but does not solve the problem. As mentioned above, +the TeΧ Live packages share a long common prefix that make it impossible +to split it properly with other packages on fixed-length prefixes. + +This idea has been followed by an adaptive proposal by Andrew Barchuk +[#ADAPTIVE_FILENAME]_. In this proposal, the filenames are not strictly +mapped to groups by a common prefix but instead each group contains +all files between two prefixes being used (like in a dictionary). +However, it has been pointed out that while this option can provide +very even results initially, it is impossible to predict how it would +be affected by future distfile changes and there will be a risk of +needing to change the groups in the future. Furthermore, it is +relatively complex and requires explicitly listing or obtaining used +groups. + +Another option was to use an initial portion of distfile hashes. Its +main advantage is that cryptographic hash algorithms can provide +a more balanced split with random data. Furthermore, since hashes are +stored in Manifests using them has no cost for users. However, this +solution has three disadvantages: + +1. Not all files in the distfile tree are covered by package Manifests. + Additional files are injected into the mirrors, and those will + not have a clearly-defined location. 2. User-provided distfiles (e.g. for fetch-restricted packages) with hash mismatches would be placed in the wrong subdirectory, potentially causing confusing errors. -3. Not all files in the distfiles tree are covered by package Manifests - --- there are additional files that are injected into distfiles. +3. The hash values are unknown for newly-downloaded distfiles, so + ``repoman`` (or an equivalent tool) would have to use a temporary + directory before locating the file in appropriate subdirectory. -Using filename hashes has proven to provide a similar balance -to using file hashes. Furthermore, since filenames are known up front -this solution does not suffer from the both listed problems. While -hashes need to be computed manually, hashing short string should not -cause any performance problems. +Using filename hashes has proven to provide a similar balance to using +file hashes. Furthermore, since filenames are known up front this +solution does not suffer from the listed problems. While hashes need +to be computed manually, hashing short string should not cause +any performance problems. + +Jason Zaman has suggested to use package categories (and package names) +[#PKGNAME]_. However, this solution has multiple problems: + +a. it does not solve the problem for large packages such as TeΧ Live, + +b. it introduces many unnecessarily small directories, + +c. it requires an explicit knowledge of which package distfiles + belong to, + +d. it does not provide an explicit solution to the problem of distfiles + shared by multiple packages, + +e. it does not provide a solution to the problem of injected distfiles. + +All the options considered, the filename hash solution was selected +as one that solves all the forementioned problems while introducing +relatively low complexity and being reasonably future-proof. .. figure:: glep-0075-extras/by-filename.png @@ -327,6 +382,17 @@ References of DISTDIR (https://bugs.gentoo.org/534528) +.. [#ML1] [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure + (https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba) + +.. [#ADAPTIVE_FILENAME] Andrew Barchuk's reply on 'using character ranges + for each directory computed in a way to have the files distributed evenly' + (https://archives.gentoo.org/gentoo-dev/message/611bdaa76be049c1d650e8995748e7b8) + +.. [#PKGNAME] Jason Zamal's reply including 'using the same dir layout + as the packages themselves) + (https://archives.gentoo.org/gentoo-dev/message/f26ed870c3a6d4ecf69a821723642975) + Copyright =========