From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 4AEA41382C5 for ; Mon, 29 Jan 2018 05:37:04 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 05417E0BD2; Mon, 29 Jan 2018 05:36:58 +0000 (UTC) Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id ABF77E0BC7 for ; Mon, 29 Jan 2018 05:36:57 +0000 (UTC) Received: from pomiot (d202-252.icpnet.pl [109.173.202.252]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: mgorny) by smtp.gentoo.org (Postfix) with ESMTPSA id D19BB335C07; Mon, 29 Jan 2018 05:36:55 +0000 (UTC) Message-ID: <1517204212.867.5.camel@gentoo.org> Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure From: =?UTF-8?Q?Micha=C5=82_G=C3=B3rny?= To: gentoo-dev@lists.gentoo.org Date: Mon, 29 Jan 2018 06:36:52 +0100 In-Reply-To: <1517172228.2114973.1251027256.0A9C8F3C@webmail.messagingengine.com> References: <1517009079.31015.3.camel@gentoo.org> <1517172228.2114973.1251027256.0A9C8F3C@webmail.messagingengine.com> Organization: Gentoo Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.24.6 Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org Mime-Version: 1.0 Content-Transfer-Encoding: 8bit X-Archives-Salt: 350997b1-badc-495a-b199-abf9b8574aad X-Archives-Hash: c25d8a51c5f6abe43a2c5f937c9b7e66 W dniu nie, 28.01.2018 o godzinie 21∶43 +0100, użytkownik Andrew Barchuk napisał: > [my apologies for posting the message to a wrong thread before] > > Hi everyone, > > > three possible solutions for splitting distfiles were listed: > > > > a. using initial portion of filename, > > > > b. using initial portion of file hash, > > > > c. using initial portion of filename hash. > > > > The significant advantage of the filename option was simplicity. With > > that solution, the users could easily determine the correct subdirectory > > themselves. However, it's significant disadvantage was very uneven > > shuffling of data. In particular, the TeΧ Live packages alone count > > almost 23500 distfiles and all use a common prefix, making it impossible > > to split them further. > > > > The alternate option of using file hash has the advantage of having > > a more balanced split. > > > There's another option to use character ranges for each directory > computed in a way to have the files distributed evenly. One way to do > that is to use filename prefix of dynamic length so that each range > holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but > texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar > but simpler option is to use file names as range bounds (the same way > dictionaries use words to demarcate page bounds): each directory will > have a name of the first file located inside. This way files will be > distributed evenly and it's still easy to pick a correct directory where > a file will be located manually. What you're talking about is pretty much an adaptive algorithm. It may look like a good at first but it's really hard to predict how it'll work in the future because you can't really predict what will happen to distfiles in the future. A few major events that could result in it going competely off: a. we stop using split texlive packages and distribute a few big tarballs instead, b. texlive packages are renamed to use date before subpackage name, c. someone adds another big package set. That said, you don't need a big event for that. Many small events may (or may not) cause it to gradually go off. Whenever that happens, we would have to have a contingency plan -- and I don't really like the idea of having to reshuffle all the mirrors all of a sudden. I think the cryptographic hash algorithms are a better choice. They may not be perfect but they can cope with a lot of very different data by design. Yes, we could technically accidentally hit a data set that is completely uneven. But it is rather unlikely, compared to home-made algorithms. -- Best regards, Michał Górny