From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-dev+bounces-83684-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by finch.gentoo.org (Postfix) with ESMTPS id 4AEA41382C5
	for <garchives@archives.gentoo.org>; Mon, 29 Jan 2018 05:37:04 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 05417E0BD2;
	Mon, 29 Jan 2018 05:36:58 +0000 (UTC)
Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id ABF77E0BC7
	for <gentoo-dev@lists.gentoo.org>; Mon, 29 Jan 2018 05:36:57 +0000 (UTC)
Received: from pomiot (d202-252.icpnet.pl [109.173.202.252])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	(Authenticated sender: mgorny)
	by smtp.gentoo.org (Postfix) with ESMTPSA id D19BB335C07;
	Mon, 29 Jan 2018 05:36:55 +0000 (UTC)
Message-ID: <1517204212.867.5.camel@gentoo.org>
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory
 structure
From: =?UTF-8?Q?Micha=C5=82_G=C3=B3rny?= <mgorny@gentoo.org>
To: gentoo-dev@lists.gentoo.org
Date: Mon, 29 Jan 2018 06:36:52 +0100
In-Reply-To: <1517172228.2114973.1251027256.0A9C8F3C@webmail.messagingengine.com>
References: <1517009079.31015.3.camel@gentoo.org>
	 <1517172228.2114973.1251027256.0A9C8F3C@webmail.messagingengine.com>
Organization: Gentoo
Content-Type: text/plain; charset="UTF-8"
X-Mailer: Evolution 3.24.6 
Precedence: bulk
List-Post: <mailto:gentoo-dev@lists.gentoo.org>
List-Help: <mailto:gentoo-dev+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-dev+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-dev+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-dev.gentoo.org>
X-BeenThere: gentoo-dev@lists.gentoo.org
Reply-to: gentoo-dev@lists.gentoo.org
Mime-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Archives-Salt: 350997b1-badc-495a-b199-abf9b8574aad
X-Archives-Hash: c25d8a51c5f6abe43a2c5f937c9b7e66

W dniu nie, 28.01.2018 o godzinie 21∶43 +0100, użytkownik Andrew Barchuk
napisał:
> [my apologies for posting the message to a wrong thread before]
> 
> Hi everyone,
> 
> > three possible solutions for splitting distfiles were listed:
> > 
> > a. using initial portion of filename,
> > 
> > b. using initial portion of file hash,
> > 
> > c. using initial portion of filename hash.
> > 
> > The significant advantage of the filename option was simplicity.  With
> > that solution, the users could easily determine the correct subdirectory
> > themselves.  However, it's significant disadvantage was very uneven
> > shuffling of data.  In particular, the TeΧ Live packages alone count
> > almost 23500 distfiles and all use a common prefix, making it impossible
> > to split them further.
> > 
> > The alternate option of using file hash has the advantage of having
> > a more balanced split.
> 
> 
> There's another option to use character ranges for each directory
> computed in a way to have the files distributed evenly. One way to do
> that is to use filename prefix of dynamic length so that each range
> holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
> texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
> but simpler option is to use file names as range bounds (the same way
> dictionaries use words to demarcate page bounds): each directory will
> have a name of the first file located inside. This way files will be
> distributed evenly and it's still easy to pick a correct directory where
> a file will be located manually.

What you're talking about is pretty much an adaptive algorithm. It may
look like a good at first but it's really hard to predict how it'll work
in the future because you can't really predict what will happen to
distfiles in the future.

A few major events that could result in it going competely off:

a. we stop using split texlive packages and distribute a few big
tarballs instead,

b. texlive packages are renamed to use date before subpackage name,

c. someone adds another big package set.

That said, you don't need a big event for that. Many small events may
(or may not) cause it to gradually go off. Whenever that happens, we
would have to have a contingency plan -- and I don't really like
the idea of having to reshuffle all the mirrors all of a sudden.

I think the cryptographic hash algorithms are a better choice. They may
not be perfect but they can cope with a lot of very different data
by design. Yes, we could technically accidentally hit a data set that is
completely uneven. But it is rather unlikely, compared to home-made
algorithms.

-- 
Best regards,
Michał Górny