public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
From: "Michał Górny" <mgorny@gentoo.org>
To: gentoo-dev@lists.gentoo.org
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
Date: Mon, 29 Jan 2018 06:36:52 +0100	[thread overview]
Message-ID: <1517204212.867.5.camel@gentoo.org> (raw)
In-Reply-To: <1517172228.2114973.1251027256.0A9C8F3C@webmail.messagingengine.com>

W dniu nie, 28.01.2018 o godzinie 21∶43 +0100, użytkownik Andrew Barchuk
napisał:
> [my apologies for posting the message to a wrong thread before]
> 
> Hi everyone,
> 
> > three possible solutions for splitting distfiles were listed:
> > 
> > a. using initial portion of filename,
> > 
> > b. using initial portion of file hash,
> > 
> > c. using initial portion of filename hash.
> > 
> > The significant advantage of the filename option was simplicity.  With
> > that solution, the users could easily determine the correct subdirectory
> > themselves.  However, it's significant disadvantage was very uneven
> > shuffling of data.  In particular, the TeΧ Live packages alone count
> > almost 23500 distfiles and all use a common prefix, making it impossible
> > to split them further.
> > 
> > The alternate option of using file hash has the advantage of having
> > a more balanced split.
> 
> 
> There's another option to use character ranges for each directory
> computed in a way to have the files distributed evenly. One way to do
> that is to use filename prefix of dynamic length so that each range
> holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
> texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
> but simpler option is to use file names as range bounds (the same way
> dictionaries use words to demarcate page bounds): each directory will
> have a name of the first file located inside. This way files will be
> distributed evenly and it's still easy to pick a correct directory where
> a file will be located manually.

What you're talking about is pretty much an adaptive algorithm. It may
look like a good at first but it's really hard to predict how it'll work
in the future because you can't really predict what will happen to
distfiles in the future.

A few major events that could result in it going competely off:

a. we stop using split texlive packages and distribute a few big
tarballs instead,

b. texlive packages are renamed to use date before subpackage name,

c. someone adds another big package set.

That said, you don't need a big event for that. Many small events may
(or may not) cause it to gradually go off. Whenever that happens, we
would have to have a contingency plan -- and I don't really like
the idea of having to reshuffle all the mirrors all of a sudden.

I think the cryptographic hash algorithms are a better choice. They may
not be perfect but they can cope with a lot of very different data
by design. Yes, we could technically accidentally hit a data set that is
completely uneven. But it is rather unlikely, compared to home-made
algorithms.

-- 
Best regards,
Michał Górny



  parent reply	other threads:[~2018-01-29  5:37 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-26 23:24 [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure Michał Górny
2018-01-27  1:48 ` Michael Orlitzky
2018-01-27  2:44   ` R0b0t1
2018-01-27  8:30   ` Michał Górny
2018-01-27 11:36     ` Roy Bamford
2018-01-27 11:41       ` Michał Górny
2018-01-27 16:42         ` Gordon Pettey
2018-01-27 16:48           ` Michael Orlitzky
2018-01-27 19:01             ` Gordon Pettey
2018-01-27 20:16               ` Michael Orlitzky
2018-01-30  1:21         ` Kent Fredric
2018-01-30  2:53           ` Robin H. Johnson
2018-01-30  7:25           ` Michał Górny
2018-01-30 19:46             ` Kent Fredric
2018-01-27 16:47     ` Michael Orlitzky
2018-01-27 18:14       ` Michał Górny
2018-01-27 18:24         ` Michael Orlitzky
2018-01-27 19:47           ` Michał Górny
2018-01-27 20:30             ` Michael Orlitzky
2018-01-30  1:27           ` Kent Fredric
2018-01-30  7:17             ` Ulrich Mueller
2018-01-28  7:01 ` Jason Zaman
2018-01-28  9:10   ` Michał Górny
2018-01-29  7:33   ` Robin H. Johnson
2018-01-28 10:14 ` Ulrich Mueller
2018-01-28 10:16   ` Michał Górny
2018-01-28 10:22     ` Ulrich Mueller
2018-01-28 10:40       ` Michał Górny
2018-01-28 13:03         ` Ulrich Mueller
2018-01-30  1:41           ` Kent Fredric
2018-01-30  7:11             ` Ulrich Mueller
2018-01-28 20:43 ` Andrew Barchuk
2018-01-28 21:17   ` Gordon Pettey
2018-01-28 22:00     ` Andrew Barchuk
2018-01-28 22:13       ` Gordon Pettey
2018-01-28 22:14       ` Zac Medico
2018-01-28 22:46         ` Andrew Barchuk
2018-01-29  5:36   ` Michał Górny [this message]
2018-01-29  9:22     ` Andrew Barchuk
2018-01-29 19:37 ` [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2) Michał Górny
2018-01-29 20:00   ` Robin H. Johnson
2018-01-29 21:09     ` Michał Górny
2018-01-29 20:26   ` R0b0t1
2018-01-29 20:55     ` Alec Warner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1517204212.867.5.camel@gentoo.org \
    --to=mgorny@gentoo.org \
    --cc=gentoo-dev@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox