From: "Michał Górny" <mgorny@gentoo.org>
To: gentoo-dev@lists.gentoo.org
Subject: Re: [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure
Date: Mon, 29 Jan 2018 06:36:52 +0100 [thread overview]
Message-ID: <1517204212.867.5.camel@gentoo.org> (raw)
In-Reply-To: <1517172228.2114973.1251027256.0A9C8F3C@webmail.messagingengine.com>
W dniu nie, 28.01.2018 o godzinie 21∶43 +0100, użytkownik Andrew Barchuk
napisał:
> [my apologies for posting the message to a wrong thread before]
>
> Hi everyone,
>
> > three possible solutions for splitting distfiles were listed:
> >
> > a. using initial portion of filename,
> >
> > b. using initial portion of file hash,
> >
> > c. using initial portion of filename hash.
> >
> > The significant advantage of the filename option was simplicity. With
> > that solution, the users could easily determine the correct subdirectory
> > themselves. However, it's significant disadvantage was very uneven
> > shuffling of data. In particular, the TeΧ Live packages alone count
> > almost 23500 distfiles and all use a common prefix, making it impossible
> > to split them further.
> >
> > The alternate option of using file hash has the advantage of having
> > a more balanced split.
>
>
> There's another option to use character ranges for each directory
> computed in a way to have the files distributed evenly. One way to do
> that is to use filename prefix of dynamic length so that each range
> holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
> texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
> but simpler option is to use file names as range bounds (the same way
> dictionaries use words to demarcate page bounds): each directory will
> have a name of the first file located inside. This way files will be
> distributed evenly and it's still easy to pick a correct directory where
> a file will be located manually.
What you're talking about is pretty much an adaptive algorithm. It may
look like a good at first but it's really hard to predict how it'll work
in the future because you can't really predict what will happen to
distfiles in the future.
A few major events that could result in it going competely off:
a. we stop using split texlive packages and distribute a few big
tarballs instead,
b. texlive packages are renamed to use date before subpackage name,
c. someone adds another big package set.
That said, you don't need a big event for that. Many small events may
(or may not) cause it to gradually go off. Whenever that happens, we
would have to have a contingency plan -- and I don't really like
the idea of having to reshuffle all the mirrors all of a sudden.
I think the cryptographic hash algorithms are a better choice. They may
not be perfect but they can cope with a lot of very different data
by design. Yes, we could technically accidentally hit a data set that is
completely uneven. But it is rather unlikely, compared to home-made
algorithms.
--
Best regards,
Michał Górny
next prev parent reply other threads:[~2018-01-29 5:37 UTC|newest]
Thread overview: 44+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-01-26 23:24 [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure Michał Górny
2018-01-27 1:48 ` Michael Orlitzky
2018-01-27 2:44 ` R0b0t1
2018-01-27 8:30 ` Michał Górny
2018-01-27 11:36 ` Roy Bamford
2018-01-27 11:41 ` Michał Górny
2018-01-27 16:42 ` Gordon Pettey
2018-01-27 16:48 ` Michael Orlitzky
2018-01-27 19:01 ` Gordon Pettey
2018-01-27 20:16 ` Michael Orlitzky
2018-01-30 1:21 ` Kent Fredric
2018-01-30 2:53 ` Robin H. Johnson
2018-01-30 7:25 ` Michał Górny
2018-01-30 19:46 ` Kent Fredric
2018-01-27 16:47 ` Michael Orlitzky
2018-01-27 18:14 ` Michał Górny
2018-01-27 18:24 ` Michael Orlitzky
2018-01-27 19:47 ` Michał Górny
2018-01-27 20:30 ` Michael Orlitzky
2018-01-30 1:27 ` Kent Fredric
2018-01-30 7:17 ` Ulrich Mueller
2018-01-28 7:01 ` Jason Zaman
2018-01-28 9:10 ` Michał Górny
2018-01-29 7:33 ` Robin H. Johnson
2018-01-28 10:14 ` Ulrich Mueller
2018-01-28 10:16 ` Michał Górny
2018-01-28 10:22 ` Ulrich Mueller
2018-01-28 10:40 ` Michał Górny
2018-01-28 13:03 ` Ulrich Mueller
2018-01-30 1:41 ` Kent Fredric
2018-01-30 7:11 ` Ulrich Mueller
2018-01-28 20:43 ` Andrew Barchuk
2018-01-28 21:17 ` Gordon Pettey
2018-01-28 22:00 ` Andrew Barchuk
2018-01-28 22:13 ` Gordon Pettey
2018-01-28 22:14 ` Zac Medico
2018-01-28 22:46 ` Andrew Barchuk
2018-01-29 5:36 ` Michał Górny [this message]
2018-01-29 9:22 ` Andrew Barchuk
2018-01-29 19:37 ` [gentoo-dev] [pre-GLEP] Split distfile mirror directory structure (draft v2) Michał Górny
2018-01-29 20:00 ` Robin H. Johnson
2018-01-29 21:09 ` Michał Górny
2018-01-29 20:26 ` R0b0t1
2018-01-29 20:55 ` Alec Warner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1517204212.867.5.camel@gentoo.org \
--to=mgorny@gentoo.org \
--cc=gentoo-dev@lists.gentoo.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox