public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
From: "Robin H. Johnson" <robbat2@gentoo.org>
To: gentoo-dev@lists.gentoo.org
Subject: Re: [gentoo-dev] [News item review] Portage rsync tree verification (v4)
Date: Mon, 29 Jan 2018 07:21:14 +0000	[thread overview]
Message-ID: <robbat2-20180129T070612-080997113Z@orbis-terrarum.net> (raw)
In-Reply-To: <1517171431.2109764.1251018832.6F16557B@webmail.messagingengine.com>

[-- Attachment #1: Type: text/plain, Size: 2998 bytes --]

On Sun, Jan 28, 2018 at 09:30:31PM +0100, Andrew Barchuk wrote:
> Hi everyone,
> 
> > three possible solutions for splitting distfiles were listed:
> There's another option to use character ranges for each directory
> computed in a way to have the files distributed evenly. One way to do
> that is to use filename prefix of dynamic length so that each range
> holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but
> texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar
> but simpler option is to use file names as range bounds (the same way
> dictionaries use words to demarcate page bounds): each directory will
> have a name of the first file located inside. This way files will be
> distributed evenly and it's still easy to pick a correct directory where
> a file will be located manually.
This was discussed early on, but thank you for the reminder, as it got
dropped from later discussions.

> [snip code]
> Using the approach above the files will distributed evenly among the
> directories keeping the possibility to determine the directory for a
> specific file by hand. It's possible if necessary to keep the directory
> structure unchanged for very long time and it will likely stay
> well-balanced. Picking a directory for a file is very cheap. The only
> obvious downside I see is that it's necessary to know list of
> directories to pick the correct one (can be mitigated by caching the
> list of directories if important). If it's desirable to make directory
> names shorter or to look less like file names it's fairly easy to
> achieve by keeping only unique prefixes of directories. For example:
As for the problem you describe, one of the requirements in the
discussion is that given ONLY the file or filename, and NOTHING ELSE, it
should be possible to determine where in a hierarchy it should go. No
prior knowledge about the hierarchy was permitted. Some parties might
answer that you just need an index file then, but that means you have to
keep the index file in sync often.

It's a superbly readable result (in the general class of perfect hashes
based on lots of well-known input). The class of solution suffers
another problem in addition the one you noted: if input changes
sufficiently, then rebalancing is expensive/hard.

As a concrete example, say we add a new category for something something
with lots of common prefixes in distfiles. 
dev-scratch/ as an example, where all distfiles start with 'scratch-'.
Unless we know up-front that we're going to add a thousand distfiles
here (not unreasonable, dev-python is ~1800 packages), they might start
by going into the 'sc' directory, but later we want them to be in
'scratch', as the tree is unweighted otherwise.

-- 
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail   : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 1113 bytes --]

  reply	other threads:[~2018-01-29  7:21 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-25 10:04 [gentoo-dev] [News item review] Portage rsync tree verification Michał Górny
2018-01-25 10:52 ` [gentoo-dev] " Duncan
2018-01-25 11:01 ` [gentoo-dev] " Kristian Fiskerstrand
2018-01-25 12:30   ` Michał Górny
2018-01-25 21:38   ` M. J. Everitt
2018-01-25 12:35 ` [gentoo-dev] [News item review] Portage rsync tree verification (v2) Michał Górny
2018-01-25 14:49   ` Aaron W. Swenson
2018-01-25 19:13   ` Ulrich Mueller
2018-01-25 21:37   ` Robin H. Johnson
2018-01-25 21:45     ` Michał Górny
2018-01-25 21:55       ` R0b0t1
2018-01-27 14:27         ` Michał Górny
2018-01-28  6:40           ` R0b0t1
2018-01-25 21:55   ` Alon Bar-Lev
2018-01-25 22:21     ` Robin H. Johnson
2018-01-25 22:48       ` Alon Bar-Lev
2018-01-27 14:26 ` [gentoo-dev] [News item review] Portage rsync tree verification (v3) Michał Górny
2018-01-27 14:47   ` M. J. Everitt
2018-01-27 15:27   ` [gentoo-dev] " Duncan
2018-01-27 15:50   ` [gentoo-dev] " Nils Freydank
2018-01-28  8:58 ` [gentoo-dev] [News item review] Portage rsync tree verification (v4) Michał Górny
2018-01-28 16:00   ` [gentoo-dev] " Duncan
2018-01-28 20:30   ` [gentoo-dev] " Andrew Barchuk
2018-01-29  7:21     ` Robin H. Johnson [this message]
2018-01-29 18:57 ` [gentoo-dev] [News item review] Portage rsync tree verification (v5) Michał Górny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=robbat2-20180129T070612-080997113Z@orbis-terrarum.net \
    --to=robbat2@gentoo.org \
    --cc=gentoo-dev@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox