From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id B1D511382C5 for ; Sun, 28 Jan 2018 20:30:38 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 366E8E0DCA; Sun, 28 Jan 2018 20:30:33 +0000 (UTC) Received: from out3-smtp.messagingengine.com (out3-smtp.messagingengine.com [66.111.4.27]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id DB964E0CB8 for ; Sun, 28 Jan 2018 20:30:32 +0000 (UTC) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.nyi.internal (Postfix) with ESMTP id D86AE20ED0 for ; Sun, 28 Jan 2018 15:30:31 -0500 (EST) Received: from web3 ([10.202.2.213]) by compute4.internal (MEProxy); Sun, 28 Jan 2018 15:30:31 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=raindev.io; h= content-transfer-encoding:content-type:date:from:in-reply-to :message-id:mime-version:references:subject:to:x-me-sender :x-me-sender:x-sasl-enc; s=mesmtp; bh=lyguDtOcy/ZMAkFa+AqjaQVbhf YxvQgtllYf7WDrXq8=; b=qFDHgU2sMbrfutCKvmV41V+Xdfqpq5xGrO2jQB6797 hlvvSp6P2FoAo3ZTV0ShY9wd9tjwIPOi087L5UuRPvVvBr+5o/XNYmIYGmtH8u2P CLSlkRs6lEJCaABHdD0sDzDLBby/dO7C9KAWtg1ZdpKduqYo4tDlXiFkFqLcqHf+ Q= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=content-transfer-encoding:content-type :date:from:in-reply-to:message-id:mime-version:references :subject:to:x-me-sender:x-me-sender:x-sasl-enc; s=fm1; bh=lyguDt Ocy/ZMAkFa+AqjaQVbhfYxvQgtllYf7WDrXq8=; b=f6mLkOzHlXuMAJfARaqkva AWWAHOoRPL5ds3YD/3ra1trqO3ChYkkvdItXM3e3WbpnVeG63KgScd2Bo5PxJUHd KrHrRNsRc6zjdVMfbCuavDofqemBV7PV7wroP9aoLpTJhgSrF/PotU5uOQK0Nc1I g0ibctbCWyUXU3yy544gnhdsX3znlMob8lTqi+sEWf0I7KaXN1WTn57ZOfO1ouvE 9oSHhN1U/0+EWOunsMXhJBe03ZFJudBJYJZkxoDSBuK1ekNLc74gGyHY7vdDksc4 0i5xo1hKMZnnb3jqnMOGdDBD/eaIqYBKIKjT3J5/zqDH7fnZEngx7TZiCabbecWQ == X-ME-Sender: Received: by mailuser.nyi.internal (Postfix, from userid 99) id B3E139E47B; Sun, 28 Jan 2018 15:30:31 -0500 (EST) Message-Id: <1517171431.2109764.1251018832.6F16557B@webmail.messagingengine.com> From: Andrew Barchuk To: gentoo-dev@lists.gentoo.org Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" X-Mailer: MessagingEngine.com Webmail Interface - ajax-20f48d70 References: <1516874667.1833.4.camel@gentoo.org> <1517129917.1270.1.camel@gentoo.org> Subject: Re: [gentoo-dev] [News item review] Portage rsync tree verification (v4) In-Reply-To: <1517129917.1270.1.camel@gentoo.org> Date: Sun, 28 Jan 2018 21:30:31 +0100 X-Archives-Salt: 27e90801-7386-4510-b148-b0d6baa35de7 X-Archives-Hash: 39c3f620e1745bbd5449904d9edb03d7 Hi everyone, > three possible solutions for splitting distfiles were listed: >=20 > a. using initial portion of filename, >=20 > b. using initial portion of file hash, >=20 > c. using initial portion of filename hash. >=20 > The significant advantage of the filename option was simplicity. With > that solution, the users could easily determine the correct subdirectory > themselves. However, it's significant disadvantage was very uneven > shuffling of data. In particular, the Te=CE=A7 Live packages alone count > almost 23500 distfiles and all use a common prefix, making it impossible > to split them further. >=20 > The alternate option of using file hash has the advantage of having > a more balanced split. There's another option to use character ranges for each directory computed in a way to have the files distributed evenly. One way to do that is to use filename prefix of dynamic length so that each range holds the same number of files. E.g. we would have Ab/, Ap/, Ar/ but texlive-module-te/, texlive-module-th/, texlive-module-ti/. A similar but simpler option is to use file names as range bounds (the same way dictionaries use words to demarcate page bounds): each directory will have a name of the first file located inside. This way files will be distributed evenly and it's still easy to pick a correct directory where a file will be located manually. I have implemented a sketch of distfiles splitting that's using file names as bounds in Python to demonstrate the idea (excuse possibly non-idiomatic code, I'm not very versed in Python): $ cat distfile-dirs.py #!/usr/bin/env python3 import sys """ Builds list of dictionary directories to split the list of input files into evenly. Each directory has name of the first file that is located in the directory. Takes number of directories as an argument and reads list of files from stdin. The resulting list or directories is printed to stdout. """ dir_num =3D int(sys.argv[1]) distfiles =3D sys.stdin.read().splitlines() distfile_num =3D len(distfiles) dir_size =3D distfile_num / dir_num # allows adding files in the beginning without repartitioning dirs =3D ["0"] next_dir =3D dir_size while next_dir < distfile_num: dirs.append(distfiles[round(next_dir)]) next_dir +=3D dir_size print("/\n".join(dirs) + "/") $ cat pick-distfiles-dir.py #!/usr/bin/env python3 """ Picks the directory for a given file name. Takes a distfile name as an argument. Reads sorted list of directories from stdin, name of each directory is assumed to be the name of first file that's located inside. """ import sys distfile =3D sys.argv[1] dirs =3D sys.stdin.read().splitlines() left =3D 0 right =3D len(dirs) - 1 while left < right: pivot =3D round((left + right) / 2) if (dirs[pivot] <=3D distfile): left =3D pivot + 1 else: right =3D pivot - 1 if distfile < dirs[right]: print(dirs[right-1]) else: print(dirs[right]) $ # distfiles.txt contains all the distfile names $ head -n5 distfiles.txt 0CD9CDDE3F56BB5250D87C54592F04CBC24F03BF-wagon-provider-api-2.10.jar 0CE1EDB914C94EBC388F086C6827E8BDEEC71AC2-commons-lang-2.6.jar 0DCC973606CBD9737541AA5F3E76DED6E3F4D0D0-iri.jar 0ad-0.0.22-alpha-unix-build.tar.xz 0ad-0.0.22-alpha-unix-data.tar.xz $ # calculate 500 directories to split distfiles into evenly $ cat distfiles.txt | ./distfile-dirs.py 500 > dirs.txt $ tail -n5 dirs.txt xrmap-2.29.tar.bz2/ xview-3.2p1.4-18c.tar.gz/ yasat-700.tar.gz/ yubikey-manager-qt-0.4.0.tar.gz/ zimg-2.5.1.tar.gz $ # pick a directory for xvinfo-1.0.1.tar.bz2 $ cat dirs.txt | ./pick-distfiles-dir.py xvinfo-1.0.1.tar.bz2 xview-3.2p1.4-18c.tar.gz/ Using the approach above the files will distributed evenly among the directories keeping the possibility to determine the directory for a specific file by hand. It's possible if necessary to keep the directory structure unchanged for very long time and it will likely stay well-balanced. Picking a directory for a file is very cheap. The only obvious downside I see is that it's necessary to know list of directories to pick the correct one (can be mitigated by caching the list of directories if important). If it's desirable to make directory names shorter or to look less like file names it's fairly easy to achieve by keeping only unique prefixes of directories. For example: xrmap-2.29.tar.bz2/ xview-3.2p1.4-18c.tar.gz/ yasat-700.tar.gz/ yubikey-manager-qt-0.4.0.tar.gz/ zimg-2.5.1.tar.gz/ will become xr/ xv/ ya/ yu/ z/ Thanks for taking time to consider the suggestion. --- Andrew