From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 94ED3138334 for ; Mon, 21 Oct 2019 16:42:45 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 934E8E0AC7; Mon, 21 Oct 2019 16:42:41 +0000 (UTC) Received: from smtp.gentoo.org (woodpecker.gentoo.org [IPv6:2001:470:ea4a:1:5054:ff:fec7:86e4]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 456E1E0AC0 for ; Mon, 21 Oct 2019 16:42:41 +0000 (UTC) Received: from [192.168.5.125] (pool-96-232-115-28.nycmny.fios.verizon.net [96.232.115.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: ryao) by smtp.gentoo.org (Postfix) with ESMTPSA id 44DD634C11B for ; Mon, 21 Oct 2019 16:42:40 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Yao Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply Mime-Version: 1.0 (1.0) Date: Mon, 21 Oct 2019 12:42:37 -0400 Subject: Re: [gentoo-dev] New distfile mirror layout Message-Id: References: <752be6c75f337df8ee8124a804247d2fb27e73b4.camel@gentoo.org> In-Reply-To: <752be6c75f337df8ee8124a804247d2fb27e73b4.camel@gentoo.org> To: gentoo-dev@lists.gentoo.org X-Mailer: iPad Mail (17A860) X-Archives-Salt: d629e3e3-8c73-43d8-96ce-30028b46dca4 X-Archives-Hash: c9b9b0512663534ff236e89a6d67f00a > On Oct 20, 2019, at 2:51 AM, Micha=C5=82 G=C3=B3rny wr= ote: >=20 > =EF=BB=BFOn Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote: >>> On 10/18/2019 09:41, Micha=C5=82 G=C3=B3rny wrote: >>> Hi, everybody. >>>=20 >>> It is my pleasure to announce that yesterday (EU) evening we've switched= >>> to a new distfile mirror layout. Users will be switching to the new >>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded >>> already -- as their caches expire (24hrs). >>>=20 >>> The new layout is mostly a bow towards mirror admins, for some of whom >>> having a 60000+ files in a single directory have been a problem.=20 >>> However, I suppose some of you also found e.g. the directory index >>> hardly usable due to its size. >>>=20 >>> Throughout a transitional period (whose exact length hasn't been decided= >>> yet), both layouts will be available. Afterwards, the old layout will >>> be removed from mirrors. This has a few implications: >>>=20 >>> 1. Users who don't upgrade their package managers in time will lose >>> the ability of fetching from Gentoo mirrors. This shouldn't be that >>> much of a problem given that the core software needed to upgrade Portage= >>> should all have reliable upstream SRC_URIs. >>>=20 >>> 2. mirror://gentoo/file URIs will stop working. While technically you >>> could use mirror://gentoo/XX/file, I'd rather recommend finally >>> discarding its usage and moving distfiles to devspace. >>>=20 >>> 3. Directly fetching files from distfiles.gentoo.org will become >>> a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have >>> to use something like: >>>=20 >>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 >>> 1b >>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz >>> ... >>>=20 >>>=20 >>> Alternatively, you can: >>>=20 >>> $ wget http://distfiles.gentoo.org/distfiles/INDEX >>>=20 >>> and grep for the right path there. This INDEX is also a more >>> lightweight alternative to HTML indexes generated by the servers. >>>=20 >>>=20 >>> If you're interested in more background details and some plots, see [1].= >>>=20 >>> [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-st= ructure.html >>>=20 >>=20 >> So the answer I didn't really see directly stated here is, where do new >> distfiles need to go //now//? E.g., if on woodpecker, I currently cp a >> distfile to /space/distfiles-local. What is the new directory I need to >> use? And if mirror://gentoo/${FOO} is going away, for the new distfiles >> target, what would be the applicable prefix to use? >>=20 >> Directly using devspace seems like a bad idea, IMHO. Once long ago, we a= ll >> got chastised for doing exactly that. Too much possibility of fragmentat= ion >> as devs retire or package maintainership changes hands. >=20 > Today you get chastised for using /space/distfiles-local and not > following policy changes. The devmanual states that it's deprecated > since at least 2011, and talks of using d.g.o [1]. >=20 >> I looked at the whitepaper'ish-like writeup, and I kinda don't like using= a >> hash-based naming scheme on the new distfiles layout. I really kind pref= er >> breaking the directories up based on the first letter of the distfiles in= >> question, factoring case-sensitivity in (so you'd have 52 top-level >> directories for A-Z and a-z, plus 10 more for 0-9). Under each of those >> directories, additional subdirectories for the next few letters (say, >> letters 2-3). Yes, this leads to some orphan cases where a distfile migh= t >> live on its own, but from a direct navigation standpoint, it's easy to fi= nd >> for someone browsing the distfiles server and easy to predict where a >> distfile is at. >>=20 >> No math, statistical analysis, or deep-rooted knowledge of filesystems >> behind that paragraph. Just a plain old unfiltered opinion. Sometimes, I= >> need to go get a distfile off the Gentoo mirrors, and being able to quick= ly >> find it in the mirror root is great. Having to do hash calculations to w= ork >> out the file path will be *really* annoying. >=20 > Your solution still doesn't solve the problem of having 8k-24k files > in a single directory, even if you use 7 letters of prefix. So it just > creates a lot of tiny directory noise for no practical gain. >=20 > [1] https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suita= ble-download-hosts If we consider the access frequency, it might actually not be that bad. Cons= ider a simple example with 500 files and two directory buckets. If we have 2= 50 in each, then the size of the directory is always 250. However, if 50 fil= es are accessed 90% of the time, then putting 450 into one directory and tha= t 50 into another directory, we end up with the performance of the O(n) dire= ctory lookup being consistent with there being only 90 files in each directo= ry. I am not sure if we should be discarding all other considerations to make ch= anges to benefit O(n) directory lookup filesystems, but if we are, then the h= ashing approach is not necessarily the best one. It is only the best when al= l files are accessed with equal frequency, which would be an incorrect assum= ption. A more human friendly approach might still be better. I doubt that we= have the data to determine that though. Also, another idea is to use a cheap hash function (e.g. fletcher) and just h= ave the mirrors do the hashing behind the scenes. Then we would have the bes= t of both worlds. >=20 > --=20 > Best regards, > Micha=C5=82 G=C3=B3rny >=20