From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 91C4F138334 for ; Sat, 19 Oct 2019 22:48:47 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 7F963E0918; Sat, 19 Oct 2019 22:48:43 +0000 (UTC) Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 23CFEE08F0 for ; Sat, 19 Oct 2019 22:48:43 +0000 (UTC) Received: from [192.168.5.125] (pool-96-232-115-28.nycmny.fios.verizon.net [96.232.115.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: ryao) by smtp.gentoo.org (Postfix) with ESMTPSA id 4DD6F34C06F for ; Sat, 19 Oct 2019 22:48:42 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Yao Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply Mime-Version: 1.0 (1.0) Date: Sat, 19 Oct 2019 18:48:39 -0400 Subject: Re: [gentoo-dev] New distfile mirror layout Message-Id: References: In-Reply-To: To: gentoo-dev@lists.gentoo.org X-Mailer: iPad Mail (17A860) X-Archives-Salt: 157343b2-c34b-403a-995a-1330f4a8cdca X-Archives-Hash: c8e5c8dbb7e8be0408c25c945fe39cdc > On Oct 19, 2019, at 4:03 PM, Micha=C5=82 G=C3=B3rny wr= ote: >=20 > =EF=BB=BFOn Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote: >>>> On Oct 18, 2019, at 9:10 PM, Richard Yao wrote: >>>=20 >>> =EF=BB=BF >>>>> On Oct 18, 2019, at 4:49 PM, Micha=C5=82 G=C3=B3rny wrote: >>>> =EF=BB=BFOn Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: >>>>>>>>>> On Oct 18, 2019, at 9:42 AM, Micha=C5=82 G=C3=B3rny wrote: >>>>>>>>> =EF=BB=BFHi, everybody. >>>>>>>>> It is my pleasure to announce that yesterday (EU) evening we've sw= itched >>>>>>>>> to a new distfile mirror layout. Users will be switching to the n= ew >>>>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgr= aded >>>>>>>>> already -- as their caches expire (24hrs). >>>>>>>>> The new layout is mostly a bow towards mirror admins, for some of w= hom >>>>>>>>> having a 60000+ files in a single directory have been a problem. >>>>>>>>> However, I suppose some of you also found e.g. the directory index= >>>>>>>>> hardly usable due to its size. >>>>> This sounds like a filesystem issue. Do we know which filesystems are s= uffering? >>>>> ZFS should be fine. I believe ext2/ext3 have problems with this many f= iles. ext4 is probably okay, but don=E2=80=99t quote me on that. >>>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose thi= s >>>> may apply only to older ntfs versions. NFS has been mentioned too. >>>=20 >>> ext2 and vfat are not surprises to me (outside of the idea that anyone w= ould use them for a mirror). NTFS and NFS are though. >>>> However, just because modern filesystems can handle them efficiently, i= t >>>> doesn't mean having directories that huge comes with zero cost. >>> While I am okay with the change, what do you mean when you say that havi= ng huge directories does not come with zero cost? >>>=20 >>> Filesystems with O(1) directory lookups like ZFS would probably be hurt b= y this, but the impact should be negligible. Filesystems with O(log n) direc= tory lookups would see faster directory lookups. >>>=20 >>> Outside of directory lookups, this could speed up up searches and sort o= perations when listing everything with just about any filesystem benefiting f= rom the improvement. >>>=20 >>> Listing directories on such filesystems should not benefit from this unl= ess you are using ls where the default behavior is to sort the directory con= tents (which is where the improvement when sorting comes into play). The nee= d to sort the directory contents by default keeps ls from displaying anythin= g until it has scanned the entire directory. The asymptotic complexity of a f= ast comparison based sort improves in this situation from O(nlogn) to O(nlog= (n/b)) provided that you sort each subdirectory independently. A further spe= ed up could be obtained by doing multithreading to parallelize the sort oper= ations. >> I read your original email late at night and I misread the description of= how this works. >>=20 >> At an initial glance, I thought we were doing a prefix approach (with the= caveat that buckets are unbalanced). In reality, we are doing a cryptograph= ic hash of the filenames. >>=20 >> That would keep all buckets balanced, which gives the best directory look= up times on O(log n) lookup filesystems, but I think there is something to b= e gained from using the less optimal approach of using filename prefixes: >>=20 >> * some regex searches on distfiles can be accelerated >> * generating a sorted list of all distfiles becomes asymptotically faster= >> * it is easy for a user to find all versions of a given distfile >> * no need to calculate a cryptographic hash >>=20 >> I realize that I am late to propose it, but could we consider a switch to= this alternative arrangement? >=20 > No, we can't. Please read either the original discussion on the bug, or > the linked article. It's explained in detail why this won't work. Alright. I am convinced. Thanks. >=20 > --=20 > Best regards, > Micha=C5=82 G=C3=B3rny >=20