From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 254C1138334 for ; Sat, 19 Oct 2019 01:10:00 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 59927E0972; Sat, 19 Oct 2019 01:09:48 +0000 (UTC) Received: from smtp.gentoo.org (mail.gentoo.org [IPv6:2001:470:ea4a:1:5054:ff:fec7:86e4]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id C1460E0900 for ; Sat, 19 Oct 2019 01:09:47 +0000 (UTC) Received: from [192.168.5.101] (pool-96-232-115-28.nycmny.fios.verizon.net [96.232.115.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: ryao) by smtp.gentoo.org (Postfix) with ESMTPSA id E1FF234C032 for ; Sat, 19 Oct 2019 01:09:46 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Yao Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply Mime-Version: 1.0 (1.0) Date: Fri, 18 Oct 2019 21:09:44 -0400 Message-Id: <43F5DC91-E5D0-40EA-A1DA-A354C1CB7A16@gentoo.org> Subject: Re: [gentoo-dev] New distfile mirror layout References: <02507080f1e18d0382f551239819fb784cb0c05d.camel@gentoo.org> In-Reply-To: <02507080f1e18d0382f551239819fb784cb0c05d.camel@gentoo.org> To: gentoo-dev@lists.gentoo.org X-Mailer: iPhone Mail (17A860) X-Archives-Salt: 5f4955ff-2d1d-433c-9447-25c89fe23e02 X-Archives-Hash: 1fdd5aa19cecb1511896f50f1bb20212 > On Oct 18, 2019, at 4:49 PM, Micha=C5=82 G=C3=B3rny wr= ote: >=20 > =EF=BB=BFOn Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: >>>>>> On Oct 18, 2019, at 9:42 AM, Micha=C5=82 G=C3=B3rny wrote: >>>>> =EF=BB=BFHi, everybody. >>>>> It is my pleasure to announce that yesterday (EU) evening we've switch= ed >>>>> to a new distfile mirror layout. Users will be switching to the new >>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded= >>>>> already -- as their caches expire (24hrs). >>>>> The new layout is mostly a bow towards mirror admins, for some of whom= >>>>> having a 60000+ files in a single directory have been a problem. >>>>> However, I suppose some of you also found e.g. the directory index >>>>> hardly usable due to its size. >> This sounds like a filesystem issue. Do we know which filesystems are suf= fering? >> ZFS should be fine. I believe ext2/ext3 have problems with this many file= s. ext4 is probably okay, but don=E2=80=99t quote me on that. >=20 > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this > may apply only to older ntfs versions. NFS has been mentioned too. ext2 and vfat are not surprises to me (outside of the idea that anyone would= use them for a mirror). NTFS and NFS are though. >=20 > However, just because modern filesystems can handle them efficiently, it > doesn't mean having directories that huge comes with zero cost. While I am okay with the change, what do you mean when you say that having h= uge directories does not come with zero cost? Filesystems with O(1) directory lookups like ZFS would probably be hurt by t= his, but the impact should be negligible. Filesystems with O(log n) director= y lookups would see faster directory lookups. Outside of directory lookups, this could speed up up searches and sort opera= tions when listing everything with just about any filesystem benefiting from= the improvement. Listing directories on such filesystems should not benefit from this unless y= ou are using ls where the default behavior is to sort the directory contents= (which is where the improvement when sorting comes into play). The need to s= ort the directory contents by default keeps ls from displaying anything unti= l it has scanned the entire directory. The asymptotic complexity of a fast c= omparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)= ) provided that you sort each subdirectory independently. A further speed up= could be obtained by doing multithreading to parallelize the sort operation= s. Since I know someone will call me out on that comment, I will explain. Each b= ucket has roughly n/b items in it where n is the total number and b is the n= umber of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each= of the b buckets. The buckets are pre-sorted by prefix, so the result is no= w sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) c= omparison sort on this very special case where you call it multiple times on= data that has been persorted by prefix into buckets. Is there any other benefit to this or did I get everything? By the way, it is offtopic for the thread, but it occurs to me that a hybrid= of radix sort and A comparison based sort could give us a general sorting a= lgorithm that is asymptotically faster than O(nlogn). >=20 > [1] https://bugs.gentoo.org/534528 >=20 > --=20 > Best regards, > Micha=C5=82 G=C3=B3rny