From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id B9A3B138334 for ; Sat, 19 Oct 2019 19:26:38 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 4E230E0901; Sat, 19 Oct 2019 19:26:34 +0000 (UTC) Received: from smtp.gentoo.org (woodpecker.gentoo.org [IPv6:2001:470:ea4a:1:5054:ff:fec7:86e4]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id EC166E08EB for ; Sat, 19 Oct 2019 19:26:33 +0000 (UTC) Received: from [IPv6:2600:1:f40b:903f:38d9:a5c0:d5da:a361] (unknown [IPv6:2600:1:f40b:903f:38d9:a5c0:d5da:a361]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: ryao) by smtp.gentoo.org (Postfix) with ESMTPSA id 4879A34C06D for ; Sat, 19 Oct 2019 19:26:31 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Yao Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply Mime-Version: 1.0 (1.0) Date: Sat, 19 Oct 2019 15:26:26 -0400 Message-Id: Subject: Re: [gentoo-dev] New distfile mirror layout In-Reply-To: <43F5DC91-E5D0-40EA-A1DA-A354C1CB7A16@gentoo.org> References: <43F5DC91-E5D0-40EA-A1DA-A354C1CB7A16@gentoo.org> To: gentoo-dev@lists.gentoo.org X-Mailer: iPhone Mail (17A860) X-Archives-Salt: 7d8be566-4546-49b3-b660-48c280c448e3 X-Archives-Hash: 2ca6d2d69ac92172366d2a3581dbab71 > On Oct 18, 2019, at 9:10 PM, Richard Yao wrote: >=20 > =EF=BB=BF >>> On Oct 18, 2019, at 4:49 PM, Micha=C5=82 G=C3=B3rny w= rote: >> =EF=BB=BFOn Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: >>>>>>>> On Oct 18, 2019, at 9:42 AM, Micha=C5=82 G=C3=B3rny wrote: >>>>>>> =EF=BB=BFHi, everybody. >>>>>>> It is my pleasure to announce that yesterday (EU) evening we've swit= ched >>>>>>> to a new distfile mirror layout. Users will be switching to the new= >>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgrad= ed >>>>>>> already -- as their caches expire (24hrs). >>>>>>> The new layout is mostly a bow towards mirror admins, for some of wh= om >>>>>>> having a 60000+ files in a single directory have been a problem. >>>>>>> However, I suppose some of you also found e.g. the directory index >>>>>>> hardly usable due to its size. >>> This sounds like a filesystem issue. Do we know which filesystems are su= ffering? >>> ZFS should be fine. I believe ext2/ext3 have problems with this many fil= es. ext4 is probably okay, but don=E2=80=99t quote me on that. >> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this >> may apply only to older ntfs versions. NFS has been mentioned too. >=20 > ext2 and vfat are not surprises to me (outside of the idea that anyone wou= ld use them for a mirror). NTFS and NFS are though. >> However, just because modern filesystems can handle them efficiently, it >> doesn't mean having directories that huge comes with zero cost. > While I am okay with the change, what do you mean when you say that having= huge directories does not come with zero cost? >=20 > Filesystems with O(1) directory lookups like ZFS would probably be hurt by= this, but the impact should be negligible. Filesystems with O(log n) direct= ory lookups would see faster directory lookups. >=20 > Outside of directory lookups, this could speed up up searches and sort ope= rations when listing everything with just about any filesystem benefiting fr= om the improvement. >=20 > Listing directories on such filesystems should not benefit from this unles= s you are using ls where the default behavior is to sort the directory conte= nts (which is where the improvement when sorting comes into play). The need t= o sort the directory contents by default keeps ls from displaying anything u= ntil it has scanned the entire directory. The asymptotic complexity of a fas= t comparison based sort improves in this situation from O(nlogn) to O(nlog(n= /b)) provided that you sort each subdirectory independently. A further speed= up could be obtained by doing multithreading to parallelize the sort operat= ions. I read your original email late at night and I misread the description of ho= w this works. At an initial glance, I thought we were doing a prefix approach (with the ca= veat that buckets are unbalanced). In reality, we are doing a cryptographic h= ash of the filenames. That would keep all buckets balanced, which gives the best directory lookup t= imes on O(log n) lookup filesystems, but I think there is something to be ga= ined from using the less optimal approach of using filename prefixes: * some regex searches on distfiles can be accelerated * generating a sorted list of all distfiles becomes asymptotically faster * it is easy for a user to find all versions of a given distfile * no need to calculate a cryptographic hash I realize that I am late to propose it, but could we consider a switch to th= is alternative arrangement? The bulk of the performance gain should be realized with either approach. > Since I know someone will call me out on that comment, I will explain. Eac= h bucket has roughly n/b items in it where n is the total number and b is th= e number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort e= ach of the b buckets. The buckets are pre-sorted by prefix, so the result is= now sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlog= n) comparison sort on this very special case where you call it multiple time= s on data that has been persorted by prefix into buckets. >=20 > Is there any other benefit to this or did I get everything? >=20 > By the way, it is offtopic for the thread, but it occurs to me that a hybr= id of radix sort and A comparison based sort could give us a general sorting= algorithm that is asymptotically faster than O(nlogn). >> [1] https://bugs.gentoo.org/534528 >> -- >> Best regards, >> Micha=C5=82 G=C3=B3rny