On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote: > > On Oct 18, 2019, at 9:10 PM, Richard Yao wrote: > > > >  > > > > On Oct 18, 2019, at 4:49 PM, Michał Górny wrote: > > > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: > > > > > > > > > On Oct 18, 2019, at 9:42 AM, Michał Górny wrote: > > > > > > > > Hi, everybody. > > > > > > > > It is my pleasure to announce that yesterday (EU) evening we've switched > > > > > > > > to a new distfile mirror layout. Users will be switching to the new > > > > > > > > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded > > > > > > > > already -- as their caches expire (24hrs). > > > > > > > > The new layout is mostly a bow towards mirror admins, for some of whom > > > > > > > > having a 60000+ files in a single directory have been a problem. > > > > > > > > However, I suppose some of you also found e.g. the directory index > > > > > > > > hardly usable due to its size. > > > > This sounds like a filesystem issue. Do we know which filesystems are suffering? > > > > ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that. > > > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this > > > may apply only to older ntfs versions. NFS has been mentioned too. > > > > ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though. > > > However, just because modern filesystems can handle them efficiently, it > > > doesn't mean having directories that huge comes with zero cost. > > While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost? > > > > Filesystems with O(1) directory lookups like ZFS would probably be hurt by this, but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups. > > > > Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement. > > > > Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations. > I read your original email late at night and I misread the description of how this works. > > At an initial glance, I thought we were doing a prefix approach (with the caveat that buckets are unbalanced). In reality, we are doing a cryptographic hash of the filenames. > > That would keep all buckets balanced, which gives the best directory lookup times on O(log n) lookup filesystems, but I think there is something to be gained from using the less optimal approach of using filename prefixes: > > * some regex searches on distfiles can be accelerated > * generating a sorted list of all distfiles becomes asymptotically faster > * it is easy for a user to find all versions of a given distfile > * no need to calculate a cryptographic hash > > I realize that I am late to propose it, but could we consider a switch to this alternative arrangement? No, we can't. Please read either the original discussion on the bug, or the linked article. It's explained in detail why this won't work. -- Best regards, Michał Górny