public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
From: Richard Yao <ryao@gentoo.org>
To: gentoo-dev@lists.gentoo.org
Subject: Re: [gentoo-dev] New distfile mirror layout
Date: Sat, 19 Oct 2019 04:20:48 -0400	[thread overview]
Message-ID: <988EB80E-A853-400F-91A9-17BEA86C71F3@gentoo.org> (raw)
In-Reply-To: <b5f42e8a60de6b5ff950bbbe1de4d97c433e88f0.camel@gentoo.org>



> On Oct 19, 2019, at 2:17 AM, Michał Górny <mgorny@gentoo.org> wrote:
> 
> On Fri, 2019-10-18 at 21:09 -0400, Richard Yao wrote:
>>>> On Oct 18, 2019, at 4:49 PM, Michał Górny <mgorny@gentoo.org> wrote:
>>> 
>>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>>>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <mgorny@gentoo.org> wrote:
>>>>>>> Hi, everybody.
>>>>>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>>>>>> to a new distfile mirror layout.  Users will be switching to the new
>>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>>>>>> already -- as their caches expire (24hrs).
>>>>>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>>>>>> having a 60000+ files in a single directory have been a problem.
>>>>>>> However, I suppose some of you also found e.g. the directory index
>>>>>>> hardly usable due to its size.
>>>> This sounds like a filesystem issue. Do we know which filesystems are suffering?
>>>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that.
>>> 
>>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>>> may apply only to older ntfs versions.  NFS has been mentioned too.
>> 
>> ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though.
> 
> Are you surprised that people use NTFS on Windows?  Or that they use
> local mirrors over NFS?  The latter still needs to be addressed
> separatel, provided that they mount it on DISTDIR.
I am surprised that it was an issue on NTFS because it uses B-trees. As for NFS, I had expected that to be more dependent on the local filesystem than on NFS itself. If it has a slowdown when used on a filesystem that had fast directory operations, that might be a bug.
> 
>>> However, just because modern filesystems can handle them efficiently, it
>>> doesn't mean having directories that huge comes with zero cost.
>> While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost?
>> 
>> Filesystems with O(1) directory lookups like ZFS would probably be hurt by this
> 
> O(1) or O(n)?
ZFS uses extendible hashing for its directories, so the data structure used is amortized O(1). You might consider it O(log n) due to the indirect tree traversal needed to find the direct block containing the hash table entry. With caching of indirect blocks, it should be amortized O(1) to find the direct block in practice as far as read IOs are considered. In addition, the base of the logarithm is 128 or 1024 depending on the pool feature flags.
> 
>> , but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups.
>> 
>> Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement.
>> 
>> Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations.
>> 
>> Since I know someone will call me out on that comment, I will explain. Each bucket has roughly n/b items in it where n is the total number and b is the number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each of the b buckets. The buckets are pre-sorted by prefix, so the result is now sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) comparison sort on this very special case where you call it multiple times on data that has been persorted by prefix into buckets.
>> 
>> Is there any other benefit to this or did I get everything?
> 
> Listings for individual directories won't cause major pain to browsers
> anymore.  Not that there's much reason to do them.
That makes sense.
> 
> All kinds of per-direction operations will consume less memory
> and be potentially faster.
Userland would save memory when sorting or grepping a directory listing by virtue of having to process less data for grep and less data at a time for sorting (if it takes advantage of this). That would have performance benefits in userland.

The kernel would have little memory savings and in some cases might be slightly worse. It is negligible. Performance in the kernel ought to be slightly better on filesystems with O(log n) directory operations, but I would only expect the really bad ones to show much improvement.
> -- 
> Best regards,
> Michał Górny
> 



  reply	other threads:[~2019-10-19  8:20 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-10-18 13:41 [gentoo-dev] New distfile mirror layout Michał Górny
2019-10-18 19:53 ` Richard Yao
2019-10-18 20:49   ` Michał Górny
2019-10-19  1:09     ` Richard Yao
2019-10-19  6:17       ` Michał Górny
2019-10-19  8:20         ` Richard Yao [this message]
2019-10-19 19:26       ` Richard Yao
2019-10-19 20:02         ` Michał Górny
2019-10-19 22:48           ` Richard Yao
2019-10-22  0:46   ` James Cloos
2019-10-19 13:31 ` Fabian Groffen
2019-10-19 13:53   ` Michał Górny
2019-10-19 23:24 ` Joshua Kinard
2019-10-19 23:57   ` Alec Warner
2019-10-20  0:14     ` Joshua Kinard
2019-10-20  6:51   ` Michał Górny
2019-10-20  8:25     ` Joshua Kinard
2019-10-20  8:32       ` Michał Górny
2019-10-20  9:21         ` Joshua Kinard
2019-10-20  9:44           ` Michał Górny
2019-10-20 20:57             ` Joshua Kinard
2019-10-21  0:05               ` Joshua Kinard
2019-10-21  5:51                 ` Ulrich Mueller
2019-10-21 10:17                 ` Kent Fredric
2019-10-21 21:34                 ` Mikle Kolyada
2019-10-21 10:13               ` Kent Fredric
2019-10-23  5:16                 ` Joshua Kinard
2019-10-29 16:35                   ` Kent Fredric
2019-10-20 17:09       ` Matt Turner
2019-10-21 16:42     ` Richard Yao
2019-10-21 23:36       ` Matt Turner
2019-10-23  5:18         ` Joshua Kinard
2019-10-23 17:06           ` William Hubbs
2019-10-23 18:38             ` William Hubbs
2019-10-23 22:04           ` William Hubbs
2019-10-24  4:30             ` Michał Górny
2019-10-22  6:51       ` Jaco Kroon
2019-10-22  8:43         ` Ulrich Mueller
2019-10-22  8:46           ` Jaco Kroon
2019-10-23 23:47         ` ext4 readdir performance - was " Richard Yao
2019-10-24  0:01           ` Richard Yao
2019-10-23  1:21       ` Rich Freeman
2019-10-28 23:24     ` Chí-Thanh Christopher Nguyễn
2019-10-29  4:27       ` Michał Górny
2019-10-29  9:34         ` Fabian Groffen
2019-10-29 11:11           ` Michał Górny
2019-10-29 12:23             ` Ulrich Mueller
2019-10-29 12:43               ` Michał Górny
2019-10-29 13:03                 ` Ulrich Mueller
2019-10-29 13:09                   ` Ulrich Mueller
2019-10-29 13:52                     ` Michał Górny
2019-10-29 14:17                       ` Ulrich Mueller
2019-10-29 14:33                         ` Fabian Groffen
2019-10-29 14:45                           ` Michał Górny
2019-10-29 14:56                             ` Fabian Groffen
2019-10-29 13:51                   ` Michał Górny

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=988EB80E-A853-400F-91A9-17BEA86C71F3@gentoo.org \
    --to=ryao@gentoo.org \
    --cc=gentoo-dev@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox