From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-dev+bounces-88963-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by finch.gentoo.org (Postfix) with ESMTPS id F1494138334
	for <garchives@archives.gentoo.org>; Tue, 22 Oct 2019 06:52:50 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id 00EE3E094F;
	Tue, 22 Oct 2019 06:52:46 +0000 (UTC)
Received: from othala.iewc.co.za (othala.iewc.co.za [IPv6:2c0f:f720:0:3:21e:67ff:fe14:6ae5])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id 8659BE092E
	for <gentoo-dev@lists.gentoo.org>; Tue, 22 Oct 2019 06:52:44 +0000 (UTC)
Received: from [165.16.203.58] (helo=tauri.local.uls.co.za)
	by othala.iewc.co.za with esmtpsa (TLSv1.2:ECDHE-RSA-AES256-GCM-SHA384:256)
	(Exim 4.92.2)
	(envelope-from <jaco@uls.co.za>)
	id 1iMo23-0002gU-28; Tue, 22 Oct 2019 08:52:39 +0200
Received: from [192.168.42.205]
	by tauri.local.uls.co.za with esmtp (Exim 4.92.2)
	(envelope-from <jaco@uls.co.za>)
	id 1iMo1O-0002Ji-OO; Tue, 22 Oct 2019 08:52:37 +0200
Subject: Re: [gentoo-dev] New distfile mirror layout
To: gentoo-dev@lists.gentoo.org, Richard Yao <ryao@gentoo.org>
References: <752be6c75f337df8ee8124a804247d2fb27e73b4.camel@gentoo.org>
 <F5C72C3C-3264-43F4-962B-5A89F0E33A8E@gentoo.org>
From: Jaco Kroon <jaco@uls.co.za>
Organization: Ultimate Linux Solutions (Pty) Ltd
Message-ID: <73f461e5-d224-6aec-48be-f7e0cf8e077f@uls.co.za>
Date: Tue, 22 Oct 2019 08:51:58 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101
 Thunderbird/68.1.2
Precedence: bulk
List-Post: <mailto:gentoo-dev@lists.gentoo.org>
List-Help: <mailto:gentoo-dev+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-dev+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-dev+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-dev.gentoo.org>
X-BeenThere: gentoo-dev@lists.gentoo.org
Reply-to: gentoo-dev@lists.gentoo.org
X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply
MIME-Version: 1.0
In-Reply-To: <F5C72C3C-3264-43F4-962B-5A89F0E33A8E@gentoo.org>
Content-Type: multipart/alternative;
 boundary="------------43B1079943DC5C8CE69E7127"
Content-Language: en-GB
X-Spam-report: Relay access (othala.iewc.co.za).
X-Archives-Salt: 3c21e0c7-e4ac-4ed0-a2c4-bffbd27a4a29
X-Archives-Hash: beba81da11f5c7c975e4219621fb7ff2

This is a multi-part message in MIME format.
--------------43B1079943DC5C8CE69E7127
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit

Hi All,


On 2019/10/21 18:42, Richard Yao wrote:
>
> If we consider the access frequency, it might actually not be that bad. Consider a simple example with 500 files and two directory buckets. If we have 250 in each, then the size of the directory is always 250. However, if 50 files are accessed 90% of the time, then putting 450 into one directory and that 50 into another directory, we end up with the performance of the O(n) directory lookup being consistent with there being only 90 files in each directory.
>
> I am not sure if we should be discarding all other considerations to make changes to benefit O(n) directory lookup filesystems, but if we are, then the hashing approach is not necessarily the best one. It is only the best when all files are accessed with equal frequency, which would be an incorrect assumption. A more human friendly approach might still be better. I doubt that we have the data to determine that though.
>
> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.


Experience:

ext4 sucks at targeting name lookups without dir_index feature (O(n) 
lookups - scans all entries in the folder).  With dir_index readdir 
performance is crap.  Pick your poison I guess.  Most of our larger 
filesystems (2TB+, but especially the 80TB+ ones) we've reverted to 
disabling dir_index as the benefit is outweighed by the crappy readdir() 
and glob() performance.

There doesn't seem to be a real specific tip-over point, and it seems to 
depend a lot on RAM availability and harddrive speed (obviously).  So if 
dentries gets cached, disk speeds becomes less of an issue.  However, on 
large folders (where I typically use 10k as a value for large based on 
"gut feeling" and "unquantifiable experience" and "nothing scientific at 
all") I find that even with lots of RAM two consecutive ls commands 
remains terribly slow. Switch off dir_index and that becomes an order of 
magnitude faster.

I don't have a great deal of experience with XFS, but on those systems 
where we do it's generally on a VM, and perceivably (again, not 
scientific) our experience has been that it feels slower.  Again, not 
scientific, just perception.

I'm in support for the change.  This will bucket to 256 folders and 
should have a reasonably even split between folders.  If required a 
second layer could be introduced by using the 3rd and 4th digits of the 
hash for a second layer.  Any hash should be fine, it really doesn't 
need to be cryptographically strong, it just needs to provide a good 
spread and be really fast.  Generally a hash table should have a prime 
number of buckets to assist with hash bias, but frankly, that's over 
complicating the situation here.

I also agree with others that it used to be easy to get distfiles as and 
when needed, so an alternative structure could mirror that of the 
portage tree itself, in other words "cat/pkg/distfile". This perhaps 
just shifts the issue:

jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name 
"*-*" | wc -l
167
jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l
19412
jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i 
-maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10
347 net-misc
373 media-sound
395 media-libs
399 dev-util
505 dev-libs
528 dev-java
684 dev-haskell
690 dev-ruby
1601 dev-perl
1889 dev-python

So that's average 116 sub folders under the top layer (only two over 
1000), and then presumably less than 100 distfiles maximum per package?  
Probably overkill but would (should) solve both the too many files per 
folder as well as the easy lookup by hand issue.

I don't have a preference on either solution though but do agree that 
"easy finding of distfiles" are handy.  The INDEX mechanism is fine for me.

Kind Regards,

Jaco

--------------43B1079943DC5C8CE69E7127
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: 8bit

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Hi All,</p>
    <p><br>
    </p>
    <div class="moz-signature">
      <style type="text/css">
* { padding: 0px; margin: 0px; }
body, html { font-family: Arial, San-Serif; font-size: small; color: black; padding-left: 10px; padding-top: 3px; }
a { text-decoration: none; color: #818285; }
h1 { font-size: large; }
table { font-size: 12px; }
p + p { padding-top: 1em; 
</style></div>
    <div class="moz-cite-prefix">On 2019/10/21 18:42, Richard Yao wrote:<br>
    </div>
    <blockquote type="cite"
      cite="mid:F5C72C3C-3264-43F4-962B-5A89F0E33A8E@gentoo.org">
      <pre class="moz-quote-pre" wrap="">

</pre>
      <pre class="moz-quote-pre" wrap="">
If we consider the access frequency, it might actually not be that bad. Consider a simple example with 500 files and two directory buckets. If we have 250 in each, then the size of the directory is always 250. However, if 50 files are accessed 90% of the time, then putting 450 into one directory and that 50 into another directory, we end up with the performance of the O(n) directory lookup being consistent with there being only 90 files in each directory.

I am not sure if we should be discarding all other considerations to make changes to benefit O(n) directory lookup filesystems, but if we are, then the hashing approach is not necessarily the best one. It is only the best when all files are accessed with equal frequency, which would be an incorrect assumption. A more human friendly approach might still be better. I doubt that we have the data to determine that though.

Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds.</pre>
    </blockquote>
    <p><br>
    </p>
    <p>Experience:</p>
    <p>ext4 sucks at targeting name lookups without dir_index feature
      (O(n) lookups - scans all entries in the folder).  With dir_index
      readdir performance is crap.  Pick your poison I guess.  Most of
      our larger filesystems (2TB+, but especially the 80TB+ ones) we've
      reverted to disabling dir_index as the benefit is outweighed by
      the crappy readdir() and glob() performance.</p>
    <p>There doesn't seem to be a real specific tip-over point, and it
      seems to depend a lot on RAM availability and harddrive speed
      (obviously).  So if dentries gets cached, disk speeds becomes less
      of an issue.  However, on large folders (where I typically use 10k
      as a value for large based on "gut feeling" and "unquantifiable
      experience" and "nothing scientific at all") I find that even with
      lots of RAM two consecutive ls commands remains terribly slow. 
      Switch off dir_index and that becomes an order of magnitude
      faster.</p>
    <p>I don't have a great deal of experience with XFS, but on those
      systems where we do it's generally on a VM, and perceivably
      (again, not scientific) our experience has been that it feels
      slower.  Again, not scientific, just perception.</p>
    <p>I'm in support for the change.  This will bucket to 256 folders
      and should have a reasonably even split between folders.  If
      required a second layer could be introduced by using the 3rd and
      4th digits of the hash for a second layer.  Any hash should be
      fine, it really doesn't need to be cryptographically strong, it
      just needs to provide a good spread and be really fast.  Generally
      a hash table should have a prime number of buckets to assist with
      hash bias, but frankly, that's over complicating the situation
      here.</p>
    <p>I also agree with others that it used to be easy to get distfiles
      as and when needed, so an alternative structure could mirror that
      of the portage tree itself, in other words "cat/pkg/distfile". 
      This perhaps just shifts the issue:</p>
    <p>jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d
      -name "*-*" | wc -l<br>
      167<br>
      jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d |
      wc -l<br>
      19412<br>
      jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i
      -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10<br>
      347 net-misc<br>
      373 media-sound<br>
      395 media-libs<br>
      399 dev-util<br>
      505 dev-libs<br>
      528 dev-java<br>
      684 dev-haskell<br>
      690 dev-ruby<br>
      1601 dev-perl<br>
      1889 dev-python<br>
      <br>
      So that's average 116 sub folders under the top layer (only two
      over 1000), and then presumably less than 100 distfiles maximum
      per package?  Probably overkill but would (should) solve both the
      too many files per folder as well as the easy lookup by hand
      issue.</p>
    <p>I don't have a preference on either solution though but do agree
      that "easy finding of distfiles" are handy.  The INDEX mechanism
      is fine for me.<br>
    </p>
    <p>Kind Regards,</p>
    Jaco<br>
  </body>
</html>

--------------43B1079943DC5C8CE69E7127--