From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gentoo-dev+bounces-88981-garchives=archives.gentoo.org@lists.gentoo.org>
Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by finch.gentoo.org (Postfix) with ESMTPS id 4C7C6139118
	for <garchives@archives.gentoo.org>; Thu, 24 Oct 2019 00:01:38 +0000 (UTC)
Received: from pigeon.gentoo.org (localhost [127.0.0.1])
	by pigeon.gentoo.org (Postfix) with SMTP id AB2C6E086C;
	Thu, 24 Oct 2019 00:01:23 +0000 (UTC)
Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by pigeon.gentoo.org (Postfix) with ESMTPS id 68A23E085E
	for <gentoo-dev@lists.gentoo.org>; Thu, 24 Oct 2019 00:01:18 +0000 (UTC)
Received: from [192.168.5.125] (pool-96-232-115-28.nycmny.fios.verizon.net [96.232.115.28])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	(Authenticated sender: ryao)
	by smtp.gentoo.org (Postfix) with ESMTPSA id 37A4D34C2E1;
	Thu, 24 Oct 2019 00:01:16 +0000 (UTC)
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
From: Richard Yao <ryao@gentoo.org>
Precedence: bulk
List-Post: <mailto:gentoo-dev@lists.gentoo.org>
List-Help: <mailto:gentoo-dev+help@lists.gentoo.org>
List-Unsubscribe: <mailto:gentoo-dev+unsubscribe@lists.gentoo.org>
List-Subscribe: <mailto:gentoo-dev+subscribe@lists.gentoo.org>
List-Id: Gentoo Linux mail <gentoo-dev.gentoo.org>
X-BeenThere: gentoo-dev@lists.gentoo.org
Reply-to: gentoo-dev@lists.gentoo.org
X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply
Mime-Version: 1.0 (1.0)
Date: Wed, 23 Oct 2019 20:01:12 -0400
Subject: Re: ext4 readdir performance - was Re: [gentoo-dev] New distfile mirror layout
Message-Id: <3E4DB3BE-8222-430D-AC50-E9B4441EE959@gentoo.org>
References: <c23b3d2d-6164-d1ad-022c-560e203d7047@gentoo.org>
In-Reply-To: <c23b3d2d-6164-d1ad-022c-560e203d7047@gentoo.org>
To: gentoo-dev@lists.gentoo.org, Jaco Kroon <jaco@uls.co.za>,
 "gentoo-dev@lists.gentoo.org" <gentoo-dev@lists.gentoo.org>
X-Mailer: iPad Mail (17A878)
X-Archives-Salt: 3080df6e-6f9d-4f8a-877c-00ca4756925c
X-Archives-Hash: 125df9365d27ab24a6003373ed72e6df


> On Oct 23, 2019, at 7:48 PM, Richard Yao <ryao@gentoo.org> wrote:
>=20
> =EF=BB=BFOn 10/22/19 2:51 AM, Jaco Kroon wrote:
>> Hi All,
>>=20
>>=20
>>> On 2019/10/21 18:42, Richard Yao wrote:
>>>=20
>>> If we consider the access frequency, it might actually not be that
>>> bad. Consider a simple example with 500 files and two directory
>>> buckets. If we have 250 in each, then the size of the directory is
>>> always 250. However, if 50 files are accessed 90% of the time, then
>>> putting 450 into one directory and that 50 into another directory, we
>>> end up with the performance of the O(n) directory lookup being
>>> consistent with there being only 90 files in each directory.
>>>=20
>>> I am not sure if we should be discarding all other considerations to
>>> make changes to benefit O(n) directory lookup filesystems, but if we
>>> are, then the hashing approach is not necessarily the best one. It is
>>> only the best when all files are accessed with equal frequency, which
>>> would be an incorrect assumption. A more human friendly approach might
>>> still be better. I doubt that we have the data to determine that though.=

>>>=20
>>> Also, another idea is to use a cheap hash function (e.g. fletcher) and
>>> just have the mirrors do the hashing behind the scenes. Then we would
>>> have the best of both worlds.
>>=20
>>=20
>> Experience:
>>=20
>> ext4 sucks at targeting name lookups without dir_index feature (O(n)
>> lookups - scans all entries in the folder).  With dir_index readdir
>> performance is crap.  Pick your poison I guess.  Most of our larger
>> filesystems (2TB+, but especially the 80TB+ ones) we've reverted to
>> disabling dir_index as the benefit is outweighed by the crappy readdir()
>> and glob() performance.
> My read of the ext4 disk layout documentation is that the read operation
> should work mostly the same way, except with a penalty from reading
> larger directories caused by the addition of the tree's metadata and
> from having more partially filled blocks:
>=20
> https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries
>=20
> The code itself is the same traversal code:
>=20
> https://github.com/torvalds/linux/blob/v5.3/fs/ext4/dir.c#L106
>=20
> However, a couple of things stand out to me here at a glance:
>=20
> 1. `cond_resched()` adds scheduler delay for no apparent reason.
> `cond_resched()` is meant to be used in places where we could block
> excessively on non-PREEMPT kernels, but this doesn't strike me as one of
> those places. The fact that we block on disk on uncached reads naturally
> serves the same purpose, so an explicit rescheduling point here is
> redundant. PREEMPT kernels should perform better in readdir() on ext4 by
> virtue of making `cond_resched()` a no-op.
I just realized that the way that I worded this could be confusing, so pleas=
e allow me to clarify what I meant. cond_resched() is meant for when a kerne=
l thread will tie up a CPU for a long period of time. Blocking on disk will c=
ause the CPU to be released to another thread. When we do not block on disk,=
 this operation is quick. There is no good reason for putting cond_resched()=
 here as far as I can tell.
> 2. read-ahead is implemented in a way that appears to be over-reading
> the directory whenever the needed information is not cached. This is
> technically read-ahead, although it is not a great way of doing it. A
> much better way to do this would be to pipeline `readdir()` by
> initiating asynchronous read operations in anticipation of future reads.
>=20
> Both of thse should affect both variants of ext4's directories, but the
> penalties I mentioned earlier mean that the dir_index variant would be
> affected more.
>=20
> If you have a way to benchmark things, a simple idea to evaluate would
> be deleting the `cond_resched()` line. If we had data showing an
> improvement, I would be happy to send a small one-line patch deleting
> the line to Ted to get the change into mainline.
The more I think about this, the more absurd having cond_resched() here seem=
s to me. I think I will sit on it for a few days. If it still seems absurd t=
o me after sitting on it, I=E2=80=99ll send Ted a patch to delete that line w=
ith the remark that the use of cond_resched() is redundant with blocking on d=
isk.
>> There doesn't seem to be a real specific tip-over point, and it seems to
>> depend a lot on RAM availability and harddrive speed (obviously).  So if
>> dentries gets cached, disk speeds becomes less of an issue.  However, on
>> large folders (where I typically use 10k as a value for large based on
>> "gut feeling" and "unquantifiable experience" and "nothing scientific at
>> all") I find that even with lots of RAM two consecutive ls commands
>> remains terribly slow. Switch off dir_index and that becomes an order of
>> magnitude faster.
>>=20
>> I don't have a great deal of experience with XFS, but on those systems
>> where we do it's generally on a VM, and perceivably (again, not
>> scientific) our experience has been that it feels slower.  Again, not
>> scientific, just perception.
>>=20
>> I'm in support for the change.  This will bucket to 256 folders and
>> should have a reasonably even split between folders.  If required a
>> second layer could be introduced by using the 3rd and 4th digits of the
>> hash for a second layer.  Any hash should be fine, it really doesn't
>> need to be cryptographically strong, it just needs to provide a good
>> spread and be really fast.  Generally a hash table should have a prime
>> number of buckets to assist with hash bias, but frankly, that's over
>> complicating the situation here.
>>=20
>> I also agree with others that it used to be easy to get distfiles as and
>> when needed, so an alternative structure could mirror that of the
>> portage tree itself, in other words "cat/pkg/distfile". This perhaps
>> just shifts the issue:
>>=20
>> jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name
>> "*-*" | wc -l
>> 167
>> jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l
>> 19412
>> jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i
>> -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10
>> 347 net-misc
>> 373 media-sound
>> 395 media-libs
>> 399 dev-util
>> 505 dev-libs
>> 528 dev-java
>> 684 dev-haskell
>> 690 dev-ruby
>> 1601 dev-perl
>> 1889 dev-python
>>=20
>> So that's average 116 sub folders under the top layer (only two over
>> 1000), and then presumably less than 100 distfiles maximum per package?=20=

>> Probably overkill but would (should) solve both the too many files per
>> folder as well as the easy lookup by hand issue.
>>=20
>> I don't have a preference on either solution though but do agree that
>> "easy finding of distfiles" are handy.  The INDEX mechanism is fine for m=
e.
>>=20
>> Kind Regards,
>>=20
>> Jaco
>>=20
>=20
>=20