From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 4C7C6139118 for ; Thu, 24 Oct 2019 00:01:38 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id AB2C6E086C; Thu, 24 Oct 2019 00:01:23 +0000 (UTC) Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 68A23E085E for ; Thu, 24 Oct 2019 00:01:18 +0000 (UTC) Received: from [192.168.5.125] (pool-96-232-115-28.nycmny.fios.verizon.net [96.232.115.28]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: ryao) by smtp.gentoo.org (Postfix) with ESMTPSA id 37A4D34C2E1; Thu, 24 Oct 2019 00:01:16 +0000 (UTC) Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable From: Richard Yao Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply Mime-Version: 1.0 (1.0) Date: Wed, 23 Oct 2019 20:01:12 -0400 Subject: Re: ext4 readdir performance - was Re: [gentoo-dev] New distfile mirror layout Message-Id: <3E4DB3BE-8222-430D-AC50-E9B4441EE959@gentoo.org> References: In-Reply-To: To: gentoo-dev@lists.gentoo.org, Jaco Kroon , "gentoo-dev@lists.gentoo.org" X-Mailer: iPad Mail (17A878) X-Archives-Salt: 3080df6e-6f9d-4f8a-877c-00ca4756925c X-Archives-Hash: 125df9365d27ab24a6003373ed72e6df > On Oct 23, 2019, at 7:48 PM, Richard Yao wrote: >=20 > =EF=BB=BFOn 10/22/19 2:51 AM, Jaco Kroon wrote: >> Hi All, >>=20 >>=20 >>> On 2019/10/21 18:42, Richard Yao wrote: >>>=20 >>> If we consider the access frequency, it might actually not be that >>> bad. Consider a simple example with 500 files and two directory >>> buckets. If we have 250 in each, then the size of the directory is >>> always 250. However, if 50 files are accessed 90% of the time, then >>> putting 450 into one directory and that 50 into another directory, we >>> end up with the performance of the O(n) directory lookup being >>> consistent with there being only 90 files in each directory. >>>=20 >>> I am not sure if we should be discarding all other considerations to >>> make changes to benefit O(n) directory lookup filesystems, but if we >>> are, then the hashing approach is not necessarily the best one. It is >>> only the best when all files are accessed with equal frequency, which >>> would be an incorrect assumption. A more human friendly approach might >>> still be better. I doubt that we have the data to determine that though.= >>>=20 >>> Also, another idea is to use a cheap hash function (e.g. fletcher) and >>> just have the mirrors do the hashing behind the scenes. Then we would >>> have the best of both worlds. >>=20 >>=20 >> Experience: >>=20 >> ext4 sucks at targeting name lookups without dir_index feature (O(n) >> lookups - scans all entries in the folder). With dir_index readdir >> performance is crap. Pick your poison I guess. Most of our larger >> filesystems (2TB+, but especially the 80TB+ ones) we've reverted to >> disabling dir_index as the benefit is outweighed by the crappy readdir() >> and glob() performance. > My read of the ext4 disk layout documentation is that the read operation > should work mostly the same way, except with a penalty from reading > larger directories caused by the addition of the tree's metadata and > from having more partially filled blocks: >=20 > https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries >=20 > The code itself is the same traversal code: >=20 > https://github.com/torvalds/linux/blob/v5.3/fs/ext4/dir.c#L106 >=20 > However, a couple of things stand out to me here at a glance: >=20 > 1. `cond_resched()` adds scheduler delay for no apparent reason. > `cond_resched()` is meant to be used in places where we could block > excessively on non-PREEMPT kernels, but this doesn't strike me as one of > those places. The fact that we block on disk on uncached reads naturally > serves the same purpose, so an explicit rescheduling point here is > redundant. PREEMPT kernels should perform better in readdir() on ext4 by > virtue of making `cond_resched()` a no-op. I just realized that the way that I worded this could be confusing, so pleas= e allow me to clarify what I meant. cond_resched() is meant for when a kerne= l thread will tie up a CPU for a long period of time. Blocking on disk will c= ause the CPU to be released to another thread. When we do not block on disk,= this operation is quick. There is no good reason for putting cond_resched()= here as far as I can tell. > 2. read-ahead is implemented in a way that appears to be over-reading > the directory whenever the needed information is not cached. This is > technically read-ahead, although it is not a great way of doing it. A > much better way to do this would be to pipeline `readdir()` by > initiating asynchronous read operations in anticipation of future reads. >=20 > Both of thse should affect both variants of ext4's directories, but the > penalties I mentioned earlier mean that the dir_index variant would be > affected more. >=20 > If you have a way to benchmark things, a simple idea to evaluate would > be deleting the `cond_resched()` line. If we had data showing an > improvement, I would be happy to send a small one-line patch deleting > the line to Ted to get the change into mainline. The more I think about this, the more absurd having cond_resched() here seem= s to me. I think I will sit on it for a few days. If it still seems absurd t= o me after sitting on it, I=E2=80=99ll send Ted a patch to delete that line w= ith the remark that the use of cond_resched() is redundant with blocking on d= isk. >> There doesn't seem to be a real specific tip-over point, and it seems to >> depend a lot on RAM availability and harddrive speed (obviously). So if >> dentries gets cached, disk speeds becomes less of an issue. However, on >> large folders (where I typically use 10k as a value for large based on >> "gut feeling" and "unquantifiable experience" and "nothing scientific at >> all") I find that even with lots of RAM two consecutive ls commands >> remains terribly slow. Switch off dir_index and that becomes an order of >> magnitude faster. >>=20 >> I don't have a great deal of experience with XFS, but on those systems >> where we do it's generally on a VM, and perceivably (again, not >> scientific) our experience has been that it feels slower. Again, not >> scientific, just perception. >>=20 >> I'm in support for the change. This will bucket to 256 folders and >> should have a reasonably even split between folders. If required a >> second layer could be introduced by using the 3rd and 4th digits of the >> hash for a second layer. Any hash should be fine, it really doesn't >> need to be cryptographically strong, it just needs to provide a good >> spread and be really fast. Generally a hash table should have a prime >> number of buckets to assist with hash bias, but frankly, that's over >> complicating the situation here. >>=20 >> I also agree with others that it used to be easy to get distfiles as and >> when needed, so an alternative structure could mirror that of the >> portage tree itself, in other words "cat/pkg/distfile". This perhaps >> just shifts the issue: >>=20 >> jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name >> "*-*" | wc -l >> 167 >> jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l >> 19412 >> jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i >> -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10 >> 347 net-misc >> 373 media-sound >> 395 media-libs >> 399 dev-util >> 505 dev-libs >> 528 dev-java >> 684 dev-haskell >> 690 dev-ruby >> 1601 dev-perl >> 1889 dev-python >>=20 >> So that's average 116 sub folders under the top layer (only two over >> 1000), and then presumably less than 100 distfiles maximum per package?=20= >> Probably overkill but would (should) solve both the too many files per >> folder as well as the easy lookup by hand issue. >>=20 >> I don't have a preference on either solution though but do agree that >> "easy finding of distfiles" are handy. The INDEX mechanism is fine for m= e. >>=20 >> Kind Regards, >>=20 >> Jaco >>=20 >=20 >=20