On Sun, 7 Nov 2004, Carsten Lohrke wrote:

> On Sunday 07 November 2004 22:39, Stuart Herbert wrote:
>> If Portage supporting arbitrary-depth category trees, then we could
>> organise things a lot easier.  But until that happens, devs are
>> going to have to accept the need for more directories in
>> /usr/portage.
>
> I don't think an arbitrary depths would be so helpful. Most likely
> it'd slowdown portage. How about flatten the whole beast!? The
> categorization hasn't to be done via directories.

Whyever would flat-tree be better than arbitrary-depth?

When I started being more dilligent about reading the gentoo mailing
lists, I saw a number of threads on the topic of adding sub-categories,
and the only consistent reason that was given for not moving forward
was, "we need to benchmark that."

Initially, I saw this as a good thing, because while I already knew the
answer, performing tests to verify what one suspects tends to be a good
thing.  However, as time went on, I continued not seeing the benchmarks.
After a while, I became disheartened.  But there were other things I was
wanting to focus on, so I didn't get invovled.


However, knowing the answer to the directory performance question, I
could not let this comment alone.

I've attached a benchmark script, written in perl, which will find all
of the files in the specified directory tree(s), and then randomly
selects [count] files (where count is either specified by the --count
option, or 10,000), and reads the first line of each of these files.
This script can be utilized to benchmark any directory layout methods
that people wish to consider for Gentoo.

What they will find is: for ext2 and ext3 systems, there is an optimal
number of files per directory, performance falls linearly beyond this
point; for reiserfs systems, it doesn't matter.

I performed my own tests with this script; doing a split at the most
obvious point (the - in the category names), I received marginally
*improved* performance - Gentoo is already slightly over the optimal
number of files in /usr/portage.  (Don't get me started on dev-perl.)

More specifically, my average time for 10,000 random file reads in
/usr/portage (by changing to /usr/portage and using '.' as the argument
to benchaccess) was a little over 60 seconds, although as more tests were
performed, the Linux file cache started optimizing that result.  My
average time for the split categories, on the other hand, averaged at 55
seconds.


Some people may wonder why this is - after all, to access multiple
directory trees is clearly a lot more work.  This may be true for a
human, but the computer doesn't see it that way - under ext2 and ext3,
it has to read all of the filenames in the directory, until it finds the
one you're looking for.  Having fewer files, but more directories, means
that it gets to the file at each level much quicker.  Each of the
directory changes adds some time, but that's negligable compared to the
time it takes to read through a large directory.  Note that this, of
course, assumes that sanity is maintained, and we don't have many
categories or subcategories with fewer than a dozen packages and/or
subcategories.

Another way to think of it, it's similar to the difference between
searches on an unsorted array, and searches on a sorted array.

Anyone who wonders why reiserfs does not have an issue with either
layout does not know what reiserfs is - it was designed specifically to
avoid this problem.

Ed