On Sun, 7 Nov 2004, Carsten Lohrke wrote: > On Sunday 07 November 2004 22:39, Stuart Herbert wrote: >> If Portage supporting arbitrary-depth category trees, then we could >> organise things a lot easier. But until that happens, devs are >> going to have to accept the need for more directories in >> /usr/portage. > > I don't think an arbitrary depths would be so helpful. Most likely > it'd slowdown portage. How about flatten the whole beast!? The > categorization hasn't to be done via directories. Whyever would flat-tree be better than arbitrary-depth? When I started being more dilligent about reading the gentoo mailing lists, I saw a number of threads on the topic of adding sub-categories, and the only consistent reason that was given for not moving forward was, "we need to benchmark that." Initially, I saw this as a good thing, because while I already knew the answer, performing tests to verify what one suspects tends to be a good thing. However, as time went on, I continued not seeing the benchmarks. After a while, I became disheartened. But there were other things I was wanting to focus on, so I didn't get invovled. However, knowing the answer to the directory performance question, I could not let this comment alone. I've attached a benchmark script, written in perl, which will find all of the files in the specified directory tree(s), and then randomly selects [count] files (where count is either specified by the --count option, or 10,000), and reads the first line of each of these files. This script can be utilized to benchmark any directory layout methods that people wish to consider for Gentoo. What they will find is: for ext2 and ext3 systems, there is an optimal number of files per directory, performance falls linearly beyond this point; for reiserfs systems, it doesn't matter. I performed my own tests with this script; doing a split at the most obvious point (the - in the category names), I received marginally *improved* performance - Gentoo is already slightly over the optimal number of files in /usr/portage. (Don't get me started on dev-perl.) More specifically, my average time for 10,000 random file reads in /usr/portage (by changing to /usr/portage and using '.' as the argument to benchaccess) was a little over 60 seconds, although as more tests were performed, the Linux file cache started optimizing that result. My average time for the split categories, on the other hand, averaged at 55 seconds. Some people may wonder why this is - after all, to access multiple directory trees is clearly a lot more work. This may be true for a human, but the computer doesn't see it that way - under ext2 and ext3, it has to read all of the filenames in the directory, until it finds the one you're looking for. Having fewer files, but more directories, means that it gets to the file at each level much quicker. Each of the directory changes adds some time, but that's negligable compared to the time it takes to read through a large directory. Note that this, of course, assumes that sanity is maintained, and we don't have many categories or subcategories with fewer than a dozen packages and/or subcategories. Another way to think of it, it's similar to the difference between searches on an unsorted array, and searches on a sorted array. Anyone who wonders why reiserfs does not have an issue with either layout does not know what reiserfs is - it was designed specifically to avoid this problem. Ed