* [gentoo-dev] New distfile mirror layout @ 2019-10-18 13:41 Michał Górny 2019-10-18 19:53 ` Richard Yao ` (2 more replies) 0 siblings, 3 replies; 56+ messages in thread From: Michał Górny @ 2019-10-18 13:41 UTC (permalink / raw To: gentoo-dev-announce; +Cc: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1884 bytes --] Hi, everybody. It is my pleasure to announce that yesterday (EU) evening we've switched to a new distfile mirror layout. Users will be switching to the new layout either as they upgrade Portage to 2.3.77 or -- if they upgraded already -- as their caches expire (24hrs). The new layout is mostly a bow towards mirror admins, for some of whom having a 60000+ files in a single directory have been a problem. However, I suppose some of you also found e.g. the directory index hardly usable due to its size. Throughout a transitional period (whose exact length hasn't been decided yet), both layouts will be available. Afterwards, the old layout will be removed from mirrors. This has a few implications: 1. Users who don't upgrade their package managers in time will lose the ability of fetching from Gentoo mirrors. This shouldn't be that much of a problem given that the core software needed to upgrade Portage should all have reliable upstream SRC_URIs. 2. mirror://gentoo/file URIs will stop working. While technically you could use mirror://gentoo/XX/file, I'd rather recommend finally discarding its usage and moving distfiles to devspace. 3. Directly fetching files from distfiles.gentoo.org will become a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have to use something like: $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 1b $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz ... Alternatively, you can: $ wget http://distfiles.gentoo.org/distfiles/INDEX and grep for the right path there. This INDEX is also a more lightweight alternative to HTML indexes generated by the servers. If you're interested in more background details and some plots, see [1]. [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-18 13:41 [gentoo-dev] New distfile mirror layout Michał Górny @ 2019-10-18 19:53 ` Richard Yao 2019-10-18 20:49 ` Michał Górny 2019-10-22 0:46 ` James Cloos 2019-10-19 13:31 ` Fabian Groffen 2019-10-19 23:24 ` Joshua Kinard 2 siblings, 2 replies; 56+ messages in thread From: Richard Yao @ 2019-10-18 19:53 UTC (permalink / raw To: gentoo-dev; +Cc: gentoo-dev-announce > On Oct 18, 2019, at 9:42 AM, Michał Górny <mgorny@gentoo.org> wrote: > > Hi, everybody. > > It is my pleasure to announce that yesterday (EU) evening we've switched > to a new distfile mirror layout. Users will be switching to the new > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded > already -- as their caches expire (24hrs). > > The new layout is mostly a bow towards mirror admins, for some of whom > having a 60000+ files in a single directory have been a problem. > However, I suppose some of you also found e.g. the directory index > hardly usable due to its size. This sounds like a filesystem issue. Do we know which filesystems are suffering? ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that. > > Throughout a transitional period (whose exact length hasn't been decided > yet), both layouts will be available. Afterwards, the old layout will > be removed from mirrors. This has a few implications: > > 1. Users who don't upgrade their package managers in time will lose > the ability of fetching from Gentoo mirrors. This shouldn't be that > much of a problem given that the core software needed to upgrade Portage > should all have reliable upstream SRC_URIs. > > 2. mirror://gentoo/file URIs will stop working. While technically you > could use mirror://gentoo/XX/file, I'd rather recommend finally > discarding its usage and moving distfiles to devspace. > > 3. Directly fetching files from distfiles.gentoo.org will become > a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have > to use something like: > > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 > 1b > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz > ... > > > Alternatively, you can: > > $ wget http://distfiles.gentoo.org/distfiles/INDEX > > and grep for the right path there. This INDEX is also a more > lightweight alternative to HTML indexes generated by the servers. > > > If you're interested in more background details and some plots, see [1]. > > [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html > > -- > Best regards, > Michał Górny > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-18 19:53 ` Richard Yao @ 2019-10-18 20:49 ` Michał Górny 2019-10-19 1:09 ` Richard Yao 2019-10-22 0:46 ` James Cloos 1 sibling, 1 reply; 56+ messages in thread From: Michał Górny @ 2019-10-18 20:49 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1287 bytes --] On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: > > On Oct 18, 2019, at 9:42 AM, Michał Górny <mgorny@gentoo.org> wrote: > > > > Hi, everybody. > > > > It is my pleasure to announce that yesterday (EU) evening we've switched > > to a new distfile mirror layout. Users will be switching to the new > > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded > > already -- as their caches expire (24hrs). > > > > The new layout is mostly a bow towards mirror admins, for some of whom > > having a 60000+ files in a single directory have been a problem. > > However, I suppose some of you also found e.g. the directory index > > hardly usable due to its size. > This sounds like a filesystem issue. Do we know which filesystems are suffering? > > ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that. Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this may apply only to older ntfs versions. NFS has been mentioned too. However, just because modern filesystems can handle them efficiently, it doesn't mean having directories that huge comes with zero cost. [1] https://bugs.gentoo.org/534528 -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-18 20:49 ` Michał Górny @ 2019-10-19 1:09 ` Richard Yao 2019-10-19 6:17 ` Michał Górny 2019-10-19 19:26 ` Richard Yao 0 siblings, 2 replies; 56+ messages in thread From: Richard Yao @ 2019-10-19 1:09 UTC (permalink / raw To: gentoo-dev > On Oct 18, 2019, at 4:49 PM, Michał Górny <mgorny@gentoo.org> wrote: > > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: >>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <mgorny@gentoo.org> wrote: >>>>> Hi, everybody. >>>>> It is my pleasure to announce that yesterday (EU) evening we've switched >>>>> to a new distfile mirror layout. Users will be switching to the new >>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded >>>>> already -- as their caches expire (24hrs). >>>>> The new layout is mostly a bow towards mirror admins, for some of whom >>>>> having a 60000+ files in a single directory have been a problem. >>>>> However, I suppose some of you also found e.g. the directory index >>>>> hardly usable due to its size. >> This sounds like a filesystem issue. Do we know which filesystems are suffering? >> ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that. > > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this > may apply only to older ntfs versions. NFS has been mentioned too. ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though. > > However, just because modern filesystems can handle them efficiently, it > doesn't mean having directories that huge comes with zero cost. While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost? Filesystems with O(1) directory lookups like ZFS would probably be hurt by this, but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups. Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement. Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations. Since I know someone will call me out on that comment, I will explain. Each bucket has roughly n/b items in it where n is the total number and b is the number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each of the b buckets. The buckets are pre-sorted by prefix, so the result is now sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) comparison sort on this very special case where you call it multiple times on data that has been persorted by prefix into buckets. Is there any other benefit to this or did I get everything? By the way, it is offtopic for the thread, but it occurs to me that a hybrid of radix sort and A comparison based sort could give us a general sorting algorithm that is asymptotically faster than O(nlogn). > > [1] https://bugs.gentoo.org/534528 > > -- > Best regards, > Michał Górny ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-19 1:09 ` Richard Yao @ 2019-10-19 6:17 ` Michał Górny 2019-10-19 8:20 ` Richard Yao 2019-10-19 19:26 ` Richard Yao 1 sibling, 1 reply; 56+ messages in thread From: Michał Górny @ 2019-10-19 6:17 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 3758 bytes --] On Fri, 2019-10-18 at 21:09 -0400, Richard Yao wrote: > > On Oct 18, 2019, at 4:49 PM, Michał Górny <mgorny@gentoo.org> wrote: > > > > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: > > > > > > > On Oct 18, 2019, at 9:42 AM, Michał Górny <mgorny@gentoo.org> wrote: > > > > > > Hi, everybody. > > > > > > It is my pleasure to announce that yesterday (EU) evening we've switched > > > > > > to a new distfile mirror layout. Users will be switching to the new > > > > > > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded > > > > > > already -- as their caches expire (24hrs). > > > > > > The new layout is mostly a bow towards mirror admins, for some of whom > > > > > > having a 60000+ files in a single directory have been a problem. > > > > > > However, I suppose some of you also found e.g. the directory index > > > > > > hardly usable due to its size. > > > This sounds like a filesystem issue. Do we know which filesystems are suffering? > > > ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that. > > > > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this > > may apply only to older ntfs versions. NFS has been mentioned too. > > ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though. Are you surprised that people use NTFS on Windows? Or that they use local mirrors over NFS? The latter still needs to be addressed separatel, provided that they mount it on DISTDIR. > > However, just because modern filesystems can handle them efficiently, it > > doesn't mean having directories that huge comes with zero cost. > While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost? > > Filesystems with O(1) directory lookups like ZFS would probably be hurt by this O(1) or O(n)? > , but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups. > > Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement. > > Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations. > > Since I know someone will call me out on that comment, I will explain. Each bucket has roughly n/b items in it where n is the total number and b is the number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each of the b buckets. The buckets are pre-sorted by prefix, so the result is now sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) comparison sort on this very special case where you call it multiple times on data that has been persorted by prefix into buckets. > > Is there any other benefit to this or did I get everything? Listings for individual directories won't cause major pain to browsers anymore. Not that there's much reason to do them. All kinds of per-direction operations will consume less memory and be potentially faster. -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-19 6:17 ` Michał Górny @ 2019-10-19 8:20 ` Richard Yao 0 siblings, 0 replies; 56+ messages in thread From: Richard Yao @ 2019-10-19 8:20 UTC (permalink / raw To: gentoo-dev > On Oct 19, 2019, at 2:17 AM, Michał Górny <mgorny@gentoo.org> wrote: > > On Fri, 2019-10-18 at 21:09 -0400, Richard Yao wrote: >>>> On Oct 18, 2019, at 4:49 PM, Michał Górny <mgorny@gentoo.org> wrote: >>> >>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: >>>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <mgorny@gentoo.org> wrote: >>>>>>> Hi, everybody. >>>>>>> It is my pleasure to announce that yesterday (EU) evening we've switched >>>>>>> to a new distfile mirror layout. Users will be switching to the new >>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded >>>>>>> already -- as their caches expire (24hrs). >>>>>>> The new layout is mostly a bow towards mirror admins, for some of whom >>>>>>> having a 60000+ files in a single directory have been a problem. >>>>>>> However, I suppose some of you also found e.g. the directory index >>>>>>> hardly usable due to its size. >>>> This sounds like a filesystem issue. Do we know which filesystems are suffering? >>>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that. >>> >>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this >>> may apply only to older ntfs versions. NFS has been mentioned too. >> >> ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though. > > Are you surprised that people use NTFS on Windows? Or that they use > local mirrors over NFS? The latter still needs to be addressed > separatel, provided that they mount it on DISTDIR. I am surprised that it was an issue on NTFS because it uses B-trees. As for NFS, I had expected that to be more dependent on the local filesystem than on NFS itself. If it has a slowdown when used on a filesystem that had fast directory operations, that might be a bug. > >>> However, just because modern filesystems can handle them efficiently, it >>> doesn't mean having directories that huge comes with zero cost. >> While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost? >> >> Filesystems with O(1) directory lookups like ZFS would probably be hurt by this > > O(1) or O(n)? ZFS uses extendible hashing for its directories, so the data structure used is amortized O(1). You might consider it O(log n) due to the indirect tree traversal needed to find the direct block containing the hash table entry. With caching of indirect blocks, it should be amortized O(1) to find the direct block in practice as far as read IOs are considered. In addition, the base of the logarithm is 128 or 1024 depending on the pool feature flags. > >> , but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups. >> >> Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement. >> >> Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations. >> >> Since I know someone will call me out on that comment, I will explain. Each bucket has roughly n/b items in it where n is the total number and b is the number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each of the b buckets. The buckets are pre-sorted by prefix, so the result is now sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) comparison sort on this very special case where you call it multiple times on data that has been persorted by prefix into buckets. >> >> Is there any other benefit to this or did I get everything? > > Listings for individual directories won't cause major pain to browsers > anymore. Not that there's much reason to do them. That makes sense. > > All kinds of per-direction operations will consume less memory > and be potentially faster. Userland would save memory when sorting or grepping a directory listing by virtue of having to process less data for grep and less data at a time for sorting (if it takes advantage of this). That would have performance benefits in userland. The kernel would have little memory savings and in some cases might be slightly worse. It is negligible. Performance in the kernel ought to be slightly better on filesystems with O(log n) directory operations, but I would only expect the really bad ones to show much improvement. > -- > Best regards, > Michał Górny > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-19 1:09 ` Richard Yao 2019-10-19 6:17 ` Michał Górny @ 2019-10-19 19:26 ` Richard Yao 2019-10-19 20:02 ` Michał Górny 1 sibling, 1 reply; 56+ messages in thread From: Richard Yao @ 2019-10-19 19:26 UTC (permalink / raw To: gentoo-dev > On Oct 18, 2019, at 9:10 PM, Richard Yao <ryao@gentoo.org> wrote: > > >>> On Oct 18, 2019, at 4:49 PM, Michał Górny <mgorny@gentoo.org> wrote: >> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: >>>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <mgorny@gentoo.org> wrote: >>>>>>> Hi, everybody. >>>>>>> It is my pleasure to announce that yesterday (EU) evening we've switched >>>>>>> to a new distfile mirror layout. Users will be switching to the new >>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded >>>>>>> already -- as their caches expire (24hrs). >>>>>>> The new layout is mostly a bow towards mirror admins, for some of whom >>>>>>> having a 60000+ files in a single directory have been a problem. >>>>>>> However, I suppose some of you also found e.g. the directory index >>>>>>> hardly usable due to its size. >>> This sounds like a filesystem issue. Do we know which filesystems are suffering? >>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that. >> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this >> may apply only to older ntfs versions. NFS has been mentioned too. > > ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though. >> However, just because modern filesystems can handle them efficiently, it >> doesn't mean having directories that huge comes with zero cost. > While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost? > > Filesystems with O(1) directory lookups like ZFS would probably be hurt by this, but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups. > > Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement. > > Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations. I read your original email late at night and I misread the description of how this works. At an initial glance, I thought we were doing a prefix approach (with the caveat that buckets are unbalanced). In reality, we are doing a cryptographic hash of the filenames. That would keep all buckets balanced, which gives the best directory lookup times on O(log n) lookup filesystems, but I think there is something to be gained from using the less optimal approach of using filename prefixes: * some regex searches on distfiles can be accelerated * generating a sorted list of all distfiles becomes asymptotically faster * it is easy for a user to find all versions of a given distfile * no need to calculate a cryptographic hash I realize that I am late to propose it, but could we consider a switch to this alternative arrangement? The bulk of the performance gain should be realized with either approach. > Since I know someone will call me out on that comment, I will explain. Each bucket has roughly n/b items in it where n is the total number and b is the number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each of the b buckets. The buckets are pre-sorted by prefix, so the result is now sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) comparison sort on this very special case where you call it multiple times on data that has been persorted by prefix into buckets. > > Is there any other benefit to this or did I get everything? > > By the way, it is offtopic for the thread, but it occurs to me that a hybrid of radix sort and A comparison based sort could give us a general sorting algorithm that is asymptotically faster than O(nlogn). >> [1] https://bugs.gentoo.org/534528 >> -- >> Best regards, >> Michał Górny ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-19 19:26 ` Richard Yao @ 2019-10-19 20:02 ` Michał Górny 2019-10-19 22:48 ` Richard Yao 0 siblings, 1 reply; 56+ messages in thread From: Michał Górny @ 2019-10-19 20:02 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 3901 bytes --] On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote: > > On Oct 18, 2019, at 9:10 PM, Richard Yao <ryao@gentoo.org> wrote: > > > > > > > > On Oct 18, 2019, at 4:49 PM, Michał Górny <mgorny@gentoo.org> wrote: > > > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: > > > > > > > > > On Oct 18, 2019, at 9:42 AM, Michał Górny <mgorny@gentoo.org> wrote: > > > > > > > > Hi, everybody. > > > > > > > > It is my pleasure to announce that yesterday (EU) evening we've switched > > > > > > > > to a new distfile mirror layout. Users will be switching to the new > > > > > > > > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded > > > > > > > > already -- as their caches expire (24hrs). > > > > > > > > The new layout is mostly a bow towards mirror admins, for some of whom > > > > > > > > having a 60000+ files in a single directory have been a problem. > > > > > > > > However, I suppose some of you also found e.g. the directory index > > > > > > > > hardly usable due to its size. > > > > This sounds like a filesystem issue. Do we know which filesystems are suffering? > > > > ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that. > > > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this > > > may apply only to older ntfs versions. NFS has been mentioned too. > > > > ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though. > > > However, just because modern filesystems can handle them efficiently, it > > > doesn't mean having directories that huge comes with zero cost. > > While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost? > > > > Filesystems with O(1) directory lookups like ZFS would probably be hurt by this, but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups. > > > > Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement. > > > > Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations. > I read your original email late at night and I misread the description of how this works. > > At an initial glance, I thought we were doing a prefix approach (with the caveat that buckets are unbalanced). In reality, we are doing a cryptographic hash of the filenames. > > That would keep all buckets balanced, which gives the best directory lookup times on O(log n) lookup filesystems, but I think there is something to be gained from using the less optimal approach of using filename prefixes: > > * some regex searches on distfiles can be accelerated > * generating a sorted list of all distfiles becomes asymptotically faster > * it is easy for a user to find all versions of a given distfile > * no need to calculate a cryptographic hash > > I realize that I am late to propose it, but could we consider a switch to this alternative arrangement? No, we can't. Please read either the original discussion on the bug, or the linked article. It's explained in detail why this won't work. -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-19 20:02 ` Michał Górny @ 2019-10-19 22:48 ` Richard Yao 0 siblings, 0 replies; 56+ messages in thread From: Richard Yao @ 2019-10-19 22:48 UTC (permalink / raw To: gentoo-dev > On Oct 19, 2019, at 4:03 PM, Michał Górny <mgorny@gentoo.org> wrote: > > On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote: >>>> On Oct 18, 2019, at 9:10 PM, Richard Yao <ryao@gentoo.org> wrote: >>> >>> >>>>> On Oct 18, 2019, at 4:49 PM, Michał Górny <mgorny@gentoo.org> wrote: >>>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote: >>>>>>>>>> On Oct 18, 2019, at 9:42 AM, Michał Górny <mgorny@gentoo.org> wrote: >>>>>>>>> Hi, everybody. >>>>>>>>> It is my pleasure to announce that yesterday (EU) evening we've switched >>>>>>>>> to a new distfile mirror layout. Users will be switching to the new >>>>>>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded >>>>>>>>> already -- as their caches expire (24hrs). >>>>>>>>> The new layout is mostly a bow towards mirror admins, for some of whom >>>>>>>>> having a 60000+ files in a single directory have been a problem. >>>>>>>>> However, I suppose some of you also found e.g. the directory index >>>>>>>>> hardly usable due to its size. >>>>> This sounds like a filesystem issue. Do we know which filesystems are suffering? >>>>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. ext4 is probably okay, but don’t quote me on that. >>>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this >>>> may apply only to older ntfs versions. NFS has been mentioned too. >>> >>> ext2 and vfat are not surprises to me (outside of the idea that anyone would use them for a mirror). NTFS and NFS are though. >>>> However, just because modern filesystems can handle them efficiently, it >>>> doesn't mean having directories that huge comes with zero cost. >>> While I am okay with the change, what do you mean when you say that having huge directories does not come with zero cost? >>> >>> Filesystems with O(1) directory lookups like ZFS would probably be hurt by this, but the impact should be negligible. Filesystems with O(log n) directory lookups would see faster directory lookups. >>> >>> Outside of directory lookups, this could speed up up searches and sort operations when listing everything with just about any filesystem benefiting from the improvement. >>> >>> Listing directories on such filesystems should not benefit from this unless you are using ls where the default behavior is to sort the directory contents (which is where the improvement when sorting comes into play). The need to sort the directory contents by default keeps ls from displaying anything until it has scanned the entire directory. The asymptotic complexity of a fast comparison based sort improves in this situation from O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory independently. A further speed up could be obtained by doing multithreading to parallelize the sort operations. >> I read your original email late at night and I misread the description of how this works. >> >> At an initial glance, I thought we were doing a prefix approach (with the caveat that buckets are unbalanced). In reality, we are doing a cryptographic hash of the filenames. >> >> That would keep all buckets balanced, which gives the best directory lookup times on O(log n) lookup filesystems, but I think there is something to be gained from using the less optimal approach of using filename prefixes: >> >> * some regex searches on distfiles can be accelerated >> * generating a sorted list of all distfiles becomes asymptotically faster >> * it is easy for a user to find all versions of a given distfile >> * no need to calculate a cryptographic hash >> >> I realize that I am late to propose it, but could we consider a switch to this alternative arrangement? > > No, we can't. Please read either the original discussion on the bug, or > the linked article. It's explained in detail why this won't work. Alright. I am convinced. Thanks. > > -- > Best regards, > Michał Górny > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-18 19:53 ` Richard Yao 2019-10-18 20:49 ` Michał Górny @ 2019-10-22 0:46 ` James Cloos 1 sibling, 0 replies; 56+ messages in thread From: James Cloos @ 2019-10-22 0:46 UTC (permalink / raw To: gentoo-dev; +Cc: Richard Yao >>>>> "RY" == Richard Yao <ryao@gentoo.org> writes: RY> ext4 is probably okay, but don’t quote me on that. Ext4 works fine here for a local distfiles mirror. -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 0x997A9F17ED7DAEA6 ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-18 13:41 [gentoo-dev] New distfile mirror layout Michał Górny 2019-10-18 19:53 ` Richard Yao @ 2019-10-19 13:31 ` Fabian Groffen 2019-10-19 13:53 ` Michał Górny 2019-10-19 23:24 ` Joshua Kinard 2 siblings, 1 reply; 56+ messages in thread From: Fabian Groffen @ 2019-10-19 13:31 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 963 bytes --] Hi, On 18-10-2019 15:41:32 +0200, Michał Górny wrote: > 3. Directly fetching files from distfiles.gentoo.org will become > a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have > to use something like: > > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 > 1b > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz > ... > > > Alternatively, you can: > > $ wget http://distfiles.gentoo.org/distfiles/INDEX > > and grep for the right path there. This INDEX is also a more > lightweight alternative to HTML indexes generated by the servers. Would it be possible to run a service that sends a 302 for the distfiles/foo-1.tar.gz to the appropriate bucket such that manual fetching doesn't require to calculate the hash? I prototyped this myself for distfiles.prefix, and seems like a nice guesture for at least the transition period? Thanks, Fabian -- Fabian Groffen Gentoo on a different level [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-19 13:31 ` Fabian Groffen @ 2019-10-19 13:53 ` Michał Górny 0 siblings, 0 replies; 56+ messages in thread From: Michał Górny @ 2019-10-19 13:53 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1291 bytes --] On Sat, 2019-10-19 at 15:31 +0200, Fabian Groffen wrote: > Hi, > > On 18-10-2019 15:41:32 +0200, Michał Górny wrote: > > 3. Directly fetching files from distfiles.gentoo.org will become > > a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have > > to use something like: > > > > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 > > 1b > > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz > > ... > > > > > > Alternatively, you can: > > > > $ wget http://distfiles.gentoo.org/distfiles/INDEX > > > > and grep for the right path there. This INDEX is also a more > > lightweight alternative to HTML indexes generated by the servers. > > Would it be possible to run a service that sends a 302 for the > distfiles/foo-1.tar.gz to the appropriate bucket such that manual > fetching doesn't require to calculate the hash? > > I prototyped this myself for distfiles.prefix, and seems like a nice > guesture for at least the transition period? > That would only for servers whose admins would explicitly install the service, i.e. not for anyone using GENTOO_MIRRORS. If you're talking purely about distfiles.gentoo.org, we may add something like that by the end of transitional period. -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-18 13:41 [gentoo-dev] New distfile mirror layout Michał Górny 2019-10-18 19:53 ` Richard Yao 2019-10-19 13:31 ` Fabian Groffen @ 2019-10-19 23:24 ` Joshua Kinard 2019-10-19 23:57 ` Alec Warner 2019-10-20 6:51 ` Michał Górny 2 siblings, 2 replies; 56+ messages in thread From: Joshua Kinard @ 2019-10-19 23:24 UTC (permalink / raw To: gentoo-dev On 10/18/2019 09:41, Michał Górny wrote: > Hi, everybody. > > It is my pleasure to announce that yesterday (EU) evening we've switched > to a new distfile mirror layout. Users will be switching to the new > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded > already -- as their caches expire (24hrs). > > The new layout is mostly a bow towards mirror admins, for some of whom > having a 60000+ files in a single directory have been a problem. > However, I suppose some of you also found e.g. the directory index > hardly usable due to its size. > > Throughout a transitional period (whose exact length hasn't been decided > yet), both layouts will be available. Afterwards, the old layout will > be removed from mirrors. This has a few implications: > > 1. Users who don't upgrade their package managers in time will lose > the ability of fetching from Gentoo mirrors. This shouldn't be that > much of a problem given that the core software needed to upgrade Portage > should all have reliable upstream SRC_URIs. > > 2. mirror://gentoo/file URIs will stop working. While technically you > could use mirror://gentoo/XX/file, I'd rather recommend finally > discarding its usage and moving distfiles to devspace. > > 3. Directly fetching files from distfiles.gentoo.org will become > a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have > to use something like: > > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 > 1b > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz > ... > > > Alternatively, you can: > > $ wget http://distfiles.gentoo.org/distfiles/INDEX > > and grep for the right path there. This INDEX is also a more > lightweight alternative to HTML indexes generated by the servers. > > > If you're interested in more background details and some plots, see [1]. > > [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html > So the answer I didn't really see directly stated here is, where do new distfiles need to go //now//? E.g., if on woodpecker, I currently cp a distfile to /space/distfiles-local. What is the new directory I need to use? And if mirror://gentoo/${FOO} is going away, for the new distfiles target, what would be the applicable prefix to use? Directly using devspace seems like a bad idea, IMHO. Once long ago, we all got chastised for doing exactly that. Too much possibility of fragmentation as devs retire or package maintainership changes hands. I looked at the whitepaper'ish-like writeup, and I kinda don't like using a hash-based naming scheme on the new distfiles layout. I really kind prefer breaking the directories up based on the first letter of the distfiles in question, factoring case-sensitivity in (so you'd have 52 top-level directories for A-Z and a-z, plus 10 more for 0-9). Under each of those directories, additional subdirectories for the next few letters (say, letters 2-3). Yes, this leads to some orphan cases where a distfile might live on its own, but from a direct navigation standpoint, it's easy to find for someone browsing the distfiles server and easy to predict where a distfile is at. No math, statistical analysis, or deep-rooted knowledge of filesystems behind that paragraph. Just a plain old unfiltered opinion. Sometimes, I need to go get a distfile off the Gentoo mirrors, and being able to quickly find it in the mirror root is great. Having to do hash calculations to work out the file path will be *really* annoying. -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org rsa6144/5C63F4E3F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-19 23:24 ` Joshua Kinard @ 2019-10-19 23:57 ` Alec Warner 2019-10-20 0:14 ` Joshua Kinard 2019-10-20 6:51 ` Michał Górny 1 sibling, 1 reply; 56+ messages in thread From: Alec Warner @ 2019-10-19 23:57 UTC (permalink / raw To: Gentoo Dev [-- Attachment #1: Type: text/plain, Size: 4490 bytes --] On Sat, Oct 19, 2019 at 4:24 PM Joshua Kinard <kumba@gentoo.org> wrote: > On 10/18/2019 09:41, Michał Górny wrote: > > Hi, everybody. > > > > It is my pleasure to announce that yesterday (EU) evening we've switched > > to a new distfile mirror layout. Users will be switching to the new > > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded > > already -- as their caches expire (24hrs). > > > > The new layout is mostly a bow towards mirror admins, for some of whom > > having a 60000+ files in a single directory have been a problem. > > However, I suppose some of you also found e.g. the directory index > > hardly usable due to its size. > > > > Throughout a transitional period (whose exact length hasn't been decided > > yet), both layouts will be available. Afterwards, the old layout will > > be removed from mirrors. This has a few implications: > > > > 1. Users who don't upgrade their package managers in time will lose > > the ability of fetching from Gentoo mirrors. This shouldn't be that > > much of a problem given that the core software needed to upgrade Portage > > should all have reliable upstream SRC_URIs. > > > > 2. mirror://gentoo/file URIs will stop working. While technically you > > could use mirror://gentoo/XX/file, I'd rather recommend finally > > discarding its usage and moving distfiles to devspace. > > > > 3. Directly fetching files from distfiles.gentoo.org will become > > a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have > > to use something like: > > > > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 > > 1b > > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz > > ... > > > > > > Alternatively, you can: > > > > $ wget http://distfiles.gentoo.org/distfiles/INDEX > > > > and grep for the right path there. This INDEX is also a more > > lightweight alternative to HTML indexes generated by the servers. > > > > > > If you're interested in more background details and some plots, see [1]. > > > > [1] > https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html > > > > So the answer I didn't really see directly stated here is, where do new > distfiles need to go //now//? E.g., if on woodpecker, I currently cp a > distfile to /space/distfiles-local. What is the new directory I need to > use? And if mirror://gentoo/${FOO} is going away, for the new distfiles > target, what would be the applicable prefix to use? > > > Directly using devspace seems like a bad idea, IMHO. Once long ago, we all > got chastised for doing exactly that. Too much possibility of > fragmentation > as devs retire or package maintainership changes hands. > > I looked at the whitepaper'ish-like writeup, and I kinda don't like using a > hash-based naming scheme on the new distfiles layout. I really kind prefer > breaking the directories up based on the first letter of the distfiles in > question, factoring case-sensitivity in (so you'd have 52 top-level > directories for A-Z and a-z, plus 10 more for 0-9). Under each of those > directories, additional subdirectories for the next few letters (say, > letters 2-3). Yes, this leads to some orphan cases where a distfile might > live on its own, but from a direct navigation standpoint, it's easy to find > for someone browsing the distfiles server and easy to predict where a > distfile is at. > > No math, statistical analysis, or deep-rooted knowledge of filesystems > behind that paragraph. Just a plain old unfiltered opinion. Sometimes, I > need to go get a distfile off the Gentoo mirrors, and being able to quickly > find it in the mirror root is great. Having to do hash calculations to > work > out the file path will be *really* annoying. > So if you want a tool that "downloads a distfile off of the mirrors" we should be able to build such a utility. I'm not really sure why that tool needs to be: *copy DISTFILENAME* wget distilfes.gentoo.org/$PASTE It could just `ebuild portageq download $DISTFILENAME or similar.` -A > > -- > Joshua Kinard > Gentoo/MIPS > kumba@gentoo.org > rsa6144/5C63F4E3F5C6C943 2015-04-27 > 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 > > "The past tempts us, the present confuses us, the future frightens us. And > our lives slip away, moment by moment, lost in that vast, terrible > in-between." > > --Emperor Turhan, Centauri Republic > > [-- Attachment #2: Type: text/html, Size: 6188 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-19 23:57 ` Alec Warner @ 2019-10-20 0:14 ` Joshua Kinard 0 siblings, 0 replies; 56+ messages in thread From: Joshua Kinard @ 2019-10-20 0:14 UTC (permalink / raw To: gentoo-dev On 10/19/2019 19:57, Alec Warner wrote: > On Sat, Oct 19, 2019 at 4:24 PM Joshua Kinard <kumba@gentoo.org> wrote: > >> On 10/18/2019 09:41, Michał Górny wrote: >>> Hi, everybody. >>> >>> It is my pleasure to announce that yesterday (EU) evening we've switched >>> to a new distfile mirror layout. Users will be switching to the new >>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded >>> already -- as their caches expire (24hrs). >>> >>> The new layout is mostly a bow towards mirror admins, for some of whom >>> having a 60000+ files in a single directory have been a problem. >>> However, I suppose some of you also found e.g. the directory index >>> hardly usable due to its size. >>> >>> Throughout a transitional period (whose exact length hasn't been decided >>> yet), both layouts will be available. Afterwards, the old layout will >>> be removed from mirrors. This has a few implications: >>> >>> 1. Users who don't upgrade their package managers in time will lose >>> the ability of fetching from Gentoo mirrors. This shouldn't be that >>> much of a problem given that the core software needed to upgrade Portage >>> should all have reliable upstream SRC_URIs. >>> >>> 2. mirror://gentoo/file URIs will stop working. While technically you >>> could use mirror://gentoo/XX/file, I'd rather recommend finally >>> discarding its usage and moving distfiles to devspace. >>> >>> 3. Directly fetching files from distfiles.gentoo.org will become >>> a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have >>> to use something like: >>> >>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 >>> 1b >>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz >>> ... >>> >>> >>> Alternatively, you can: >>> >>> $ wget http://distfiles.gentoo.org/distfiles/INDEX >>> >>> and grep for the right path there. This INDEX is also a more >>> lightweight alternative to HTML indexes generated by the servers. >>> >>> >>> If you're interested in more background details and some plots, see [1]. >>> >>> [1] >> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html >>> >> >> So the answer I didn't really see directly stated here is, where do new >> distfiles need to go //now//? E.g., if on woodpecker, I currently cp a >> distfile to /space/distfiles-local. What is the new directory I need to >> use? And if mirror://gentoo/${FOO} is going away, for the new distfiles >> target, what would be the applicable prefix to use? >> > > > > >> >> Directly using devspace seems like a bad idea, IMHO. Once long ago, we all >> got chastised for doing exactly that. Too much possibility of >> fragmentation >> as devs retire or package maintainership changes hands. >> >> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a >> hash-based naming scheme on the new distfiles layout. I really kind prefer >> breaking the directories up based on the first letter of the distfiles in >> question, factoring case-sensitivity in (so you'd have 52 top-level >> directories for A-Z and a-z, plus 10 more for 0-9). Under each of those >> directories, additional subdirectories for the next few letters (say, >> letters 2-3). Yes, this leads to some orphan cases where a distfile might >> live on its own, but from a direct navigation standpoint, it's easy to find >> for someone browsing the distfiles server and easy to predict where a >> distfile is at. >> >> No math, statistical analysis, or deep-rooted knowledge of filesystems >> behind that paragraph. Just a plain old unfiltered opinion. Sometimes, I >> need to go get a distfile off the Gentoo mirrors, and being able to quickly >> find it in the mirror root is great. Having to do hash calculations to >> work >> out the file path will be *really* annoying. >> > > So if you want a tool that "downloads a distfile off of the mirrors" we > should be able to build such a utility. > > I'm not really sure why that tool needs to be: > *copy DISTFILENAME* > wget distilfes.gentoo.org/$PASTE > > It could just `ebuild portageq download $DISTFILENAME or similar.` > > -A Sometimes, I'm not on a Gentoo system, or even a Linux/Unix platform, when I go to fetch a distfile. Could (and have) fetched as such off of Debian's mirrors before, but Gentoo is what I know and fetching a distfile off of those mirrors manually was generally very straight forward. Not a common case, and certainly not a blocker. I was just pointing out that hashed-based naming is decidedly a lot less human-friendly. But, that's been the general trend for all-things technology these last few years. -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org rsa6144/5C63F4E3F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-19 23:24 ` Joshua Kinard 2019-10-19 23:57 ` Alec Warner @ 2019-10-20 6:51 ` Michał Górny 2019-10-20 8:25 ` Joshua Kinard ` (2 more replies) 1 sibling, 3 replies; 56+ messages in thread From: Michał Górny @ 2019-10-20 6:51 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 4322 bytes --] On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote: > On 10/18/2019 09:41, Michał Górny wrote: > > Hi, everybody. > > > > It is my pleasure to announce that yesterday (EU) evening we've switched > > to a new distfile mirror layout. Users will be switching to the new > > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded > > already -- as their caches expire (24hrs). > > > > The new layout is mostly a bow towards mirror admins, for some of whom > > having a 60000+ files in a single directory have been a problem. > > However, I suppose some of you also found e.g. the directory index > > hardly usable due to its size. > > > > Throughout a transitional period (whose exact length hasn't been decided > > yet), both layouts will be available. Afterwards, the old layout will > > be removed from mirrors. This has a few implications: > > > > 1. Users who don't upgrade their package managers in time will lose > > the ability of fetching from Gentoo mirrors. This shouldn't be that > > much of a problem given that the core software needed to upgrade Portage > > should all have reliable upstream SRC_URIs. > > > > 2. mirror://gentoo/file URIs will stop working. While technically you > > could use mirror://gentoo/XX/file, I'd rather recommend finally > > discarding its usage and moving distfiles to devspace. > > > > 3. Directly fetching files from distfiles.gentoo.org will become > > a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have > > to use something like: > > > > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 > > 1b > > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz > > ... > > > > > > Alternatively, you can: > > > > $ wget http://distfiles.gentoo.org/distfiles/INDEX > > > > and grep for the right path there. This INDEX is also a more > > lightweight alternative to HTML indexes generated by the servers. > > > > > > If you're interested in more background details and some plots, see [1]. > > > > [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html > > > > So the answer I didn't really see directly stated here is, where do new > distfiles need to go //now//? E.g., if on woodpecker, I currently cp a > distfile to /space/distfiles-local. What is the new directory I need to > use? And if mirror://gentoo/${FOO} is going away, for the new distfiles > target, what would be the applicable prefix to use? > > Directly using devspace seems like a bad idea, IMHO. Once long ago, we all > got chastised for doing exactly that. Too much possibility of fragmentation > as devs retire or package maintainership changes hands. Today you get chastised for using /space/distfiles-local and not following policy changes. The devmanual states that it's deprecated since at least 2011, and talks of using d.g.o [1]. > I looked at the whitepaper'ish-like writeup, and I kinda don't like using a > hash-based naming scheme on the new distfiles layout. I really kind prefer > breaking the directories up based on the first letter of the distfiles in > question, factoring case-sensitivity in (so you'd have 52 top-level > directories for A-Z and a-z, plus 10 more for 0-9). Under each of those > directories, additional subdirectories for the next few letters (say, > letters 2-3). Yes, this leads to some orphan cases where a distfile might > live on its own, but from a direct navigation standpoint, it's easy to find > for someone browsing the distfiles server and easy to predict where a > distfile is at. > > No math, statistical analysis, or deep-rooted knowledge of filesystems > behind that paragraph. Just a plain old unfiltered opinion. Sometimes, I > need to go get a distfile off the Gentoo mirrors, and being able to quickly > find it in the mirror root is great. Having to do hash calculations to work > out the file path will be *really* annoying. Your solution still doesn't solve the problem of having 8k-24k files in a single directory, even if you use 7 letters of prefix. So it just creates a lot of tiny directory noise for no practical gain. [1] https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-20 6:51 ` Michał Górny @ 2019-10-20 8:25 ` Joshua Kinard 2019-10-20 8:32 ` Michał Górny 2019-10-20 17:09 ` Matt Turner 2019-10-21 16:42 ` Richard Yao 2019-10-28 23:24 ` Chí-Thanh Christopher Nguyễn 2 siblings, 2 replies; 56+ messages in thread From: Joshua Kinard @ 2019-10-20 8:25 UTC (permalink / raw To: gentoo-dev On 10/20/2019 02:51, Michał Górny wrote: > On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote: >> On 10/18/2019 09:41, Michał Górny wrote: >>> Hi, everybody. >>> >>> It is my pleasure to announce that yesterday (EU) evening we've switched >>> to a new distfile mirror layout. Users will be switching to the new >>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded >>> already -- as their caches expire (24hrs). >>> >>> The new layout is mostly a bow towards mirror admins, for some of whom >>> having a 60000+ files in a single directory have been a problem. >>> However, I suppose some of you also found e.g. the directory index >>> hardly usable due to its size. >>> >>> Throughout a transitional period (whose exact length hasn't been decided >>> yet), both layouts will be available. Afterwards, the old layout will >>> be removed from mirrors. This has a few implications: >>> >>> 1. Users who don't upgrade their package managers in time will lose >>> the ability of fetching from Gentoo mirrors. This shouldn't be that >>> much of a problem given that the core software needed to upgrade Portage >>> should all have reliable upstream SRC_URIs. >>> >>> 2. mirror://gentoo/file URIs will stop working. While technically you >>> could use mirror://gentoo/XX/file, I'd rather recommend finally >>> discarding its usage and moving distfiles to devspace. >>> >>> 3. Directly fetching files from distfiles.gentoo.org will become >>> a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have >>> to use something like: >>> >>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 >>> 1b >>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz >>> ... >>> >>> >>> Alternatively, you can: >>> >>> $ wget http://distfiles.gentoo.org/distfiles/INDEX >>> >>> and grep for the right path there. This INDEX is also a more >>> lightweight alternative to HTML indexes generated by the servers. >>> >>> >>> If you're interested in more background details and some plots, see [1]. >>> >>> [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html >>> >> >> So the answer I didn't really see directly stated here is, where do new >> distfiles need to go //now//? E.g., if on woodpecker, I currently cp a >> distfile to /space/distfiles-local. What is the new directory I need to >> use? And if mirror://gentoo/${FOO} is going away, for the new distfiles >> target, what would be the applicable prefix to use? >> >> Directly using devspace seems like a bad idea, IMHO. Once long ago, we all >> got chastised for doing exactly that. Too much possibility of fragmentation >> as devs retire or package maintainership changes hands. > > Today you get chastised for using /space/distfiles-local and not > following policy changes. The devmanual states that it's deprecated > since at least 2011, and talks of using d.g.o [1]. I don't recall this change being added as far back as 2011. Maybe my memory is bad, but if it was done that long ago, it was done quietly, and it was not enforced. I checked my local mailing list archives for gentoo-dev and don't see any mention of distfiles-local being deprecated back then. Why has it taken 8 years for this to get addressed? In any event, I still think using devspace is a bad idea. A centralized distfiles repo is what most other distros use, and it's what we should use. >> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a >> hash-based naming scheme on the new distfiles layout. I really kind prefer >> breaking the directories up based on the first letter of the distfiles in >> question, factoring case-sensitivity in (so you'd have 52 top-level >> directories for A-Z and a-z, plus 10 more for 0-9). Under each of those >> directories, additional subdirectories for the next few letters (say, >> letters 2-3). Yes, this leads to some orphan cases where a distfile might >> live on its own, but from a direct navigation standpoint, it's easy to find >> for someone browsing the distfiles server and easy to predict where a >> distfile is at. >> >> No math, statistical analysis, or deep-rooted knowledge of filesystems >> behind that paragraph. Just a plain old unfiltered opinion. Sometimes, I >> need to go get a distfile off the Gentoo mirrors, and being able to quickly >> find it in the mirror root is great. Having to do hash calculations to work >> out the file path will be *really* annoying. > > Your solution still doesn't solve the problem of having 8k-24k files > in a single directory, even if you use 7 letters of prefix. So it just > creates a lot of tiny directory noise for no practical gain. Why is having a max ~24k files in a directory a bad idea? Modern filesystems are more than capable of handling that. - ext4: unlimited files in a directory - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume) - ntfs: 4,294,967,295 And 24k is a bit more than 1/3rd of all distfiles that we currently have. Under which scenario do you wind up with 24k files in a single directory? I consider the tex package an outlier in this case (one package should not be the sole dictator of policy). -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org rsa6144/5C63F4E3F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-20 8:25 ` Joshua Kinard @ 2019-10-20 8:32 ` Michał Górny 2019-10-20 9:21 ` Joshua Kinard 2019-10-20 17:09 ` Matt Turner 1 sibling, 1 reply; 56+ messages in thread From: Michał Górny @ 2019-10-20 8:32 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 6309 bytes --] On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote: > On 10/20/2019 02:51, Michał Górny wrote: > > On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote: > > > On 10/18/2019 09:41, Michał Górny wrote: > > > > Hi, everybody. > > > > > > > > It is my pleasure to announce that yesterday (EU) evening we've switched > > > > to a new distfile mirror layout. Users will be switching to the new > > > > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded > > > > already -- as their caches expire (24hrs). > > > > > > > > The new layout is mostly a bow towards mirror admins, for some of whom > > > > having a 60000+ files in a single directory have been a problem. > > > > However, I suppose some of you also found e.g. the directory index > > > > hardly usable due to its size. > > > > > > > > Throughout a transitional period (whose exact length hasn't been decided > > > > yet), both layouts will be available. Afterwards, the old layout will > > > > be removed from mirrors. This has a few implications: > > > > > > > > 1. Users who don't upgrade their package managers in time will lose > > > > the ability of fetching from Gentoo mirrors. This shouldn't be that > > > > much of a problem given that the core software needed to upgrade Portage > > > > should all have reliable upstream SRC_URIs. > > > > > > > > 2. mirror://gentoo/file URIs will stop working. While technically you > > > > could use mirror://gentoo/XX/file, I'd rather recommend finally > > > > discarding its usage and moving distfiles to devspace. > > > > > > > > 3. Directly fetching files from distfiles.gentoo.org will become > > > > a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have > > > > to use something like: > > > > > > > > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 > > > > 1b > > > > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz > > > > ... > > > > > > > > > > > > Alternatively, you can: > > > > > > > > $ wget http://distfiles.gentoo.org/distfiles/INDEX > > > > > > > > and grep for the right path there. This INDEX is also a more > > > > lightweight alternative to HTML indexes generated by the servers. > > > > > > > > > > > > If you're interested in more background details and some plots, see [1]. > > > > > > > > [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html > > > > > > > > > > So the answer I didn't really see directly stated here is, where do new > > > distfiles need to go //now//? E.g., if on woodpecker, I currently cp a > > > distfile to /space/distfiles-local. What is the new directory I need to > > > use? And if mirror://gentoo/${FOO} is going away, for the new distfiles > > > target, what would be the applicable prefix to use? > > > > > > Directly using devspace seems like a bad idea, IMHO. Once long ago, we all > > > got chastised for doing exactly that. Too much possibility of fragmentation > > > as devs retire or package maintainership changes hands. > > > > Today you get chastised for using /space/distfiles-local and not > > following policy changes. The devmanual states that it's deprecated > > since at least 2011, and talks of using d.g.o [1]. > > I don't recall this change being added as far back as 2011. Maybe my memory > is bad, but if it was done that long ago, it was done quietly, and it was > not enforced. I checked my local mailing list archives for gentoo-dev and > don't see any mention of distfiles-local being deprecated back then. Why > has it taken 8 years for this to get addressed? Don't ask me. I think I was already taught to use d.g.o back when I was recruited. > In any event, I still think using devspace is a bad idea. A centralized > distfiles repo is what most other distros use, and it's what we should use. Talking doesn't make things happen. Coming up with good proposals that address all the problems (e.g. those listed in devmanual) does. > > > I looked at the whitepaper'ish-like writeup, and I kinda don't like using a > > > hash-based naming scheme on the new distfiles layout. I really kind prefer > > > breaking the directories up based on the first letter of the distfiles in > > > question, factoring case-sensitivity in (so you'd have 52 top-level > > > directories for A-Z and a-z, plus 10 more for 0-9). Under each of those > > > directories, additional subdirectories for the next few letters (say, > > > letters 2-3). Yes, this leads to some orphan cases where a distfile might > > > live on its own, but from a direct navigation standpoint, it's easy to find > > > for someone browsing the distfiles server and easy to predict where a > > > distfile is at. > > > > > > No math, statistical analysis, or deep-rooted knowledge of filesystems > > > behind that paragraph. Just a plain old unfiltered opinion. Sometimes, I > > > need to go get a distfile off the Gentoo mirrors, and being able to quickly > > > find it in the mirror root is great. Having to do hash calculations to work > > > out the file path will be *really* annoying. > > > > Your solution still doesn't solve the problem of having 8k-24k files > > in a single directory, even if you use 7 letters of prefix. So it just > > creates a lot of tiny directory noise for no practical gain. > > Why is having a max ~24k files in a directory a bad idea? Modern > filesystems are more than capable of handling that. > > - ext4: unlimited files in a directory > - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume) > - ntfs: 4,294,967,295 > > And 24k is a bit more than 1/3rd of all distfiles that we currently have. For the same reason having ~60k files in a directory was a problem. There is really no point in changing anything if you change BIG_NUMBER to SMALLER_BIG_NUMBER. > Under which scenario do you wind up with 24k files in a single directory? I > consider the tex package an outlier in this case (one package should not be > the sole dictator of policy). Three versions of TeXLive living simultaneously. If one package falls completely out of bounds, no problem is solved by the change, so what's the point of making it? -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-20 8:32 ` Michał Górny @ 2019-10-20 9:21 ` Joshua Kinard 2019-10-20 9:44 ` Michał Górny 0 siblings, 1 reply; 56+ messages in thread From: Joshua Kinard @ 2019-10-20 9:21 UTC (permalink / raw To: gentoo-dev On 10/20/2019 04:32, Michał Górny wrote: > On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote: >> On 10/20/2019 02:51, Michał Górny wrote: >>> On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote: >>>> On 10/18/2019 09:41, Michał Górny wrote: >>>>> Hi, everybody. >>>>> >>>>> It is my pleasure to announce that yesterday (EU) evening we've switched >>>>> to a new distfile mirror layout. Users will be switching to the new >>>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded >>>>> already -- as their caches expire (24hrs). >>>>> >>>>> The new layout is mostly a bow towards mirror admins, for some of whom >>>>> having a 60000+ files in a single directory have been a problem. >>>>> However, I suppose some of you also found e.g. the directory index >>>>> hardly usable due to its size. >>>>> >>>>> Throughout a transitional period (whose exact length hasn't been decided >>>>> yet), both layouts will be available. Afterwards, the old layout will >>>>> be removed from mirrors. This has a few implications: >>>>> >>>>> 1. Users who don't upgrade their package managers in time will lose >>>>> the ability of fetching from Gentoo mirrors. This shouldn't be that >>>>> much of a problem given that the core software needed to upgrade Portage >>>>> should all have reliable upstream SRC_URIs. >>>>> >>>>> 2. mirror://gentoo/file URIs will stop working. While technically you >>>>> could use mirror://gentoo/XX/file, I'd rather recommend finally >>>>> discarding its usage and moving distfiles to devspace. >>>>> >>>>> 3. Directly fetching files from distfiles.gentoo.org will become >>>>> a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have >>>>> to use something like: >>>>> >>>>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 >>>>> 1b >>>>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz >>>>> ... >>>>> >>>>> >>>>> Alternatively, you can: >>>>> >>>>> $ wget http://distfiles.gentoo.org/distfiles/INDEX >>>>> >>>>> and grep for the right path there. This INDEX is also a more >>>>> lightweight alternative to HTML indexes generated by the servers. >>>>> >>>>> >>>>> If you're interested in more background details and some plots, see [1]. >>>>> >>>>> [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html >>>>> >>>> >>>> So the answer I didn't really see directly stated here is, where do new >>>> distfiles need to go //now//? E.g., if on woodpecker, I currently cp a >>>> distfile to /space/distfiles-local. What is the new directory I need to >>>> use? And if mirror://gentoo/${FOO} is going away, for the new distfiles >>>> target, what would be the applicable prefix to use? >>>> >>>> Directly using devspace seems like a bad idea, IMHO. Once long ago, we all >>>> got chastised for doing exactly that. Too much possibility of fragmentation >>>> as devs retire or package maintainership changes hands. >>> >>> Today you get chastised for using /space/distfiles-local and not >>> following policy changes. The devmanual states that it's deprecated >>> since at least 2011, and talks of using d.g.o [1]. >> >> I don't recall this change being added as far back as 2011. Maybe my memory >> is bad, but if it was done that long ago, it was done quietly, and it was >> not enforced. I checked my local mailing list archives for gentoo-dev and >> don't see any mention of distfiles-local being deprecated back then. Why >> has it taken 8 years for this to get addressed? > > Don't ask me. I think I was already taught to use d.g.o back when I was > recruited. > >> In any event, I still think using devspace is a bad idea. A centralized >> distfiles repo is what most other distros use, and it's what we should use. > > Talking doesn't make things happen. Coming up with good proposals that > address all the problems (e.g. those listed in devmanual) does. Proposing changes when a direction has already been decided, the rudder position changed, and engines put to full power is equally as pointless. You're the defacto captain of this ship lately. I expect you to not rock the boat too hard. This change is a pretty hard jolt, IMHO. >>>> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a >>>> hash-based naming scheme on the new distfiles layout. I really kind prefer >>>> breaking the directories up based on the first letter of the distfiles in >>>> question, factoring case-sensitivity in (so you'd have 52 top-level >>>> directories for A-Z and a-z, plus 10 more for 0-9). Under each of those >>>> directories, additional subdirectories for the next few letters (say, >>>> letters 2-3). Yes, this leads to some orphan cases where a distfile might >>>> live on its own, but from a direct navigation standpoint, it's easy to find >>>> for someone browsing the distfiles server and easy to predict where a >>>> distfile is at. >>>> >>>> No math, statistical analysis, or deep-rooted knowledge of filesystems >>>> behind that paragraph. Just a plain old unfiltered opinion. Sometimes, I >>>> need to go get a distfile off the Gentoo mirrors, and being able to quickly >>>> find it in the mirror root is great. Having to do hash calculations to work >>>> out the file path will be *really* annoying. >>> >>> Your solution still doesn't solve the problem of having 8k-24k files >>> in a single directory, even if you use 7 letters of prefix. So it just >>> creates a lot of tiny directory noise for no practical gain. >> >> Why is having a max ~24k files in a directory a bad idea? Modern >> filesystems are more than capable of handling that. >> >> - ext4: unlimited files in a directory >> - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume) >> - ntfs: 4,294,967,295 >> >> And 24k is a bit more than 1/3rd of all distfiles that we currently have. > > For the same reason having ~60k files in a directory was a problem. > There is really no point in changing anything if you change BIG_NUMBER > to SMALLER_BIG_NUMBER. That doesn't answer my question. Why is it a problem? What criteria are you using to decide that 24k is a "smaller big number"? Is there some issue highlighted by the mirror admins where having 24k files in a single directory offers no significant relief versus the current 60k files? >> Under which scenario do you wind up with 24k files in a single directory? I >> consider the tex package an outlier in this case (one package should not be >> the sole dictator of policy). > > Three versions of TeXLive living simultaneously. If one package falls > completely out of bounds, no problem is solved by the change, so what's > the point of making it? The problem in this case is with texlive, not our current, or future, distfiles methodology. Has anyone looked at how other distros deal with texlive? Has anyone complained or filed a bug to texlive developers upstream about their excessive amount of distfiles and the burden it places on distro maintainers? -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org rsa6144/5C63F4E3F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-20 9:21 ` Joshua Kinard @ 2019-10-20 9:44 ` Michał Górny 2019-10-20 20:57 ` Joshua Kinard 0 siblings, 1 reply; 56+ messages in thread From: Michał Górny @ 2019-10-20 9:44 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 2665 bytes --] On Sun, 2019-10-20 at 05:21 -0400, Joshua Kinard wrote: > On 10/20/2019 04:32, Michał Górny wrote: > > On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote: > > > Why is having a max ~24k files in a directory a bad idea? Modern > > > filesystems are more than capable of handling that. > > > > > > - ext4: unlimited files in a directory > > > - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume) > > > - ntfs: 4,294,967,295 > > > > > > And 24k is a bit more than 1/3rd of all distfiles that we currently have. > > > > For the same reason having ~60k files in a directory was a problem. > > There is really no point in changing anything if you change BIG_NUMBER > > to SMALLER_BIG_NUMBER. > > That doesn't answer my question. Why is it a problem? What criteria are > you using to decide that 24k is a "smaller big number"? Is there some issue > highlighted by the mirror admins where having 24k files in a single > directory offers no significant relief versus the current 60k files? IIRC Robin set the goal as: | the number of files in a single directory should not exceed 1000, [1] I don't recall how that number was chosen but it's probably pretty arbitrary. In any case, I can notice the difference between working with a listing of 1k files and 24k files, on the hardware running masterdist. > > > Under which scenario do you wind up with 24k files in a single directory? I > > > consider the tex package an outlier in this case (one package should not be > > > the sole dictator of policy). > > > > Three versions of TeXLive living simultaneously. If one package falls > > completely out of bounds, no problem is solved by the change, so what's > > the point of making it? > > The problem in this case is with texlive, not our current, or future, > distfiles methodology. Is it? Are you suggesting we should ban upstream from using multiple distfiles with similar prefix? What about other potential packages that may suffer from the same problem in the future? Go packages have a good potential, given that majority of them starts with 'github.com'. > Has anyone looked at how other distros deal with texlive? Other distros don't mirror original distfiles. > Has anyone complained or filed a bug to texlive developers > upstream about their excessive amount of distfiles and the burden it places > on distro maintainers? You believe it to be a problem. Don't expect others to bother upstream with your preferences. [1] https://www.gentoo.org/glep/glep-0075.html#algorithm-for-splitting-distfiles > -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-20 9:44 ` Michał Górny @ 2019-10-20 20:57 ` Joshua Kinard 2019-10-21 0:05 ` Joshua Kinard 2019-10-21 10:13 ` Kent Fredric 0 siblings, 2 replies; 56+ messages in thread From: Joshua Kinard @ 2019-10-20 20:57 UTC (permalink / raw To: gentoo-dev On 10/20/2019 05:44, Michał Górny wrote: > On Sun, 2019-10-20 at 05:21 -0400, Joshua Kinard wrote: >> On 10/20/2019 04:32, Michał Górny wrote: >>> On Sun, 2019-10-20 at 04:25 -0400, Joshua Kinard wrote: >>>> Why is having a max ~24k files in a directory a bad idea? Modern >>>> filesystems are more than capable of handling that. >>>> >>>> - ext4: unlimited files in a directory >>>> - xfs: virtually unlimited (hard limit of 2^64-1 total files per volume) >>>> - ntfs: 4,294,967,295 >>>> >>>> And 24k is a bit more than 1/3rd of all distfiles that we currently have. >>> >>> For the same reason having ~60k files in a directory was a problem. >>> There is really no point in changing anything if you change BIG_NUMBER >>> to SMALLER_BIG_NUMBER. >> >> That doesn't answer my question. Why is it a problem? What criteria are >> you using to decide that 24k is a "smaller big number"? Is there some issue >> highlighted by the mirror admins where having 24k files in a single >> directory offers no significant relief versus the current 60k files? > > IIRC Robin set the goal as: > > | the number of files in a single directory should not exceed 1000, [1] > > I don't recall how that number was chosen but it's probably pretty > arbitrary. In any case, I can notice the difference between working > with a listing of 1k files and 24k files, on the hardware running > masterdist. I think it would be prudent then to get some data to help underpin why that number was chosen and add that to the GLEP, possibly as one of the references at the bottom. Your personal observations of a system (masterdist) that few of us have access to is not good enough, especially for future developers who may revisit this topic long after you or I are gone. > >>>> Under which scenario do you wind up with 24k files in a single directory? I >>>> consider the tex package an outlier in this case (one package should not be >>>> the sole dictator of policy). >>> >>> Three versions of TeXLive living simultaneously. If one package falls >>> completely out of bounds, no problem is solved by the change, so what's >>> the point of making it? >> >> The problem in this case is with texlive, not our current, or future, >> distfiles methodology. > > Is it? Are you suggesting we should ban upstream from using multiple > distfiles with similar prefix? What about other potential packages that > may suffer from the same problem in the future? Go packages have a good > potential, given that majority of them starts with 'github.com'. Please highlight which of my words imply in any way that I want to ban something. I simply said texlive's significant number of distfiles is a problem. That doesn't mean that I want to resolve the problem by banning it, or future packages that employ that method. My concern is that out of the tens of thousands of packages we have, we're allowing ONE package to dictate how we shape a major piece of Gentoo infrastructure, and I don't feel that the proposed solution seeks to address it. Rather, it seeks to band-aid it by wrapping the entire distro up like a mummy. >> Has anyone looked at how other distros deal with texlive? > > Other distros don't mirror original distfiles. Has thought be given to doing the same? This is arguably a better approach than mirroring original distfiles in devspace. This would significantly reduce the infrastructure burden on the project. >> Has anyone complained or filed a bug to texlive developers >> upstream about their excessive amount of distfiles and the burden it places >> on distro maintainers? > > You believe it to be a problem. Don't expect others to bother upstream > with your preferences. Hah. So you consider texlive having 16k+ distfiles to be completely within operating norms then? I did a quick look, and it looks like the TeX project has a fairly comprehensive mirroring system distributed around the world. In fact, it looks like they emulate Perl's CPAN system with "CTAN": https://ctan.org/ I don't know the history of the texlive and other associated tex packages in Gentoo, but my guess is instead of doing what our Perl packages do, someone just decided to mirror the CTAN archive directly on the Gentoo distfiles system. It seems to me that what should actually happen is that we leverage CTAN itself, much like CPAN, and use their mirroring system instead of burdening our infrastructure as an unofficial CTAN archive. I know we've got a ton of Perl packages for the core set of Perl modules, but doesn't the CPAN eclass also have the capability to auto-generate an ebuild package for virtually any Perl package distributed via CPAN? Can that logic be used with the CTAN system in its own eclass and then we remove the 16k+ texlive modules off of our mirrors completely? Or at the worst, we might just have to generate ebuilds for texlive modules and treat them as discrete, installed packages. -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org rsa6144/5C63F4E3F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-20 20:57 ` Joshua Kinard @ 2019-10-21 0:05 ` Joshua Kinard 2019-10-21 5:51 ` Ulrich Mueller ` (2 more replies) 2019-10-21 10:13 ` Kent Fredric 1 sibling, 3 replies; 56+ messages in thread From: Joshua Kinard @ 2019-10-21 0:05 UTC (permalink / raw To: gentoo-dev On 10/20/2019 16:57, Joshua Kinard wrote:> On 10/20/2019 05:44, Michal Górny wrote: >> On Sun, 2019-10-20 at 05:21 -0400, Joshua Kinard wrote: >>> On 10/20/2019 04:32, Michal Górny wrote: [snip] >> You believe it to be a problem. Don't expect others to bother upstream >> with your preferences. > > Hah. So you consider texlive having 16k+ distfiles to be completely within > operating norms then? > > I did a quick look, and it looks like the TeX project has a fairly > comprehensive mirroring system distributed around the world. In fact, it > looks like they emulate Perl's CPAN system with "CTAN": > > https://ctan.org/ > > I don't know the history of the texlive and other associated tex packages in > Gentoo, but my guess is instead of doing what our Perl packages do, someone > just decided to mirror the CTAN archive directly on the Gentoo distfiles > system. It seems to me that what should actually happen is that we leverage > CTAN itself, much like CPAN, and use their mirroring system instead of > burdening our infrastructure as an unofficial CTAN archive. > > I know we've got a ton of Perl packages for the core set of Perl modules, > but doesn't the CPAN eclass also have the capability to auto-generate an > ebuild package for virtually any Perl package distributed via CPAN? Can > that logic be used with the CTAN system in its own eclass and then we remove > the 16k+ texlive modules off of our mirrors completely? Or at the worst, we > might just have to generate ebuilds for texlive modules and treat them as > discrete, installed packages. So looking at texlive-latexextra-2019-r2.ebuild, it defines three variables: - TEXLIVE_MODULE_CONTENTS, with 1,241 space-delimited module names - TEXLIVE_MODULE_DOC_CONTENTS, with 1,227 space-delimited doc names - TEXLIVE_MODULE_SRC_CONTENTS, with 745 space-delimited src names Then, in texlive-module.eclass, there's these loops: for i in ${TEXLIVE_MODULE_CONTENTS}; do SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}" done # Forge doc SRC_URI [ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} doc? (" for i in ${TEXLIVE_MODULE_DOC_CONTENTS}; do SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}" done [ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} )" # Forge source SRC_URI if [ -n "${TEXLIVE_MODULE_SRC_CONTENTS}" ] ; then SRC_URI="${SRC_URI} source? (" for i in ${TEXLIVE_MODULE_SRC_CONTENTS}; do SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}" done SRC_URI="${SRC_URI} )" fi I think this is definitely an issue with how this package is laying out its needed distfiles. It really should be leveraging CTAN system at a minimum to fetch all of the needed distfiles so we can get them off of our distfiles mirror. Then it would be interesting to re-run the math on the distfiles distribution using the different schemes highlighted in the GLEP-75 paper. Longer-term, I think this entire approach should be revisited by the TeX team to make it behave more like Perl or Python packages by having discrete ebuilds for these modules. That's not exactly a small undertaking, but this current approach feels very kludgy in its design and is probably asking for trouble. I looked at several of the modules on CTAN, and they each have their own version and even have different licenses. E.g., - altfont is licensed under "GNU General Public License" (version ??) - achemso is licensed under "The LaTeX Project Public License 1.3c" - arraysort is licensed under "The LaTeX Project Public License 1.2" - amsfonts is licensed under "The SIL Open Font License" - a0poster is licensed under "The LaTeX Project Public License" (ver ??) - arydshln is licensed under "The LaTeX Project Public License 1" - aurl is licensed under "Public Domain Software" That's just a random selection from the 'a' category. Do we have copies of those licenses in the tree? Do they allow redistribution of the distfiles? For the users that want "free" software, do any of the licenses in any of the TeX modules put up any disagreeable restrictions? Etc... -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org rsa6144/5C63F4E3F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-21 0:05 ` Joshua Kinard @ 2019-10-21 5:51 ` Ulrich Mueller 2019-10-21 10:17 ` Kent Fredric 2019-10-21 21:34 ` Mikle Kolyada 2 siblings, 0 replies; 56+ messages in thread From: Ulrich Mueller @ 2019-10-21 5:51 UTC (permalink / raw To: Joshua Kinard; +Cc: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1058 bytes --] >>>>> On Mon, 21 Oct 2019, Joshua Kinard wrote: > - altfont is licensed under "GNU General Public License" (version ??) > - achemso is licensed under "The LaTeX Project Public License 1.3c" > - arraysort is licensed under "The LaTeX Project Public License 1.2" > - amsfonts is licensed under "The SIL Open Font License" > - a0poster is licensed under "The LaTeX Project Public License" (ver ??) > - arydshln is licensed under "The LaTeX Project Public License 1" > - aurl is licensed under "Public Domain Software" > That's just a random selection from the 'a' category. Do we have > copies of those licenses in the tree? Yes. > Do they allow redistribution of the distfiles? Yes. > For the users that want "free" software, do any of the licenses in any > of the TeX modules put up any disagreeable restrictions? All of TeXLive should be free software. Upstream doesn't accept anything that is non-free. (Mistakes can happen, though. There was one non-free module in texlive-latexextra-2019, which was sorted out in bug 687328.) Ulrich [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-21 0:05 ` Joshua Kinard 2019-10-21 5:51 ` Ulrich Mueller @ 2019-10-21 10:17 ` Kent Fredric 2019-10-21 21:34 ` Mikle Kolyada 2 siblings, 0 replies; 56+ messages in thread From: Kent Fredric @ 2019-10-21 10:17 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1026 bytes --] On Sun, 20 Oct 2019 20:05:40 -0400 Joshua Kinard <kumba@gentoo.org> wrote: > Longer-term, I think this entire approach should be revisited by the TeX > team to make it behave more like Perl or Python packages by having discrete > ebuilds for these modules. That's not exactly a small undertaking, but > this current approach feels very kludgy in its design and is probably > asking for trouble. I looked at several of the modules on CTAN, and they > each have their own version and even have different licenses. With the current state of the portage dependency resolver, and with regards to the constant problems end users face with it, I really can't advise this unless you need to. Currently working on vendoring rust in an overlay, and 128 ebuilds just to satisfy the dependencies enough to test *one* package is a bit of a piss-take. I'd suggest waiting a few years for portage to see some improvements here before taking on something that ambitious when the current approach works well enough. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-21 0:05 ` Joshua Kinard 2019-10-21 5:51 ` Ulrich Mueller 2019-10-21 10:17 ` Kent Fredric @ 2019-10-21 21:34 ` Mikle Kolyada 2 siblings, 0 replies; 56+ messages in thread From: Mikle Kolyada @ 2019-10-21 21:34 UTC (permalink / raw To: gentoo-dev [-- Attachment #1.1: Type: text/plain, Size: 2049 bytes --] On 21.10.2019 3:05, Joshua Kinard wrote: > So looking at texlive-latexextra-2019-r2.ebuild, it defines three variables: > > - TEXLIVE_MODULE_CONTENTS, with 1,241 space-delimited module names > - TEXLIVE_MODULE_DOC_CONTENTS, with 1,227 space-delimited doc names > - TEXLIVE_MODULE_SRC_CONTENTS, with 745 space-delimited src names > > Then, in texlive-module.eclass, there's these loops: > > for i in ${TEXLIVE_MODULE_CONTENTS}; do > SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}" > done > > # Forge doc SRC_URI > [ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} doc? (" > for i in ${TEXLIVE_MODULE_DOC_CONTENTS}; do > SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}" > done > [ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} )" > > # Forge source SRC_URI > if [ -n "${TEXLIVE_MODULE_SRC_CONTENTS}" ] ; then > SRC_URI="${SRC_URI} source? (" > for i in ${TEXLIVE_MODULE_SRC_CONTENTS}; do > SRC_URI="${SRC_URI} mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}" > done > SRC_URI="${SRC_URI} )" > fi > > I think this is definitely an issue with how this package is laying out its > needed distfiles. It really should be leveraging CTAN system at a minimum > to fetch all of the needed distfiles so we can get them off of our > distfiles mirror. Then it would be interesting to re-run the math on > the distfiles distribution using the different schemes highlighted in the > GLEP-75 paper. TexLive distributes collections of macros, not packages separately, they make their packaging based on CTAN. In the meantime CTAN packages are not versioned, they only have internal release number, no tags, releases and so on, see [1]. I also fail to see what problem you try to solve when suggest fetching macros from CTAN, you are going to have the same amount of data mirrored as a result. [1] - https://ctan.org/tex-archive/systems/texlive/tlnet/archive [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-20 20:57 ` Joshua Kinard 2019-10-21 0:05 ` Joshua Kinard @ 2019-10-21 10:13 ` Kent Fredric 2019-10-23 5:16 ` Joshua Kinard 1 sibling, 1 reply; 56+ messages in thread From: Kent Fredric @ 2019-10-21 10:13 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 902 bytes --] On Sun, 20 Oct 2019 16:57:54 -0400 Joshua Kinard <kumba@gentoo.org> wrote: > I know we've got a ton of Perl packages for the core set of Perl modules, > but doesn't the CPAN eclass also have the capability to auto-generate an > ebuild package for virtually any Perl package distributed via CPAN? Can > that logic be used with the CTAN system in its own eclass and then we remove > the 16k+ texlive modules off of our mirrors completely? Or at the worst, we > might just have to generate ebuilds for texlive modules and treat them as > discrete, installed packages. - Perl packages never have more than 1:1 source archives per ebuild - Perl upstream naming doesn't habitually use "perl-" as an archive prefix - Everything that is packaged for Perl in Gentoo is mirrored to the Gentoo distfiles mirror, and this causes no issues. So I don't think any comparison here makes sense. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-21 10:13 ` Kent Fredric @ 2019-10-23 5:16 ` Joshua Kinard 2019-10-29 16:35 ` Kent Fredric 0 siblings, 1 reply; 56+ messages in thread From: Joshua Kinard @ 2019-10-23 5:16 UTC (permalink / raw To: gentoo-dev On 10/21/2019 06:13, Kent Fredric wrote: > On Sun, 20 Oct 2019 16:57:54 -0400 > Joshua Kinard <kumba@gentoo.org> wrote: > >> I know we've got a ton of Perl packages for the core set of Perl modules, >> but doesn't the CPAN eclass also have the capability to auto-generate an >> ebuild package for virtually any Perl package distributed via CPAN? Can >> that logic be used with the CTAN system in its own eclass and then we remove >> the 16k+ texlive modules off of our mirrors completely? Or at the worst, we >> might just have to generate ebuilds for texlive modules and treat them as >> discrete, installed packages. > > - Perl packages never have more than 1:1 source archives per ebuild > - Perl upstream naming doesn't habitually use "perl-" as an archive prefix > - Everything that is packaged for Perl in Gentoo is mirrored to the > Gentoo distfiles mirror, and this causes no issues. > > So I don't think any comparison here makes sense. I have to disagree on the "doesn't make sense" bit. Regardless of what it is that TexLive is packaging, the problem that I feel exists is storing these macro packages on our mirrors is what is responsible for 20% of *all* distfiles that we store. That's lopsided that a small collection of ebuilds, due to the way their build logic is architected, has that many distfiles on the mirrors. It's scenarios like that which led to Michał developing the GLEP the way he did. His approach is more broad, seeking to future-proof the mirroring issue regardless of package mirroring decisions, whereas I'm more curious why texlive needs all of those packages on our mirrors when it appears to have a fairly comprehensive mirroring system of its own. Why reinvent the wheel? Since CTAN exists as a worldwide mirroring system, I think at a minimum, we should try to fetch from that directly instead of mirroring them on our own systems and partner mirrors. Or we could go the other way and become an official CTAN mirror ourselves. After all, if we're going to reinvent the wheel, do all four instead of just one. And for Perl or Python, I think we should be making an effort to leverage their respective mirroring systems first before putting their distfiles onto our mirrors. Perl's got CPAN, and Python has pypi. For things that don't exist on those systems, then we use our mirrors. -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org rsa6144/5C63F4E3F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-23 5:16 ` Joshua Kinard @ 2019-10-29 16:35 ` Kent Fredric 0 siblings, 0 replies; 56+ messages in thread From: Kent Fredric @ 2019-10-29 16:35 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1246 bytes --] On Wed, 23 Oct 2019 01:16:51 -0400 Joshua Kinard <kumba@gentoo.org> wrote: > And for Perl or Python, I think we should be making an effort to leverage > their respective mirroring systems first before putting their distfiles onto > our mirrors. Perl's got CPAN, and Python has pypi. For things that don't > exist on those systems, then we use our mirrors. We still have to mirror them, because upstream has a tendency to nuke things so that they can't be fetched any more from these primary sources. So whether end user fetch from the distfiles mirror for the first hit, or as a fallback, the cost is still there. The packages aren't broken, upstream hasn't stopped shipping it, just some upstreams have a fetish for nuking everything but the latest-and-greatest, and at a pace that is absolutely rediculous and can't be imagined for us to keep up with with all the stabilization rigmarole. Yes, backpan does exist, but its neither perfect, nor fast. And the faster upstream nukes things, the more likely it is it won't even be mirrored on backpan! ( I wish I was imagining this circumstance, but its happened far too often ) And we're not doing our users any service by burdening them with this madness. [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-20 8:25 ` Joshua Kinard 2019-10-20 8:32 ` Michał Górny @ 2019-10-20 17:09 ` Matt Turner 1 sibling, 0 replies; 56+ messages in thread From: Matt Turner @ 2019-10-20 17:09 UTC (permalink / raw To: gentoo development On Sun, Oct 20, 2019 at 1:25 AM Joshua Kinard <kumba@gentoo.org> wrote: > In any event, I still think using devspace is a bad idea. A centralized > distfiles repo is what most other distros use, and it's what we should use. I agree, but let's discuss that in a separate topic. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-20 6:51 ` Michał Górny 2019-10-20 8:25 ` Joshua Kinard @ 2019-10-21 16:42 ` Richard Yao 2019-10-21 23:36 ` Matt Turner ` (2 more replies) 2019-10-28 23:24 ` Chí-Thanh Christopher Nguyễn 2 siblings, 3 replies; 56+ messages in thread From: Richard Yao @ 2019-10-21 16:42 UTC (permalink / raw To: gentoo-dev > On Oct 20, 2019, at 2:51 AM, Michał Górny <mgorny@gentoo.org> wrote: > > On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote: >>> On 10/18/2019 09:41, Michał Górny wrote: >>> Hi, everybody. >>> >>> It is my pleasure to announce that yesterday (EU) evening we've switched >>> to a new distfile mirror layout. Users will be switching to the new >>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded >>> already -- as their caches expire (24hrs). >>> >>> The new layout is mostly a bow towards mirror admins, for some of whom >>> having a 60000+ files in a single directory have been a problem. >>> However, I suppose some of you also found e.g. the directory index >>> hardly usable due to its size. >>> >>> Throughout a transitional period (whose exact length hasn't been decided >>> yet), both layouts will be available. Afterwards, the old layout will >>> be removed from mirrors. This has a few implications: >>> >>> 1. Users who don't upgrade their package managers in time will lose >>> the ability of fetching from Gentoo mirrors. This shouldn't be that >>> much of a problem given that the core software needed to upgrade Portage >>> should all have reliable upstream SRC_URIs. >>> >>> 2. mirror://gentoo/file URIs will stop working. While technically you >>> could use mirror://gentoo/XX/file, I'd rather recommend finally >>> discarding its usage and moving distfiles to devspace. >>> >>> 3. Directly fetching files from distfiles.gentoo.org will become >>> a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have >>> to use something like: >>> >>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 >>> 1b >>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz >>> ... >>> >>> >>> Alternatively, you can: >>> >>> $ wget http://distfiles.gentoo.org/distfiles/INDEX >>> >>> and grep for the right path there. This INDEX is also a more >>> lightweight alternative to HTML indexes generated by the servers. >>> >>> >>> If you're interested in more background details and some plots, see [1]. >>> >>> [1] https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html >>> >> >> So the answer I didn't really see directly stated here is, where do new >> distfiles need to go //now//? E.g., if on woodpecker, I currently cp a >> distfile to /space/distfiles-local. What is the new directory I need to >> use? And if mirror://gentoo/${FOO} is going away, for the new distfiles >> target, what would be the applicable prefix to use? >> >> Directly using devspace seems like a bad idea, IMHO. Once long ago, we all >> got chastised for doing exactly that. Too much possibility of fragmentation >> as devs retire or package maintainership changes hands. > > Today you get chastised for using /space/distfiles-local and not > following policy changes. The devmanual states that it's deprecated > since at least 2011, and talks of using d.g.o [1]. > >> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a >> hash-based naming scheme on the new distfiles layout. I really kind prefer >> breaking the directories up based on the first letter of the distfiles in >> question, factoring case-sensitivity in (so you'd have 52 top-level >> directories for A-Z and a-z, plus 10 more for 0-9). Under each of those >> directories, additional subdirectories for the next few letters (say, >> letters 2-3). Yes, this leads to some orphan cases where a distfile might >> live on its own, but from a direct navigation standpoint, it's easy to find >> for someone browsing the distfiles server and easy to predict where a >> distfile is at. >> >> No math, statistical analysis, or deep-rooted knowledge of filesystems >> behind that paragraph. Just a plain old unfiltered opinion. Sometimes, I >> need to go get a distfile off the Gentoo mirrors, and being able to quickly >> find it in the mirror root is great. Having to do hash calculations to work >> out the file path will be *really* annoying. > > Your solution still doesn't solve the problem of having 8k-24k files > in a single directory, even if you use 7 letters of prefix. So it just > creates a lot of tiny directory noise for no practical gain. > > [1] https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts If we consider the access frequency, it might actually not be that bad. Consider a simple example with 500 files and two directory buckets. If we have 250 in each, then the size of the directory is always 250. However, if 50 files are accessed 90% of the time, then putting 450 into one directory and that 50 into another directory, we end up with the performance of the O(n) directory lookup being consistent with there being only 90 files in each directory. I am not sure if we should be discarding all other considerations to make changes to benefit O(n) directory lookup filesystems, but if we are, then the hashing approach is not necessarily the best one. It is only the best when all files are accessed with equal frequency, which would be an incorrect assumption. A more human friendly approach might still be better. I doubt that we have the data to determine that though. Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds. > > -- > Best regards, > Michał Górny > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-21 16:42 ` Richard Yao @ 2019-10-21 23:36 ` Matt Turner 2019-10-23 5:18 ` Joshua Kinard 2019-10-22 6:51 ` Jaco Kroon 2019-10-23 1:21 ` Rich Freeman 2 siblings, 1 reply; 56+ messages in thread From: Matt Turner @ 2019-10-21 23:36 UTC (permalink / raw To: gentoo development On Mon, Oct 21, 2019 at 9:42 AM Richard Yao <ryao@gentoo.org> wrote: > Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds. It probably would have been better to make these suggestions when the GLEP was discussed close to two years ago. I'm glad that we have ideas for improvements but I worry that we're just backseat driving at this point given that the GLEP's now implemented. ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-21 23:36 ` Matt Turner @ 2019-10-23 5:18 ` Joshua Kinard 2019-10-23 17:06 ` William Hubbs 2019-10-23 22:04 ` William Hubbs 0 siblings, 2 replies; 56+ messages in thread From: Joshua Kinard @ 2019-10-23 5:18 UTC (permalink / raw To: gentoo-dev On 10/21/2019 19:36, Matt Turner wrote: > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao <ryao@gentoo.org> wrote: >> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds. > > It probably would have been better to make these suggestions when the > GLEP was discussed close to two years ago. > > I'm glad that we have ideas for improvements but I worry that we're > just backseat driving at this point given that the GLEP's now > implemented. Agreed, although, I don't even remember this coming up two years ago. But, I was tied up with a lot of work-related stress and tasks, so probably just my memory storage backend not having enough cycles to commit it to...neurons. IMHO, perhaps future GLEPs should have a defined window to implement them following discussion. Having the discussion, then waiting a few years before implementing them leads to discussions like this where we're arguing about the color of the boat after the boat has sailed off into the distance. -- Joshua Kinard Gentoo/MIPS kumba@gentoo.org rsa6144/5C63F4E3F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-23 5:18 ` Joshua Kinard @ 2019-10-23 17:06 ` William Hubbs 2019-10-23 18:38 ` William Hubbs 2019-10-23 22:04 ` William Hubbs 1 sibling, 1 reply; 56+ messages in thread From: William Hubbs @ 2019-10-23 17:06 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1149 bytes --] On Wed, Oct 23, 2019 at 01:18:02AM -0400, Joshua Kinard wrote: > On 10/21/2019 19:36, Matt Turner wrote: > > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao <ryao@gentoo.org> wrote: > >> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds. > > > > It probably would have been better to make these suggestions when the > > GLEP was discussed close to two years ago. > > > > I'm glad that we have ideas for improvements but I worry that we're > > just backseat driving at this point given that the GLEP's now > > implemented. Nothing is really etched in stone, so we could change it again if we see fit. *snip* > IMHO, perhaps future GLEPs should have a defined window to implement them > following discussion. Having the discussion, then waiting a few years > before implementing them leads to discussions like this where we're arguing > about the color of the boat after the boat has sailed off into the distance. Agreed. I will work on a proposal for the next council meeting. Thanks, William [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-23 17:06 ` William Hubbs @ 2019-10-23 18:38 ` William Hubbs 0 siblings, 0 replies; 56+ messages in thread From: William Hubbs @ 2019-10-23 18:38 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1155 bytes --] On Wed, Oct 23, 2019 at 12:06:24PM -0500, William Hubbs wrote: > On Wed, Oct 23, 2019 at 01:18:02AM -0400, Joshua Kinard wrote: > > On 10/21/2019 19:36, Matt Turner wrote: > > > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao <ryao@gentoo.org> wrote: > > >> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds. > > > > > > It probably would have been better to make these suggestions when the > > > GLEP was discussed close to two years ago. > > > > > > I'm glad that we have ideas for improvements but I worry that we're > > > just backseat driving at this point given that the GLEP's now > > > implemented. > > Nothing is really etched in stone, so we could change it again if we > see fit. Actually, which glep are we talking about? If we are talking about glep 75, I don't see where the council approved it [1], so it definitely should be discussed/approved before any implementation changes are made, or we should see where it was approved. William [1] https://www.gentoo.org/glep/glep-0075.html [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-23 5:18 ` Joshua Kinard 2019-10-23 17:06 ` William Hubbs @ 2019-10-23 22:04 ` William Hubbs 2019-10-24 4:30 ` Michał Górny 1 sibling, 1 reply; 56+ messages in thread From: William Hubbs @ 2019-10-23 22:04 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 2056 bytes --] On Wed, Oct 23, 2019 at 01:18:02AM -0400, Joshua Kinard wrote: > On 10/21/2019 19:36, Matt Turner wrote: > > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao <ryao@gentoo.org> wrote: > >> Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds. > > > > It probably would have been better to make these suggestions when the > > GLEP was discussed close to two years ago. > > > > I'm glad that we have ideas for improvements but I worry that we're > > just backseat driving at this point given that the GLEP's now > > implemented. > > Agreed, although, I don't even remember this coming up two years ago. But, > I was tied up with a lot of work-related stress and tasks, so probably just > my memory storage backend not having enough cycles to commit it to...neurons. After looking at this further, I found that the glep was presented to us in Jan 2018 on the dev ml [1]. I checked all council meeting logs and discovered that this was never brought to us formally for approval. It looks like the developers decided to do this as an infrastructure/portage project and because of that they felt like they didn't need a glep. Thanks, William [1] https://archives.gentoo.org/gentoo-dev/message/cfc4f8595df2edf9a25ba9ecae2463ba > IMHO, perhaps future GLEPs should have a defined window to implement them > following discussion. Having the discussion, then waiting a few years > before implementing them leads to discussions like this where we're arguing > about the color of the boat after the boat has sailed off into the distance. > > -- > Joshua Kinard > Gentoo/MIPS > kumba@gentoo.org > rsa6144/5C63F4E3F5C6C943 2015-04-27 > 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 > > "The past tempts us, the present confuses us, the future frightens us. And > our lives slip away, moment by moment, lost in that vast, terrible in-between." > > --Emperor Turhan, Centauri Republic > [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 195 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-23 22:04 ` William Hubbs @ 2019-10-24 4:30 ` Michał Górny 0 siblings, 0 replies; 56+ messages in thread From: Michał Górny @ 2019-10-24 4:30 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1560 bytes --] On Wed, 2019-10-23 at 17:04 -0500, William Hubbs wrote: > On Wed, Oct 23, 2019 at 01:18:02AM -0400, Joshua Kinard wrote: > > On 10/21/2019 19:36, Matt Turner wrote: > > > On Mon, Oct 21, 2019 at 9:42 AM Richard Yao <ryao@gentoo.org> wrote: > > > > Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds. > > > > > > It probably would have been better to make these suggestions when the > > > GLEP was discussed close to two years ago. > > > > > > I'm glad that we have ideas for improvements but I worry that we're > > > just backseat driving at this point given that the GLEP's now > > > implemented. > > > > Agreed, although, I don't even remember this coming up two years ago. But, > > I was tied up with a lot of work-related stress and tasks, so probably just > > my memory storage backend not having enough cycles to commit it to...neurons. > > After looking at this further, I found that the glep was presented to > us in Jan 2018 on the dev ml [1]. > > I checked all council meeting logs and discovered that this was never > brought to us formally for approval. > > It looks like the developers decided to do this as an > infrastructure/portage project and because of that they felt like they > didn't need a glep. > ...or simply forgotten whether it was approved or not after waiting almost two years for Portage team provide a reference implementation. -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-21 16:42 ` Richard Yao 2019-10-21 23:36 ` Matt Turner @ 2019-10-22 6:51 ` Jaco Kroon 2019-10-22 8:43 ` Ulrich Mueller 2019-10-23 23:47 ` ext4 readdir performance - was " Richard Yao 2019-10-23 1:21 ` Rich Freeman 2 siblings, 2 replies; 56+ messages in thread From: Jaco Kroon @ 2019-10-22 6:51 UTC (permalink / raw To: gentoo-dev, Richard Yao [-- Attachment #1: Type: text/plain, Size: 3918 bytes --] Hi All, On 2019/10/21 18:42, Richard Yao wrote: > > If we consider the access frequency, it might actually not be that bad. Consider a simple example with 500 files and two directory buckets. If we have 250 in each, then the size of the directory is always 250. However, if 50 files are accessed 90% of the time, then putting 450 into one directory and that 50 into another directory, we end up with the performance of the O(n) directory lookup being consistent with there being only 90 files in each directory. > > I am not sure if we should be discarding all other considerations to make changes to benefit O(n) directory lookup filesystems, but if we are, then the hashing approach is not necessarily the best one. It is only the best when all files are accessed with equal frequency, which would be an incorrect assumption. A more human friendly approach might still be better. I doubt that we have the data to determine that though. > > Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds. Experience: ext4 sucks at targeting name lookups without dir_index feature (O(n) lookups - scans all entries in the folder). With dir_index readdir performance is crap. Pick your poison I guess. Most of our larger filesystems (2TB+, but especially the 80TB+ ones) we've reverted to disabling dir_index as the benefit is outweighed by the crappy readdir() and glob() performance. There doesn't seem to be a real specific tip-over point, and it seems to depend a lot on RAM availability and harddrive speed (obviously). So if dentries gets cached, disk speeds becomes less of an issue. However, on large folders (where I typically use 10k as a value for large based on "gut feeling" and "unquantifiable experience" and "nothing scientific at all") I find that even with lots of RAM two consecutive ls commands remains terribly slow. Switch off dir_index and that becomes an order of magnitude faster. I don't have a great deal of experience with XFS, but on those systems where we do it's generally on a VM, and perceivably (again, not scientific) our experience has been that it feels slower. Again, not scientific, just perception. I'm in support for the change. This will bucket to 256 folders and should have a reasonably even split between folders. If required a second layer could be introduced by using the 3rd and 4th digits of the hash for a second layer. Any hash should be fine, it really doesn't need to be cryptographically strong, it just needs to provide a good spread and be really fast. Generally a hash table should have a prime number of buckets to assist with hash bias, but frankly, that's over complicating the situation here. I also agree with others that it used to be easy to get distfiles as and when needed, so an alternative structure could mirror that of the portage tree itself, in other words "cat/pkg/distfile". This perhaps just shifts the issue: jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name "*-*" | wc -l 167 jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l 19412 jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10 347 net-misc 373 media-sound 395 media-libs 399 dev-util 505 dev-libs 528 dev-java 684 dev-haskell 690 dev-ruby 1601 dev-perl 1889 dev-python So that's average 116 sub folders under the top layer (only two over 1000), and then presumably less than 100 distfiles maximum per package? Probably overkill but would (should) solve both the too many files per folder as well as the easy lookup by hand issue. I don't have a preference on either solution though but do agree that "easy finding of distfiles" are handy. The INDEX mechanism is fine for me. Kind Regards, Jaco [-- Attachment #2: Type: text/html, Size: 5146 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-22 6:51 ` Jaco Kroon @ 2019-10-22 8:43 ` Ulrich Mueller 2019-10-22 8:46 ` Jaco Kroon 2019-10-23 23:47 ` ext4 readdir performance - was " Richard Yao 1 sibling, 1 reply; 56+ messages in thread From: Ulrich Mueller @ 2019-10-22 8:43 UTC (permalink / raw To: Jaco Kroon; +Cc: gentoo-dev, Richard Yao [-- Attachment #1: Type: text/plain, Size: 478 bytes --] >>>>> On Tue, 22 Oct 2019, Jaco Kroon wrote: > I also agree with others that it used to be easy to get distfiles as > and when needed, so an alternative structure could mirror that of the > portage tree itself, in other words "cat/pkg/distfile". Not a good idea, because some distfiles are shared between packages. For example, sys-kernel/*-sources use the same distfiles. (It won't work with categories either, e.g., there are dev-lang/ruby and app-emacs/ruby-mode.) Ulrich [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-22 8:43 ` Ulrich Mueller @ 2019-10-22 8:46 ` Jaco Kroon 0 siblings, 0 replies; 56+ messages in thread From: Jaco Kroon @ 2019-10-22 8:46 UTC (permalink / raw To: gentoo-dev, Ulrich Mueller; +Cc: Richard Yao Hi, On 2019/10/22 10:43, Ulrich Mueller wrote: >>>>>> On Tue, 22 Oct 2019, Jaco Kroon wrote: >> I also agree with others that it used to be easy to get distfiles as >> and when needed, so an alternative structure could mirror that of the >> portage tree itself, in other words "cat/pkg/distfile". > Not a good idea, because some distfiles are shared between packages. > For example, sys-kernel/*-sources use the same distfiles. (It won't > work with categories either, e.g., there are dev-lang/ruby and > app-emacs/ruby-mode.) > > Ulrich You are absolutely correct. I then fully agree with current implementation. Kind Regards, Jaco ^ permalink raw reply [flat|nested] 56+ messages in thread
* ext4 readdir performance - was Re: [gentoo-dev] New distfile mirror layout 2019-10-22 6:51 ` Jaco Kroon 2019-10-22 8:43 ` Ulrich Mueller @ 2019-10-23 23:47 ` Richard Yao 2019-10-24 0:01 ` Richard Yao 1 sibling, 1 reply; 56+ messages in thread From: Richard Yao @ 2019-10-23 23:47 UTC (permalink / raw To: Jaco Kroon, gentoo-dev [-- Attachment #1.1: Type: text/plain, Size: 5971 bytes --] On 10/22/19 2:51 AM, Jaco Kroon wrote: > Hi All, > > > On 2019/10/21 18:42, Richard Yao wrote: >> >> If we consider the access frequency, it might actually not be that >> bad. Consider a simple example with 500 files and two directory >> buckets. If we have 250 in each, then the size of the directory is >> always 250. However, if 50 files are accessed 90% of the time, then >> putting 450 into one directory and that 50 into another directory, we >> end up with the performance of the O(n) directory lookup being >> consistent with there being only 90 files in each directory. >> >> I am not sure if we should be discarding all other considerations to >> make changes to benefit O(n) directory lookup filesystems, but if we >> are, then the hashing approach is not necessarily the best one. It is >> only the best when all files are accessed with equal frequency, which >> would be an incorrect assumption. A more human friendly approach might >> still be better. I doubt that we have the data to determine that though. >> >> Also, another idea is to use a cheap hash function (e.g. fletcher) and >> just have the mirrors do the hashing behind the scenes. Then we would >> have the best of both worlds. > > > Experience: > > ext4 sucks at targeting name lookups without dir_index feature (O(n) > lookups - scans all entries in the folder). With dir_index readdir > performance is crap. Pick your poison I guess. Most of our larger > filesystems (2TB+, but especially the 80TB+ ones) we've reverted to > disabling dir_index as the benefit is outweighed by the crappy readdir() > and glob() performance. My read of the ext4 disk layout documentation is that the read operation should work mostly the same way, except with a penalty from reading larger directories caused by the addition of the tree's metadata and from having more partially filled blocks: https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries The code itself is the same traversal code: https://github.com/torvalds/linux/blob/v5.3/fs/ext4/dir.c#L106 However, a couple of things stand out to me here at a glance: 1. `cond_resched()` adds scheduler delay for no apparent reason. `cond_resched()` is meant to be used in places where we could block excessively on non-PREEMPT kernels, but this doesn't strike me as one of those places. The fact that we block on disk on uncached reads naturally serves the same purpose, so an explicit rescheduling point here is redundant. PREEMPT kernels should perform better in readdir() on ext4 by virtue of making `cond_resched()` a no-op. 2. read-ahead is implemented in a way that appears to be over-reading the directory whenever the needed information is not cached. This is technically read-ahead, although it is not a great way of doing it. A much better way to do this would be to pipeline `readdir()` by initiating asynchronous read operations in anticipation of future reads. Both of thse should affect both variants of ext4's directories, but the penalties I mentioned earlier mean that the dir_index variant would be affected more. If you have a way to benchmark things, a simple idea to evaluate would be deleting the `cond_resched()` line. If we had data showing an improvement, I would be happy to send a small one-line patch deleting the line to Ted to get the change into mainline. > There doesn't seem to be a real specific tip-over point, and it seems to > depend a lot on RAM availability and harddrive speed (obviously). So if > dentries gets cached, disk speeds becomes less of an issue. However, on > large folders (where I typically use 10k as a value for large based on > "gut feeling" and "unquantifiable experience" and "nothing scientific at > all") I find that even with lots of RAM two consecutive ls commands > remains terribly slow. Switch off dir_index and that becomes an order of > magnitude faster. > > I don't have a great deal of experience with XFS, but on those systems > where we do it's generally on a VM, and perceivably (again, not > scientific) our experience has been that it feels slower. Again, not > scientific, just perception. > > I'm in support for the change. This will bucket to 256 folders and > should have a reasonably even split between folders. If required a > second layer could be introduced by using the 3rd and 4th digits of the > hash for a second layer. Any hash should be fine, it really doesn't > need to be cryptographically strong, it just needs to provide a good > spread and be really fast. Generally a hash table should have a prime > number of buckets to assist with hash bias, but frankly, that's over > complicating the situation here. > > I also agree with others that it used to be easy to get distfiles as and > when needed, so an alternative structure could mirror that of the > portage tree itself, in other words "cat/pkg/distfile". This perhaps > just shifts the issue: > > jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name > "*-*" | wc -l > 167 > jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l > 19412 > jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i > -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10 > 347 net-misc > 373 media-sound > 395 media-libs > 399 dev-util > 505 dev-libs > 528 dev-java > 684 dev-haskell > 690 dev-ruby > 1601 dev-perl > 1889 dev-python > > So that's average 116 sub folders under the top layer (only two over > 1000), and then presumably less than 100 distfiles maximum per package? > Probably overkill but would (should) solve both the too many files per > folder as well as the easy lookup by hand issue. > > I don't have a preference on either solution though but do agree that > "easy finding of distfiles" are handy. The INDEX mechanism is fine for me. > > Kind Regards, > > Jaco > [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: ext4 readdir performance - was Re: [gentoo-dev] New distfile mirror layout 2019-10-23 23:47 ` ext4 readdir performance - was " Richard Yao @ 2019-10-24 0:01 ` Richard Yao 0 siblings, 0 replies; 56+ messages in thread From: Richard Yao @ 2019-10-24 0:01 UTC (permalink / raw To: gentoo-dev, Jaco Kroon, gentoo-dev@lists.gentoo.org > On Oct 23, 2019, at 7:48 PM, Richard Yao <ryao@gentoo.org> wrote: > > On 10/22/19 2:51 AM, Jaco Kroon wrote: >> Hi All, >> >> >>> On 2019/10/21 18:42, Richard Yao wrote: >>> >>> If we consider the access frequency, it might actually not be that >>> bad. Consider a simple example with 500 files and two directory >>> buckets. If we have 250 in each, then the size of the directory is >>> always 250. However, if 50 files are accessed 90% of the time, then >>> putting 450 into one directory and that 50 into another directory, we >>> end up with the performance of the O(n) directory lookup being >>> consistent with there being only 90 files in each directory. >>> >>> I am not sure if we should be discarding all other considerations to >>> make changes to benefit O(n) directory lookup filesystems, but if we >>> are, then the hashing approach is not necessarily the best one. It is >>> only the best when all files are accessed with equal frequency, which >>> would be an incorrect assumption. A more human friendly approach might >>> still be better. I doubt that we have the data to determine that though. >>> >>> Also, another idea is to use a cheap hash function (e.g. fletcher) and >>> just have the mirrors do the hashing behind the scenes. Then we would >>> have the best of both worlds. >> >> >> Experience: >> >> ext4 sucks at targeting name lookups without dir_index feature (O(n) >> lookups - scans all entries in the folder). With dir_index readdir >> performance is crap. Pick your poison I guess. Most of our larger >> filesystems (2TB+, but especially the 80TB+ ones) we've reverted to >> disabling dir_index as the benefit is outweighed by the crappy readdir() >> and glob() performance. > My read of the ext4 disk layout documentation is that the read operation > should work mostly the same way, except with a penalty from reading > larger directories caused by the addition of the tree's metadata and > from having more partially filled blocks: > > https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Directory_Entries > > The code itself is the same traversal code: > > https://github.com/torvalds/linux/blob/v5.3/fs/ext4/dir.c#L106 > > However, a couple of things stand out to me here at a glance: > > 1. `cond_resched()` adds scheduler delay for no apparent reason. > `cond_resched()` is meant to be used in places where we could block > excessively on non-PREEMPT kernels, but this doesn't strike me as one of > those places. The fact that we block on disk on uncached reads naturally > serves the same purpose, so an explicit rescheduling point here is > redundant. PREEMPT kernels should perform better in readdir() on ext4 by > virtue of making `cond_resched()` a no-op. I just realized that the way that I worded this could be confusing, so please allow me to clarify what I meant. cond_resched() is meant for when a kernel thread will tie up a CPU for a long period of time. Blocking on disk will cause the CPU to be released to another thread. When we do not block on disk, this operation is quick. There is no good reason for putting cond_resched() here as far as I can tell. > 2. read-ahead is implemented in a way that appears to be over-reading > the directory whenever the needed information is not cached. This is > technically read-ahead, although it is not a great way of doing it. A > much better way to do this would be to pipeline `readdir()` by > initiating asynchronous read operations in anticipation of future reads. > > Both of thse should affect both variants of ext4's directories, but the > penalties I mentioned earlier mean that the dir_index variant would be > affected more. > > If you have a way to benchmark things, a simple idea to evaluate would > be deleting the `cond_resched()` line. If we had data showing an > improvement, I would be happy to send a small one-line patch deleting > the line to Ted to get the change into mainline. The more I think about this, the more absurd having cond_resched() here seems to me. I think I will sit on it for a few days. If it still seems absurd to me after sitting on it, I’ll send Ted a patch to delete that line with the remark that the use of cond_resched() is redundant with blocking on disk. >> There doesn't seem to be a real specific tip-over point, and it seems to >> depend a lot on RAM availability and harddrive speed (obviously). So if >> dentries gets cached, disk speeds becomes less of an issue. However, on >> large folders (where I typically use 10k as a value for large based on >> "gut feeling" and "unquantifiable experience" and "nothing scientific at >> all") I find that even with lots of RAM two consecutive ls commands >> remains terribly slow. Switch off dir_index and that becomes an order of >> magnitude faster. >> >> I don't have a great deal of experience with XFS, but on those systems >> where we do it's generally on a VM, and perceivably (again, not >> scientific) our experience has been that it feels slower. Again, not >> scientific, just perception. >> >> I'm in support for the change. This will bucket to 256 folders and >> should have a reasonably even split between folders. If required a >> second layer could be introduced by using the 3rd and 4th digits of the >> hash for a second layer. Any hash should be fine, it really doesn't >> need to be cryptographically strong, it just needs to provide a good >> spread and be really fast. Generally a hash table should have a prime >> number of buckets to assist with hash bias, but frankly, that's over >> complicating the situation here. >> >> I also agree with others that it used to be easy to get distfiles as and >> when needed, so an alternative structure could mirror that of the >> portage tree itself, in other words "cat/pkg/distfile". This perhaps >> just shifts the issue: >> >> jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name >> "*-*" | wc -l >> 167 >> jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l >> 19412 >> jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i >> -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10 >> 347 net-misc >> 373 media-sound >> 395 media-libs >> 399 dev-util >> 505 dev-libs >> 528 dev-java >> 684 dev-haskell >> 690 dev-ruby >> 1601 dev-perl >> 1889 dev-python >> >> So that's average 116 sub folders under the top layer (only two over >> 1000), and then presumably less than 100 distfiles maximum per package? >> Probably overkill but would (should) solve both the too many files per >> folder as well as the easy lookup by hand issue. >> >> I don't have a preference on either solution though but do agree that >> "easy finding of distfiles" are handy. The INDEX mechanism is fine for me. >> >> Kind Regards, >> >> Jaco >> > > ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-21 16:42 ` Richard Yao 2019-10-21 23:36 ` Matt Turner 2019-10-22 6:51 ` Jaco Kroon @ 2019-10-23 1:21 ` Rich Freeman 2 siblings, 0 replies; 56+ messages in thread From: Rich Freeman @ 2019-10-23 1:21 UTC (permalink / raw To: gentoo-dev On Mon, Oct 21, 2019 at 12:42 PM Richard Yao <ryao@gentoo.org> wrote: > > Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds. I think something that is getting missed in this discussion is that we don't control all of our mirrors, and they're generally donated resources. Somebody has some webserver, and they stick a Debian mirror in one directory tree, and an Arch one in another, and they're kind enough to give us one too. That is why we're seeing odder situations like ntfs and so on being mentioned. They're not necessarily even running Linux, let alone zfs or some other optimized filesystem. And their webserver might be set up to do browsable directory indexes which could perform terribly even if the filesystem itself is fine with direct filename lookups. It doesn't matter if you have hashed b-trees or whatever for filename lookups if you're going to ask the filesystem to give you a list of every file in a large directory - it is going to have to traverse whatever data structure it uses entirely to do so. If we want to start putting requirements on hosting a mirror, then we'll end up with less mirrors, and with mirrors more is usually better. Ideally a mirror should just be a black box to us - we don't really care what they're running because we don't depend on any mirror individually. Likewise if we negatively impact mirror hosts we'll end up with less mirrors. Sure, maybe those hosts have odd configurations, but we're still better off with them than without. That said we do seem to have a lot of mirrors so it probably isn't the end of the world if we lose a limited number. And there is nothing to say that we can't have some infra mirror set up more for interactive browsing that we don't have people fetch from but which dispenses with all the hashing or which bins by the first letter of the filename/etc. It seems like most of the use cases where hashing is inconvenient are for more casual use. To avoid another reply, people are talking about having utilities that can fetch distfiles using the new scheme. I'd think that "ebuild foo.ebuild fetch" is probably the simplest solution for this. Chances are that you're dealing with SRC_URI strings that have variable substitution in them anyway, so just letting ebuild do the fetching means you're not substituting ${PV} and so on, let alone all the stuff versionator and its ilk do. And of course you can always just fetch from upstream anyway if you do have a clean URI. -- Rich ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-20 6:51 ` Michał Górny 2019-10-20 8:25 ` Joshua Kinard 2019-10-21 16:42 ` Richard Yao @ 2019-10-28 23:24 ` Chí-Thanh Christopher Nguyễn 2019-10-29 4:27 ` Michał Górny 2 siblings, 1 reply; 56+ messages in thread From: Chí-Thanh Christopher Nguyễn @ 2019-10-28 23:24 UTC (permalink / raw To: gentoo-dev Hi! > Today you get chastised for using /space/distfiles-local and not > following policy changes. The devmanual states that it's deprecated > since at least 2011, and talks of using d.g.o [1]. > [1] https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts Sorry I'm late to the party, but I would like to enquire about what happens if a file with existing filename but different b2sum gets uploaded to /space/distfiles-local now? Doing so and updating the Manifest used to be another (not necessarily preferred) method to address upstream remaking release packages. Best regards, Chí-Thanh Christopher Nguyễn ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-28 23:24 ` Chí-Thanh Christopher Nguyễn @ 2019-10-29 4:27 ` Michał Górny 2019-10-29 9:34 ` Fabian Groffen 0 siblings, 1 reply; 56+ messages in thread From: Michał Górny @ 2019-10-29 4:27 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1020 bytes --] On Tue, 2019-10-29 at 00:24 +0100, Chí-Thanh Christopher Nguyễn wrote: > Hi! > > > Today you get chastised for using /space/distfiles-local and not > > following policy changes. The devmanual states that it's deprecated > > since at least 2011, and talks of using d.g.o [1]. > > [1] > https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts > > Sorry I'm late to the party, but I would like to enquire about what happens > if a file with existing filename but different b2sum gets uploaded to > /space/distfiles-local now? The same as before. It gets put in top-level disfiles directory. Hashes are calculated from filenames, so this wouldn't affect it. That is, if it put those files in subdirectories in the first place because it doesn't. > Doing so and updating the Manifest used to be another (not necessarily > preferred) method to address upstream remaking release packages. > It's no longer valid. -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 4:27 ` Michał Górny @ 2019-10-29 9:34 ` Fabian Groffen 2019-10-29 11:11 ` Michał Górny 0 siblings, 1 reply; 56+ messages in thread From: Fabian Groffen @ 2019-10-29 9:34 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1393 bytes --] On 29-10-2019 05:27:37 +0100, Michał Górny wrote: > On Tue, 2019-10-29 at 00:24 +0100, Chí-Thanh Christopher Nguyễn wrote: > > Hi! > > > > > Today you get chastised for using /space/distfiles-local and not > > > following policy changes. The devmanual states that it's deprecated > > > since at least 2011, and talks of using d.g.o [1]. > > > [1] > > https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts > > > > Sorry I'm late to the party, but I would like to enquire about what happens > > if a file with existing filename but different b2sum gets uploaded to > > /space/distfiles-local now? > > The same as before. It gets put in top-level disfiles directory. > Hashes are calculated from filenames, so this wouldn't affect it. That > is, if it put those files in subdirectories in the first place because > it doesn't. /space/distfiles-local is no longer copied to the mirrors? or just not copied in the subdir-hierarchy? > > Doing so and updating the Manifest used to be another (not necessarily > > preferred) method to address upstream remaking release packages. > > > > It's no longer valid. Just wondering. Do you mean it isn't valid that some upstreams do this (yes horror)? We surely need a way to work around that ... Thanks, Fabian -- Fabian Groffen Gentoo on a different level [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 9:34 ` Fabian Groffen @ 2019-10-29 11:11 ` Michał Górny 2019-10-29 12:23 ` Ulrich Mueller 0 siblings, 1 reply; 56+ messages in thread From: Michał Górny @ 2019-10-29 11:11 UTC (permalink / raw To: gentoo-dev, Fabian Groffen Dnia October 29, 2019 9:34:01 AM UTC, Fabian Groffen <grobian@gentoo.org> napisał(a): >On 29-10-2019 05:27:37 +0100, Michał Górny wrote: >> On Tue, 2019-10-29 at 00:24 +0100, Chí-Thanh Christopher Nguyễn >wrote: >> > Hi! >> > >> > > Today you get chastised for using /space/distfiles-local and not >> > > following policy changes. The devmanual states that it's >deprecated >> > > since at least 2011, and talks of using d.g.o [1]. >> > > [1] >> > >https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts >> > >> > Sorry I'm late to the party, but I would like to enquire about what >happens >> > if a file with existing filename but different b2sum gets uploaded >to >> > /space/distfiles-local now? >> >> The same as before. It gets put in top-level disfiles directory. >> Hashes are calculated from filenames, so this wouldn't affect it. >That >> is, if it put those files in subdirectories in the first place >because >> it doesn't. > >/space/distfiles-local is no longer copied to the mirrors? or just not >copied in the subdir-hierarchy? The latter. > >> > Doing so and updating the Manifest used to be another (not >necessarily >> > preferred) method to address upstream remaking release packages. >> > >> >> It's no longer valid. > >Just wondering. Do you mean it isn't valid that some upstreams do this >(yes horror)? We surely need a way to work around that ... I mean the method using same filename and expecting distfiles-local to overwrite it. It is preferable to just rename it. > >Thanks, >Fabian -- Best regards, Michał Górny ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 11:11 ` Michał Górny @ 2019-10-29 12:23 ` Ulrich Mueller 2019-10-29 12:43 ` Michał Górny 0 siblings, 1 reply; 56+ messages in thread From: Ulrich Mueller @ 2019-10-29 12:23 UTC (permalink / raw To: Michał Górny; +Cc: gentoo-dev, Fabian Groffen [-- Attachment #1: Type: text/plain, Size: 1072 bytes --] >>>>> On Tue, 29 Oct 2019, Michał Górny wrote: > Dnia October 29, 2019 9:34:01 AM UTC, Fabian Groffen <grobian@gentoo.org> napisał(a): >> /space/distfiles-local is no longer copied to the mirrors? or just >> not copied in the subdir-hierarchy? > The latter. So, what has to be be done to have it appear in the proper place? Should the file be placed in a subdir of /space/distfiles-local/? That seems to be error prone, and certainly could be automated? >> Just wondering. Do you mean it isn't valid that some upstreams do >> this (yes horror)? We surely need a way to work around that ... > I mean the method using same filename and expecting distfiles-local to > overwrite it. It is preferable to just rename it. Looks like this will break backwards compatibility. IIUC, backwards compatibility is also broken on the receiving side, that is, mirror://gentoo/ in SRC_URI will no longer work as expected? Shouldn't GLEP 75 have mentioned this? It's certainly something that needs to be discussed before the GLEP is implemented. Ulrich [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 12:23 ` Ulrich Mueller @ 2019-10-29 12:43 ` Michał Górny 2019-10-29 13:03 ` Ulrich Mueller 0 siblings, 1 reply; 56+ messages in thread From: Michał Górny @ 2019-10-29 12:43 UTC (permalink / raw To: gentoo-dev; +Cc: Fabian Groffen [-- Attachment #1: Type: text/plain, Size: 1732 bytes --] On Tue, 2019-10-29 at 13:23 +0100, Ulrich Mueller wrote: > > > > > > On Tue, 29 Oct 2019, Michał Górny wrote: > > Dnia October 29, 2019 9:34:01 AM UTC, Fabian Groffen <grobian@gentoo.org> napisał(a): > > > /space/distfiles-local is no longer copied to the mirrors? or just > > > not copied in the subdir-hierarchy? > > The latter. > > So, what has to be be done to have it appear in the proper place? Should > the file be placed in a subdir of /space/distfiles-local/? That seems to > be error prone, and certainly could be automated? The file should be placed in SRC_URI, and emirrordist will take care of fetching it. > > > > Just wondering. Do you mean it isn't valid that some upstreams do > > > this (yes horror)? We surely need a way to work around that ... > > I mean the method using same filename and expecting distfiles-local to > > overwrite it. It is preferable to just rename it. > > Looks like this will break backwards compatibility. IIUC, backwards > compatibility is also broken on the receiving side, that is, > mirror://gentoo/ in SRC_URI will no longer work as expected? Yes, this was noted in the top mail. > Shouldn't GLEP 75 have mentioned this? It's certainly something that > needs to be discussed before the GLEP is implemented. GLEP only covers how regular distfile fetching works. Third-party mirrors are out of scope, and all the people working on it and reviewing it have missed the problem. That said, this can't be fixed within bounds defined by PMS. Given that mirror://gentoo is discouraged since at least 2011, I don't see a big deal here. One day it'll stop working; we should stop using it before then. -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 12:43 ` Michał Górny @ 2019-10-29 13:03 ` Ulrich Mueller 2019-10-29 13:09 ` Ulrich Mueller 2019-10-29 13:51 ` Michał Górny 0 siblings, 2 replies; 56+ messages in thread From: Ulrich Mueller @ 2019-10-29 13:03 UTC (permalink / raw To: Michał Górny; +Cc: gentoo-dev, Fabian Groffen [-- Attachment #1: Type: text/plain, Size: 621 bytes --] >>>>> On Tue, 29 Oct 2019, Michał Górny wrote: > On Tue, 2019-10-29 at 13:23 +0100, Ulrich Mueller wrote: >> So, what has to be be done to have it appear in the proper place? >> Should the file be placed in a subdir of /space/distfiles-local/? >> That seems to be error prone, and certainly could be automated? > The file should be placed in SRC_URI, and emirrordist will take care > of fetching it. What if the file is hosted at a non-standard tcp port upstream (like http://example.org:8080/)? The devmanual says that it _must_ be manually uploaded to /space/distfiles-local/ in such cases. Ulrich [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 13:03 ` Ulrich Mueller @ 2019-10-29 13:09 ` Ulrich Mueller 2019-10-29 13:52 ` Michał Górny 2019-10-29 13:51 ` Michał Górny 1 sibling, 1 reply; 56+ messages in thread From: Ulrich Mueller @ 2019-10-29 13:09 UTC (permalink / raw To: Michał Górny; +Cc: gentoo-dev, Fabian Groffen [-- Attachment #1: Type: text/plain, Size: 627 bytes --] >>>>> On Tue, 29 Oct 2019, Ulrich Mueller wrote: >>>>> On Tue, 29 Oct 2019, Michał Górny wrote: >> The file should be placed in SRC_URI, and emirrordist will take care >> of fetching it. > What if the file is hosted at a non-standard tcp port upstream (like > http://example.org:8080/)? The devmanual says that it _must_ be manually > uploaded to /space/distfiles-local/ in such cases. Or another example, app-emacs/vhdl-mode-3.38.1, where (incompetent, or nasty?) upstream blocks wget for some reason, but other methods (e.g., curl, firefox) work? How would I get the file onto the mirrors there? Ulrich [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 13:09 ` Ulrich Mueller @ 2019-10-29 13:52 ` Michał Górny 2019-10-29 14:17 ` Ulrich Mueller 0 siblings, 1 reply; 56+ messages in thread From: Michał Górny @ 2019-10-29 13:52 UTC (permalink / raw To: gentoo-dev; +Cc: Fabian Groffen [-- Attachment #1: Type: text/plain, Size: 911 bytes --] On Tue, 2019-10-29 at 14:09 +0100, Ulrich Mueller wrote: > > > > > > On Tue, 29 Oct 2019, Ulrich Mueller wrote: > > > > > > On Tue, 29 Oct 2019, Michał Górny wrote: > > > The file should be placed in SRC_URI, and emirrordist will take care > > > of fetching it. > > What if the file is hosted at a non-standard tcp port upstream (like > > http://example.org:8080/)? The devmanual says that it _must_ be manually > > uploaded to /space/distfiles-local/ in such cases. > > Or another example, app-emacs/vhdl-mode-3.38.1, where (incompetent, > or nasty?) upstream blocks wget for some reason, but other methods > (e.g., curl, firefox) work? How would I get the file onto the mirrors > there? > If I were you, I would've explicitly mirrored the file anyway. If upstream blocks wget, then users who do not use GENTOO_MIRRORS will also suffer due to it. -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 13:52 ` Michał Górny @ 2019-10-29 14:17 ` Ulrich Mueller 2019-10-29 14:33 ` Fabian Groffen 0 siblings, 1 reply; 56+ messages in thread From: Ulrich Mueller @ 2019-10-29 14:17 UTC (permalink / raw To: Michał Górny; +Cc: gentoo-dev, Fabian Groffen [-- Attachment #1: Type: text/plain, Size: 887 bytes --] >>>>> On Tue, 29 Oct 2019, Michał Górny wrote: > On Tue, 2019-10-29 at 14:09 +0100, Ulrich Mueller wrote: >> > What if the file is hosted at a non-standard tcp port upstream >> > (like http://example.org:8080/)? The devmanual says that it _must_ >> > be manually uploaded to /space/distfiles-local/ in such cases. >> Or another example, app-emacs/vhdl-mode-3.38.1, where (incompetent, >> or nasty?) upstream blocks wget for some reason, but other methods >> (e.g., curl, firefox) work? How would I get the file onto the mirrors >> there? > If I were you, I would've explicitly mirrored the file anyway. > If upstream blocks wget, then users who do not use GENTOO_MIRRORS will > also suffer due to it. All what I'm saying is that there can be unusual circumstances where manual uploading of a file is useful. So please don't take that possibility away. Ulrich [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 487 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 14:17 ` Ulrich Mueller @ 2019-10-29 14:33 ` Fabian Groffen 2019-10-29 14:45 ` Michał Górny 0 siblings, 1 reply; 56+ messages in thread From: Fabian Groffen @ 2019-10-29 14:33 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 2083 bytes --] On 29-10-2019 15:17:38 +0100, Ulrich Mueller wrote: > >>>>> On Tue, 29 Oct 2019, Michał Górny wrote: > > > On Tue, 2019-10-29 at 14:09 +0100, Ulrich Mueller wrote: > >> > What if the file is hosted at a non-standard tcp port upstream > >> > (like http://example.org:8080/)? The devmanual says that it _must_ > >> > be manually uploaded to /space/distfiles-local/ in such cases. > > >> Or another example, app-emacs/vhdl-mode-3.38.1, where (incompetent, > >> or nasty?) upstream blocks wget for some reason, but other methods > >> (e.g., curl, firefox) work? How would I get the file onto the mirrors > >> there? > > > If I were you, I would've explicitly mirrored the file anyway. > > If upstream blocks wget, then users who do not use GENTOO_MIRRORS will > > also suffer due to it. > > All what I'm saying is that there can be unusual circumstances where > manual uploading of a file is useful. So please don't take that > possibility away. In addition, there are currently files there that aren't referenced from ebuilds. Prefix uses these files during bootstrap, local mirrors are often much faster than dev.g.o. If the files don't get mirrored anymore, I guess I can create a dummy ebuild that has the files in SRC_URI. If the files get mirrored, but put in a subdir based on the filename hash, the original query endpoint on distfiles.g.o changes, much like the SRC_URI approach. Now I can use distfiles.prefix.b.n which redirects to the distfiles.g.o URL with subdir for most part I think, but it's sub-optimal from my point of view. Calculating the hash is not always feasible due to the lack of b2sum or other means. Hence my earlier request to have such official translation service on Gentoo hardware. (I just wrote a small wsgi script that calculates the hash and generates the redirect from Python, served via uwsgi/nginx, but there should be many ways to achieve the same goals, if and only if a blake2b implementation were available for it.) Thanks, Fabian -- Fabian Groffen Gentoo on a different level [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 14:33 ` Fabian Groffen @ 2019-10-29 14:45 ` Michał Górny 2019-10-29 14:56 ` Fabian Groffen 0 siblings, 1 reply; 56+ messages in thread From: Michał Górny @ 2019-10-29 14:45 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 2879 bytes --] On Tue, 2019-10-29 at 15:33 +0100, Fabian Groffen wrote: > On 29-10-2019 15:17:38 +0100, Ulrich Mueller wrote: > > > > > > > On Tue, 29 Oct 2019, Michał Górny wrote: > > > On Tue, 2019-10-29 at 14:09 +0100, Ulrich Mueller wrote: > > > > > What if the file is hosted at a non-standard tcp port upstream > > > > > (like http://example.org:8080/)? The devmanual says that it _must_ > > > > > be manually uploaded to /space/distfiles-local/ in such cases. > > > > Or another example, app-emacs/vhdl-mode-3.38.1, where (incompetent, > > > > or nasty?) upstream blocks wget for some reason, but other methods > > > > (e.g., curl, firefox) work? How would I get the file onto the mirrors > > > > there? > > > If I were you, I would've explicitly mirrored the file anyway. > > > If upstream blocks wget, then users who do not use GENTOO_MIRRORS will > > > also suffer due to it. > > > > All what I'm saying is that there can be unusual circumstances where > > manual uploading of a file is useful. So please don't take that > > possibility away. > > In addition, there are currently files there that aren't referenced from > ebuilds. Prefix uses these files during bootstrap, local mirrors are > often much faster than dev.g.o. > > If the files don't get mirrored anymore, I guess I can create a dummy > ebuild that has the files in SRC_URI. Ok, this is something I wasn't aware of. I agree that dummy ebuild should not be necessary here. However, I'm also not sure if distfiles- local is really the proper way either, especially that I don't see such files on woodpecker right now. I don't think the matter is urgent right now, so let's ponder on it a bit. In particular, I think we should have a clear indication of who added which files, when, what for and where they came from. Those are precisely the things that the current distfiles-local approach misses. > If the files get mirrored, but put in a subdir based on the filename > hash, the original query endpoint on distfiles.g.o changes, much like > the SRC_URI approach. > > Now I can use distfiles.prefix.b.n which redirects to the distfiles.g.o > URL with subdir for most part I think, but it's sub-optimal from my > point of view. Calculating the hash is not always feasible due to the > lack of b2sum or other means. Hence my earlier request to have such > official translation service on Gentoo hardware. > > (I just wrote a small wsgi script that calculates the hash and generates > the redirect from Python, served via uwsgi/nginx, but there should be > many ways to achieve the same goals, if and only if a blake2b > implementation were available for it.) > This is also something that needs thinking. I personally don't mind having one but it would be nice if it was able to account for geodns and such. -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 14:45 ` Michał Górny @ 2019-10-29 14:56 ` Fabian Groffen 0 siblings, 0 replies; 56+ messages in thread From: Fabian Groffen @ 2019-10-29 14:56 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 2206 bytes --] On 29-10-2019 15:45:34 +0100, Michał Górny wrote: > On Tue, 2019-10-29 at 15:33 +0100, Fabian Groffen wrote: > > In addition, there are currently files there that aren't referenced from > > ebuilds. Prefix uses these files during bootstrap, local mirrors are > > often much faster than dev.g.o. > > > > If the files don't get mirrored anymore, I guess I can create a dummy > > ebuild that has the files in SRC_URI. > > Ok, this is something I wasn't aware of. I agree that dummy ebuild > should not be necessary here. However, I'm also not sure if distfiles- > local is really the proper way either, especially that I don't see such > files on woodpecker right now. There should be /space/distfiles-local and /space/distfiles-whitelist/prefix with a list of files to retain on the mirror. Thanks, Fabian > I don't think the matter is urgent right now, so let's ponder on it > a bit. In particular, I think we should have a clear indication of who > added which files, when, what for and where they came from. Those are > precisely the things that the current distfiles-local approach misses. > > > If the files get mirrored, but put in a subdir based on the filename > > hash, the original query endpoint on distfiles.g.o changes, much like > > the SRC_URI approach. > > > > Now I can use distfiles.prefix.b.n which redirects to the distfiles.g.o > > URL with subdir for most part I think, but it's sub-optimal from my > > point of view. Calculating the hash is not always feasible due to the > > lack of b2sum or other means. Hence my earlier request to have such > > official translation service on Gentoo hardware. > > > > (I just wrote a small wsgi script that calculates the hash and generates > > the redirect from Python, served via uwsgi/nginx, but there should be > > many ways to achieve the same goals, if and only if a blake2b > > implementation were available for it.) > > This is also something that needs thinking. I personally don't mind > having one but it would be nice if it was able to account for geodns > and such. > > -- > Best regards, > Michał Górny > -- Fabian Groffen Gentoo on a different level [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
* Re: [gentoo-dev] New distfile mirror layout 2019-10-29 13:03 ` Ulrich Mueller 2019-10-29 13:09 ` Ulrich Mueller @ 2019-10-29 13:51 ` Michał Górny 1 sibling, 0 replies; 56+ messages in thread From: Michał Górny @ 2019-10-29 13:51 UTC (permalink / raw To: gentoo-dev; +Cc: Fabian Groffen [-- Attachment #1: Type: text/plain, Size: 883 bytes --] On Tue, 2019-10-29 at 14:03 +0100, Ulrich Mueller wrote: > > > > > > On Tue, 29 Oct 2019, Michał Górny wrote: > > On Tue, 2019-10-29 at 13:23 +0100, Ulrich Mueller wrote: > > > So, what has to be be done to have it appear in the proper place? > > > Should the file be placed in a subdir of /space/distfiles-local/? > > > That seems to be error prone, and certainly could be automated? > > The file should be placed in SRC_URI, and emirrordist will take care > > of fetching it. > > What if the file is hosted at a non-standard tcp port upstream (like > http://example.org:8080/)? The devmanual says that it _must_ be manually > uploaded to /space/distfiles-local/ in such cases. > I can't really see why this wouldn't work. I've just did an experiment using app-benchmarks/forkbomb, and emirrordist fetched it just fine. -- Best regards, Michał Górny [-- Attachment #2: This is a digitally signed message part --] [-- Type: application/pgp-signature, Size: 618 bytes --] ^ permalink raw reply [flat|nested] 56+ messages in thread
end of thread, other threads:[~2019-10-29 16:35 UTC | newest] Thread overview: 56+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2019-10-18 13:41 [gentoo-dev] New distfile mirror layout Michał Górny 2019-10-18 19:53 ` Richard Yao 2019-10-18 20:49 ` Michał Górny 2019-10-19 1:09 ` Richard Yao 2019-10-19 6:17 ` Michał Górny 2019-10-19 8:20 ` Richard Yao 2019-10-19 19:26 ` Richard Yao 2019-10-19 20:02 ` Michał Górny 2019-10-19 22:48 ` Richard Yao 2019-10-22 0:46 ` James Cloos 2019-10-19 13:31 ` Fabian Groffen 2019-10-19 13:53 ` Michał Górny 2019-10-19 23:24 ` Joshua Kinard 2019-10-19 23:57 ` Alec Warner 2019-10-20 0:14 ` Joshua Kinard 2019-10-20 6:51 ` Michał Górny 2019-10-20 8:25 ` Joshua Kinard 2019-10-20 8:32 ` Michał Górny 2019-10-20 9:21 ` Joshua Kinard 2019-10-20 9:44 ` Michał Górny 2019-10-20 20:57 ` Joshua Kinard 2019-10-21 0:05 ` Joshua Kinard 2019-10-21 5:51 ` Ulrich Mueller 2019-10-21 10:17 ` Kent Fredric 2019-10-21 21:34 ` Mikle Kolyada 2019-10-21 10:13 ` Kent Fredric 2019-10-23 5:16 ` Joshua Kinard 2019-10-29 16:35 ` Kent Fredric 2019-10-20 17:09 ` Matt Turner 2019-10-21 16:42 ` Richard Yao 2019-10-21 23:36 ` Matt Turner 2019-10-23 5:18 ` Joshua Kinard 2019-10-23 17:06 ` William Hubbs 2019-10-23 18:38 ` William Hubbs 2019-10-23 22:04 ` William Hubbs 2019-10-24 4:30 ` Michał Górny 2019-10-22 6:51 ` Jaco Kroon 2019-10-22 8:43 ` Ulrich Mueller 2019-10-22 8:46 ` Jaco Kroon 2019-10-23 23:47 ` ext4 readdir performance - was " Richard Yao 2019-10-24 0:01 ` Richard Yao 2019-10-23 1:21 ` Rich Freeman 2019-10-28 23:24 ` Chí-Thanh Christopher Nguyễn 2019-10-29 4:27 ` Michał Górny 2019-10-29 9:34 ` Fabian Groffen 2019-10-29 11:11 ` Michał Górny 2019-10-29 12:23 ` Ulrich Mueller 2019-10-29 12:43 ` Michał Górny 2019-10-29 13:03 ` Ulrich Mueller 2019-10-29 13:09 ` Ulrich Mueller 2019-10-29 13:52 ` Michał Górny 2019-10-29 14:17 ` Ulrich Mueller 2019-10-29 14:33 ` Fabian Groffen 2019-10-29 14:45 ` Michał Górny 2019-10-29 14:56 ` Fabian Groffen 2019-10-29 13:51 ` Michał Górny
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox