* [gentoo-portage-dev] Package compression header for binhosts
@ 2010-06-01 3:32 Zac Medico
2010-06-01 5:16 ` Brian Harring
0 siblings, 1 reply; 8+ messages in thread
From: Zac Medico @ 2010-06-01 3:32 UTC (permalink / raw
To: gentoo-portage-dev
Hi,
In order to support alternative compression types for binhost
packages, I was thinking about adding support for a header field in
the Packages index file. For example, a header line like
"PACKAGE_EXTENSION: txz" could be used to indicate that clients
should download files with txz extensions instead of tbz2
extensions. I'm planning to add support for both tgz [1] and txz
extensions.
[1] http://bugs.gentoo.org/show_bug.cgi?id=142579
--
Thanks,
Zac
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-portage-dev] Package compression header for binhosts
2010-06-01 3:32 [gentoo-portage-dev] Package compression header for binhosts Zac Medico
@ 2010-06-01 5:16 ` Brian Harring
2010-06-01 20:01 ` Ned Ludd
0 siblings, 1 reply; 8+ messages in thread
From: Brian Harring @ 2010-06-01 5:16 UTC (permalink / raw
To: gentoo-portage-dev
[-- Attachment #1: Type: text/plain, Size: 1478 bytes --]
On Mon, May 31, 2010 at 08:32:34PM -0700, Zac Medico wrote:
> Hi,
>
> In order to support alternative compression types for binhost
> packages, I was thinking about adding support for a header field in
> the Packages index file. For example, a header line like
> "PACKAGE_EXTENSION: txz" could be used to indicate that clients
> should download files with txz extensions instead of tbz2
> extensions. I'm planning to add support for both tgz [1] and txz
> extensions.
>
> [1] http://bugs.gentoo.org/show_bug.cgi?id=142579
1) requires a version header bump
2) a header alone isn't useful unless it's specifiable per cpv entry;
thus it must be inheritable
3) PACKAGE_EXTENSION is overly verbose and unclear it's specifying
the compressor too; it's intention is for compression, state it as
such (I mention this in light of URI's existance where
PACKAGE_EXTENSION would only be a hint of compressor)
Re: #1, there is a decent set of optimizations I'm kicking around in
pkgcore for the next version- a discussion should probably be started
there.
Offhand, having a compression specific header (a simple enumeration
of known compressors) and a DEFAULT_URI that is python string
interpolation assembled (for example,
DEFAULT_URI="%(host)s/%(category)s/%(pf)s.txz") seems wiser. Via
doing what I'm suggesting, it would be possible to do binpkg
repository 'views' w/out having to map each binpkg into the url space
for it.
~harring
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-portage-dev] Package compression header for binhosts
2010-06-01 5:16 ` Brian Harring
@ 2010-06-01 20:01 ` Ned Ludd
2010-06-01 21:22 ` Brian Harring
0 siblings, 1 reply; 8+ messages in thread
From: Ned Ludd @ 2010-06-01 20:01 UTC (permalink / raw
To: gentoo-portage-dev
On Mon, 2010-05-31 at 22:16 -0700, Brian Harring wrote:
> On Mon, May 31, 2010 at 08:32:34PM -0700, Zac Medico wrote:
> > Hi,
> >
> > In order to support alternative compression types for binhost
> > packages, I was thinking about adding support for a header field in
> > the Packages index file. For example, a header line like
> > "PACKAGE_EXTENSION: txz" could be used to indicate that clients
> > should download files with txz extensions instead of tbz2
> > extensions. I'm planning to add support for both tgz [1] and txz
> > extensions.
> >
> > [1] http://bugs.gentoo.org/show_bug.cgi?id=142579
>
> 1) requires a version header bump
Agreed. But there were some other pending changes for "VERSION: 1"
Any planned changes to the format should be documented on
https://bugs.gentoo.org/show_bug.cgi?id=263994
> 2) a header alone isn't useful unless it's specifiable per cpv entry;
> thus it must be inheritable
Per CPV entries is going to bloat the format and make me carry around a
more data on a per pkg basis then I'd want to. How about we run with
zac's idea but use tools to convert a full repo over to $EXTENTION
This should keep the portage code fast as well as it checks for invalid
binpkgs all the time. Having to have portage process a ton of ever
growing extentions is just going to be slow.
> 3) PACKAGE_EXTENSION is overly verbose and unclear it's specifying
> the compressor too; it's intention is for compression, state it as
> such (I mention this in light of URI's existance where
> PACKAGE_EXTENSION would only be a hint of compressor)
>
> Re: #1, there is a decent set of optimizations I'm kicking around in
> pkgcore for the next version- a discussion should probably be started
> there.
>
> Offhand, having a compression specific header (a simple enumeration
> of known compressors) and a DEFAULT_URI that is python string
No go bro. The 'Packages' format should be independent of python.
> interpolation assembled (for example,
> DEFAULT_URI="%(host)s/%(category)s/%(pf)s.txz") seems wiser. Via
> doing what I'm suggesting, it would be possible to do binpkg
> repository 'views' w/out having to map each binpkg into the url space
> for it.
>
> ~harring
--
Ned Ludd <solar@gentoo.org>
Gentoo Linux
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-portage-dev] Package compression header for binhosts
2010-06-01 20:01 ` Ned Ludd
@ 2010-06-01 21:22 ` Brian Harring
2010-06-01 21:37 ` Zac Medico
0 siblings, 1 reply; 8+ messages in thread
From: Brian Harring @ 2010-06-01 21:22 UTC (permalink / raw
To: gentoo-portage-dev
[-- Attachment #1: Type: text/plain, Size: 3631 bytes --]
On Tue, Jun 1, 2010 at 1:01 PM, Ned Ludd <solar@gentoo.org> wrote:
> On Mon, 2010-05-31 at 22:16 -0700, Brian Harring wrote:
> > On Mon, May 31, 2010 at 08:32:34PM -0700, Zac Medico wrote:
> > > Hi,
> > >
> > > In order to support alternative compression types for binhost
> > > packages, I was thinking about adding support for a header field in
> > > the Packages index file. For example, a header line like
> > > "PACKAGE_EXTENSION: txz" could be used to indicate that clients
> > > should download files with txz extensions instead of tbz2
> > > extensions. I'm planning to add support for both tgz [1] and txz
> > > extensions.
> > >
> > > [1] http://bugs.gentoo.org/show_bug.cgi?id=142579
> >
> > 1) requires a version header bump
>
> Agreed. But there were some other pending changes for "VERSION: 1"
>
> Any planned changes to the format should be documented on
> https://bugs.gentoo.org/show_bug.cgi?id=263994
>
>
> > 2) a header alone isn't useful unless it's specifiable per cpv entry;
> > thus it must be inheritable
>
> Per CPV entries is going to bloat the format and make me carry around a
> more data on a per pkg basis then I'd want to. How about we run with
> zac's idea but use tools to convert a full repo over to $EXTENTION
> This should keep the portage code fast as well as it checks for invalid
> binpkgs all the time. Having to have portage process a ton of ever
> growing extentions is just going to be slow.
>
Note I said 'inheritable'; one of the main flaws w/ version 0 is that it
requires quite a few entries per CPV, instead of setting a default in the
preamble and then overriding as needed at the CPV level.
What I'm suggesting is a COMPRESSOR in the preamble, and individual cpv's
override it if they're not that compressor.
As for zacs tool to try and generate new views of a repository via
hardlinking/recreating the tree... frankly it's a bit of a hack. Via
DEFAULT_URI and relying on the hash, you can make a stable repository that
is able to be updated in place without corrupting ongoing downloads- simply
put, new additions to the repo don't perturb current DL's since the md5 is
the same (hash collision chance is low enough that I don't care about it
here).
> > 3) PACKAGE_EXTENSION is overly verbose and unclear it's specifying
> > the compressor too; it's intention is for compression, state it as
> > such (I mention this in light of URI's existance where
> > PACKAGE_EXTENSION would only be a hint of compressor)
> >
> > Re: #1, there is a decent set of optimizations I'm kicking around in
> > pkgcore for the next version- a discussion should probably be started
> > there.
> >
> > Offhand, having a compression specific header (a simple enumeration
> > of known compressors) and a DEFAULT_URI that is python string
>
> No go bro. The 'Packages' format should be independent of python.
>
> > interpolation assembled (for example,
> > DEFAULT_URI="%(host)s/%(category)s/%(pf)s.txz") seems wiser. Via
> > doing what I'm suggesting, it would be possible to do binpkg
> > repository 'views' w/out having to map each binpkg into the url space
> > for it.
>
Then come up w/ an alternative w/ the same power as DEFAULT_URI that isn't
python specific; think through the potentials of it, I could very easily
centralize the binpkgs for an arch, use the hash as they're lookup value,
then use the Packages cache as a 'view' into that binpkg repository.
Differing use flag combinations, differing license views, hell, differing
ACCEPT_KEYWORDS, all of that can have the raw pkgs stored centrally while
just providing differing views into it- DEFAULT_URI lays the groundwork for
it.
[-- Attachment #2: Type: text/html, Size: 4848 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-portage-dev] Package compression header for binhosts
2010-06-01 21:22 ` Brian Harring
@ 2010-06-01 21:37 ` Zac Medico
2010-06-01 21:52 ` Brian Harring
0 siblings, 1 reply; 8+ messages in thread
From: Zac Medico @ 2010-06-01 21:37 UTC (permalink / raw
To: gentoo-portage-dev
On 06/01/2010 02:22 PM, Brian Harring wrote:
> As for zacs tool to try and generate new views of a repository via
> hardlinking/recreating the tree... frankly it's a bit of a hack. Via
> DEFAULT_URI and relying on the hash, you can make a stable repository that
> is able to be updated in place without corrupting ongoing downloads- simply
> put, new additions to the repo don't perturb current DL's since the md5 is
> the same (hash collision chance is low enough that I don't care about it
> here).
When you say "hash collision" are you talking about
http://crosbug.com/3225? Maybe that behavior is acceptable for
small-scale private use, but for large scale public repositories I'd
say it's totally unacceptable. Eventually, I'd like to see gentoo
officially distributing binary packages, so that we'll be able to
get a slice of the binary distribution pie. When that happens, we're
certainly not going to want to have race conditions like these in
our public binhosts.
--
Thanks,
Zac
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-portage-dev] Package compression header for binhosts
2010-06-01 21:37 ` Zac Medico
@ 2010-06-01 21:52 ` Brian Harring
2010-06-01 23:53 ` Zac Medico
0 siblings, 1 reply; 8+ messages in thread
From: Brian Harring @ 2010-06-01 21:52 UTC (permalink / raw
To: gentoo-portage-dev
[-- Attachment #1: Type: text/plain, Size: 2380 bytes --]
On Tue, Jun 1, 2010 at 2:37 PM, Zac Medico <zmedico@gentoo.org> wrote:
> On 06/01/2010 02:22 PM, Brian Harring wrote:
> > As for zacs tool to try and generate new views of a repository via
> > hardlinking/recreating the tree... frankly it's a bit of a hack. Via
> > DEFAULT_URI and relying on the hash, you can make a stable repository
> that
> > is able to be updated in place without corrupting ongoing downloads-
> simply
> > put, new additions to the repo don't perturb current DL's since the md5
> is
> > the same (hash collision chance is low enough that I don't care about it
> > here).
>
> When you say "hash collision" are you talking about
> http://crosbug.com/3225? Maybe that behavior is acceptable for
> small-scale private use, but for large scale public repositories I'd
> say it's totally unacceptable.
That bug isn't about a collision, it's about files being replaced underneath
Packages feet. Even with the tricks you've leveled the issue of things
changing under foot still is possible- you've just made the race less
likely.
What I was talking about was solving this issue once and for all via
restructuring, and specifically refering to the potential of an md5
collision in the URI space- specifically what I'm implementing for pkgcore
is the ability to do stupid stuff like this-
http://host/binpkg-store/$MD5.{txz,tbz2,tgz}
then have multiple views accessible just via pointing the binpkg repo remote
url at
http://host/views/license/oss-approved/
http://host/views/keywords/amd64/stable/
http://host/views/raw/ # no filtering on the view of the binpkg repo, see
everything.
Via restructuring where the binpkgs are stored and doing this approach,
multiple views can be had easily into the repo. An additional benefit of
this approach is that via making URI able to point outside the host, you
could combine multiple seperate repositories into one just via a view.
> Eventually, I'd like to see gentoo
> officially distributing binary packages, so that we'll be able to
> get a slice of the binary distribution pie. When that happens, we're
> certainly not going to want to have race conditions like these in
> our public binhosts.
>
I'd suggest abandoning the current repository layout of Packages then, since
it's irrevocably flawed. You can hack around it via jamming timestamp/md5
info into URI, but that's not a sane solution.
~harring
[-- Attachment #2: Type: text/html, Size: 3390 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-portage-dev] Package compression header for binhosts
2010-06-01 21:52 ` Brian Harring
@ 2010-06-01 23:53 ` Zac Medico
2010-06-02 3:42 ` Brian Harring
0 siblings, 1 reply; 8+ messages in thread
From: Zac Medico @ 2010-06-01 23:53 UTC (permalink / raw
To: gentoo-portage-dev
On 06/01/2010 02:52 PM, Brian Harring wrote:
> That bug isn't about a collision, it's about files being replaced underneath
> Packages feet. Even with the tricks you've leveled the issue of things
> changing under foot still is possible- you've just made the race less
> likely.
AFAIK the race is completely eliminated by the RCU-like snapshot
mechanism. I think I like your hash-in-the-filename idea better
though, since it seems simpler to implement and maintain. It can
even be done with the existing version 0 format by abusing the
per-package PATH attribute to refer to a filename that contains a
hash (maybe something like $CATEGORY/$PF.$MD5.tbz2). I wouldn't want
to abuse PATH for compression in version 0 though, since clients are
only required to support tbz2.
> What I was talking about was solving this issue once and for all via
> restructuring, and specifically refering to the potential of an md5
> collision in the URI space- specifically what I'm implementing for pkgcore
> is the ability to do stupid stuff like this-
>
> http://host/binpkg-store/$MD5.{txz,tbz2,tgz}
That would be the MD5 of the entire file, after compression and
having the xpak segment appended, right?
> then have multiple views accessible just via pointing the binpkg repo remote
> url at
>
> http://host/views/license/oss-approved/
> http://host/views/keywords/amd64/stable/
> http://host/views/raw/ # no filtering on the view of the binpkg repo, see
> everything.
So, the default path of a package would come from looking at the MD5
in the Packages file and then mapping that to a path?
> Via restructuring where the binpkgs are stored and doing this approach,
> multiple views can be had easily into the repo. An additional benefit of
> this approach is that via making URI able to point outside the host, you
> could combine multiple seperate repositories into one just via a view.
This might also be useful for creating per-profile views while
allowing packages to be shared between profiles in cases when
hosting a separate build would be redundant. It might be possible to
save lots of build time, disk space, and testing that way.
Being able to have multiple builds of the same package with
different USE settings is also solves bug 150031 [1].
>> Eventually, I'd like to see gentoo
>> officially distributing binary packages, so that we'll be able to
>> get a slice of the binary distribution pie. When that happens, we're
>> certainly not going to want to have race conditions like these in
>> our public binhosts.
>>
>
> I'd suggest abandoning the current repository layout of Packages then, since
> it's irrevocably flawed. You can hack around it via jamming timestamp/md5
> info into URI, but that's not a sane solution.
Shrug, it's a handy way to solve race conditions given the existing
version 0 format. It's not optimal, so we'll surely want something
better in version 1.
[1] http://bugs.gentoo.org/show_bug.cgi?id=150031
--
Thanks,
Zac
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [gentoo-portage-dev] Package compression header for binhosts
2010-06-01 23:53 ` Zac Medico
@ 2010-06-02 3:42 ` Brian Harring
0 siblings, 0 replies; 8+ messages in thread
From: Brian Harring @ 2010-06-02 3:42 UTC (permalink / raw
To: gentoo-portage-dev
[-- Attachment #1: Type: text/plain, Size: 4222 bytes --]
On Tue, Jun 01, 2010 at 04:53:31PM -0700, Zac Medico wrote:
> On 06/01/2010 02:52 PM, Brian Harring wrote:
> > That bug isn't about a collision, it's about files being replaced underneath
> > Packages feet. Even with the tricks you've leveled the issue of things
> > changing under foot still is possible- you've just made the race less
> > likely.
>
> AFAIK the race is completely eliminated by the RCU-like snapshot
> mechanism. I think I like your hash-in-the-filename idea better
> though, since it seems simpler to implement and maintain.
You're forgetting about how one actually updates the snapshot- client
grabs the packages cache, starts pulling binpkgs. During that time
the snapshot is being updated- the client now has a stale view of the
repo, and since the repo's structure is based on cpv (which doesn't
change regardless of metadata changing like use configuration) they
can grab a binpkg that has the wrong metadata/checksums.
It's racey in exactly the same was as before, the only difference is
you switched it to rewrite the tbz2 in a temp file instead of directly
to the tbz2. Reduction, but same level of risk for any form of
updates.
Snapshot script just duck tapes around the issue, while leaving the
core flaw intact.
> > What I was talking about was solving this issue once and for all via
> > restructuring, and specifically refering to the potential of an md5
> > collision in the URI space- specifically what I'm implementing for pkgcore
> > is the ability to do stupid stuff like this-
> >
> > http://host/binpkg-store/$MD5.{txz,tbz2,tgz}
>
> That would be the MD5 of the entire file, after compression and
> having the xpak segment appended, right?
Yep. The only potential issue here is the unlikely case of a CHF
collision. There is a way to resolve that one too, although it is
outside of what I'm willing to do format wise (namely a secondary url
fallback).
> > then have multiple views accessible just via pointing the binpkg repo remote
> > url at
> >
> > http://host/views/license/oss-approved/
> > http://host/views/keywords/amd64/stable/
> > http://host/views/raw/ # no filtering on the view of the binpkg repo, see
> > everything.
>
> So, the default path of a package would come from looking at the MD5
> in the Packages file and then mapping that to a path?
default path would be defined in the preamble by a string interpolated
pattern; whatever folk wanted to use.
A sane default is %(host)s/raw-pkgs/%(md5)s.%(compressor_ext)s imo.
> > Via restructuring where the binpkgs are stored and doing this approach,
> > multiple views can be had easily into the repo. An additional benefit of
> > this approach is that via making URI able to point outside the host, you
> > could combine multiple seperate repositories into one just via a view.
>
> This might also be useful for creating per-profile views while
> allowing packages to be shared between profiles in cases when
> hosting a separate build would be redundant. It might be possible to
> save lots of build time, disk space, and testing that way.
>
> Being able to have multiple builds of the same package with
> different USE settings is also solves bug 150031 [1].
Yep and Yep.
> >> Eventually, I'd like to see gentoo
> >> officially distributing binary packages, so that we'll be able to
> >> get a slice of the binary distribution pie. When that happens, we're
> >> certainly not going to want to have race conditions like these in
> >> our public binhosts.
> >>
> >
> > I'd suggest abandoning the current repository layout of Packages then, since
> > it's irrevocably flawed. You can hack around it via jamming timestamp/md5
> > info into URI, but that's not a sane solution.
>
> Shrug, it's a handy way to solve race conditions given the existing
> version 0 format. It's not optimal, so we'll surely want something
> better in version 1.
The problem here is that version0 still maps down to the existing
binpkg on disk layout. That layout is the core flaw here- as long as
binpkgs are stored cpv orientated, version0 isn't able to do the crazy
things I'm intending.
~harring
[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2010-06-02 3:45 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-06-01 3:32 [gentoo-portage-dev] Package compression header for binhosts Zac Medico
2010-06-01 5:16 ` Brian Harring
2010-06-01 20:01 ` Ned Ludd
2010-06-01 21:22 ` Brian Harring
2010-06-01 21:37 ` Zac Medico
2010-06-01 21:52 ` Brian Harring
2010-06-01 23:53 ` Zac Medico
2010-06-02 3:42 ` Brian Harring
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox