* [gentoo-dev] proposed md5sum change
@ 2003-06-11 16:02 Brian Harring
2003-06-12 6:56 ` Seemant Kulleen
0 siblings, 1 reply; 9+ messages in thread
From: Brian Harring @ 2003-06-11 16:02 UTC (permalink / raw
To: Gentoo-dev
Hola all,
Straight to the point, I propose instead of md5summing the compressed
distfile, we md5sum the actual data, the tarball. There are a couple of
reasons/benefits of this-
1) users are currently tied to a specific compression on the tarball-
for those who would want to convert their distfiles to bzip2 rather then
gzip (for space reasons), they're a bit out of luck- yes, they can
attempt to update md5sum digests or force it to ignore the incorrect
sums, but that gets old *real* quick.
2) Say for whatever reason, the tarball gets inflated- if the original
tarball was compressed w/ say bzip2 0.90, and the user has bzip2 1.x,
even if they recompress it they're out of luck- the bzip2 algorithm was
tweaked for better compression after .90, resulting in a different
md5sum then the original. Yet the distfile is still data-correct- it's
just compressed slightly differently.
3) For anyone making a serious attempt at distfile diffs, the
reconstruction process is seriously borked by the possibility that it's
data-correct, but the compression has changed/been improved resulting in
a different md5sum. I do know JJW's deltup attempt ran smack dab into
this problem w/ the openoffice tarballs. I've also ran into the
problem, and I'd prefer not to use the deltup method of having both old
bzip2 and current bzip2 installed.
In terms of performance of the md5summing, it would still likely be i/o
limited- decompression would be done in memory after all.
That said and done, I'm not after bludgeoning someone into implementing
this- assuming people don't have any major criticism's against it and it
has more then a snowball's chance in hell of being used I'm more then
willing to code it myself.
Comments/Flames/Death Threats?
~Brian
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gentoo-dev] proposed md5sum change
2003-06-11 16:02 [gentoo-dev] proposed md5sum change Brian Harring
@ 2003-06-12 6:56 ` Seemant Kulleen
2003-06-12 7:06 ` Seemant Kulleen
2003-06-12 7:53 ` Evan Powers
0 siblings, 2 replies; 9+ messages in thread
From: Seemant Kulleen @ 2003-06-12 6:56 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 2378 bytes --]
Hi Brian & everyone,
I like the beginnings of this idea, in terms of what it could lead to. But before you start coding (that's not an official sanction, just me speaking :P), there are a few concerns. I've been talking with John extensively about his deltup package (which incidentally is in portage already).
The first test I did was on two ~500K tarballs which produced a dtu of ~400K (zsh). The second test I did was on ~17MB tarballs, which produced a dtu of 184K. Now *THAT* knocked my socks off. I'm still looking for them, they flew so far.
So, tonight John and I were talking about this very proposal (and a few nights ago I was thinking on this idea as well, except my thoughts were towards an md5sum of the uncompressed directory (didn't know and still don't know if that's even possible)). However, your approach has merit over mine.
Now, the promised concern bit. Unfortunately, while the majority of the packages do come in a compressed tarball format, there are many (enough to make it a corner case of some concern) packages which do not. Off the top of my head, I can think of .Z (forget which package), .rpm (redhat-artwork), .bin (realplayer). And in some cases, we just get an uncompressed README file in the SRC_URI (or the wacom.c file in xfree, though I'm not certain of it right this moment).
Anyway, the current approach keeps it simple in that the md5sum is off the *item(s) that is/are downloaded*. The first reason I can see is what I stated above. There are other reasons I can see as well. You know, immediately upon fetching the set of source items that they are bad. So, no disk i/o or cpu cycles are spent in the unpacking; and no potentially nasty code is even untarred on the system, yet.
I'm not terribly technically inclined, but I am certain that Daniel or Nicholas (carpaski) can fill in the holes of my reasoning.
So, please understand, that I am not shooting your idea down at all, because I really do like it, and I am definitely a fan of deltup and would like to see it integrated as an official thingywhatsit. However, we must think upon these concerns.
Thanks,
--
Seemant Kulleen
Developer and Project Co-ordinator,
Gentoo Linux http://www.gentoo.org/~seemant
Public Key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x3458780E
Key fingerprint = 23A9 7CB5 9BBB 4F8D 549B 6593 EDA2 65D8 3458 780E
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gentoo-dev] proposed md5sum change
2003-06-12 6:56 ` Seemant Kulleen
@ 2003-06-12 7:06 ` Seemant Kulleen
2003-06-12 7:53 ` Evan Powers
1 sibling, 0 replies; 9+ messages in thread
From: Seemant Kulleen @ 2003-06-12 7:06 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 802 bytes --]
On Wed, 11 Jun 2003 23:56:49 -0700
Seemant Kulleen <seemant@gentoo.org> wrote:
> The first test I did was on two ~500K tarballs which produced a dtu of ~400K (zsh). The second test I did was on ~17MB tarballs, which produced a dtu of 184K. Now *THAT* knocked my socks off. I'm still looking for them, they flew so far.
>
Oh, by the way, for the sake of completeness, the socks-knocking package was abiword's recent upgrade (1.0.4 to 1.0.5 I believe). If you want to check it out, the dtu is on the mirrors and also http://cvs.gentoo.org/~seemant
--
Seemant Kulleen
Developer and Project Co-ordinator,
Gentoo Linux http://www.gentoo.org/~seemant
Public Key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x3458780E
Key fingerprint = 23A9 7CB5 9BBB 4F8D 549B 6593 EDA2 65D8 3458 780E
[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gentoo-dev] proposed md5sum change
2003-06-12 6:56 ` Seemant Kulleen
2003-06-12 7:06 ` Seemant Kulleen
@ 2003-06-12 7:53 ` Evan Powers
2003-06-12 8:15 ` Paul de Vrieze
1 sibling, 1 reply; 9+ messages in thread
From: Evan Powers @ 2003-06-12 7:53 UTC (permalink / raw
To: gentoo-dev
On Thursday 12 June 2003 02:56 am, Seemant Kulleen wrote:
> Anyway, the current approach keeps it simple in that the md5sum is off the
> *item(s) that is/are downloaded*. The first reason I can see is what I
> stated above. There are other reasons I can see as well. You know,
> immediately upon fetching the set of source items that they are bad. So,
> no disk i/o or cpu cycles are spent in the unpacking; and no potentially
> nasty code is even untarred on the system, yet.
Well, I can think of a way to address this part of the issue, anyway.
* The portage tree still has MD5 digests for the item(s) which is/are
downloaded.
* After downloading, emerge executes a (user specified?) program on the newly
downloaded file. This program applies some transform to the file; maybe it
decompresses whatever format the file is in and re-compresses it with bzip2,
or maybe it only format-shifts files which are over a certain size threshold,
whatever.
* Next, emerge adds a new record to a database (text, one record per line, for
example) somewhere in /var. This database has the original name of the
downloaded file, the original MD5 digest, the new name, and the new MD5
digest.
* When emerge wants a new file, it checks the database to see if the desired
file has been mapped to a new transformed name and MD5.
You'd probably want to make the digest file readable only by the portage user
or something.
With infrastructure like this you could even add more interesting
functionality to portage pretty easily. Like maybe the transform program
uploads the file to the corporate internal FTP mirror, and the database maps
the original name to the URI which locates it.
Or, if emerge exported sufficient context to the transform program, you could
fix the case where a particular braindead package is available only as
package.tar.gz, not package-version.tar.gz. The transform program would add
the version to the filename, and the database would be allowed to have
multiple entries for each original file name (provided they had different MD5
digests). Then emerge would just pick the record with both the desired
original name and the desired original MD5 digest.
Evan
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gentoo-dev] proposed md5sum change
2003-06-12 7:53 ` Evan Powers
@ 2003-06-12 8:15 ` Paul de Vrieze
2003-06-12 16:31 ` Brian Harring
0 siblings, 1 reply; 9+ messages in thread
From: Paul de Vrieze @ 2003-06-12 8:15 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: signed data --]
[-- Type: text/plain, Size: 3433 bytes --]
On Thursday 12 June 2003 09:53, Evan Powers wrote:
> On Thursday 12 June 2003 02:56 am, Seemant Kulleen wrote:
> > Anyway, the current approach keeps it simple in that the md5sum is off
> > the *item(s) that is/are downloaded*. The first reason I can see is what
> > I stated above. There are other reasons I can see as well. You know,
> > immediately upon fetching the set of source items that they are bad. So,
> > no disk i/o or cpu cycles are spent in the unpacking; and no potentially
> > nasty code is even untarred on the system, yet.
>
> Well, I can think of a way to address this part of the issue, anyway.
>
> * The portage tree still has MD5 digests for the item(s) which is/are
> downloaded.
>
This would make things easy for most people, and would save also lots of cpu
cycles.
> * After downloading, emerge executes a (user specified?) program on the
> newly downloaded file. This program applies some transform to the file;
> maybe it decompresses whatever format the file is in and re-compresses it
> with bzip2, or maybe it only format-shifts files which are over a certain
> size threshold, whatever.
>
Allways fun, maybe we would also have a list of allowed/blacklisted extensions
like RPM files which are allready compressed. (Or openoffice files/jar-balls
which loose validity if compressed in a different format, but could be
recompressed in zip format)
> * Next, emerge adds a new record to a database (text, one record per line,
> for example) somewhere in /var. This database has the original name of the
> downloaded file, the original MD5 digest, the new name, and the new MD5
> digest.
>
Maybe extending the serverside digest to also include an unpacked digest
(where applicable) would be smarter in validation of patch based and oddball
cases. (This doesn't mean that client-side digesting isn't useful, it is for
not having to unpack first).
> With infrastructure like this you could even add more interesting
> functionality to portage pretty easily. Like maybe the transform program
> uploads the file to the corporate internal FTP mirror, and the database
> maps the original name to the URI which locates it.
>
> Or, if emerge exported sufficient context to the transform program, you
> could fix the case where a particular braindead package is available only
> as package.tar.gz, not package-version.tar.gz. The transform program would
> add the version to the filename, and the database would be allowed to have
> multiple entries for each original file name (provided they had different
> MD5 digests). Then emerge would just pick the record with both the desired
> original name and the desired original MD5 digest.
>
I do not know whether package (name) transformation is the same thing, but I
support it fully. I believe that name transformation should also be specified
in the ebuild. That would lead to a variable like:
TRANSFORM="foo.tgz:foo-0.0.1.tar.gz
bar-0.1Beta1.2.tar.bz2:bar-0.1_beta102.tar.bz2"
Which would automatically transform filenames. This would also apply to the
mirrors, so portage needs to be changed to try to fetch the transformed name
from the mirrors while trying to fetch the original name from the source. I
do believe though that name transformation is a separate issue.
Paul
--
Paul de Vrieze
Researcher
Mail: pauldv@cs.kun.nl
Homepage: http://www.cs.kun.nl/~pauldv
[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gentoo-dev] proposed md5sum change
2003-06-12 8:15 ` Paul de Vrieze
@ 2003-06-12 16:31 ` Brian Harring
2003-06-12 19:39 ` Paul de Vrieze
0 siblings, 1 reply; 9+ messages in thread
From: Brian Harring @ 2003-06-12 16:31 UTC (permalink / raw
To: gentoo-dev; +Cc: Paul de Vrieze
Replies are below...
On Thu, 2003-06-12 at 03:15, Paul de Vrieze wrote:
> On Thursday 12 June 2003 09:53, Evan Powers wrote:
> > On Thursday 12 June 2003 02:56 am, Seemant Kulleen wrote:
> > > Anyway, the current approach keeps it simple in that the md5sum is off
> > > the *item(s) that is/are downloaded*. The first reason I can see is what
> > > I stated above. There are other reasons I can see as well. You know,
> > > immediately upon fetching the set of source items that they are bad. So,
> > > no disk i/o or cpu cycles are spent in the unpacking; and no potentially
> > > nasty code is even untarred on the system, yet.
> >
> > Well, I can think of a way to address this part of the issue, anyway.
> >
> > * The portage tree still has MD5 digests for the item(s) which is/are
> > downloaded.
> >
> This would make things easy for most people, and would save also lots of cpu
> cycles.
Agreed. Ignoring the cpu cost of decompression, gzip decompressing and
md5ing is fairly close to i/o at least for my machine, which is an
xp1700 w/ about 40mb/s throughput on my hd. Bzip2 (as jjw pointed out)
is an entirely different beast, decompressing it is not close to i/o
speeds. At this stage of the game, recompressed/patched tarballs would
be the minority- down the line assuming diffing/patching takes off, this
might be something to think about.
> > * After downloading, emerge executes a (user specified?) program on the
> > newly downloaded file. This program applies some transform to the file;
> > maybe it decompresses whatever format the file is in and re-compresses it
> > with bzip2, or maybe it only format-shifts files which are over a certain
> > size threshold, whatever.
> >
> Allways fun, maybe we would also have a list of allowed/blacklisted extensions
> like RPM files which are allready compressed. (Or openoffice files/jar-balls
> which loose validity if compressed in a different format, but could be
> recompressed in zip format)
>
> > * Next, emerge adds a new record to a database (text, one record per line,
> > for example) somewhere in /var. This database has the original name of the
> > downloaded file, the original MD5 digest, the new name, and the new MD5
> > digest.
> >
> Maybe extending the serverside digest to also include an unpacked digest
> (where applicable) would be smarter in validation of patch based and oddball
> cases. (This doesn't mean that client-side digesting isn't useful, it is for
> not having to unpack first).
How about this, and mind you this is just for dealing w/ md5sum's-
instead of doing any db-style stuff, just create a file along side (w/in
the distfile dir most likely) that contains the uncompressed data's
md5sum. If you go about creating a db type setup, you're going to run
into major issues in an environment where the distfiles dir is shared
out to other systems since you're not going to be sharing the db.
Basically I could see this- a simple script that a user can use to
convert non-gz distfiles to bzip2 tarballs which creates the file (think
linux-2.4.19.tar.bz2 and linux-2.4.19.tar.md5)... for distfile
diffing/patching, it uses the same method. If the reconstructed and
recompressed verion's md5 matches what portage has, hoozah, no need to
create the file- if not (say upgrading openoffice), we create the file.
Also, it dawned on me that md5summing the data has an added bonus of
being indifferent to the patching/differencing method. In other words,
we could use the standard unified diff's that are provided for the
kernel versions for instance.
As for transforming, would it really be needed? I've spent a bit of
time rooting through the distfile dir and I don't recall seeing
non-versioned names, although as always, I could be wrong.
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gentoo-dev] proposed md5sum change
2003-06-12 16:31 ` Brian Harring
@ 2003-06-12 19:39 ` Paul de Vrieze
2003-06-12 21:24 ` Brian Harring
0 siblings, 1 reply; 9+ messages in thread
From: Paul de Vrieze @ 2003-06-12 19:39 UTC (permalink / raw
To: gentoo-dev
On Thursday 12 June 2003 18:31, Brian Harring wrote:
> Replies are below...
>
<cut>
> instead of doing any db-style stuff, just create a file along side (w/in
> the distfile dir most likely) that contains the uncompressed data's
> md5sum. If you go about creating a db type setup, you're going to run
> into major issues in an environment where the distfiles dir is shared
> out to other systems since you're not going to be sharing the db.
Keeping them in the distdir might create a mess, so I don't know about this
one, but keeping the md5sums in an easilly exported area is certainly no bad
idea.
> Basically I could see this- a simple script that a user can use to
> convert non-gz distfiles to bzip2 tarballs which creates the file (think
> linux-2.4.19.tar.bz2 and linux-2.4.19.tar.md5)... for distfile
> diffing/patching, it uses the same method. If the reconstructed and
> recompressed verion's md5 matches what portage has, hoozah, no need to
> create the file- if not (say upgrading openoffice), we create the file.
> Also, it dawned on me that md5summing the data has an added bonus of
> being indifferent to the patching/differencing method. In other words,
> we could use the standard unified diff's that are provided for the
> kernel versions for instance.
>
> As for transforming, would it really be needed? I've spent a bit of
> time rooting through the distfile dir and I don't recall seeing
> non-versioned names, although as always, I could be wrong.
>
You are wrong there are even auxiliary files in the distfiles repository.
Unversioned tarbals are though, certainly not helping to get a package into
the repository. Sometimes we provide a versioned version of a tarbal on the
portage mirrors.
Paul
--
Paul de Vrieze
Researcher
Mail: pauldv@cs.kun.nl
Homepage: http://www.devrieze.net
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gentoo-dev] proposed md5sum change
2003-06-12 19:39 ` Paul de Vrieze
@ 2003-06-12 21:24 ` Brian Harring
2003-06-13 7:52 ` Paul de Vrieze
0 siblings, 1 reply; 9+ messages in thread
From: Brian Harring @ 2003-06-12 21:24 UTC (permalink / raw
To: gentoo-dev; +Cc: pauldv
<snip>
> > As for transforming, would it really be needed? I've spent a bit of
> > time rooting through the distfile dir and I don't recall seeing
> > non-versioned names, although as always, I could be wrong.
> >
> You are wrong there are even auxiliary files in the distfiles repository.
> Unversioned tarbals are though, certainly not helping to get a package into
> the repository. Sometimes we provide a versioned version of a tarbal on the
> portage mirrors.
I'm guessing what you're saying is that I'm wrong on the 'auxiliary
files' aspect? Example? Unless you were referencing my comment about
non-versionned tarballs... I guess I don't follow on what I'm wrong
about.
As for changing the name of a tarball locally so it's versioned, that
doesn't strike me as the best solution- the name change ought to take
place at the server/mirror level. Course that's my opinion...
~Brian
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [gentoo-dev] proposed md5sum change
2003-06-12 21:24 ` Brian Harring
@ 2003-06-13 7:52 ` Paul de Vrieze
0 siblings, 0 replies; 9+ messages in thread
From: Paul de Vrieze @ 2003-06-13 7:52 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: signed data --]
[-- Type: text/plain, Size: 1730 bytes --]
On Thursday 12 June 2003 23:24, Brian Harring wrote:
> <snip>
>
> > > As for transforming, would it really be needed? I've spent a bit of
> > > time rooting through the distfile dir and I don't recall seeing
> > > non-versioned names, although as always, I could be wrong.
> >
> > You are wrong there are even auxiliary files in the distfiles repository.
> > Unversioned tarbals are though, certainly not helping to get a package
> > into the repository. Sometimes we provide a versioned version of a tarbal
> > on the portage mirrors.
>
> I'm guessing what you're saying is that I'm wrong on the 'auxiliary
> files' aspect? Example? Unless you were referencing my comment about
No, on the point that there are no unversioned files in the distfiles dir.
Unfortunately there are.
> non-versionned tarballs... I guess I don't follow on what I'm wrong
> about.
> As for changing the name of a tarball locally so it's versioned, that
> doesn't strike me as the best solution- the name change ought to take
> place at the server/mirror level. Course that's my opinion...
> ~Brian
The server of the package unfortunately is not under control of gentoo, as
each package is developed by their respective independent developers. In some
cases it is not possible to change that name, in other cases the version is
specified by a directory name.
My solution would have the mirrors provide the versioned files only, but
specifying a transformation would enable automatic transformation of name on
both the mirrors, and at the user, for files that are not (or will not be)
mirrored (yet).
Paul
--
Paul de Vrieze
Researcher
Mail: pauldv@cs.kun.nl
Homepage: http://www.cs.kun.nl/~pauldv
[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2003-06-13 7:53 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-06-11 16:02 [gentoo-dev] proposed md5sum change Brian Harring
2003-06-12 6:56 ` Seemant Kulleen
2003-06-12 7:06 ` Seemant Kulleen
2003-06-12 7:53 ` Evan Powers
2003-06-12 8:15 ` Paul de Vrieze
2003-06-12 16:31 ` Brian Harring
2003-06-12 19:39 ` Paul de Vrieze
2003-06-12 21:24 ` Brian Harring
2003-06-13 7:52 ` Paul de Vrieze
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox