public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] Re: proposed md5sum change
@ 2003-06-23  2:41 Martin Pool
  2003-06-23  4:00 ` bdharring
  0 siblings, 1 reply; 3+ messages in thread
From: Martin Pool @ 2003-06-23  2:41 UTC (permalink / raw
  To: gentoo-dev

On Wed, 11 Jun 2003 11:02:02 -0500, Brian Harring wrote:

> Hola all,
> Straight to the point, I propose instead of md5summing the compressed
> distfile, we md5sum the actual data, the tarball.  

Speaking as somebody who has worked on rsync and librsync: I agree, I
think that would be an big improvement.

The uncompressed form is the natural and efficient place to do delta
compression.

This implies that the client, after applying a patch, ends up with an
uncompressed (e.g. .tar) file.  Making the client recompress it is
wasteful, because compression is expensive and in any case it's just
going to be uncompressed and extracted.

Not only is it wasteful, but it's hard to do correctly.  As other
people have noted, compression is not very reproducible.

This implies that the script which unpacks and builds the source needs
to be able to accept the unpacked form rather than the packed form as
at present.  That doesn't sound terribly hard.

Some people might want to store packages in compressed form because
they're low on disk, and so might want to bzip them up again after
applying the patch.  On the other hand, some people might want to
keep them uncompressed because their CPU is slow.  On the third hand,
some people might want to *recompress* everything into bz2 even if it
was originally .gz.  Any of these can be supported through some future
mechanism; they don't need to determine the download system.

Seemant Kuleen wrote:

> Now, the promised concern bit.  Unfortunately, while the majority of the
> packages do come in a compressed tarball format, there are many (enough to
> make it a corner case of some concern) packages which do not.  Off the top
> of my head, I can think of .Z (forget which package), .rpm
> (redhat-artwork), .bin (realplayer).  And in some cases, we just get an
> uncompressed README file in the SRC_URI (or the wacom.c file in xfree,
> though I'm not certain of it right this moment).

.Z files can be uncompressed and handled as for gzip (I think gzip
handles them in fact.)

.zip, .rpm, or self-extracting .exe files can also be uncompressed and
diffd, at least in principle.

Uncompressed READMEs, patches or .c files are just too easy. :-)

If you don't recognize the format, you can try to do a delta on the
binary form.  If the delta is too big, drop it.

Experience on Debian has shown that compiled binaries in general do
not delta-compress very well, so I think not being able to uncompress
them is not a terrible thing.

The point:

Gentoo should distribute the md5sums for both the compressed and
uncompressed forms of packages.  They are checked in that order;
either is sufficient.

Regular non-delta downloads will proceed as usual, and the md5sum can
be checked immediately after download.  There is no added cost.

Patch downloads can be done by 

 - download xdelta
 - uncompress old file, pipe it into 'xdelta patch', store the result
 - check result against uncompressed MD5sum

As far as I can see this removes any need for a special deltup file
format.  Just simply send xdeltas.

A great advantage is that xdeltas are useful to people other than
Gentoo, so people upstream or mirrors may be more willing to
distribute them alongside the original source.  

Much as I love the idea of deltup, I think the current code is a bit
messy and making up a new format is unjustified.

> In terms of performance of the md5summing, it would still likely be i/o
> limited- decompression would be done in memory after all.

The approach above is much *more* efficient than deltup, which makes
an extra roundtrip to bz2 format.

What have I missed?

-- 
Martin

If you don't know how to code, then you don't know how to design the
software either. Period. You can only cause trouble.
                -- Havoc Pennington, http://ometer.com/hacking.html

--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2003-06-23  4:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-06-23  2:41 [gentoo-dev] Re: proposed md5sum change Martin Pool
2003-06-23  4:00 ` bdharring
2003-06-23  4:43   ` Martin Pool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox