public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] Re: proposed md5sum change
@ 2003-06-23  2:41 Martin Pool
  2003-06-23  4:00 ` bdharring
  0 siblings, 1 reply; 3+ messages in thread
From: Martin Pool @ 2003-06-23  2:41 UTC (permalink / raw
  To: gentoo-dev

On Wed, 11 Jun 2003 11:02:02 -0500, Brian Harring wrote:

> Hola all,
> Straight to the point, I propose instead of md5summing the compressed
> distfile, we md5sum the actual data, the tarball.  

Speaking as somebody who has worked on rsync and librsync: I agree, I
think that would be an big improvement.

The uncompressed form is the natural and efficient place to do delta
compression.

This implies that the client, after applying a patch, ends up with an
uncompressed (e.g. .tar) file.  Making the client recompress it is
wasteful, because compression is expensive and in any case it's just
going to be uncompressed and extracted.

Not only is it wasteful, but it's hard to do correctly.  As other
people have noted, compression is not very reproducible.

This implies that the script which unpacks and builds the source needs
to be able to accept the unpacked form rather than the packed form as
at present.  That doesn't sound terribly hard.

Some people might want to store packages in compressed form because
they're low on disk, and so might want to bzip them up again after
applying the patch.  On the other hand, some people might want to
keep them uncompressed because their CPU is slow.  On the third hand,
some people might want to *recompress* everything into bz2 even if it
was originally .gz.  Any of these can be supported through some future
mechanism; they don't need to determine the download system.

Seemant Kuleen wrote:

> Now, the promised concern bit.  Unfortunately, while the majority of the
> packages do come in a compressed tarball format, there are many (enough to
> make it a corner case of some concern) packages which do not.  Off the top
> of my head, I can think of .Z (forget which package), .rpm
> (redhat-artwork), .bin (realplayer).  And in some cases, we just get an
> uncompressed README file in the SRC_URI (or the wacom.c file in xfree,
> though I'm not certain of it right this moment).

.Z files can be uncompressed and handled as for gzip (I think gzip
handles them in fact.)

.zip, .rpm, or self-extracting .exe files can also be uncompressed and
diffd, at least in principle.

Uncompressed READMEs, patches or .c files are just too easy. :-)

If you don't recognize the format, you can try to do a delta on the
binary form.  If the delta is too big, drop it.

Experience on Debian has shown that compiled binaries in general do
not delta-compress very well, so I think not being able to uncompress
them is not a terrible thing.

The point:

Gentoo should distribute the md5sums for both the compressed and
uncompressed forms of packages.  They are checked in that order;
either is sufficient.

Regular non-delta downloads will proceed as usual, and the md5sum can
be checked immediately after download.  There is no added cost.

Patch downloads can be done by 

 - download xdelta
 - uncompress old file, pipe it into 'xdelta patch', store the result
 - check result against uncompressed MD5sum

As far as I can see this removes any need for a special deltup file
format.  Just simply send xdeltas.

A great advantage is that xdeltas are useful to people other than
Gentoo, so people upstream or mirrors may be more willing to
distribute them alongside the original source.  

Much as I love the idea of deltup, I think the current code is a bit
messy and making up a new format is unjustified.

> In terms of performance of the md5summing, it would still likely be i/o
> limited- decompression would be done in memory after all.

The approach above is much *more* efficient than deltup, which makes
an extra roundtrip to bz2 format.

What have I missed?

-- 
Martin

If you don't know how to code, then you don't know how to design the
software either. Period. You can only cause trouble.
                -- Havoc Pennington, http://ometer.com/hacking.html

--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [gentoo-dev] Re: proposed md5sum change
  2003-06-23  2:41 [gentoo-dev] Re: proposed md5sum change Martin Pool
@ 2003-06-23  4:00 ` bdharring
  2003-06-23  4:43   ` Martin Pool
  0 siblings, 1 reply; 3+ messages in thread
From: bdharring @ 2003-06-23  4:00 UTC (permalink / raw
  To: gentoo-dev, mbp

Responses/cheer-leading littered liberally below...

On Sunday, June 22, 2003, at 09:41 PM, Martin Pool wrote:

> On Wed, 11 Jun 2003 11:02:02 -0500, Brian Harring wrote:
>
>> Hola all,
>> Straight to the point, I propose instead of md5summing the compressed
>> distfile, we md5sum the actual data, the tarball.
>
> Speaking as somebody who has worked on rsync and librsync: I agree, I
> think that would be an big improvement.
Heh, small world.  I'd actually read of the original complaint of it I 
in tridgell's master thesis while researching delta compression for my 
own little prog...

> The uncompressed form is the natural and efficient place to do delta
> compression.
Agreed, although I would posit that decompressing a large bzip2 for 
md5suming in memory makes it a substantially longer affair then if you 
just md5'd the compressed tarball.  On my personal system, 
compressed=>3-5s, bzip2 decompressing piped to md5 = 1-2 minutes.  More 
below...
> Seemant Kuleen wrote:
>
>> Now, the promised concern bit.  Unfortunately, while the majority of 
>> the
>> packages do come in a compressed tarball format, there are many 
>> (enough to
>> make it a corner case of some concern) packages which do not.  Off 
>> the top
>> of my head, I can think of .Z (forget which package), .rpm
>> (redhat-artwork), .bin (realplayer).  And in some cases, we just get 
>> an
>> uncompressed README file in the SRC_URI (or the wacom.c file in xfree,
>> though I'm not certain of it right this moment).
>
> .Z files can be uncompressed and handled as for gzip (I think gzip
> handles them in fact.)
>
> .zip, .rpm, or self-extracting .exe files can also be uncompressed and
> diffd, at least in principle.
Summing it up, if we can pull it apart and get the uncompressed data, 
we md5 that data.  If we can't, well I've yet to see any diff prog 
(aside from xdelta's lackluster gzip support) that even does 
decompression of data, so it's a non-issue for the moment...
>
> Experience on Debian has shown that compiled binaries in general do
> not delta-compress very well, so I think not being able to uncompress
> them is not a terrible thing.
Horribly badly actually.  Problem being of course that you change 
offset x, everything after x is different... tiz the reason I was 
looking at md5ing the data, since to get any decent delta compression 
you have to decompress... but you likely know that so I'll shut up now.
>
> The point:
>
> Gentoo should distribute the md5sums for both the compressed and
> uncompressed forms of packages.  They are checked in that order;
> either is sufficient.
That would solve the initial complaint I had mentioned about speed 
above.  I like it, and it's a general solution allowing the user more 
control over how their distfiles are stored (aside from making delta 
compression much easier to do).
>
> Regular non-delta downloads will proceed as usual, and the md5sum can
> be checked immediately after download.  There is no added cost.
>
> Patch downloads can be done by
>
>  - download xdelta
>  - uncompress old file, pipe it into 'xdelta patch', store the result
>  - check result against uncompressed MD5sum
>
> As far as I can see this removes any need for a special deltup file
> format.  Just simply send xdeltas.
I'd agree.  My understanding for why the deltup format, from what I've 
gathered trolling the forums, jjw's attempting to build his own 
differencing/encoding setup which is a fair amount of work speaking 
from experience.  A side note for doing gentoo delta patching is that 
(imo) it ought to in some form provide for standard diff's since any 
version patches that are distributed currently are typically diff (look 
at the kernel for instance).
Either way, back to adult swim...
~brian


--
gentoo-dev@gentoo.org mailing list


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [gentoo-dev] Re: proposed md5sum change
  2003-06-23  4:00 ` bdharring
@ 2003-06-23  4:43   ` Martin Pool
  0 siblings, 0 replies; 3+ messages in thread
From: Martin Pool @ 2003-06-23  4:43 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2570 bytes --]

On 22 Jun 2003, bdharring <bdharring@wisc.edu> wrote:

> >The uncompressed form is the natural and efficient place to do delta
> >compression.

> Agreed, although I would posit that decompressing a large bzip2 for
> md5suming in memory makes it a substantially longer affair then if
> you just md5'd the compressed tarball.  On my personal system,
> compressed=>3-5s, bzip2 decompressing piped to md5 = 1-2 minutes.
> More below...

Yes, if the user is downloading a compressed form, then it makes sense
to calculate the hash of the compressed form when checking if
e.g. they got an interrupted or corrupt download.  

But aside from that, including the time to decompress as a cost of
checking the MD5 sum is a furphy.  It has to be decompressed at some
point whether to patch it or to build it.  You can check the MD5sum
then.

Note that xdelta patches in fact include the MD5 checksum of the
output file, so checking it is a bit redundant.

> >.zip, .rpm, or self-extracting .exe files can also be uncompressed and
> >diffd, at least in principle.
> Summing it up, if we can pull it apart and get the uncompressed data, 
> we md5 that data.  If we can't, well I've yet to see any diff prog 
> (aside from xdelta's lackluster gzip support) that even does 
> decompression of data, so it's a non-issue for the moment...

Yes, if we can decompress it then we do.  Otherwise we just do the
xdelta across the whole file.  In either case, if the delta is
ridiculously large, then we discard it.

> I'd agree.  My understanding for why the deltup format, from what I've 
> gathered trolling the forums, jjw's attempting to build his own 
> differencing/encoding setup which is a fair amount of work speaking 
> from experience.

I think the right thing is to use the VCDIFF format, which allows
standard expression of deltas regardless of the algorithm that
generates them.  I understand that xdelta is moving towards this and
librsync will too eventually.

> A side note for doing gentoo delta patching is that (imo) it ought
> to in some form provide for standard diff's since any version
> patches that are distributed currently are typically diff (look at
> the kernel for instance).

That would be OK, but I'm actually inclined to think that it would be
better to recode diffs into xdeltas.  xdeltas are often 5-10x smaller
than a compressed diff, because they don't include redundant context.

diffs are great for humans or for fuzzy merges.  As a
delta-compression mechanism they're pretty lame.

-- 
Martin 

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2003-06-23  4:44 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-06-23  2:41 [gentoo-dev] Re: proposed md5sum change Martin Pool
2003-06-23  4:00 ` bdharring
2003-06-23  4:43   ` Martin Pool

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox