* [gentoo-dev] Having fun with compression
@ 2006-04-30 16:30 Patrick Lauer
2006-04-30 17:30 ` Robin H. Johnson
` (3 more replies)
0 siblings, 4 replies; 11+ messages in thread
From: Patrick Lauer @ 2006-04-30 16:30 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 734 bytes --]
Hi all,
I had this random idea that many of our distfiles are .tar.gz while more
efficient compression methods exist. So I did some testing for fun:
We have ~15k .tar.gz in distfiles. ~6500 .tar.bz2, ~2000 others.
A short run over 477 distfiles spanning 833M gave me 586M of .tar.bz2 -
roughly 30% more efficient!
A comparison run with 7zip gave me 590M files, so bzip2 seems to be
quite good.
I don't think repackaging every .tar.gz as .tar.bz2 is a reasonable
option (breaks MD5 digests, we lose the fallback download from the
homepage), but maybe this motivates people to save bandwidth and migrate
their packaging to bzip2.
Happy hacking,
Patrick
--
Stand still, and let the rest of the universe move
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [gentoo-dev] Having fun with compression
2006-04-30 16:30 [gentoo-dev] Having fun with compression Patrick Lauer
@ 2006-04-30 17:30 ` Robin H. Johnson
2006-04-30 19:03 ` Patrick Lauer
2006-05-02 15:33 ` Francesco Riosa
2006-05-01 3:36 ` Jon Hood
` (2 subsequent siblings)
3 siblings, 2 replies; 11+ messages in thread
From: Robin H. Johnson @ 2006-04-30 17:30 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 652 bytes --]
On Sun, Apr 30, 2006 at 06:30:23PM +0200, Patrick Lauer wrote:
> We have ~15k .tar.gz in distfiles. ~6500 .tar.bz2, ~2000 others.
> A short run over 477 distfiles spanning 833M gave me 586M of .tar.bz2 -
> roughly 30% more efficient!
> A comparison run with 7zip gave me 590M files, so bzip2 seems to be
> quite good.
Try rzip, esp. on the larger files, and see a serious improvement, with
the cost of one major penalty [*].
* rzip cannot handle streams, it seeks across the file multiple times
for what it does.
--
Robin Hugh Johnson
E-Mail : robbat2@gentoo.org
GnuPG FP : 11AC BA4F 4778 E3F6 E4ED F38E B27B 944E 3488 4E85
[-- Attachment #2: Type: application/pgp-signature, Size: 241 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [gentoo-dev] Having fun with compression
2006-04-30 17:30 ` Robin H. Johnson
@ 2006-04-30 19:03 ` Patrick Lauer
2006-05-02 15:33 ` Francesco Riosa
1 sibling, 0 replies; 11+ messages in thread
From: Patrick Lauer @ 2006-04-30 19:03 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 771 bytes --]
On Sun, 2006-04-30 at 10:30 -0700, Robin H. Johnson wrote:
> On Sun, Apr 30, 2006 at 06:30:23PM +0200, Patrick Lauer wrote:
> > We have ~15k .tar.gz in distfiles. ~6500 .tar.bz2, ~2000 others.
> > A short run over 477 distfiles spanning 833M gave me 586M of .tar.bz2 -
> > roughly 30% more efficient!
> > A comparison run with 7zip gave me 590M files, so bzip2 seems to be
> > quite good.
> Try rzip, esp. on the larger files, and see a serious improvement, with
> the cost of one major penalty [*].
>
> * rzip cannot handle streams, it seeks across the file multiple times
> for what it does.
>
642M on my data set - not bad, but bzip2 seems to be better on the files
I have tested.
Patrick
--
Stand still, and let the rest of the universe move
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [gentoo-dev] Having fun with compression
2006-04-30 16:30 [gentoo-dev] Having fun with compression Patrick Lauer
2006-04-30 17:30 ` Robin H. Johnson
@ 2006-05-01 3:36 ` Jon Hood
2006-05-01 8:56 ` Patrick Lauer
2006-05-01 12:11 ` Chris Bainbridge
2006-05-02 15:50 ` Ryan Phillips
3 siblings, 1 reply; 11+ messages in thread
From: Jon Hood @ 2006-05-01 3:36 UTC (permalink / raw
To: gentoo-dev
Hey Patrick,
I agree, tar.bz2 is the way to go when possible, but I have many
friends on old bsd-based systems and some old linux boxes I must
maintain that don't have bzip2 support. Normally if I know a package I
write is going to need to go on an older system, I'll package it in both
formats, but there are times when bz2 is just not an option.
That having been said, it IS an option in 95%+ of the cases I deal
with, and for being on a cable modem, bzip2 has saved quite a bit of
time (and money) in the past.
-Jon
Patrick Lauer wrote:
> Hi all,
>
> I had this random idea that many of our distfiles are .tar.gz while more
> efficient compression methods exist. So I did some testing for fun:
>
> We have ~15k .tar.gz in distfiles. ~6500 .tar.bz2, ~2000 others.
> A short run over 477 distfiles spanning 833M gave me 586M of .tar.bz2 -
> roughly 30% more efficient!
> A comparison run with 7zip gave me 590M files, so bzip2 seems to be
> quite good.
>
> I don't think repackaging every .tar.gz as .tar.bz2 is a reasonable
> option (breaks MD5 digests, we lose the fallback download from the
> homepage), but maybe this motivates people to save bandwidth and migrate
> their packaging to bzip2.
>
> Happy hacking,
>
> Patrick
>
>
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [gentoo-dev] Having fun with compression
2006-05-01 3:36 ` Jon Hood
@ 2006-05-01 8:56 ` Patrick Lauer
0 siblings, 0 replies; 11+ messages in thread
From: Patrick Lauer @ 2006-05-01 8:56 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 1072 bytes --]
On Sun, 2006-04-30 at 22:36 -0500, Jon Hood wrote:
> Hey Patrick,
> I agree, tar.bz2 is the way to go when possible, but I have many
> friends on old bsd-based systems and some old linux boxes I must
> maintain that don't have bzip2 support. Normally if I know a package I
> write is going to need to go on an older system, I'll package it in both
> formats, but there are times when bz2 is just not an option.
Is that a problem in the sense "it doesn't run at all" or is it "they'd
need to install extra dependencies" ?
> That having been said, it IS an option in 95%+ of the cases I deal
> with, and for being on a cable modem, bzip2 has saved quite a bit of
> time (and money) in the past.
I just did a conversion run over all of distfiles just for fun (~10h on
an AMD64)
Input: 15634581 kB
Output: 13462050 kB
Difference: ~14%
Compared to my earlier run with ~830M this has less difference, but I
think users would appreciate a reduction of 10-30% of their downloads.
Patrick
--
Stand still, and let the rest of the universe move
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [gentoo-dev] Having fun with compression
2006-04-30 16:30 [gentoo-dev] Having fun with compression Patrick Lauer
2006-04-30 17:30 ` Robin H. Johnson
2006-05-01 3:36 ` Jon Hood
@ 2006-05-01 12:11 ` Chris Bainbridge
2006-05-02 15:50 ` Ryan Phillips
3 siblings, 0 replies; 11+ messages in thread
From: Chris Bainbridge @ 2006-05-01 12:11 UTC (permalink / raw
To: gentoo-dev
On 30/04/06, Patrick Lauer <patrick@gentoo.org> wrote:
> Hi all,
>
> I had this random idea that many of our distfiles are .tar.gz while more
> efficient compression methods exist. So I did some testing for fun:
If you already have an old copy of the distfile it's much more
bandwidth efficient to transfer deltas. Many Gentoo users rarely clean
out /usr/portage/distfiles so it could be quite a bandwidth saving to
use something like zsync http://zsync.moria.org.uk/ .
I did some tests a long time ago and found that a version bump of a
package like kdegraphics produced a 300k uncompressed diff, which was
25x more bandwidth efficient to transfer with rsync than to download
the full bz2 file. I haven't played with zsync yet, but the technical
paper suggests it is close to 'rsync -z' in terms of bandwidth
efficiency, and it removes some of the drawbacks of rsync, such as
high server load and the requirement to run a special daemon.
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [gentoo-dev] Having fun with compression
2006-04-30 17:30 ` Robin H. Johnson
2006-04-30 19:03 ` Patrick Lauer
@ 2006-05-02 15:33 ` Francesco Riosa
1 sibling, 0 replies; 11+ messages in thread
From: Francesco Riosa @ 2006-05-02 15:33 UTC (permalink / raw
To: gentoo-dev
Robin H. Johnson wrote:
> On Sun, Apr 30, 2006 at 06:30:23PM +0200, Patrick Lauer wrote:
>> We have ~15k .tar.gz in distfiles. ~6500 .tar.bz2, ~2000 others.
>> A short run over 477 distfiles spanning 833M gave me 586M of .tar.bz2 -
>> roughly 30% more efficient!
>> A comparison run with 7zip gave me 590M files, so bzip2 seems to be
>> quite good.
> Try rzip, esp. on the larger files, and see a serious improvement, with
> the cost of one major penalty [*].
>
> * rzip cannot handle streams, it seeks across the file multiple times
> for what it does.
>
/me for fun too, values are consistnt between various run
foreach of:
export CMD='rzip -9 gcc-4.2-20060429.tar'
export CMD='rzip -d gcc-4.2-20060429.tar.rz'
export CMD='bzip2 -d gcc-4.2-20060429.tar.bz2'
export CMD='bzip2 -9 gcc-4.2-20060429.tar'
export CMD='gzip gcc-4.2-20060429.tar'
export CMD='gunzip gcc-4.2-20060429.tar.gz'
$CMD &>/dev/null & \
J=$(jobs -l 1 | cut -c 6- ) ; J=${J% Running*} \
; while [[ -d /proc/${J} ]] ; do sleep 0.05 ; echo -n "$J " ; grep
VmPeak /proc/${J}/status ; done
file: gcc-4.2-20060429.tar size 268160 kB
Compression:
rzip -9
VmPeak 345368 kB
Size 34437.87 kB
bzip2 -9
VmPeak 9224 kB
Size 38024.95 kB
gzip
VmPeak 1940 kB
Size 50368.28 kB
De-compression:
bzip2 -d
VmPeak 5448 kB
rzip -d
VmPeak 7892 kB
gunzip
VmPeak 1940 kB
--
gentoo-dev@gentoo.org mailing list
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [gentoo-dev] Having fun with compression
2006-04-30 16:30 [gentoo-dev] Having fun with compression Patrick Lauer
` (2 preceding siblings ...)
2006-05-01 12:11 ` Chris Bainbridge
@ 2006-05-02 15:50 ` Ryan Phillips
2006-05-02 16:27 ` Patrick Lauer
2006-05-02 17:59 ` Jan Kundrát
3 siblings, 2 replies; 11+ messages in thread
From: Ryan Phillips @ 2006-05-02 15:50 UTC (permalink / raw
To: Patrick Lauer; +Cc: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 950 bytes --]
Patrick Lauer <patrick@gentoo.org> said:
> Hi all,
>
> I had this random idea that many of our distfiles are .tar.gz while more
> efficient compression methods exist. So I did some testing for fun:
>
> We have ~15k .tar.gz in distfiles. ~6500 .tar.bz2, ~2000 others.
> A short run over 477 distfiles spanning 833M gave me 586M of .tar.bz2 -
> roughly 30% more efficient!
> A comparison run with 7zip gave me 590M files, so bzip2 seems to be
> quite good.
>
> I don't think repackaging every .tar.gz as .tar.bz2 is a reasonable
> option (breaks MD5 digests, we lose the fallback download from the
> homepage), but maybe this motivates people to save bandwidth and migrate
> their packaging to bzip2.
Patrick,
did you benchmark CPU load? Often bzip2 takes 3x as long to
uncompress a package than bzip. Often, the space savings doesn't
justify the cost of how long it takes for the cpu to decompress the
archive.
-ryan
[-- Attachment #2: Type: application/pgp-signature, Size: 187 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [gentoo-dev] Having fun with compression
2006-05-02 15:50 ` Ryan Phillips
@ 2006-05-02 16:27 ` Patrick Lauer
2006-05-02 17:36 ` Chris Gianelloni
2006-05-02 17:59 ` Jan Kundrát
1 sibling, 1 reply; 11+ messages in thread
From: Patrick Lauer @ 2006-05-02 16:27 UTC (permalink / raw
To: Ryan Phillips; +Cc: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 791 bytes --]
On Tue, 2006-05-02 at 08:50 -0700, Ryan Phillips wrote:
> Patrick,
>
> did you benchmark CPU load? Often bzip2 takes 3x as long to
> uncompress a package than bzip. Often, the space savings doesn't
> justify the cost of how long it takes for the cpu to decompress the
> archive.
I did not compare CPU load. Maybe I should do that :-)
But the average user will take longer to download than to uncompress, so
my rationale for this experiment was space reduction "at all costs". The
15% average gain should outweigh CPU issues.
Also - on my test system I managed to _compress_ at 1,5MB/s. If anyone
could provide some performance numbers for slower systems it'd be
easier to evaluate the tradeoff.
Patrick
--
Stand still, and let the rest of the universe move
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [gentoo-dev] Having fun with compression
2006-05-02 16:27 ` Patrick Lauer
@ 2006-05-02 17:36 ` Chris Gianelloni
0 siblings, 0 replies; 11+ messages in thread
From: Chris Gianelloni @ 2006-05-02 17:36 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 1093 bytes --]
On Tue, 2006-05-02 at 18:27 +0200, Patrick Lauer wrote:
> On Tue, 2006-05-02 at 08:50 -0700, Ryan Phillips wrote:
>
> > Patrick,
> >
> > did you benchmark CPU load? Often bzip2 takes 3x as long to
> > uncompress a package than bzip. Often, the space savings doesn't
> > justify the cost of how long it takes for the cpu to decompress the
> > archive.
>
> I did not compare CPU load. Maybe I should do that :-)
>
> But the average user will take longer to download than to uncompress, so
> my rationale for this experiment was space reduction "at all costs". The
> 15% average gain should outweigh CPU issues.
>
> Also - on my test system I managed to _compress_ at 1,5MB/s. If anyone
> could provide some performance numbers for slower systems it'd be
> easier to evaluate the tradeoff.
I'm sure somebody *cough* vapier *cough* could find you a slow enough
machine to compare against. Perhaps something in the double-digits of
MHz. *grin*
--
Chris Gianelloni
Release Engineering - Strategic Lead
x86 Architecture Team
Games - Developer
Gentoo Linux
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 191 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [gentoo-dev] Having fun with compression
2006-05-02 15:50 ` Ryan Phillips
2006-05-02 16:27 ` Patrick Lauer
@ 2006-05-02 17:59 ` Jan Kundrát
1 sibling, 0 replies; 11+ messages in thread
From: Jan Kundrát @ 2006-05-02 17:59 UTC (permalink / raw
To: gentoo-dev
[-- Attachment #1: Type: text/plain, Size: 396 bytes --]
Ryan Phillips wrote:
> did you benchmark CPU load? Often bzip2 takes 3x as long to
> uncompress a package than bzip. Often, the space savings doesn't
> justify the cost of how long it takes for the cpu to decompress the
> archive.
How long does it take in time units defined as "the time required to
compile the package in question"?
Cheers,
-jkt
--
cd /local/pub && more beer > /dev/mouth
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 258 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2006-05-02 18:02 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-30 16:30 [gentoo-dev] Having fun with compression Patrick Lauer
2006-04-30 17:30 ` Robin H. Johnson
2006-04-30 19:03 ` Patrick Lauer
2006-05-02 15:33 ` Francesco Riosa
2006-05-01 3:36 ` Jon Hood
2006-05-01 8:56 ` Patrick Lauer
2006-05-01 12:11 ` Chris Bainbridge
2006-05-02 15:50 ` Ryan Phillips
2006-05-02 16:27 ` Patrick Lauer
2006-05-02 17:36 ` Chris Gianelloni
2006-05-02 17:59 ` Jan Kundrát
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox