public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] RFC: using .xz for doc/man/info compression
@ 2014-05-11 17:46 Michał Górny
  2014-05-11 19:37 ` Alexander Tsoy
                   ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Michał Górny @ 2014-05-11 17:46 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2566 bytes --]

Hello, developers.

I'd like to raise the following item for discussion: making .xz
the default compressor used by portage for documentation, man pages
and info files. That is, the equivalent of:

  PORTAGE_COMPRESS=xz

in make.globals.

Rationale: xz-utils is quite widespread nowadays and it is a part
of @system set. It can achieve better compression ratio than bzip2,
and faster decompression at the same time.

I have confirmed that both sys-apps/man and sys-apps/man-db can
handle .xz compressed man pages, and sys-apps/texinfo can handle .xz
compressed info pages. Major text editors and pagers support .xz
alike .bz2 (i.e. usually they support both or neither :)).

The additional question is: what preset to use? To help discussing
this, I'd like to quote the tables from 'man xz':

     Preset   DictSize   CompCPU   CompMem   DecMem
       -0     256 KiB       0        3 MiB    1 MiB
       -1       1 MiB       1        9 MiB    2 MiB
       -2       2 MiB       2       17 MiB    3 MiB
       -3       4 MiB       3       32 MiB    5 MiB
       -4       4 MiB       4       48 MiB    5 MiB
       -5       8 MiB       5       94 MiB    9 MiB
       -6       8 MiB       6       94 MiB    9 MiB
       -7      16 MiB       6      186 MiB   17 MiB
       -8      32 MiB       6      370 MiB   33 MiB
       -9      64 MiB       6      674 MiB   65 MiB 

     Preset   DictSize   CompCPU   CompMem   DecMem
      -0e     256 KiB       8        4 MiB    1 MiB
      -1e       1 MiB       8       13 MiB    2 MiB
      -2e       2 MiB       8       25 MiB    3 MiB
      -3e       4 MiB       7       48 MiB    5 MiB
      -4e       4 MiB       8       48 MiB    5 MiB
      -5e       8 MiB       7       94 MiB    9 MiB
      -6e       8 MiB       8       94 MiB    9 MiB
      -7e      16 MiB       8      186 MiB   17 MiB
      -8e      32 MiB       8      370 MiB   33 MiB
      -9e      64 MiB       8      674 MiB   65 MiB

I'd like to note here that increasing dictionary size over file size
does not improve compression. However, the options involved in CompCPU
may.

Depending on the expected amount of complexity, I'd either go for:

1) -6e (or -6, the default) -- max CompCPU, reasonable use of memory,
and dictionary larger than most (or all?) documents that are going to
be compressed,

2) -Ne with minimal 'N' for CompCPU==8 and DictSize > filesize -- still
max compression ratio while keeping lowest memory requirements possible.

Your thoughts?

-- 
Best regards,
Michał Górny

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 966 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-11 17:46 [gentoo-dev] RFC: using .xz for doc/man/info compression Michał Górny
@ 2014-05-11 19:37 ` Alexander Tsoy
  2014-05-11 21:27 ` Pacho Ramos
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 28+ messages in thread
From: Alexander Tsoy @ 2014-05-11 19:37 UTC (permalink / raw
  To: gentoo-dev

В Sun, 11 May 2014 19:46:50 +0200
Michał Górny <mgorny@gentoo.org> пишет:

> Hello, developers.
> 
> I'd like to raise the following item for discussion: making .xz
> the default compressor used by portage for documentation, man pages
> and info files. That is, the equivalent of:
> 
>   PORTAGE_COMPRESS=xz
> 
> in make.globals.
> 
> Rationale: xz-utils is quite widespread nowadays and it is a part
> of @system set. It can achieve better compression ratio than bzip2,
> and faster decompression at the same time.

I tried it recently. Actually for doc/man/info and any other relatively
small text files xz has worse compression ratio than bzip2. See also:

https://bugs.gentoo.org/show_bug.cgi?id=372653

> 
> I have confirmed that both sys-apps/man and sys-apps/man-db can
> handle .xz compressed man pages, and sys-apps/texinfo can handle .xz
> compressed info pages. Major text editors and pagers support .xz
> alike .bz2 (i.e. usually they support both or neither :)).
> 
> The additional question is: what preset to use? To help discussing
> this, I'd like to quote the tables from 'man xz':
> 
>      Preset   DictSize   CompCPU   CompMem   DecMem
>        -0     256 KiB       0        3 MiB    1 MiB
>        -1       1 MiB       1        9 MiB    2 MiB
>        -2       2 MiB       2       17 MiB    3 MiB
>        -3       4 MiB       3       32 MiB    5 MiB
>        -4       4 MiB       4       48 MiB    5 MiB
>        -5       8 MiB       5       94 MiB    9 MiB
>        -6       8 MiB       6       94 MiB    9 MiB
>        -7      16 MiB       6      186 MiB   17 MiB
>        -8      32 MiB       6      370 MiB   33 MiB
>        -9      64 MiB       6      674 MiB   65 MiB 
> 
>      Preset   DictSize   CompCPU   CompMem   DecMem
>       -0e     256 KiB       8        4 MiB    1 MiB
>       -1e       1 MiB       8       13 MiB    2 MiB
>       -2e       2 MiB       8       25 MiB    3 MiB
>       -3e       4 MiB       7       48 MiB    5 MiB
>       -4e       4 MiB       8       48 MiB    5 MiB
>       -5e       8 MiB       7       94 MiB    9 MiB
>       -6e       8 MiB       8       94 MiB    9 MiB
>       -7e      16 MiB       8      186 MiB   17 MiB
>       -8e      32 MiB       8      370 MiB   33 MiB
>       -9e      64 MiB       8      674 MiB   65 MiB
> 
> I'd like to note here that increasing dictionary size over file size
> does not improve compression. However, the options involved in CompCPU
> may.
> 
> Depending on the expected amount of complexity, I'd either go for:
> 
> 1) -6e (or -6, the default) -- max CompCPU, reasonable use of memory,
> and dictionary larger than most (or all?) documents that are going to
> be compressed,
> 
> 2) -Ne with minimal 'N' for CompCPU==8 and DictSize > filesize --
> still max compression ratio while keeping lowest memory requirements
> possible.
> 
> Your thoughts?
> 

-- 
Alexander Tsoy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-11 17:46 [gentoo-dev] RFC: using .xz for doc/man/info compression Michał Górny
  2014-05-11 19:37 ` Alexander Tsoy
@ 2014-05-11 21:27 ` Pacho Ramos
  2014-05-11 23:26   ` Gordon Pettey
  2014-05-12  9:31   ` Marcin Mirosław
  2014-05-12  3:24 ` Samuli Suominen
  2014-05-12  9:35 ` Tom Wijsman
  3 siblings, 2 replies; 28+ messages in thread
From: Pacho Ramos @ 2014-05-11 21:27 UTC (permalink / raw
  To: gentoo-dev

El dom, 11-05-2014 a las 19:46 +0200, Michał Górny escribió:
> Hello, developers.
> 
> I'd like to raise the following item for discussion: making .xz
> the default compressor used by portage for documentation, man pages
> and info files. That is, the equivalent of:
> 
>   PORTAGE_COMPRESS=xz
> 
> in make.globals.
> 
> Rationale: xz-utils is quite widespread nowadays and it is a part
> of @system set. It can achieve better compression ratio than bzip2,
> and faster decompression at the same time.
> 
> I have confirmed that both sys-apps/man and sys-apps/man-db can
> handle .xz compressed man pages, and sys-apps/texinfo can handle .xz
> compressed info pages. Major text editors and pagers support .xz
> alike .bz2 (i.e. usually they support both or neither :)).
> 
> The additional question is: what preset to use? To help discussing
> this, I'd like to quote the tables from 'man xz':
> 
>      Preset   DictSize   CompCPU   CompMem   DecMem
>        -0     256 KiB       0        3 MiB    1 MiB
>        -1       1 MiB       1        9 MiB    2 MiB
>        -2       2 MiB       2       17 MiB    3 MiB
>        -3       4 MiB       3       32 MiB    5 MiB
>        -4       4 MiB       4       48 MiB    5 MiB
>        -5       8 MiB       5       94 MiB    9 MiB
>        -6       8 MiB       6       94 MiB    9 MiB
>        -7      16 MiB       6      186 MiB   17 MiB
>        -8      32 MiB       6      370 MiB   33 MiB
>        -9      64 MiB       6      674 MiB   65 MiB 
> 
>      Preset   DictSize   CompCPU   CompMem   DecMem
>       -0e     256 KiB       8        4 MiB    1 MiB
>       -1e       1 MiB       8       13 MiB    2 MiB
>       -2e       2 MiB       8       25 MiB    3 MiB
>       -3e       4 MiB       7       48 MiB    5 MiB
>       -4e       4 MiB       8       48 MiB    5 MiB
>       -5e       8 MiB       7       94 MiB    9 MiB
>       -6e       8 MiB       8       94 MiB    9 MiB
>       -7e      16 MiB       8      186 MiB   17 MiB
>       -8e      32 MiB       8      370 MiB   33 MiB
>       -9e      64 MiB       8      674 MiB   65 MiB
> 
> I'd like to note here that increasing dictionary size over file size
> does not improve compression. However, the options involved in CompCPU
> may.
> 
> Depending on the expected amount of complexity, I'd either go for:
> 
> 1) -6e (or -6, the default) -- max CompCPU, reasonable use of memory,
> and dictionary larger than most (or all?) documents that are going to
> be compressed,
> 
> 2) -Ne with minimal 'N' for CompCPU==8 and DictSize > filesize -- still
> max compression ratio while keeping lowest memory requirements possible.
> 
> Your thoughts?
> 

Per:
https://bugs.gentoo.org/show_bug.cgi?id=372653

Looks like bzip2 was still better for small files :/



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-11 21:27 ` Pacho Ramos
@ 2014-05-11 23:26   ` Gordon Pettey
  2014-05-12 10:47     ` Alexander Tsoy
  2014-05-12  9:31   ` Marcin Mirosław
  1 sibling, 1 reply; 28+ messages in thread
From: Gordon Pettey @ 2014-05-11 23:26 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 3406 bytes --]

A lot of small files (e.g. AUTHORS, ChangeLog

FWIW: On my system, I have 59M of bz2 files in /usr/share/man and
/usr/share/doc. A short script to decompress those and recompress with xz
-6e reduced that to 36M. I don't have a comparison for individual file
differences.

I posted the short bash scripts at
https://gist.github.com/petteyg/96c71fa3c4680552f5c4



On Sun, May 11, 2014 at 4:27 PM, Pacho Ramos <pacho@gentoo.org> wrote:

> El dom, 11-05-2014 a las 19:46 +0200, Michał Górny escribió:
> > Hello, developers.
> >
> > I'd like to raise the following item for discussion: making .xz
> > the default compressor used by portage for documentation, man pages
> > and info files. That is, the equivalent of:
> >
> >   PORTAGE_COMPRESS=xz
> >
> > in make.globals.
> >
> > Rationale: xz-utils is quite widespread nowadays and it is a part
> > of @system set. It can achieve better compression ratio than bzip2,
> > and faster decompression at the same time.
> >
> > I have confirmed that both sys-apps/man and sys-apps/man-db can
> > handle .xz compressed man pages, and sys-apps/texinfo can handle .xz
> > compressed info pages. Major text editors and pagers support .xz
> > alike .bz2 (i.e. usually they support both or neither :)).
> >
> > The additional question is: what preset to use? To help discussing
> > this, I'd like to quote the tables from 'man xz':
> >
> >      Preset   DictSize   CompCPU   CompMem   DecMem
> >        -0     256 KiB       0        3 MiB    1 MiB
> >        -1       1 MiB       1        9 MiB    2 MiB
> >        -2       2 MiB       2       17 MiB    3 MiB
> >        -3       4 MiB       3       32 MiB    5 MiB
> >        -4       4 MiB       4       48 MiB    5 MiB
> >        -5       8 MiB       5       94 MiB    9 MiB
> >        -6       8 MiB       6       94 MiB    9 MiB
> >        -7      16 MiB       6      186 MiB   17 MiB
> >        -8      32 MiB       6      370 MiB   33 MiB
> >        -9      64 MiB       6      674 MiB   65 MiB
> >
> >      Preset   DictSize   CompCPU   CompMem   DecMem
> >       -0e     256 KiB       8        4 MiB    1 MiB
> >       -1e       1 MiB       8       13 MiB    2 MiB
> >       -2e       2 MiB       8       25 MiB    3 MiB
> >       -3e       4 MiB       7       48 MiB    5 MiB
> >       -4e       4 MiB       8       48 MiB    5 MiB
> >       -5e       8 MiB       7       94 MiB    9 MiB
> >       -6e       8 MiB       8       94 MiB    9 MiB
> >       -7e      16 MiB       8      186 MiB   17 MiB
> >       -8e      32 MiB       8      370 MiB   33 MiB
> >       -9e      64 MiB       8      674 MiB   65 MiB
> >
> > I'd like to note here that increasing dictionary size over file size
> > does not improve compression. However, the options involved in CompCPU
> > may.
> >
> > Depending on the expected amount of complexity, I'd either go for:
> >
> > 1) -6e (or -6, the default) -- max CompCPU, reasonable use of memory,
> > and dictionary larger than most (or all?) documents that are going to
> > be compressed,
> >
> > 2) -Ne with minimal 'N' for CompCPU==8 and DictSize > filesize -- still
> > max compression ratio while keeping lowest memory requirements possible.
> >
> > Your thoughts?
> >
>
> Per:
> https://bugs.gentoo.org/show_bug.cgi?id=372653
>
> Looks like bzip2 was still better for small files :/
>
>
>

[-- Attachment #2: Type: text/html, Size: 4593 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-11 17:46 [gentoo-dev] RFC: using .xz for doc/man/info compression Michał Górny
  2014-05-11 19:37 ` Alexander Tsoy
  2014-05-11 21:27 ` Pacho Ramos
@ 2014-05-12  3:24 ` Samuli Suominen
  2014-05-12  9:35 ` Tom Wijsman
  3 siblings, 0 replies; 28+ messages in thread
From: Samuli Suominen @ 2014-05-12  3:24 UTC (permalink / raw
  To: gentoo-dev


On 11/05/14 20:46, Michał Górny wrote:
> Hello, developers.
>
> I'd like to raise the following item for discussion: making .xz
> the default compressor used by portage for documentation, man pages
> and info files. That is, the equivalent of:
>
>   PORTAGE_COMPRESS=xz
>
> in make.globals.
>
>

I like it, I've been using it myself from make.conf with the current
install on this machine.

But no, I don't have size or speed comparison to give :/

- Samuli


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-11 21:27 ` Pacho Ramos
  2014-05-11 23:26   ` Gordon Pettey
@ 2014-05-12  9:31   ` Marcin Mirosław
  2014-05-12  9:45     ` Tom Wijsman
  1 sibling, 1 reply; 28+ messages in thread
From: Marcin Mirosław @ 2014-05-12  9:31 UTC (permalink / raw
  To: gentoo-dev

W dniu 11.05.2014 23:27, Pacho Ramos pisze:
> El dom, 11-05-2014 a las 19:46 +0200, Michał Górny escribió:
>> Hello, developers.
>>
>> I'd like to raise the following item for discussion: making .xz
>> the default compressor used by portage for documentation, man pages
>> and info files. That is, the equivalent of:
>>
>>   PORTAGE_COMPRESS=xz
>>
>> in make.globals.
>>
>> Rationale: xz-utils is quite widespread nowadays and it is a part
>> of @system set. It can achieve better compression ratio than bzip2,
>> and faster decompression at the same time.
>>
>> I have confirmed that both sys-apps/man and sys-apps/man-db can
>> handle .xz compressed man pages, and sys-apps/texinfo can handle .xz
>> compressed info pages. Major text editors and pagers support .xz
>> alike .bz2 (i.e. usually they support both or neither :)).
>>
>> The additional question is: what preset to use? To help discussing
>> this, I'd like to quote the tables from 'man xz':
>>
>>      Preset   DictSize   CompCPU   CompMem   DecMem
>>        -0     256 KiB       0        3 MiB    1 MiB
>>        -1       1 MiB       1        9 MiB    2 MiB
>>        -2       2 MiB       2       17 MiB    3 MiB
>>        -3       4 MiB       3       32 MiB    5 MiB
>>        -4       4 MiB       4       48 MiB    5 MiB
>>        -5       8 MiB       5       94 MiB    9 MiB
>>        -6       8 MiB       6       94 MiB    9 MiB
>>        -7      16 MiB       6      186 MiB   17 MiB
>>        -8      32 MiB       6      370 MiB   33 MiB
>>        -9      64 MiB       6      674 MiB   65 MiB 
>>
>>      Preset   DictSize   CompCPU   CompMem   DecMem
>>       -0e     256 KiB       8        4 MiB    1 MiB
>>       -1e       1 MiB       8       13 MiB    2 MiB
>>       -2e       2 MiB       8       25 MiB    3 MiB
>>       -3e       4 MiB       7       48 MiB    5 MiB
>>       -4e       4 MiB       8       48 MiB    5 MiB
>>       -5e       8 MiB       7       94 MiB    9 MiB
>>       -6e       8 MiB       8       94 MiB    9 MiB
>>       -7e      16 MiB       8      186 MiB   17 MiB
>>       -8e      32 MiB       8      370 MiB   33 MiB
>>       -9e      64 MiB       8      674 MiB   65 MiB
>>
>> I'd like to note here that increasing dictionary size over file size
>> does not improve compression. However, the options involved in CompCPU
>> may.
>>
>> Depending on the expected amount of complexity, I'd either go for:
>>
>> 1) -6e (or -6, the default) -- max CompCPU, reasonable use of memory,
>> and dictionary larger than most (or all?) documents that are going to
>> be compressed,
>>
>> 2) -Ne with minimal 'N' for CompCPU==8 and DictSize > filesize -- still
>> max compression ratio while keeping lowest memory requirements possible.
>>
>> Your thoughts?
>>
> 
> Per:
> https://bugs.gentoo.org/show_bug.cgi?id=372653
> 
> Looks like bzip2 was still better for small files :/

Hi!
I did test on medium sized man file (bash):
$ man -a -w bash
/usr/share/man/man1/bash.1.bz2
$ stat --printf=%s\\n  /usr/share/man/man1/bash.1.bz2
62606
$ time man -c -P /bin/cat bash >/dev/null

real    0m0.248s
user    0m0.316s
sys     0m0.012s
$ time man -c -P /bin/cat bash >/dev/null

real    0m0.252s
user    0m0.324s
sys     0m0.016s
$ time man -c -P /bin/cat bash >/dev/null

real    0m0.249s
user    0m0.320s
sys     0m0.012s

Now I recompress using xz -6 and next:
$ stat --printf=%s\\n  /usr/share/man/man1/bash.1.xz
66628
$ time man -c -P /bin/cat bash >/dev/null

real    0m0.234s
user    0m0.304s
sys     0m0.004s
$ time man -c -P /bin/cat bash >/dev/null

real    0m0.244s
user    0m0.288s
sys     0m0.024s
$ time man -c -P /bin/cat bash >/dev/null

real    0m0.239s
user    0m0.308s
sys     0m0.012s

And with file compressed using '-6e':
$ stat --printf=%s\\n  /usr/share/man/man1/bash.1.xz
66700
$ time man -c -P /bin/cat bash >/dev/null

real    0m0.233s
user    0m0.292s
sys     0m0.016s
$ time man -c -P /bin/cat bash >/dev/null

real    0m0.234s
user    0m0.300s
sys     0m0.008s

Imho there is no real advantages to change current compressor for man files.
Regards


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-11 17:46 [gentoo-dev] RFC: using .xz for doc/man/info compression Michał Górny
                   ` (2 preceding siblings ...)
  2014-05-12  3:24 ` Samuli Suominen
@ 2014-05-12  9:35 ` Tom Wijsman
  2014-05-13  2:08   ` Andrew Savchenko
                     ` (2 more replies)
  3 siblings, 3 replies; 28+ messages in thread
From: Tom Wijsman @ 2014-05-12  9:35 UTC (permalink / raw
  To: gentoo-dev; +Cc: mgorny

[-- Attachment #1: Type: text/plain, Size: 847 bytes --]

On Sun, 11 May 2014 19:46:50 +0200
Michał Górny <mgorny@gentoo.org> wrote:

> Rationale: xz-utils is quite widespread nowadays and it is a part
> of @system set. It can achieve better compression ratio than bzip2,
> and faster decompression at the same time.

Some thoughts:

What about putting multiple doc / man / info files in a single .xz file
for each package? Would that further improve the situation?

(As they can share dictionary, instead of having multiple dictionaries)

Some algorithms tend to work better for smaller files, whereas others
work better for larger files; might this be the case for bzip2 vs. xz?

-- 
With kind regards,

Tom Wijsman (TomWij)
Gentoo Developer

E-mail address  : TomWij@gentoo.org
GPG Public Key  : 6D34E57D
GPG Fingerprint : C165 AF18 AB4C 400B C3D2  ABF0 95B2 1FCD 6D34 E57D

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-12  9:31   ` Marcin Mirosław
@ 2014-05-12  9:45     ` Tom Wijsman
  0 siblings, 0 replies; 28+ messages in thread
From: Tom Wijsman @ 2014-05-12  9:45 UTC (permalink / raw
  To: gentoo-dev; +Cc: marcin

[-- Attachment #1: Type: text/plain, Size: 641 bytes --]

On Mon, 12 May 2014 11:31:45 +0200
Marcin Mirosław <marcin@mejor.pl> wrote:

> Imho there is no real advantages to change current compressor for man
> files.

It's insufficient to experiment on a single file to make such claim,
you may very well found a file that works equally well with multiple
compression algorithms; it would be nice to script this over all
man / ... files and draw out a table for further comparison. 

-- 
With kind regards,

Tom Wijsman (TomWij)
Gentoo Developer

E-mail address  : TomWij@gentoo.org
GPG Public Key  : 6D34E57D
GPG Fingerprint : C165 AF18 AB4C 400B C3D2  ABF0 95B2 1FCD 6D34 E57D

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-11 23:26   ` Gordon Pettey
@ 2014-05-12 10:47     ` Alexander Tsoy
  2014-05-12 10:55       ` Alexander Tsoy
                         ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: Alexander Tsoy @ 2014-05-12 10:47 UTC (permalink / raw
  To: gentoo-dev

В Sun, 11 May 2014 18:26:32 -0500
Gordon Pettey <petteyg359@gmail.com> пишет:

> A lot of small files (e.g. AUTHORS, ChangeLog
> 
> FWIW: On my system, I have 59M of bz2 files in /usr/share/man and
> /usr/share/doc. A short script to decompress those and recompress with xz
> -6e reduced that to 36M.

Very strange o_O 

Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB".
Larger dictionary size does not improve compression ratio, I get
even worse results with just "-6e" or "-9e". man-bz2 is a full copy of
my /usr/share/man, man-xz is a recompressed one.

Size comparison:

$ du -s man-bz2/ man-xz/
82032	man-bz2/
82308	man-xz/


Decompression speed:

$ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \;

real	0m35.110s
user	0m14.509s
sys	0m15.227s
$ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \;

real	0m35.407s
user	0m14.432s
sys	0m15.186s
$ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \;

real	0m46.571s
user	0m17.077s
sys	0m23.906s
$ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \;

real	0m46.137s
user	0m17.276s
sys	0m23.426s


As you can see, xz is actually worse in speed and compression ratio.

-- 
Alexander Tsoy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-12 10:47     ` Alexander Tsoy
@ 2014-05-12 10:55       ` Alexander Tsoy
  2014-05-12 12:17       ` Tom Wijsman
                         ` (2 subsequent siblings)
  3 siblings, 0 replies; 28+ messages in thread
From: Alexander Tsoy @ 2014-05-12 10:55 UTC (permalink / raw
  To: gentoo-dev

В Mon, 12 May 2014 14:47:36 +0400
Alexander Tsoy <alexander@tsoy.me> пишет:

> В Sun, 11 May 2014 18:26:32 -0500
> Gordon Pettey <petteyg359@gmail.com> пишет:
> 
> > A lot of small files (e.g. AUTHORS, ChangeLog
> > 
> > FWIW: On my system, I have 59M of bz2 files in /usr/share/man and
> > /usr/share/doc. A short script to decompress those and recompress with xz
> > -6e reduced that to 36M.
> 
> Very strange o_O 
> 
> Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB".
> Larger dictionary size does not improve compression ratio, I get
> even worse results with just "-6e" or "-9e". man-bz2 is a full copy of
> my /usr/share/man, man-xz is a recompressed one.
> 
> Size comparison:
> 
> $ du -s man-bz2/ man-xz/
> 82032	man-bz2/
> 82308	man-xz/

Note that a lot of files in these directories are non-compressed text files
or symlinks:

$ find man-bz2/ \( ! -name "*.bz2" -o -type l \) -a ! -type d | wc -l
8434
$ find man-bz2/ -name "*.bz2" -type f | wc -l
11243
$ find man-xz/ \( ! -name "*.xz" -o -type l \) -a ! -type d | wc -l
8434
$ find man-xz/ -name "*.xz" -type f | wc -l
11243

After cleaning them and adding -b option:

$ du -bs man-bz2/ man-xz/
32158286	man-bz2/
32550305	man-xz/

-- 
Alexander Tsoy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-12 10:47     ` Alexander Tsoy
  2014-05-12 10:55       ` Alexander Tsoy
@ 2014-05-12 12:17       ` Tom Wijsman
  2014-05-12 12:40         ` Alexander Tsoy
  2014-05-12 22:55       ` Gordon Pettey
  2014-05-13  5:01       ` Andrew Savchenko
  3 siblings, 1 reply; 28+ messages in thread
From: Tom Wijsman @ 2014-05-12 12:17 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 697 bytes --]

On Mon, 12 May 2014 14:47:36 +0400
Alexander Tsoy <alexander@tsoy.me> wrote:

> Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB".
> Larger dictionary size does not improve compression ratio, I get
> even worse results with just "-6e" or "-9e". man-bz2 is a full copy of
> my /usr/share/man, man-xz is a recompressed one.

Picking a random post to reply; if you don't already, please consider
to do these tests in tmpfs to cancel out any fs / storage differences.

-- 
With kind regards,

Tom Wijsman (TomWij)
Gentoo Developer

E-mail address  : TomWij@gentoo.org
GPG Public Key  : 6D34E57D
GPG Fingerprint : C165 AF18 AB4C 400B C3D2  ABF0 95B2 1FCD 6D34 E57D

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-12 12:17       ` Tom Wijsman
@ 2014-05-12 12:40         ` Alexander Tsoy
  0 siblings, 0 replies; 28+ messages in thread
From: Alexander Tsoy @ 2014-05-12 12:40 UTC (permalink / raw
  To: gentoo-dev

В Mon, 12 May 2014 14:17:11 +0200
Tom Wijsman <TomWij@gentoo.org> пишет:

> On Mon, 12 May 2014 14:47:36 +0400
> Alexander Tsoy <alexander@tsoy.me> wrote:
> 
> > Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB".
> > Larger dictionary size does not improve compression ratio, I get
> > even worse results with just "-6e" or "-9e". man-bz2 is a full copy of
> > my /usr/share/man, man-xz is a recompressed one.
> 
> Picking a random post to reply; if you don't already, please consider
> to do these tests in tmpfs to cancel out any fs / storage differences.
> 

The same test in tmpfs.

$ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \;

real	0m35.895s
user	0m14.232s
sys	0m14.121s
$ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \;

real	0m44.342s
user	0m16.842s
sys	0m21.459s


And here is additional test. It shows where is actually a bottleneck.
xz is faster in decompression, but looks like it just has a slower
process initialization speed. So it's slower in decompressing of a single
little file.

$ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \+

real	0m10.096s
user	0m9.000s
sys	0m0.787s
$ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \+

real	0m7.846s
user	0m7.108s
sys	0m0.487s

-- 
Alexander Tsoy


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-12 10:47     ` Alexander Tsoy
  2014-05-12 10:55       ` Alexander Tsoy
  2014-05-12 12:17       ` Tom Wijsman
@ 2014-05-12 22:55       ` Gordon Pettey
  2014-05-13  5:01       ` Andrew Savchenko
  3 siblings, 0 replies; 28+ messages in thread
From: Gordon Pettey @ 2014-05-12 22:55 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1398 bytes --]

On Mon, May 12, 2014 at 5:47 AM, Alexander Tsoy <alexander@tsoy.me> wrote:

> В Sun, 11 May 2014 18:26:32 -0500
> Gordon Pettey <petteyg359@gmail.com> пишет:
>
> > A lot of small files (e.g. AUTHORS, ChangeLog
> >
> > FWIW: On my system, I have 59M of bz2 files in /usr/share/man and
> > /usr/share/doc. A short script to decompress those and recompress with xz
> > -6e reduced that to 36M.
>
> Very strange o_O
>
> Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB".
> Larger dictionary size does not improve compression ratio, I get
> even worse results with just "-6e" or "-9e". man-bz2 is a full copy of
> my /usr/share/man, man-xz is a recompressed one.
>
> Size comparison:
>
> $ du -s man-bz2/ man-xz/
> 82032   man-bz2/
> 82308   man-xz


Did you skip all the files that weren't bz2 in the first place, and
decompress bz2 before compressing with xz? My comparison script does not
include uncompressed files. It copies all the bz2 files to a new folder,
pipes those through bzip -d to xz -6e to files in another new folder, then
compares the total size of those folders. Out of 8576 compressed files,
only 464 were larger in xz than in bz2. A very bad timing test I just did
showed the total decompression time of all the xz files to be half that of
decompressing all the bz2 files. Working on getting that data per-file and
averages.

[-- Attachment #2: Type: text/html, Size: 1862 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-12  9:35 ` Tom Wijsman
@ 2014-05-13  2:08   ` Andrew Savchenko
  2014-05-13 16:33     ` Tom Wijsman
  2014-05-14  3:29   ` Kent Fredric
  2014-05-14 16:53   ` Roy Bamford
  2 siblings, 1 reply; 28+ messages in thread
From: Andrew Savchenko @ 2014-05-13  2:08 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1223 bytes --]

On Mon, 12 May 2014 11:35:00 +0200 Tom Wijsman wrote:
> On Sun, 11 May 2014 19:46:50 +0200
> Michał Górny <mgorny@gentoo.org> wrote:
> 
> > Rationale: xz-utils is quite widespread nowadays and it is a
> > part of @system set. It can achieve better compression ratio
> > than bzip2, and faster decompression at the same time.
> 
> Some thoughts:
> 
> What about putting multiple doc / man / info files in a
> single .xz file for each package? Would that further improve the
> situation?
> 
> (As they can share dictionary, instead of having multiple
> dictionaries)

1. How tools like man or info are supposed to work with such
bundle? They are not expecting to have multiple man/info files into
single xz bundle.

2. This will put a stress on decompression procedure: in order to
extract one file whole xz will have to be docompressed.

> Some algorithms tend to work better for smaller files, whereas others 
> work better for larger files; might this be the case for bzip2 vs. xz?

It doesn't really matter because small files will still require one
filesystem block for majority of users. For people with reiserfs or
squashfs this may matter of course.

Best regards,
Andrew Savchenko

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-12 10:47     ` Alexander Tsoy
                         ` (2 preceding siblings ...)
  2014-05-12 22:55       ` Gordon Pettey
@ 2014-05-13  5:01       ` Andrew Savchenko
  2014-05-13  5:55         ` Ulrich Mueller
  3 siblings, 1 reply; 28+ messages in thread
From: Andrew Savchenko @ 2014-05-13  5:01 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2276 bytes --]

Hello,

On Mon, 12 May 2014 14:47:36 +0400 Alexander Tsoy wrote:
> В Sun, 11 May 2014 18:26:32 -0500
> Gordon Pettey <petteyg359@gmail.com> пишет:
> 
> > A lot of small files (e.g. AUTHORS, ChangeLog
> > 
> > FWIW: On my system, I have 59M of bz2 files in /usr/share/man and
> > /usr/share/doc. A short script to decompress those and recompress with xz
> > -6e reduced that to 36M.
> 
> Very strange o_O 
> 
> Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB".
> Larger dictionary size does not improve compression ratio, I get
> even worse results with just "-6e" or "-9e". man-bz2 is a full copy of
> my /usr/share/man, man-xz is a recompressed one.
> 
> Size comparison:
> 
> $ du -s man-bz2/ man-xz/
> 82032	man-bz2/
> 82308	man-xz/

Please consider that by default du shows block size, not byte size.
Than means that if file is actually 1234 bytes large, without -b it
will be still accounted for 4096 bytes on 4K-block filesystem.

Here are my results:

1. With bzip2 -9:
find -O3 /usr/share/man -type f -name "*.bz2" -print0 | du -bhc --files0-from -
63M
find -O3 /usr/share/man -type f -name "*.bz2" -print0 | du -hc --files0-from -
146M

find -O3 /usr/share/doc -type f -name "*.bz2" -print0 | du -bhc --files0-from -
151M    total
find -O3 /usr/share/doc -type f -name "*.bz2" -print0 | du -hc --files0-from -
249M    total

2. With xz -9e:
find -O3 /usr/share/man -type f -name "*.xz" -print0 | du -bhc --files0-from -
64M
find -O3 /usr/share/man -type f -name "*.xz" -print0 | du -bhc --files0-from -
146M

find -O3 /usr/share/doc -type f -name "*.xz" -print0 | du -bhc --files0-from -
147M    total
find -O3 /usr/share/doc -type f -name "*.xz" -print0 | du -hc --files0-from -
245M    total

As one can see, on man pages xz is slightly worse or apparent file sizes
and has no difference on real disk usage. On docs xz is better for both sizes.

As for decompression speed, xz is about twice as good as bzip2 for a large man
pages (bash, mplayer, cmake, zshall). Though this speed gain needs to be
measured directly for bunzip2 and unxz applications. I'll publish statistically
meaningful results later. Both scripting and testing requires time.

Best regards,
Andrew Savchenko

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-13  5:01       ` Andrew Savchenko
@ 2014-05-13  5:55         ` Ulrich Mueller
  2014-05-13 11:01           ` Andrew Savchenko
  0 siblings, 1 reply; 28+ messages in thread
From: Ulrich Mueller @ 2014-05-13  5:55 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 478 bytes --]

>>>>> On Tue, 13 May 2014, Andrew Savchenko wrote:

> Please consider that by default du shows block size, not byte size.
> Than means that if file is actually 1234 bytes large, without -b it
> will be still accounted for 4096 bytes on 4K-block filesystem.

This raises another question, namely if files with <= 4096 bytes size
should be compressed at all? Portage already has a fixed size limit of
128 bytes (see bug 169260), but maybe this could be made configurable.

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-13  5:55         ` Ulrich Mueller
@ 2014-05-13 11:01           ` Andrew Savchenko
  2014-05-13 12:18             ` Rich Freeman
  2014-05-14 13:16             ` vivo75
  0 siblings, 2 replies; 28+ messages in thread
From: Andrew Savchenko @ 2014-05-13 11:01 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1397 bytes --]

On Tue, 13 May 2014 07:55:56 +0200 Ulrich Mueller wrote:
> >>>>> On Tue, 13 May 2014, Andrew Savchenko wrote:
> 
> > Please consider that by default du shows block size, not byte size.
> > Than means that if file is actually 1234 bytes large, without -b it
> > will be still accounted for 4096 bytes on 4K-block filesystem.
> 
> This raises another question, namely if files with <= 4096 bytes size
> should be compressed at all? Portage already has a fixed size limit of
> 128 bytes (see bug 169260), but maybe this could be made configurable.

In no doubt this limit should be configurable, because defaults
fine for one setup may harm another.

If we are trying to consider all possible cases, some filesystems
may benefit even from compression of very small files (e.g. from
140 to 100 bytes) due to packing of multiple small files in the
same inode. ReiserFS is a good example, but more may be somewhere
there.

If we are trying to consider a majority of users (and thus to
select reasonable defaults), from disk usage + decompression
overhead point of view it will be the best to store compressed files
if they are at least one filesystem block smaller than original
file. FS block size may be extracted runtime for any man or doc, or
alike directory used, so this is doable. But this approach may
overcomplicate implementation.

Best regards,
Andrew Savchenko

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-13 11:01           ` Andrew Savchenko
@ 2014-05-13 12:18             ` Rich Freeman
  2014-05-13 13:42               ` Ulrich Mueller
                                 ` (2 more replies)
  2014-05-14 13:16             ` vivo75
  1 sibling, 3 replies; 28+ messages in thread
From: Rich Freeman @ 2014-05-13 12:18 UTC (permalink / raw
  To: gentoo-dev

On Tue, May 13, 2014 at 7:01 AM, Andrew Savchenko <bircoph@gmail.com> wrote:
>
> If we are trying to consider all possible cases, some filesystems
> may benefit even from compression of very small files (e.g. from
> 140 to 100 bytes) due to packing of multiple small files in the
> same inode. ReiserFS is a good example, but more may be somewhere
> there.
>

Btrfs also supports file inlining, so every byte saved on small files
does actually help (I believe the data structure that stores the
inlined data doesn't have a fixed record size).  Then again, btrfs
also supports lzo compression and I believe this is fairly widely
used, so I'm not sure that the impact of not compressing small files
will be felt.

I don't think ext4 supports inlining, but I see some discussions of
attempts to add it.

For VERY small files I would think that overhead would become an issue.

Unless we have a bunch of 30-byte man pages I'd think that both
simplicity and some potential for utility would lead us to use the
best algorithm possible.

Rich


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-13 12:18             ` Rich Freeman
@ 2014-05-13 13:42               ` Ulrich Mueller
  2014-05-14 13:42                 ` Andreas K. Huettel
  2014-05-13 17:27               ` [gentoo-dev] " Duncan
  2014-05-14  2:38               ` [gentoo-dev] " Andrew Savchenko
  2 siblings, 1 reply; 28+ messages in thread
From: Ulrich Mueller @ 2014-05-13 13:42 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 942 bytes --]

>>>>> On Tue, 13 May 2014, Rich Freeman wrote:

> Btrfs also supports file inlining, so every byte saved on small files
> does actually help (I believe the data structure that stores the
> inlined data doesn't have a fixed record size).  Then again, btrfs
> also supports lzo compression and I believe this is fairly widely
> used, so I'm not sure that the impact of not compressing small files
> will be felt.

> I don't think ext4 supports inlining, but I see some discussions of
> attempts to add it.

> For VERY small files I would think that overhead would become an issue.

> Unless we have a bunch of 30-byte man pages I'd think that both
> simplicity and some potential for utility would lead us to use the
> best algorithm possible.

Compression for very small files was systematically studied by vapier
in bug 169260, which led to the current threshold of 128 bytes. Files
smaller than that "usually don't compress at all".

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-13  2:08   ` Andrew Savchenko
@ 2014-05-13 16:33     ` Tom Wijsman
  0 siblings, 0 replies; 28+ messages in thread
From: Tom Wijsman @ 2014-05-13 16:33 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]

On Tue, 13 May 2014 06:08:52 +0400
Andrew Savchenko <bircoph@gmail.com> wrote:

> 1. How tools like man or info are supposed to work with such
> bundle? They are not expecting to have multiple man/info files into
> single xz bundle.

Hmm, true; they would need to be adapted, which involves talking to
upstream. Benchmarking and working on this would be a ML thread on its
own; it would be nice to check it out, if someone is interested in it.

> 2. This will put a stress on decompression procedure: in order to
> extract one file whole xz will have to be docompressed.

This would only work for compression algorithms that would allow you to
seek to a specific file to extract; I don't know about xz, but from a
quick look at the man page it indeed seems that that isn't possible.
 
> > Some algorithms tend to work better for smaller files, whereas
> > others work better for larger files; might this be the case for
> > bzip2 vs. xz?
> 
> It doesn't really matter because small files will still require one
> filesystem block for majority of users. For people with reiserfs or
> squashfs this may matter of course.

Thank you for sharing this insight.

-- 
With kind regards,

Tom Wijsman (TomWij)
Gentoo Developer

E-mail address  : TomWij@gentoo.org
GPG Public Key  : 6D34E57D
GPG Fingerprint : C165 AF18 AB4C 400B C3D2  ABF0 95B2 1FCD 6D34 E57D

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [gentoo-dev] Re: RFC: using .xz for doc/man/info compression
  2014-05-13 12:18             ` Rich Freeman
  2014-05-13 13:42               ` Ulrich Mueller
@ 2014-05-13 17:27               ` Duncan
  2014-05-14  2:38               ` [gentoo-dev] " Andrew Savchenko
  2 siblings, 0 replies; 28+ messages in thread
From: Duncan @ 2014-05-13 17:27 UTC (permalink / raw
  To: gentoo-dev

Rich Freeman posted on Tue, 13 May 2014 08:18:25 -0400 as excerpted:

> Btrfs also supports file inlining, so every byte saved on small files
> does actually help (I believe the data structure that stores the inlined
> data doesn't have a fixed record size).

There's an option for it, altho I've not screwed with it and don't know 
the default without looking it up.

The overall metadata node size (set at mkfs.btrfs time) originally 
defaulted to the filesystem block size, which is the memory page size, 
thus 4096 bytes on x86/amd64 and I believe arm.  However, the metadata 
node size default recently changed to 16KiB (or page size where that is 
larger than 16KiB), altho I'd guess there's still more 4KiB node size 
users due to all the legacy btrfs out there, but 16KiB will certainly be 
the majority at some point.

Individual file inline size is certainly smaller than metadata node size, 
but again, I've not messed with that so don't know the actual default for 
it.

> Then again, btrfs also supports lzo compression and I believe this is
> fairly widely used, so I'm not sure that the impact of not compressing
> small files will be felt.

Of course there's gzip as well, and it's the (now legacy) default if 
compression is specified but not type, altho lzo is recommended as faster 
with "good enough" compression.

The other factor to consider is replication mode.  On a single device 
filesystem data replication mode is single by default, with metadata dup 
(two copies), except on detected ssd, where the metadata default is 
(somewhat controversially) single due to some ssds doing internal 
deduplication.  On multi-device filesystems the metadata default is (two-
copy, regardless of the number of devices) raid1, while the data default 
remains single.

So from a size perspective, assuming defaults of single data, dup or 
raid1 metadata, uncompressed, the cutover should be near 2048 bytes, 
since under that, duplicated metadata inlining will still be smaller than 
the 4096 byte data block size, while over that, sticking it in a single-
mode data extent should be more efficient.

Bottom line, there's enough btrfs variables including inlining size, data 
vs. metadata replication modes, metadata node sizes and compression and 
compression type, and the chances that gentoo btrfs users are likely to 
be tweaking at least one of those variables is high enough, that I'm not 
sure a generic ideal cutover makes a lot of sense, but to the extent that 
there is one, it's likely to be near 2048 bytes.

FWIW I believe I'm still using portage bzip2 docs compression by default 
here, altho in the context of this thread I should really examine that 
since I use compress=lzo at the filesystem level.  Both data and metadata 
are raid1 here, so inlining doesn't matter except that AFAIK inlining is 
NOT compressed while data extents can be, so portage level compression is 
likely to make even less difference if it's in the range that portage 
level bzip2 compression makes it small enough to be inlined, vs not 
portage level compressed but then big enough to not be inlined, thus 
btrfs-level transparent lzo compressed as a data extent.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-13 12:18             ` Rich Freeman
  2014-05-13 13:42               ` Ulrich Mueller
  2014-05-13 17:27               ` [gentoo-dev] " Duncan
@ 2014-05-14  2:38               ` Andrew Savchenko
  2 siblings, 0 replies; 28+ messages in thread
From: Andrew Savchenko @ 2014-05-14  2:38 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1762 bytes --]

On Tue, 13 May 2014 08:18:25 -0400 Rich Freeman wrote:
> On Tue, May 13, 2014 at 7:01 AM, Andrew Savchenko <bircoph@gmail.com> wrote:
> >
> > If we are trying to consider all possible cases, some filesystems
> > may benefit even from compression of very small files (e.g. from
> > 140 to 100 bytes) due to packing of multiple small files in the
> > same inode. ReiserFS is a good example, but more may be somewhere
> > there.
> >
> 
> Btrfs also supports file inlining, so every byte saved on small files
> does actually help (I believe the data structure that stores the
> inlined data doesn't have a fixed record size).  Then again, btrfs
> also supports lzo compression and I believe this is fairly widely
> used, so I'm not sure that the impact of not compressing small files
> will be felt.

I did not meant inlining. I was talking about block suballocation
which allows to store small files in underused blocks of another
files:
http://en.wikipedia.org/wiki/Block_suballocation

> I don't think ext4 supports inlining, but I see some discussions of
> attempts to add it.

Ext4 supports inlining for files up to 59 bytes:
https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Inline_Data
 
> For VERY small files I would think that overhead would become an issue.
> 
> Unless we have a bunch of 30-byte man pages I'd think that both
> simplicity and some potential for utility would lead us to use the
> best algorithm possible.

Agreed, though performance should be considered still. I doubt
paq8l -9 will be used for this task, though it is about 1.5 times
more effective than xz -9e on text files, even on small ones like
man pages; on large files it is at least 2 times better.

Best regards,
Andrew Savchenko

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-12  9:35 ` Tom Wijsman
  2014-05-13  2:08   ` Andrew Savchenko
@ 2014-05-14  3:29   ` Kent Fredric
  2014-05-14 16:53   ` Roy Bamford
  2 siblings, 0 replies; 28+ messages in thread
From: Kent Fredric @ 2014-05-14  3:29 UTC (permalink / raw
  To: gentoo-dev; +Cc: Michał Górny

[-- Attachment #1: Type: text/plain, Size: 807 bytes --]

On 12 May 2014 21:35, Tom Wijsman <TomWij@gentoo.org> wrote:

> What about putting multiple doc / man / info files in a single .xz file
> for each package?
>


How would one use them if they're installed as a single .xz file per
package?

Is there a trick that exists to allow this to even work for "man man" ?

I'm guessing you *could* have an extension wrapper that handles a symlink
to such a file to extract the desired content, but that seems messy.

( ie: less /path/to/bar.polyxz  => polyxz /path/to/bar.polyxz  => polyxz
reads symlink target, decodes the .xz file and returns "bar" from it based
on the symlink source name )

Though I'd imagine that would mitigate the marginal savings made by
unifying them as a single file by needing extra file system allocations to
house the symlinks.

-- 
Kent

[-- Attachment #2: Type: text/html, Size: 1473 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-13 11:01           ` Andrew Savchenko
  2014-05-13 12:18             ` Rich Freeman
@ 2014-05-14 13:16             ` vivo75
  1 sibling, 0 replies; 28+ messages in thread
From: vivo75 @ 2014-05-14 13:16 UTC (permalink / raw
  To: gentoo-dev

On 05/13/14 13:01, Andrew Savchenko wrote:
> f we are trying to consider a majority of users (and thus to
> select reasonable defaults), from disk usage + decompression
> overhead point of view it will be the best to store compressed files
> if they are at least one filesystem block smaller than original
> file. FS block size may be extracted runtime for any man or doc, or
> alike directory used, so this is doable. But this approach may
> overcomplicate implementation.
The filesystem on which the files will end is totally unknown to
portage, since it could be a different machine using binpkg



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-13 13:42               ` Ulrich Mueller
@ 2014-05-14 13:42                 ` Andreas K. Huettel
  2014-05-14 14:01                   ` Ulrich Mueller
  0 siblings, 1 reply; 28+ messages in thread
From: Andreas K. Huettel @ 2014-05-14 13:42 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: Text/Plain, Size: 905 bytes --]

Am Dienstag, 13. Mai 2014, 15:42:11 schrieb Ulrich Mueller:
> 
> Compression for very small files was systematically studied by vapier
> in bug 169260, which led to the current threshold of 128 bytes. Files
> smaller than that "usually don't compress at all".
> 

As long as this concerns manpages (where the code handles transparently both 
formats) this is fine. 

However, I'm not so happy with a "semi-random" compres/dont compress decision 
for other files. Maybe some program expects a certain filename to display a 
README? If there is a clear-cut decision, then the code can be adapted, but if 
the portage behaviour changes as soon as the file grows over a size limit, 
this is difficult...

Not so important though, since this seems to be a more academic problem.

-- 
Andreas K. Huettel
Gentoo Linux developer (council, kde)
dilfridge@gentoo.org
http://www.akhuettel.de/

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-14 13:42                 ` Andreas K. Huettel
@ 2014-05-14 14:01                   ` Ulrich Mueller
  0 siblings, 0 replies; 28+ messages in thread
From: Ulrich Mueller @ 2014-05-14 14:01 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 795 bytes --]

>>>>> On Wed, 14 May 2014, Andreas K Huettel wrote:

> However, I'm not so happy with a "semi-random" compres/dont compress
> decision for other files. Maybe some program expects a certain
> filename to display a README? If there is a clear-cut decision, then
> the code can be adapted, but if the portage behaviour changes as
> soon as the file grows over a size limit, this is difficult...

For any files that are installed in /usr/share/doc you cannot assume
anything about the compression scheme. It also depends on user
configuration which is not accessible from inside an ebuild.

If it must be ensured that a file is not being compressed or that a
certain program for compression is used, then call "docompress -x" for
that file, and (if necessary) compress it with that program.

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-12  9:35 ` Tom Wijsman
  2014-05-13  2:08   ` Andrew Savchenko
  2014-05-14  3:29   ` Kent Fredric
@ 2014-05-14 16:53   ` Roy Bamford
  2014-05-14 17:59     ` Rich Freeman
  2 siblings, 1 reply; 28+ messages in thread
From: Roy Bamford @ 2014-05-14 16:53 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1541 bytes --]

On 2014.05.12 10:35, Tom Wijsman wrote:
> On Sun, 11 May 2014 19:46:50 +0200
> Michał Górny <mgorny@gentoo.org> wrote:
> 
> > Rationale: xz-utils is quite widespread nowadays and it is a part
> > of @system set. It can achieve better compression ratio than bzip2,
> > and faster decompression at the same time.
> 
> Some thoughts:
> 
> What about putting multiple doc / man / info files in a single .xz
> file
> for each package? Would that further improve the situation?
> 
> (As they can share dictionary, instead of having multiple
> dictionaries)
> 
> Some algorithms tend to work better for smaller files, whereas others
> work better for larger files; might this be the case for bzip2 vs. 
> xz?
> 
> -- 
> With kind regards,
> 
> Tom Wijsman (TomWij)
> Gentoo Developer
> 
> E-mail address  : TomWij@gentoo.org
> GPG Public Key  : 6D34E57D
> GPG Fingerprint : C165 AF18 AB4C 400B C3D2  ABF0 95B2 1FCD 6D34 E57D
> 

Some more thoughts ...

What about not compressing files smaller than the filesysem block size 
at all.  In my case its 4k.  Any file gets allocated 4k on disc anyway, 
so compression/decompression is just a waste of resource for files 
<=4k. 

I'm not suggesting dynamically determining the output filesystem block 
size (unless you really want to), choose a static limit below which 
compression will not be applied.

That eliminates the discussion about small files.
-- 
Regards,

Roy Bamford
(Neddyseagoon) a member of
elections
gentoo-ops
forum-mods
trustees

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression
  2014-05-14 16:53   ` Roy Bamford
@ 2014-05-14 17:59     ` Rich Freeman
  0 siblings, 0 replies; 28+ messages in thread
From: Rich Freeman @ 2014-05-14 17:59 UTC (permalink / raw
  To: gentoo-dev

On Wed, May 14, 2014 at 12:53 PM, Roy Bamford <neddyseagoon@gentoo.org> wrote:
> What about not compressing files smaller than the filesysem block size
> at all.  In my case its 4k.  Any file gets allocated 4k on disc anyway,
> so compression/decompression is just a waste of resource for files
> <=4k.
>
> I'm not suggesting dynamically determining the output filesystem block
> size (unless you really want to), choose a static limit below which
> compression will not be applied.
>
> That eliminates the discussion about small files.

See existing discussion around this very topic - for some filesystems
that threshold is apparently as low as about 150 bytes.

Rich


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2014-05-14 17:59 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-11 17:46 [gentoo-dev] RFC: using .xz for doc/man/info compression Michał Górny
2014-05-11 19:37 ` Alexander Tsoy
2014-05-11 21:27 ` Pacho Ramos
2014-05-11 23:26   ` Gordon Pettey
2014-05-12 10:47     ` Alexander Tsoy
2014-05-12 10:55       ` Alexander Tsoy
2014-05-12 12:17       ` Tom Wijsman
2014-05-12 12:40         ` Alexander Tsoy
2014-05-12 22:55       ` Gordon Pettey
2014-05-13  5:01       ` Andrew Savchenko
2014-05-13  5:55         ` Ulrich Mueller
2014-05-13 11:01           ` Andrew Savchenko
2014-05-13 12:18             ` Rich Freeman
2014-05-13 13:42               ` Ulrich Mueller
2014-05-14 13:42                 ` Andreas K. Huettel
2014-05-14 14:01                   ` Ulrich Mueller
2014-05-13 17:27               ` [gentoo-dev] " Duncan
2014-05-14  2:38               ` [gentoo-dev] " Andrew Savchenko
2014-05-14 13:16             ` vivo75
2014-05-12  9:31   ` Marcin Mirosław
2014-05-12  9:45     ` Tom Wijsman
2014-05-12  3:24 ` Samuli Suominen
2014-05-12  9:35 ` Tom Wijsman
2014-05-13  2:08   ` Andrew Savchenko
2014-05-13 16:33     ` Tom Wijsman
2014-05-14  3:29   ` Kent Fredric
2014-05-14 16:53   ` Roy Bamford
2014-05-14 17:59     ` Rich Freeman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox