* [gentoo-dev] RFC: using .xz for doc/man/info compression @ 2014-05-11 17:46 Michał Górny 2014-05-11 19:37 ` Alexander Tsoy ` (3 more replies) 0 siblings, 4 replies; 28+ messages in thread From: Michał Górny @ 2014-05-11 17:46 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 2566 bytes --] Hello, developers. I'd like to raise the following item for discussion: making .xz the default compressor used by portage for documentation, man pages and info files. That is, the equivalent of: PORTAGE_COMPRESS=xz in make.globals. Rationale: xz-utils is quite widespread nowadays and it is a part of @system set. It can achieve better compression ratio than bzip2, and faster decompression at the same time. I have confirmed that both sys-apps/man and sys-apps/man-db can handle .xz compressed man pages, and sys-apps/texinfo can handle .xz compressed info pages. Major text editors and pagers support .xz alike .bz2 (i.e. usually they support both or neither :)). The additional question is: what preset to use? To help discussing this, I'd like to quote the tables from 'man xz': Preset DictSize CompCPU CompMem DecMem -0 256 KiB 0 3 MiB 1 MiB -1 1 MiB 1 9 MiB 2 MiB -2 2 MiB 2 17 MiB 3 MiB -3 4 MiB 3 32 MiB 5 MiB -4 4 MiB 4 48 MiB 5 MiB -5 8 MiB 5 94 MiB 9 MiB -6 8 MiB 6 94 MiB 9 MiB -7 16 MiB 6 186 MiB 17 MiB -8 32 MiB 6 370 MiB 33 MiB -9 64 MiB 6 674 MiB 65 MiB Preset DictSize CompCPU CompMem DecMem -0e 256 KiB 8 4 MiB 1 MiB -1e 1 MiB 8 13 MiB 2 MiB -2e 2 MiB 8 25 MiB 3 MiB -3e 4 MiB 7 48 MiB 5 MiB -4e 4 MiB 8 48 MiB 5 MiB -5e 8 MiB 7 94 MiB 9 MiB -6e 8 MiB 8 94 MiB 9 MiB -7e 16 MiB 8 186 MiB 17 MiB -8e 32 MiB 8 370 MiB 33 MiB -9e 64 MiB 8 674 MiB 65 MiB I'd like to note here that increasing dictionary size over file size does not improve compression. However, the options involved in CompCPU may. Depending on the expected amount of complexity, I'd either go for: 1) -6e (or -6, the default) -- max CompCPU, reasonable use of memory, and dictionary larger than most (or all?) documents that are going to be compressed, 2) -Ne with minimal 'N' for CompCPU==8 and DictSize > filesize -- still max compression ratio while keeping lowest memory requirements possible. Your thoughts? -- Best regards, Michał Górny [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 966 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-11 17:46 [gentoo-dev] RFC: using .xz for doc/man/info compression Michał Górny @ 2014-05-11 19:37 ` Alexander Tsoy 2014-05-11 21:27 ` Pacho Ramos ` (2 subsequent siblings) 3 siblings, 0 replies; 28+ messages in thread From: Alexander Tsoy @ 2014-05-11 19:37 UTC (permalink / raw To: gentoo-dev В Sun, 11 May 2014 19:46:50 +0200 Michał Górny <mgorny@gentoo.org> пишет: > Hello, developers. > > I'd like to raise the following item for discussion: making .xz > the default compressor used by portage for documentation, man pages > and info files. That is, the equivalent of: > > PORTAGE_COMPRESS=xz > > in make.globals. > > Rationale: xz-utils is quite widespread nowadays and it is a part > of @system set. It can achieve better compression ratio than bzip2, > and faster decompression at the same time. I tried it recently. Actually for doc/man/info and any other relatively small text files xz has worse compression ratio than bzip2. See also: https://bugs.gentoo.org/show_bug.cgi?id=372653 > > I have confirmed that both sys-apps/man and sys-apps/man-db can > handle .xz compressed man pages, and sys-apps/texinfo can handle .xz > compressed info pages. Major text editors and pagers support .xz > alike .bz2 (i.e. usually they support both or neither :)). > > The additional question is: what preset to use? To help discussing > this, I'd like to quote the tables from 'man xz': > > Preset DictSize CompCPU CompMem DecMem > -0 256 KiB 0 3 MiB 1 MiB > -1 1 MiB 1 9 MiB 2 MiB > -2 2 MiB 2 17 MiB 3 MiB > -3 4 MiB 3 32 MiB 5 MiB > -4 4 MiB 4 48 MiB 5 MiB > -5 8 MiB 5 94 MiB 9 MiB > -6 8 MiB 6 94 MiB 9 MiB > -7 16 MiB 6 186 MiB 17 MiB > -8 32 MiB 6 370 MiB 33 MiB > -9 64 MiB 6 674 MiB 65 MiB > > Preset DictSize CompCPU CompMem DecMem > -0e 256 KiB 8 4 MiB 1 MiB > -1e 1 MiB 8 13 MiB 2 MiB > -2e 2 MiB 8 25 MiB 3 MiB > -3e 4 MiB 7 48 MiB 5 MiB > -4e 4 MiB 8 48 MiB 5 MiB > -5e 8 MiB 7 94 MiB 9 MiB > -6e 8 MiB 8 94 MiB 9 MiB > -7e 16 MiB 8 186 MiB 17 MiB > -8e 32 MiB 8 370 MiB 33 MiB > -9e 64 MiB 8 674 MiB 65 MiB > > I'd like to note here that increasing dictionary size over file size > does not improve compression. However, the options involved in CompCPU > may. > > Depending on the expected amount of complexity, I'd either go for: > > 1) -6e (or -6, the default) -- max CompCPU, reasonable use of memory, > and dictionary larger than most (or all?) documents that are going to > be compressed, > > 2) -Ne with minimal 'N' for CompCPU==8 and DictSize > filesize -- > still max compression ratio while keeping lowest memory requirements > possible. > > Your thoughts? > -- Alexander Tsoy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-11 17:46 [gentoo-dev] RFC: using .xz for doc/man/info compression Michał Górny 2014-05-11 19:37 ` Alexander Tsoy @ 2014-05-11 21:27 ` Pacho Ramos 2014-05-11 23:26 ` Gordon Pettey 2014-05-12 9:31 ` Marcin Mirosław 2014-05-12 3:24 ` Samuli Suominen 2014-05-12 9:35 ` Tom Wijsman 3 siblings, 2 replies; 28+ messages in thread From: Pacho Ramos @ 2014-05-11 21:27 UTC (permalink / raw To: gentoo-dev El dom, 11-05-2014 a las 19:46 +0200, Michał Górny escribió: > Hello, developers. > > I'd like to raise the following item for discussion: making .xz > the default compressor used by portage for documentation, man pages > and info files. That is, the equivalent of: > > PORTAGE_COMPRESS=xz > > in make.globals. > > Rationale: xz-utils is quite widespread nowadays and it is a part > of @system set. It can achieve better compression ratio than bzip2, > and faster decompression at the same time. > > I have confirmed that both sys-apps/man and sys-apps/man-db can > handle .xz compressed man pages, and sys-apps/texinfo can handle .xz > compressed info pages. Major text editors and pagers support .xz > alike .bz2 (i.e. usually they support both or neither :)). > > The additional question is: what preset to use? To help discussing > this, I'd like to quote the tables from 'man xz': > > Preset DictSize CompCPU CompMem DecMem > -0 256 KiB 0 3 MiB 1 MiB > -1 1 MiB 1 9 MiB 2 MiB > -2 2 MiB 2 17 MiB 3 MiB > -3 4 MiB 3 32 MiB 5 MiB > -4 4 MiB 4 48 MiB 5 MiB > -5 8 MiB 5 94 MiB 9 MiB > -6 8 MiB 6 94 MiB 9 MiB > -7 16 MiB 6 186 MiB 17 MiB > -8 32 MiB 6 370 MiB 33 MiB > -9 64 MiB 6 674 MiB 65 MiB > > Preset DictSize CompCPU CompMem DecMem > -0e 256 KiB 8 4 MiB 1 MiB > -1e 1 MiB 8 13 MiB 2 MiB > -2e 2 MiB 8 25 MiB 3 MiB > -3e 4 MiB 7 48 MiB 5 MiB > -4e 4 MiB 8 48 MiB 5 MiB > -5e 8 MiB 7 94 MiB 9 MiB > -6e 8 MiB 8 94 MiB 9 MiB > -7e 16 MiB 8 186 MiB 17 MiB > -8e 32 MiB 8 370 MiB 33 MiB > -9e 64 MiB 8 674 MiB 65 MiB > > I'd like to note here that increasing dictionary size over file size > does not improve compression. However, the options involved in CompCPU > may. > > Depending on the expected amount of complexity, I'd either go for: > > 1) -6e (or -6, the default) -- max CompCPU, reasonable use of memory, > and dictionary larger than most (or all?) documents that are going to > be compressed, > > 2) -Ne with minimal 'N' for CompCPU==8 and DictSize > filesize -- still > max compression ratio while keeping lowest memory requirements possible. > > Your thoughts? > Per: https://bugs.gentoo.org/show_bug.cgi?id=372653 Looks like bzip2 was still better for small files :/ ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-11 21:27 ` Pacho Ramos @ 2014-05-11 23:26 ` Gordon Pettey 2014-05-12 10:47 ` Alexander Tsoy 2014-05-12 9:31 ` Marcin Mirosław 1 sibling, 1 reply; 28+ messages in thread From: Gordon Pettey @ 2014-05-11 23:26 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 3406 bytes --] A lot of small files (e.g. AUTHORS, ChangeLog FWIW: On my system, I have 59M of bz2 files in /usr/share/man and /usr/share/doc. A short script to decompress those and recompress with xz -6e reduced that to 36M. I don't have a comparison for individual file differences. I posted the short bash scripts at https://gist.github.com/petteyg/96c71fa3c4680552f5c4 On Sun, May 11, 2014 at 4:27 PM, Pacho Ramos <pacho@gentoo.org> wrote: > El dom, 11-05-2014 a las 19:46 +0200, Michał Górny escribió: > > Hello, developers. > > > > I'd like to raise the following item for discussion: making .xz > > the default compressor used by portage for documentation, man pages > > and info files. That is, the equivalent of: > > > > PORTAGE_COMPRESS=xz > > > > in make.globals. > > > > Rationale: xz-utils is quite widespread nowadays and it is a part > > of @system set. It can achieve better compression ratio than bzip2, > > and faster decompression at the same time. > > > > I have confirmed that both sys-apps/man and sys-apps/man-db can > > handle .xz compressed man pages, and sys-apps/texinfo can handle .xz > > compressed info pages. Major text editors and pagers support .xz > > alike .bz2 (i.e. usually they support both or neither :)). > > > > The additional question is: what preset to use? To help discussing > > this, I'd like to quote the tables from 'man xz': > > > > Preset DictSize CompCPU CompMem DecMem > > -0 256 KiB 0 3 MiB 1 MiB > > -1 1 MiB 1 9 MiB 2 MiB > > -2 2 MiB 2 17 MiB 3 MiB > > -3 4 MiB 3 32 MiB 5 MiB > > -4 4 MiB 4 48 MiB 5 MiB > > -5 8 MiB 5 94 MiB 9 MiB > > -6 8 MiB 6 94 MiB 9 MiB > > -7 16 MiB 6 186 MiB 17 MiB > > -8 32 MiB 6 370 MiB 33 MiB > > -9 64 MiB 6 674 MiB 65 MiB > > > > Preset DictSize CompCPU CompMem DecMem > > -0e 256 KiB 8 4 MiB 1 MiB > > -1e 1 MiB 8 13 MiB 2 MiB > > -2e 2 MiB 8 25 MiB 3 MiB > > -3e 4 MiB 7 48 MiB 5 MiB > > -4e 4 MiB 8 48 MiB 5 MiB > > -5e 8 MiB 7 94 MiB 9 MiB > > -6e 8 MiB 8 94 MiB 9 MiB > > -7e 16 MiB 8 186 MiB 17 MiB > > -8e 32 MiB 8 370 MiB 33 MiB > > -9e 64 MiB 8 674 MiB 65 MiB > > > > I'd like to note here that increasing dictionary size over file size > > does not improve compression. However, the options involved in CompCPU > > may. > > > > Depending on the expected amount of complexity, I'd either go for: > > > > 1) -6e (or -6, the default) -- max CompCPU, reasonable use of memory, > > and dictionary larger than most (or all?) documents that are going to > > be compressed, > > > > 2) -Ne with minimal 'N' for CompCPU==8 and DictSize > filesize -- still > > max compression ratio while keeping lowest memory requirements possible. > > > > Your thoughts? > > > > Per: > https://bugs.gentoo.org/show_bug.cgi?id=372653 > > Looks like bzip2 was still better for small files :/ > > > [-- Attachment #2: Type: text/html, Size: 4593 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-11 23:26 ` Gordon Pettey @ 2014-05-12 10:47 ` Alexander Tsoy 2014-05-12 10:55 ` Alexander Tsoy ` (3 more replies) 0 siblings, 4 replies; 28+ messages in thread From: Alexander Tsoy @ 2014-05-12 10:47 UTC (permalink / raw To: gentoo-dev В Sun, 11 May 2014 18:26:32 -0500 Gordon Pettey <petteyg359@gmail.com> пишет: > A lot of small files (e.g. AUTHORS, ChangeLog > > FWIW: On my system, I have 59M of bz2 files in /usr/share/man and > /usr/share/doc. A short script to decompress those and recompress with xz > -6e reduced that to 36M. Very strange o_O Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB". Larger dictionary size does not improve compression ratio, I get even worse results with just "-6e" or "-9e". man-bz2 is a full copy of my /usr/share/man, man-xz is a recompressed one. Size comparison: $ du -s man-bz2/ man-xz/ 82032 man-bz2/ 82308 man-xz/ Decompression speed: $ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \; real 0m35.110s user 0m14.509s sys 0m15.227s $ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \; real 0m35.407s user 0m14.432s sys 0m15.186s $ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \; real 0m46.571s user 0m17.077s sys 0m23.906s $ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \; real 0m46.137s user 0m17.276s sys 0m23.426s As you can see, xz is actually worse in speed and compression ratio. -- Alexander Tsoy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-12 10:47 ` Alexander Tsoy @ 2014-05-12 10:55 ` Alexander Tsoy 2014-05-12 12:17 ` Tom Wijsman ` (2 subsequent siblings) 3 siblings, 0 replies; 28+ messages in thread From: Alexander Tsoy @ 2014-05-12 10:55 UTC (permalink / raw To: gentoo-dev В Mon, 12 May 2014 14:47:36 +0400 Alexander Tsoy <alexander@tsoy.me> пишет: > В Sun, 11 May 2014 18:26:32 -0500 > Gordon Pettey <petteyg359@gmail.com> пишет: > > > A lot of small files (e.g. AUTHORS, ChangeLog > > > > FWIW: On my system, I have 59M of bz2 files in /usr/share/man and > > /usr/share/doc. A short script to decompress those and recompress with xz > > -6e reduced that to 36M. > > Very strange o_O > > Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB". > Larger dictionary size does not improve compression ratio, I get > even worse results with just "-6e" or "-9e". man-bz2 is a full copy of > my /usr/share/man, man-xz is a recompressed one. > > Size comparison: > > $ du -s man-bz2/ man-xz/ > 82032 man-bz2/ > 82308 man-xz/ Note that a lot of files in these directories are non-compressed text files or symlinks: $ find man-bz2/ \( ! -name "*.bz2" -o -type l \) -a ! -type d | wc -l 8434 $ find man-bz2/ -name "*.bz2" -type f | wc -l 11243 $ find man-xz/ \( ! -name "*.xz" -o -type l \) -a ! -type d | wc -l 8434 $ find man-xz/ -name "*.xz" -type f | wc -l 11243 After cleaning them and adding -b option: $ du -bs man-bz2/ man-xz/ 32158286 man-bz2/ 32550305 man-xz/ -- Alexander Tsoy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-12 10:47 ` Alexander Tsoy 2014-05-12 10:55 ` Alexander Tsoy @ 2014-05-12 12:17 ` Tom Wijsman 2014-05-12 12:40 ` Alexander Tsoy 2014-05-12 22:55 ` Gordon Pettey 2014-05-13 5:01 ` Andrew Savchenko 3 siblings, 1 reply; 28+ messages in thread From: Tom Wijsman @ 2014-05-12 12:17 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 697 bytes --] On Mon, 12 May 2014 14:47:36 +0400 Alexander Tsoy <alexander@tsoy.me> wrote: > Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB". > Larger dictionary size does not improve compression ratio, I get > even worse results with just "-6e" or "-9e". man-bz2 is a full copy of > my /usr/share/man, man-xz is a recompressed one. Picking a random post to reply; if you don't already, please consider to do these tests in tmpfs to cancel out any fs / storage differences. -- With kind regards, Tom Wijsman (TomWij) Gentoo Developer E-mail address : TomWij@gentoo.org GPG Public Key : 6D34E57D GPG Fingerprint : C165 AF18 AB4C 400B C3D2 ABF0 95B2 1FCD 6D34 E57D [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-12 12:17 ` Tom Wijsman @ 2014-05-12 12:40 ` Alexander Tsoy 0 siblings, 0 replies; 28+ messages in thread From: Alexander Tsoy @ 2014-05-12 12:40 UTC (permalink / raw To: gentoo-dev В Mon, 12 May 2014 14:17:11 +0200 Tom Wijsman <TomWij@gentoo.org> пишет: > On Mon, 12 May 2014 14:47:36 +0400 > Alexander Tsoy <alexander@tsoy.me> wrote: > > > Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB". > > Larger dictionary size does not improve compression ratio, I get > > even worse results with just "-6e" or "-9e". man-bz2 is a full copy of > > my /usr/share/man, man-xz is a recompressed one. > > Picking a random post to reply; if you don't already, please consider > to do these tests in tmpfs to cancel out any fs / storage differences. > The same test in tmpfs. $ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \; real 0m35.895s user 0m14.232s sys 0m14.121s $ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \; real 0m44.342s user 0m16.842s sys 0m21.459s And here is additional test. It shows where is actually a bottleneck. xz is faster in decompression, but looks like it just has a slower process initialization speed. So it's slower in decompressing of a single little file. $ time find man-bz2/ -type f -name "*.bz2" -exec bzcat '{}' > /dev/null \+ real 0m10.096s user 0m9.000s sys 0m0.787s $ time find man-xz/ -type f -name "*.xz" -exec xzcat '{}' > /dev/null \+ real 0m7.846s user 0m7.108s sys 0m0.487s -- Alexander Tsoy ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-12 10:47 ` Alexander Tsoy 2014-05-12 10:55 ` Alexander Tsoy 2014-05-12 12:17 ` Tom Wijsman @ 2014-05-12 22:55 ` Gordon Pettey 2014-05-13 5:01 ` Andrew Savchenko 3 siblings, 0 replies; 28+ messages in thread From: Gordon Pettey @ 2014-05-12 22:55 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1398 bytes --] On Mon, May 12, 2014 at 5:47 AM, Alexander Tsoy <alexander@tsoy.me> wrote: > В Sun, 11 May 2014 18:26:32 -0500 > Gordon Pettey <petteyg359@gmail.com> пишет: > > > A lot of small files (e.g. AUTHORS, ChangeLog > > > > FWIW: On my system, I have 59M of bz2 files in /usr/share/man and > > /usr/share/doc. A short script to decompress those and recompress with xz > > -6e reduced that to 36M. > > Very strange o_O > > Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB". > Larger dictionary size does not improve compression ratio, I get > even worse results with just "-6e" or "-9e". man-bz2 is a full copy of > my /usr/share/man, man-xz is a recompressed one. > > Size comparison: > > $ du -s man-bz2/ man-xz/ > 82032 man-bz2/ > 82308 man-xz Did you skip all the files that weren't bz2 in the first place, and decompress bz2 before compressing with xz? My comparison script does not include uncompressed files. It copies all the bz2 files to a new folder, pipes those through bzip -d to xz -6e to files in another new folder, then compares the total size of those folders. Out of 8576 compressed files, only 464 were larger in xz than in bz2. A very bad timing test I just did showed the total decompression time of all the xz files to be half that of decompressing all the bz2 files. Working on getting that data per-file and averages. [-- Attachment #2: Type: text/html, Size: 1862 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-12 10:47 ` Alexander Tsoy ` (2 preceding siblings ...) 2014-05-12 22:55 ` Gordon Pettey @ 2014-05-13 5:01 ` Andrew Savchenko 2014-05-13 5:55 ` Ulrich Mueller 3 siblings, 1 reply; 28+ messages in thread From: Andrew Savchenko @ 2014-05-13 5:01 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 2276 bytes --] Hello, On Mon, 12 May 2014 14:47:36 +0400 Alexander Tsoy wrote: > В Sun, 11 May 2014 18:26:32 -0500 > Gordon Pettey <petteyg359@gmail.com> пишет: > > > A lot of small files (e.g. AUTHORS, ChangeLog > > > > FWIW: On my system, I have 59M of bz2 files in /usr/share/man and > > /usr/share/doc. A short script to decompress those and recompress with xz > > -6e reduced that to 36M. > > Very strange o_O > > Here is my test results. xz options: "--lzma2=preset=6e,dict=4MiB". > Larger dictionary size does not improve compression ratio, I get > even worse results with just "-6e" or "-9e". man-bz2 is a full copy of > my /usr/share/man, man-xz is a recompressed one. > > Size comparison: > > $ du -s man-bz2/ man-xz/ > 82032 man-bz2/ > 82308 man-xz/ Please consider that by default du shows block size, not byte size. Than means that if file is actually 1234 bytes large, without -b it will be still accounted for 4096 bytes on 4K-block filesystem. Here are my results: 1. With bzip2 -9: find -O3 /usr/share/man -type f -name "*.bz2" -print0 | du -bhc --files0-from - 63M find -O3 /usr/share/man -type f -name "*.bz2" -print0 | du -hc --files0-from - 146M find -O3 /usr/share/doc -type f -name "*.bz2" -print0 | du -bhc --files0-from - 151M total find -O3 /usr/share/doc -type f -name "*.bz2" -print0 | du -hc --files0-from - 249M total 2. With xz -9e: find -O3 /usr/share/man -type f -name "*.xz" -print0 | du -bhc --files0-from - 64M find -O3 /usr/share/man -type f -name "*.xz" -print0 | du -bhc --files0-from - 146M find -O3 /usr/share/doc -type f -name "*.xz" -print0 | du -bhc --files0-from - 147M total find -O3 /usr/share/doc -type f -name "*.xz" -print0 | du -hc --files0-from - 245M total As one can see, on man pages xz is slightly worse or apparent file sizes and has no difference on real disk usage. On docs xz is better for both sizes. As for decompression speed, xz is about twice as good as bzip2 for a large man pages (bash, mplayer, cmake, zshall). Though this speed gain needs to be measured directly for bunzip2 and unxz applications. I'll publish statistically meaningful results later. Both scripting and testing requires time. Best regards, Andrew Savchenko [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-13 5:01 ` Andrew Savchenko @ 2014-05-13 5:55 ` Ulrich Mueller 2014-05-13 11:01 ` Andrew Savchenko 0 siblings, 1 reply; 28+ messages in thread From: Ulrich Mueller @ 2014-05-13 5:55 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 478 bytes --] >>>>> On Tue, 13 May 2014, Andrew Savchenko wrote: > Please consider that by default du shows block size, not byte size. > Than means that if file is actually 1234 bytes large, without -b it > will be still accounted for 4096 bytes on 4K-block filesystem. This raises another question, namely if files with <= 4096 bytes size should be compressed at all? Portage already has a fixed size limit of 128 bytes (see bug 169260), but maybe this could be made configurable. Ulrich [-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-13 5:55 ` Ulrich Mueller @ 2014-05-13 11:01 ` Andrew Savchenko 2014-05-13 12:18 ` Rich Freeman 2014-05-14 13:16 ` vivo75 0 siblings, 2 replies; 28+ messages in thread From: Andrew Savchenko @ 2014-05-13 11:01 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1397 bytes --] On Tue, 13 May 2014 07:55:56 +0200 Ulrich Mueller wrote: > >>>>> On Tue, 13 May 2014, Andrew Savchenko wrote: > > > Please consider that by default du shows block size, not byte size. > > Than means that if file is actually 1234 bytes large, without -b it > > will be still accounted for 4096 bytes on 4K-block filesystem. > > This raises another question, namely if files with <= 4096 bytes size > should be compressed at all? Portage already has a fixed size limit of > 128 bytes (see bug 169260), but maybe this could be made configurable. In no doubt this limit should be configurable, because defaults fine for one setup may harm another. If we are trying to consider all possible cases, some filesystems may benefit even from compression of very small files (e.g. from 140 to 100 bytes) due to packing of multiple small files in the same inode. ReiserFS is a good example, but more may be somewhere there. If we are trying to consider a majority of users (and thus to select reasonable defaults), from disk usage + decompression overhead point of view it will be the best to store compressed files if they are at least one filesystem block smaller than original file. FS block size may be extracted runtime for any man or doc, or alike directory used, so this is doable. But this approach may overcomplicate implementation. Best regards, Andrew Savchenko [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-13 11:01 ` Andrew Savchenko @ 2014-05-13 12:18 ` Rich Freeman 2014-05-13 13:42 ` Ulrich Mueller ` (2 more replies) 2014-05-14 13:16 ` vivo75 1 sibling, 3 replies; 28+ messages in thread From: Rich Freeman @ 2014-05-13 12:18 UTC (permalink / raw To: gentoo-dev On Tue, May 13, 2014 at 7:01 AM, Andrew Savchenko <bircoph@gmail.com> wrote: > > If we are trying to consider all possible cases, some filesystems > may benefit even from compression of very small files (e.g. from > 140 to 100 bytes) due to packing of multiple small files in the > same inode. ReiserFS is a good example, but more may be somewhere > there. > Btrfs also supports file inlining, so every byte saved on small files does actually help (I believe the data structure that stores the inlined data doesn't have a fixed record size). Then again, btrfs also supports lzo compression and I believe this is fairly widely used, so I'm not sure that the impact of not compressing small files will be felt. I don't think ext4 supports inlining, but I see some discussions of attempts to add it. For VERY small files I would think that overhead would become an issue. Unless we have a bunch of 30-byte man pages I'd think that both simplicity and some potential for utility would lead us to use the best algorithm possible. Rich ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-13 12:18 ` Rich Freeman @ 2014-05-13 13:42 ` Ulrich Mueller 2014-05-14 13:42 ` Andreas K. Huettel 2014-05-13 17:27 ` [gentoo-dev] " Duncan 2014-05-14 2:38 ` [gentoo-dev] " Andrew Savchenko 2 siblings, 1 reply; 28+ messages in thread From: Ulrich Mueller @ 2014-05-13 13:42 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 942 bytes --] >>>>> On Tue, 13 May 2014, Rich Freeman wrote: > Btrfs also supports file inlining, so every byte saved on small files > does actually help (I believe the data structure that stores the > inlined data doesn't have a fixed record size). Then again, btrfs > also supports lzo compression and I believe this is fairly widely > used, so I'm not sure that the impact of not compressing small files > will be felt. > I don't think ext4 supports inlining, but I see some discussions of > attempts to add it. > For VERY small files I would think that overhead would become an issue. > Unless we have a bunch of 30-byte man pages I'd think that both > simplicity and some potential for utility would lead us to use the > best algorithm possible. Compression for very small files was systematically studied by vapier in bug 169260, which led to the current threshold of 128 bytes. Files smaller than that "usually don't compress at all". Ulrich [-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-13 13:42 ` Ulrich Mueller @ 2014-05-14 13:42 ` Andreas K. Huettel 2014-05-14 14:01 ` Ulrich Mueller 0 siblings, 1 reply; 28+ messages in thread From: Andreas K. Huettel @ 2014-05-14 13:42 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: Text/Plain, Size: 905 bytes --] Am Dienstag, 13. Mai 2014, 15:42:11 schrieb Ulrich Mueller: > > Compression for very small files was systematically studied by vapier > in bug 169260, which led to the current threshold of 128 bytes. Files > smaller than that "usually don't compress at all". > As long as this concerns manpages (where the code handles transparently both formats) this is fine. However, I'm not so happy with a "semi-random" compres/dont compress decision for other files. Maybe some program expects a certain filename to display a README? If there is a clear-cut decision, then the code can be adapted, but if the portage behaviour changes as soon as the file grows over a size limit, this is difficult... Not so important though, since this seems to be a more academic problem. -- Andreas K. Huettel Gentoo Linux developer (council, kde) dilfridge@gentoo.org http://www.akhuettel.de/ [-- Attachment #2: This is a digitally signed message part. --] [-- Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-14 13:42 ` Andreas K. Huettel @ 2014-05-14 14:01 ` Ulrich Mueller 0 siblings, 0 replies; 28+ messages in thread From: Ulrich Mueller @ 2014-05-14 14:01 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 795 bytes --] >>>>> On Wed, 14 May 2014, Andreas K Huettel wrote: > However, I'm not so happy with a "semi-random" compres/dont compress > decision for other files. Maybe some program expects a certain > filename to display a README? If there is a clear-cut decision, then > the code can be adapted, but if the portage behaviour changes as > soon as the file grows over a size limit, this is difficult... For any files that are installed in /usr/share/doc you cannot assume anything about the compression scheme. It also depends on user configuration which is not accessible from inside an ebuild. If it must be ensured that a file is not being compressed or that a certain program for compression is used, then call "docompress -x" for that file, and (if necessary) compress it with that program. Ulrich [-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* [gentoo-dev] Re: RFC: using .xz for doc/man/info compression 2014-05-13 12:18 ` Rich Freeman 2014-05-13 13:42 ` Ulrich Mueller @ 2014-05-13 17:27 ` Duncan 2014-05-14 2:38 ` [gentoo-dev] " Andrew Savchenko 2 siblings, 0 replies; 28+ messages in thread From: Duncan @ 2014-05-13 17:27 UTC (permalink / raw To: gentoo-dev Rich Freeman posted on Tue, 13 May 2014 08:18:25 -0400 as excerpted: > Btrfs also supports file inlining, so every byte saved on small files > does actually help (I believe the data structure that stores the inlined > data doesn't have a fixed record size). There's an option for it, altho I've not screwed with it and don't know the default without looking it up. The overall metadata node size (set at mkfs.btrfs time) originally defaulted to the filesystem block size, which is the memory page size, thus 4096 bytes on x86/amd64 and I believe arm. However, the metadata node size default recently changed to 16KiB (or page size where that is larger than 16KiB), altho I'd guess there's still more 4KiB node size users due to all the legacy btrfs out there, but 16KiB will certainly be the majority at some point. Individual file inline size is certainly smaller than metadata node size, but again, I've not messed with that so don't know the actual default for it. > Then again, btrfs also supports lzo compression and I believe this is > fairly widely used, so I'm not sure that the impact of not compressing > small files will be felt. Of course there's gzip as well, and it's the (now legacy) default if compression is specified but not type, altho lzo is recommended as faster with "good enough" compression. The other factor to consider is replication mode. On a single device filesystem data replication mode is single by default, with metadata dup (two copies), except on detected ssd, where the metadata default is (somewhat controversially) single due to some ssds doing internal deduplication. On multi-device filesystems the metadata default is (two- copy, regardless of the number of devices) raid1, while the data default remains single. So from a size perspective, assuming defaults of single data, dup or raid1 metadata, uncompressed, the cutover should be near 2048 bytes, since under that, duplicated metadata inlining will still be smaller than the 4096 byte data block size, while over that, sticking it in a single- mode data extent should be more efficient. Bottom line, there's enough btrfs variables including inlining size, data vs. metadata replication modes, metadata node sizes and compression and compression type, and the chances that gentoo btrfs users are likely to be tweaking at least one of those variables is high enough, that I'm not sure a generic ideal cutover makes a lot of sense, but to the extent that there is one, it's likely to be near 2048 bytes. FWIW I believe I'm still using portage bzip2 docs compression by default here, altho in the context of this thread I should really examine that since I use compress=lzo at the filesystem level. Both data and metadata are raid1 here, so inlining doesn't matter except that AFAIK inlining is NOT compressed while data extents can be, so portage level compression is likely to make even less difference if it's in the range that portage level bzip2 compression makes it small enough to be inlined, vs not portage level compressed but then big enough to not be inlined, thus btrfs-level transparent lzo compressed as a data extent. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-13 12:18 ` Rich Freeman 2014-05-13 13:42 ` Ulrich Mueller 2014-05-13 17:27 ` [gentoo-dev] " Duncan @ 2014-05-14 2:38 ` Andrew Savchenko 2 siblings, 0 replies; 28+ messages in thread From: Andrew Savchenko @ 2014-05-14 2:38 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1762 bytes --] On Tue, 13 May 2014 08:18:25 -0400 Rich Freeman wrote: > On Tue, May 13, 2014 at 7:01 AM, Andrew Savchenko <bircoph@gmail.com> wrote: > > > > If we are trying to consider all possible cases, some filesystems > > may benefit even from compression of very small files (e.g. from > > 140 to 100 bytes) due to packing of multiple small files in the > > same inode. ReiserFS is a good example, but more may be somewhere > > there. > > > > Btrfs also supports file inlining, so every byte saved on small files > does actually help (I believe the data structure that stores the > inlined data doesn't have a fixed record size). Then again, btrfs > also supports lzo compression and I believe this is fairly widely > used, so I'm not sure that the impact of not compressing small files > will be felt. I did not meant inlining. I was talking about block suballocation which allows to store small files in underused blocks of another files: http://en.wikipedia.org/wiki/Block_suballocation > I don't think ext4 supports inlining, but I see some discussions of > attempts to add it. Ext4 supports inlining for files up to 59 bytes: https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Inline_Data > For VERY small files I would think that overhead would become an issue. > > Unless we have a bunch of 30-byte man pages I'd think that both > simplicity and some potential for utility would lead us to use the > best algorithm possible. Agreed, though performance should be considered still. I doubt paq8l -9 will be used for this task, though it is about 1.5 times more effective than xz -9e on text files, even on small ones like man pages; on large files it is at least 2 times better. Best regards, Andrew Savchenko [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-13 11:01 ` Andrew Savchenko 2014-05-13 12:18 ` Rich Freeman @ 2014-05-14 13:16 ` vivo75 1 sibling, 0 replies; 28+ messages in thread From: vivo75 @ 2014-05-14 13:16 UTC (permalink / raw To: gentoo-dev On 05/13/14 13:01, Andrew Savchenko wrote: > f we are trying to consider a majority of users (and thus to > select reasonable defaults), from disk usage + decompression > overhead point of view it will be the best to store compressed files > if they are at least one filesystem block smaller than original > file. FS block size may be extracted runtime for any man or doc, or > alike directory used, so this is doable. But this approach may > overcomplicate implementation. The filesystem on which the files will end is totally unknown to portage, since it could be a different machine using binpkg ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-11 21:27 ` Pacho Ramos 2014-05-11 23:26 ` Gordon Pettey @ 2014-05-12 9:31 ` Marcin Mirosław 2014-05-12 9:45 ` Tom Wijsman 1 sibling, 1 reply; 28+ messages in thread From: Marcin Mirosław @ 2014-05-12 9:31 UTC (permalink / raw To: gentoo-dev W dniu 11.05.2014 23:27, Pacho Ramos pisze: > El dom, 11-05-2014 a las 19:46 +0200, Michał Górny escribió: >> Hello, developers. >> >> I'd like to raise the following item for discussion: making .xz >> the default compressor used by portage for documentation, man pages >> and info files. That is, the equivalent of: >> >> PORTAGE_COMPRESS=xz >> >> in make.globals. >> >> Rationale: xz-utils is quite widespread nowadays and it is a part >> of @system set. It can achieve better compression ratio than bzip2, >> and faster decompression at the same time. >> >> I have confirmed that both sys-apps/man and sys-apps/man-db can >> handle .xz compressed man pages, and sys-apps/texinfo can handle .xz >> compressed info pages. Major text editors and pagers support .xz >> alike .bz2 (i.e. usually they support both or neither :)). >> >> The additional question is: what preset to use? To help discussing >> this, I'd like to quote the tables from 'man xz': >> >> Preset DictSize CompCPU CompMem DecMem >> -0 256 KiB 0 3 MiB 1 MiB >> -1 1 MiB 1 9 MiB 2 MiB >> -2 2 MiB 2 17 MiB 3 MiB >> -3 4 MiB 3 32 MiB 5 MiB >> -4 4 MiB 4 48 MiB 5 MiB >> -5 8 MiB 5 94 MiB 9 MiB >> -6 8 MiB 6 94 MiB 9 MiB >> -7 16 MiB 6 186 MiB 17 MiB >> -8 32 MiB 6 370 MiB 33 MiB >> -9 64 MiB 6 674 MiB 65 MiB >> >> Preset DictSize CompCPU CompMem DecMem >> -0e 256 KiB 8 4 MiB 1 MiB >> -1e 1 MiB 8 13 MiB 2 MiB >> -2e 2 MiB 8 25 MiB 3 MiB >> -3e 4 MiB 7 48 MiB 5 MiB >> -4e 4 MiB 8 48 MiB 5 MiB >> -5e 8 MiB 7 94 MiB 9 MiB >> -6e 8 MiB 8 94 MiB 9 MiB >> -7e 16 MiB 8 186 MiB 17 MiB >> -8e 32 MiB 8 370 MiB 33 MiB >> -9e 64 MiB 8 674 MiB 65 MiB >> >> I'd like to note here that increasing dictionary size over file size >> does not improve compression. However, the options involved in CompCPU >> may. >> >> Depending on the expected amount of complexity, I'd either go for: >> >> 1) -6e (or -6, the default) -- max CompCPU, reasonable use of memory, >> and dictionary larger than most (or all?) documents that are going to >> be compressed, >> >> 2) -Ne with minimal 'N' for CompCPU==8 and DictSize > filesize -- still >> max compression ratio while keeping lowest memory requirements possible. >> >> Your thoughts? >> > > Per: > https://bugs.gentoo.org/show_bug.cgi?id=372653 > > Looks like bzip2 was still better for small files :/ Hi! I did test on medium sized man file (bash): $ man -a -w bash /usr/share/man/man1/bash.1.bz2 $ stat --printf=%s\\n /usr/share/man/man1/bash.1.bz2 62606 $ time man -c -P /bin/cat bash >/dev/null real 0m0.248s user 0m0.316s sys 0m0.012s $ time man -c -P /bin/cat bash >/dev/null real 0m0.252s user 0m0.324s sys 0m0.016s $ time man -c -P /bin/cat bash >/dev/null real 0m0.249s user 0m0.320s sys 0m0.012s Now I recompress using xz -6 and next: $ stat --printf=%s\\n /usr/share/man/man1/bash.1.xz 66628 $ time man -c -P /bin/cat bash >/dev/null real 0m0.234s user 0m0.304s sys 0m0.004s $ time man -c -P /bin/cat bash >/dev/null real 0m0.244s user 0m0.288s sys 0m0.024s $ time man -c -P /bin/cat bash >/dev/null real 0m0.239s user 0m0.308s sys 0m0.012s And with file compressed using '-6e': $ stat --printf=%s\\n /usr/share/man/man1/bash.1.xz 66700 $ time man -c -P /bin/cat bash >/dev/null real 0m0.233s user 0m0.292s sys 0m0.016s $ time man -c -P /bin/cat bash >/dev/null real 0m0.234s user 0m0.300s sys 0m0.008s Imho there is no real advantages to change current compressor for man files. Regards ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-12 9:31 ` Marcin Mirosław @ 2014-05-12 9:45 ` Tom Wijsman 0 siblings, 0 replies; 28+ messages in thread From: Tom Wijsman @ 2014-05-12 9:45 UTC (permalink / raw To: gentoo-dev; +Cc: marcin [-- Attachment #1: Type: text/plain, Size: 641 bytes --] On Mon, 12 May 2014 11:31:45 +0200 Marcin Mirosław <marcin@mejor.pl> wrote: > Imho there is no real advantages to change current compressor for man > files. It's insufficient to experiment on a single file to make such claim, you may very well found a file that works equally well with multiple compression algorithms; it would be nice to script this over all man / ... files and draw out a table for further comparison. -- With kind regards, Tom Wijsman (TomWij) Gentoo Developer E-mail address : TomWij@gentoo.org GPG Public Key : 6D34E57D GPG Fingerprint : C165 AF18 AB4C 400B C3D2 ABF0 95B2 1FCD 6D34 E57D [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-11 17:46 [gentoo-dev] RFC: using .xz for doc/man/info compression Michał Górny 2014-05-11 19:37 ` Alexander Tsoy 2014-05-11 21:27 ` Pacho Ramos @ 2014-05-12 3:24 ` Samuli Suominen 2014-05-12 9:35 ` Tom Wijsman 3 siblings, 0 replies; 28+ messages in thread From: Samuli Suominen @ 2014-05-12 3:24 UTC (permalink / raw To: gentoo-dev On 11/05/14 20:46, Michał Górny wrote: > Hello, developers. > > I'd like to raise the following item for discussion: making .xz > the default compressor used by portage for documentation, man pages > and info files. That is, the equivalent of: > > PORTAGE_COMPRESS=xz > > in make.globals. > > I like it, I've been using it myself from make.conf with the current install on this machine. But no, I don't have size or speed comparison to give :/ - Samuli ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-11 17:46 [gentoo-dev] RFC: using .xz for doc/man/info compression Michał Górny ` (2 preceding siblings ...) 2014-05-12 3:24 ` Samuli Suominen @ 2014-05-12 9:35 ` Tom Wijsman 2014-05-13 2:08 ` Andrew Savchenko ` (2 more replies) 3 siblings, 3 replies; 28+ messages in thread From: Tom Wijsman @ 2014-05-12 9:35 UTC (permalink / raw To: gentoo-dev; +Cc: mgorny [-- Attachment #1: Type: text/plain, Size: 847 bytes --] On Sun, 11 May 2014 19:46:50 +0200 Michał Górny <mgorny@gentoo.org> wrote: > Rationale: xz-utils is quite widespread nowadays and it is a part > of @system set. It can achieve better compression ratio than bzip2, > and faster decompression at the same time. Some thoughts: What about putting multiple doc / man / info files in a single .xz file for each package? Would that further improve the situation? (As they can share dictionary, instead of having multiple dictionaries) Some algorithms tend to work better for smaller files, whereas others work better for larger files; might this be the case for bzip2 vs. xz? -- With kind regards, Tom Wijsman (TomWij) Gentoo Developer E-mail address : TomWij@gentoo.org GPG Public Key : 6D34E57D GPG Fingerprint : C165 AF18 AB4C 400B C3D2 ABF0 95B2 1FCD 6D34 E57D [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-12 9:35 ` Tom Wijsman @ 2014-05-13 2:08 ` Andrew Savchenko 2014-05-13 16:33 ` Tom Wijsman 2014-05-14 3:29 ` Kent Fredric 2014-05-14 16:53 ` Roy Bamford 2 siblings, 1 reply; 28+ messages in thread From: Andrew Savchenko @ 2014-05-13 2:08 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1223 bytes --] On Mon, 12 May 2014 11:35:00 +0200 Tom Wijsman wrote: > On Sun, 11 May 2014 19:46:50 +0200 > Michał Górny <mgorny@gentoo.org> wrote: > > > Rationale: xz-utils is quite widespread nowadays and it is a > > part of @system set. It can achieve better compression ratio > > than bzip2, and faster decompression at the same time. > > Some thoughts: > > What about putting multiple doc / man / info files in a > single .xz file for each package? Would that further improve the > situation? > > (As they can share dictionary, instead of having multiple > dictionaries) 1. How tools like man or info are supposed to work with such bundle? They are not expecting to have multiple man/info files into single xz bundle. 2. This will put a stress on decompression procedure: in order to extract one file whole xz will have to be docompressed. > Some algorithms tend to work better for smaller files, whereas others > work better for larger files; might this be the case for bzip2 vs. xz? It doesn't really matter because small files will still require one filesystem block for majority of users. For people with reiserfs or squashfs this may matter of course. Best regards, Andrew Savchenko [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-13 2:08 ` Andrew Savchenko @ 2014-05-13 16:33 ` Tom Wijsman 0 siblings, 0 replies; 28+ messages in thread From: Tom Wijsman @ 2014-05-13 16:33 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1389 bytes --] On Tue, 13 May 2014 06:08:52 +0400 Andrew Savchenko <bircoph@gmail.com> wrote: > 1. How tools like man or info are supposed to work with such > bundle? They are not expecting to have multiple man/info files into > single xz bundle. Hmm, true; they would need to be adapted, which involves talking to upstream. Benchmarking and working on this would be a ML thread on its own; it would be nice to check it out, if someone is interested in it. > 2. This will put a stress on decompression procedure: in order to > extract one file whole xz will have to be docompressed. This would only work for compression algorithms that would allow you to seek to a specific file to extract; I don't know about xz, but from a quick look at the man page it indeed seems that that isn't possible. > > Some algorithms tend to work better for smaller files, whereas > > others work better for larger files; might this be the case for > > bzip2 vs. xz? > > It doesn't really matter because small files will still require one > filesystem block for majority of users. For people with reiserfs or > squashfs this may matter of course. Thank you for sharing this insight. -- With kind regards, Tom Wijsman (TomWij) Gentoo Developer E-mail address : TomWij@gentoo.org GPG Public Key : 6D34E57D GPG Fingerprint : C165 AF18 AB4C 400B C3D2 ABF0 95B2 1FCD 6D34 E57D [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 490 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-12 9:35 ` Tom Wijsman 2014-05-13 2:08 ` Andrew Savchenko @ 2014-05-14 3:29 ` Kent Fredric 2014-05-14 16:53 ` Roy Bamford 2 siblings, 0 replies; 28+ messages in thread From: Kent Fredric @ 2014-05-14 3:29 UTC (permalink / raw To: gentoo-dev; +Cc: Michał Górny [-- Attachment #1: Type: text/plain, Size: 807 bytes --] On 12 May 2014 21:35, Tom Wijsman <TomWij@gentoo.org> wrote: > What about putting multiple doc / man / info files in a single .xz file > for each package? > How would one use them if they're installed as a single .xz file per package? Is there a trick that exists to allow this to even work for "man man" ? I'm guessing you *could* have an extension wrapper that handles a symlink to such a file to extract the desired content, but that seems messy. ( ie: less /path/to/bar.polyxz => polyxz /path/to/bar.polyxz => polyxz reads symlink target, decodes the .xz file and returns "bar" from it based on the symlink source name ) Though I'd imagine that would mitigate the marginal savings made by unifying them as a single file by needing extra file system allocations to house the symlinks. -- Kent [-- Attachment #2: Type: text/html, Size: 1473 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-12 9:35 ` Tom Wijsman 2014-05-13 2:08 ` Andrew Savchenko 2014-05-14 3:29 ` Kent Fredric @ 2014-05-14 16:53 ` Roy Bamford 2014-05-14 17:59 ` Rich Freeman 2 siblings, 1 reply; 28+ messages in thread From: Roy Bamford @ 2014-05-14 16:53 UTC (permalink / raw To: gentoo-dev [-- Attachment #1: Type: text/plain, Size: 1541 bytes --] On 2014.05.12 10:35, Tom Wijsman wrote: > On Sun, 11 May 2014 19:46:50 +0200 > Michał Górny <mgorny@gentoo.org> wrote: > > > Rationale: xz-utils is quite widespread nowadays and it is a part > > of @system set. It can achieve better compression ratio than bzip2, > > and faster decompression at the same time. > > Some thoughts: > > What about putting multiple doc / man / info files in a single .xz > file > for each package? Would that further improve the situation? > > (As they can share dictionary, instead of having multiple > dictionaries) > > Some algorithms tend to work better for smaller files, whereas others > work better for larger files; might this be the case for bzip2 vs. > xz? > > -- > With kind regards, > > Tom Wijsman (TomWij) > Gentoo Developer > > E-mail address : TomWij@gentoo.org > GPG Public Key : 6D34E57D > GPG Fingerprint : C165 AF18 AB4C 400B C3D2 ABF0 95B2 1FCD 6D34 E57D > Some more thoughts ... What about not compressing files smaller than the filesysem block size at all. In my case its 4k. Any file gets allocated 4k on disc anyway, so compression/decompression is just a waste of resource for files <=4k. I'm not suggesting dynamically determining the output filesystem block size (unless you really want to), choose a static limit below which compression will not be applied. That eliminates the discussion about small files. -- Regards, Roy Bamford (Neddyseagoon) a member of elections gentoo-ops forum-mods trustees [-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --] ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [gentoo-dev] RFC: using .xz for doc/man/info compression 2014-05-14 16:53 ` Roy Bamford @ 2014-05-14 17:59 ` Rich Freeman 0 siblings, 0 replies; 28+ messages in thread From: Rich Freeman @ 2014-05-14 17:59 UTC (permalink / raw To: gentoo-dev On Wed, May 14, 2014 at 12:53 PM, Roy Bamford <neddyseagoon@gentoo.org> wrote: > What about not compressing files smaller than the filesysem block size > at all. In my case its 4k. Any file gets allocated 4k on disc anyway, > so compression/decompression is just a waste of resource for files > <=4k. > > I'm not suggesting dynamically determining the output filesystem block > size (unless you really want to), choose a static limit below which > compression will not be applied. > > That eliminates the discussion about small files. See existing discussion around this very topic - for some filesystems that threshold is apparently as low as about 150 bytes. Rich ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2014-05-14 17:59 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-05-11 17:46 [gentoo-dev] RFC: using .xz for doc/man/info compression Michał Górny 2014-05-11 19:37 ` Alexander Tsoy 2014-05-11 21:27 ` Pacho Ramos 2014-05-11 23:26 ` Gordon Pettey 2014-05-12 10:47 ` Alexander Tsoy 2014-05-12 10:55 ` Alexander Tsoy 2014-05-12 12:17 ` Tom Wijsman 2014-05-12 12:40 ` Alexander Tsoy 2014-05-12 22:55 ` Gordon Pettey 2014-05-13 5:01 ` Andrew Savchenko 2014-05-13 5:55 ` Ulrich Mueller 2014-05-13 11:01 ` Andrew Savchenko 2014-05-13 12:18 ` Rich Freeman 2014-05-13 13:42 ` Ulrich Mueller 2014-05-14 13:42 ` Andreas K. Huettel 2014-05-14 14:01 ` Ulrich Mueller 2014-05-13 17:27 ` [gentoo-dev] " Duncan 2014-05-14 2:38 ` [gentoo-dev] " Andrew Savchenko 2014-05-14 13:16 ` vivo75 2014-05-12 9:31 ` Marcin Mirosław 2014-05-12 9:45 ` Tom Wijsman 2014-05-12 3:24 ` Samuli Suominen 2014-05-12 9:35 ` Tom Wijsman 2014-05-13 2:08 ` Andrew Savchenko 2014-05-13 16:33 ` Tom Wijsman 2014-05-14 3:29 ` Kent Fredric 2014-05-14 16:53 ` Roy Bamford 2014-05-14 17:59 ` Rich Freeman
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox