public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] [pre-GLEP] Gentoo binary package container format
@ 2018-11-17 11:21 Michał Górny
  2018-11-17 14:05 ` Roy Bamford
                   ` (5 more replies)
  0 siblings, 6 replies; 32+ messages in thread
From: Michał Górny @ 2018-11-17 11:21 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 17395 bytes --]

Hi,

Here's a pre-GLEP draft based on the earlier discussion on gentoo-
portage-dev mailing list.  The specification uses GLEP form as it
provides for cleanly specifying the motivation and rationale.

(Note: the number assignment is not official, just took the next number
to satisfy the glep converter script)

Also available via HTTPS:

rst:  https://dev.gentoo.org/~mgorny/tmp/glep-0078.rst
html: https://dev.gentoo.org/~mgorny/tmp/glep-0078.html

---
GLEP: 78
Title: Gentoo binary package container format
Author: Michał Górny <mgorny@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2018-11-15
Last-Modified: 2018-11-16
Post-History: 2018-11-17
Content-Type: text/x-rst
---

Abstract
========

This GLEP proposes a new binary package container format for Gentoo.
The current tbz2/XPAK format is shortly described, and its deficiences
are listed.  Accordingly, the requirements for a new format are set
and a gpkg format satisfying them is proposed.  The rationale for
various design decisions is provided.


Motivation
==========

The current Portage binary package format
-----------------------------------------

The historical ``.tbz2`` binary package format used by Portage is
a concatenation of two distinct formats: header-oriented compressed .tar
format (used to hold package files) and trailer-oriented custom XPAK
format (used to hold metadata)  [#MAN-XPAK]_.  The format has already
been extended incompatibly twice.

The first time, support for storing multiple successive builds of binary
package for a single ebuild version has been added.  This feature relies
on appending additional hyphen, followed by an integer to the package
filename.  It is disabled by default (preserving backwards
compatibility) and controlled by ``binpkg-multi-instance`` feature.

The second time, support for additional compression formats has been
added.  When format other than bzip2 is used, the ``.tbz2`` suffix
is replaced by ``.xpak`` and Portage relies on magic bytes to detect
compression used.  For backwards compatibility, Portage still defaults
to using bzip2; compression program can be switched using
``BINPKG_COMPRESS`` configuration variable.

Additionally, there have been minor changes to the stored metadata
and file storage policies.  In particular, behavior regarding
``INSTALL_MASK``, controllable file compression and stripping has
changed over time.


Problems with the current binary package format
-----------------------------------------------

The following problems were identified with the package format currently
in use:

1. **The packages rely on custom binary archive format to store
   metadata.**  It is entirely Gentoo invented, and requires dedicated
   tooling to work with it.  In fact, the reference implementation
   in Portage does not even include a CLI tool to work with tbz2
   packages; an unofficial implementation is provided as part
   of portage-utils toolkit [#PORTAGE-UTILS]_.

2. **The format relies on obscure compressor feature of ignoring
   trailing garbage**.  While this behavior is traditionally implemented
   by many compressors, the original reasons for it have become long
   irrelevant and it is not surprising that new compressors do not
   support it.  In particular, Portage already hit this problem twice:
   once when users replaced bzip2 with parallel-capable pbzip2
   implementation [#PBZIP2]_, and the second time when support for zstd
   compressor was added [#ZSTD]_.

3. **Placing metadata at the end of file makes partial fetches
   complex.**  While it is technically possible to obtain package
   metadata remotely without fetching the whole package, it usually
   requires e.g. 2-3 HTTP requests with rather complex driver.  For
   comparison, if metadata was placed at the beginning of the file,
   early-terminated pipeline with a single fetch request would suffice.

4. **Extending the format with OpenPGP signatures is non-trivial.**
   Depending on the implementation details, it either requires fetching
   additional detached signature, breaking backwards compatibility or
   introducing more custom logic to reassemble OpenPGP packets.

5. **Metadata is not compressed.**  This is not a significant problem,
   it is just listed for completeness.


Goals for a new container format
--------------------------------

The following goals have been set for a replacement format:

1. **The packages must remain contained in a single file.**  As a matter
   of user convenience, it should be possible to transfer binary
   packages without having to use multiple files, and to install them
   from any location.

2. **The file format must be entirely based on common file formats,
   respecting best practices, with as little customization as necessary
   to satisfy the requirements.**  In particular, it is unacceptable
   to create new binary formats.

3. **The file format should provide for partial fetching of binary
   packages.**  It should be possible to easily fetch and read
   the package metadata without having to download the whole package.

4. **The file format must provide support for OpenPGP signatures.**
   Preferably, it should use standard OpenPGP message formats.

5. **The file format must allow for efficient metadata updates.**
   In particular, it should be possible to update the metadata without
   having to recompress package files.

6. **The file format should account for easy recognition both through
   filename and through contents.**  Preferably, it should have distinct
   features making it possible to detect it via file(1).

7. **The file format should allow for metadata compression.**

8. **The file format should make future extensions easily possible
   without breaking backwards compatibility.**


Specification
=============

The container format
--------------------

The gpkg package container is an uncompressed .tar achive whose filename
uses ``.gpkg.tar`` suffix.  This archive contains the following members,
in order:

1. A volume label: ``gpkg: ${full_package_identifier}`` (optional).

2. A signature for the metadata archive: ``metadata.tar${comp}.sig``
   (optional).

3. The metadata archive ``metadata.tar${comp}``, optionally compressed
   (required).

4. A signature for the filesystem image archive:
   ``image.tar${comp}.sig`` (optional).

5. The filesystem image archive ``image.tar${comp}``, optionally
   compressed (required).

It is recommended that relative order of the archive members is
preserved.  However, implementations must support archives with members
out of order.

The container may be extended with additional members in the future.
The implementations should ignore unrecognized members and preserve
them across package updates.


The volume label
----------------

The volume label provides an easy way for users to identify the binary
package without dedicated tooling or specific format knowledge.

The implementations should include a volume label consisting of fixed
string ``gpkg:``, followed by a single space, followed by full package
identifier.  However, the implementations must not rely on the volume
label being present or attempt to parse its value when it is.

Furthermore, since the volume label is included in the .tar archive
as the first member, it provides a magic string at a fixed location
that can be used by tools such as file(1) to easily distinguish Gentoo
binary packages from regular .tar archives.


The metadata archive
--------------------

The metadata archive stores the package metadata needed for the package
manager to process it.  The archive should be included at the beginning
of the binary package in order to make it possible to read it out of
partially fetched binary package, and to avoid fetching the remaining
part of the package if not necessary.

The archive contains a single directory called ``metadata``.  In this
directory, the individual metadata keys are stored as files.  The exact
keys and metadata format is outside the scope of this specification.

The package manager may need to modify the package metadata.  In this
case, it should replace the metadata archive without having to alter
other package members.

The metadata archive can optionally be compressed.  It can also be
supplemented with a detached OpenPGP signature.


The image archive
-----------------

The image archive stores all the files to be installed by the binary
package.  It should be included as the last of the files in the binary
package container.

The archive contains a single directory called ``image``.  Inside this
directory, all package files are stored in filesystem layout, relative
to the root directory.

The image archive can optionally be compressed.  It can also be
supplemented with a detached OpenPGP signature.


Archive member compression
--------------------------

The archive members outlined above support optional compression using
one of the compressed file formats supported by the package manager.
The exact list of compression types is outside the scope of this
specification.

The implementations must support archive members being uncompressed,
and must support using different compression types for different files.

When compressing an archive member, the member filename should be
suffixed using the standard suffix for the particular compressed file
type (e.g. ``.bz2`` for bzip2 format).


OpenPGP member signatures
-------------------------

The archive members support optional OpenPGP signatures.
The implementations must allow the user to specify whether OpenPGP
signatures are to be expected in remotely fetched packages.

If the signatures are expected and the archive member is unsigned, the
package manager must reject processing it.  If the signature does not
verify, the package manager must reject processing the corresponding
archive member.  In particular, it must not attempt decompressing
compressed members in those circumstances.

If the implementation needs to manipulate archive members, it must
either create a new signature or discard the existing signature.

The signatures are created as binary detached OpenPGP signature files,
with filename corresponding to the member filename with ``.sig`` suffix
appended.


Rationale
=========

Nested archive format
---------------------

The basic problem in designing the new format was how to embed multiple
data streams (metadata, image) into a single file.  Traditionally, this
has been done via using two non-conflicting file formats.  However,
while such a solution is clever, it suffers in terms of transparency.

Therefore, it has been established that the new format should really
consist of a single archive format, with all necessary data
transparently accessible inside the file.  Consequently, it has been
debated how different parts of binary package data should be stored
inside that archive.

The proposal to continue storing image data as top-level data
in the package format, and store metadata as special directory in that
structure has been discarded as a case of in-band signalling.

Finally, the proposal has been shaped to store different kinds of data
as nested archives in the outer binary package container.  Besides
providing a clean way of accessing different kinds of information, it
makes it possible to add separate OpenPGP signatures to them.


Inner vs. outer compression
---------------------------

One of the points in the new format debate was whether the binary
package as a whole should be compressed vs. compressing individual
members.  The first option may seem as an obvious choice, especially
given that with a larger data set, the compression may proceed more
effectively.  However, it has a single strong disadvantage: compression
prevents random access and manipulation of the binary package members.

While for the purpose of reading binary packages, the problem could be
circumvented through convenient member ordering and avoiding disjoint
reads of the binary package, metadata updates would either require
recompressing the whole package (which could be really time consuming
with large packages) or applying complex techniques such as splitting
the compressed archive into multiple compressed streams.

This considered, the simplest solution is to apply compression to
the individual package members, while leaving the container format
uncompressed.  It provides fast random access to the individual members,
as well as capability of updating them without the necessity of
recompressing other files in the container.

This also makes it possible to easily protect compressed files using
standard OpenPGP detached signature format.  All this combined,
the package manager may perform partial fetch of binary package, verify
the signature of its metadata member and process it without having to
fetch the potentially-large image part.


Container and archive formats
-----------------------------

During the debate, the actual archive formats to use were considered.
The .tar format seemed an obvious choice for the image archive since
it is the only widely deployed archive format that stores all kinds
of file metadata on POSIX systems.  However, multiple options for
the outer format has been debated.

Firstly, the ZIP format has been proposed as the only commonly supported
format supporting adding files from stdin (i.e. making it possible to
pipe the inner archives straight into the container without using
temporary files).  However, this format has been clearly rejected
as both not being present in the system set, and being trailer-based
and therefore unusable without having to fetch the whole file.

Secondly, the ar and cpio formats were considered.  The former is used
by Debian and its derivative binary packages; the latter is used by Red
Hat derivatives.  Both formats have the advantage of having less
historical baggage than .tar, and having less overhead.  However, both
are also rather obscure (especially given that ar is actually provided
by GNU binutils rather than as a stand-alone archiver), considered
obsolete by POSIX and both have file size limitations smaller than .tar.

All that considered, it has been decided that there is no purpose
in using a second archive format in the specification unless it has
significant advantage to .tar.  Therefore, .tar has also been used
as outer package format, even though it has larger overhead than other
formats (mostly due to padding).


Member ordering
---------------

The member ordering is explicitly specified in order to provide for
trivially reading metadata from partially fetched archives.
By requiring the metadata archive to be stored before the image archive,
the package manager may stop fetching after reading it and save
bandwidth and/or space.


Detached OpenPGP signatures
---------------------------

The use of detached OpenPGP signatures is to provide authenticity checks
for binary packages.  Covering the complete members with signatures
provide for trivial verification of all metadata and image contents
respectively, without having to invent custom mechanisms for combining
them.  Covering the compressed archives helps to prevent zipbomb
attacks.  Covering the individual members rather than the whole package
provides for verification of partially fetched binary packages.


Backwards Compatibility
=======================

The format does not preserve backwards compatibility with the tbz2
packages.  It has been established that preserving compatibility with
the old format was impossible without making the new format even worse
than the old one was.

For example, adding any visible members to the tarball would cause
them to be installed to the filesystem by old Portage versions.  Working
around this would require some kind of awful hacks that would oppose
the goal of using simple and transparent package format.


Reference Implementation
========================

The proof-of-concept implementation of binary package format converter
is available as xpak2gpkg [#XPAK2GPKG]_.  It can be used to easily
create packages in the new format for early inspection.


References
==========

.. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary
   packages
   (https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html)

.. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools
   written in C
   (https://packages.gentoo.org/packages/app-portage/portage-utils)

.. [#PBZIP2] PBZIP2 - a parallel implementation of the bzip2
   block-sorting file compressor
   (https://launchpad.net/pbzip2)

.. [#ZSTD] Zstandard - Real-time data compression algorithm
   (https://facebook.github.io/zstd/)

.. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak
   to gpkg binpkg format
   (https://github.com/mgorny/xpak2gpkg)


Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.

-- 
Best regards,
Michał Górny

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 963 bytes --]

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-12-01 10:25 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-11-17 11:21 [gentoo-dev] [pre-GLEP] Gentoo binary package container format Michał Górny
2018-11-17 14:05 ` Roy Bamford
2018-11-17 14:17   ` Rich Freeman
2018-11-17 21:53   ` Michał Górny
2018-11-18  9:16 ` Fabian Groffen
2018-11-18  9:38   ` Michał Górny
2018-11-18 11:00     ` Fabian Groffen
2018-11-19 20:46       ` Kent Fredric
2018-11-21  9:33       ` Michał Górny
2018-11-21 10:45         ` Fabian Groffen
2018-11-21 11:20           ` Michał Górny
2018-11-26 21:13           ` Andrey Utkin
2018-11-27  8:32             ` Fabian Groffen
2018-11-18 11:04     ` Roy Bamford
2018-11-19 18:35 ` [gentoo-dev] [pre-GLEP r1] " Michał Górny
2018-11-19 19:21   ` Roy Bamford
2018-11-19 19:33     ` Rich Freeman
2018-11-19 19:40       ` Zac Medico
2018-11-19 19:51         ` Rich Freeman
2018-11-19 20:48       ` Roy Bamford
2018-11-20 20:34     ` Michał Górny
2018-11-20 20:33 ` [gentoo-dev] [pre-GLEP r2] " Michał Górny
2018-11-21 13:10   ` Fabian Groffen
2018-11-21 14:21     ` Michał Górny
2018-11-26 18:58 ` [gentoo-dev] [pre-GLEP r3] " Michał Górny
2018-11-26 19:17   ` Ulrich Mueller
2018-11-26 19:51     ` Michał Górny
2018-11-26 21:43   ` Roy Bamford
2018-11-30 17:06     ` Michał Górny
2018-11-30 21:23       ` Roy Bamford
2018-11-30 17:09 ` [gentoo-dev] [pre-GLEP r4] " Michał Górny
2018-12-01 10:25   ` Ulrich Mueller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox