From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 82008138334 for ; Mon, 19 Nov 2018 18:35:27 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id 154E3E091B; Mon, 19 Nov 2018 18:35:23 +0000 (UTC) Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 7E965E0917 for ; Mon, 19 Nov 2018 18:35:22 +0000 (UTC) Received: from pomiot (d202-252.icpnet.pl [109.173.202.252]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: mgorny) by smtp.gentoo.org (Postfix) with ESMTPSA id 623A0335C94; Mon, 19 Nov 2018 18:35:20 +0000 (UTC) Message-ID: <1542652504.26086.4.camel@gentoo.org> Subject: Re: [gentoo-dev] [pre-GLEP r1] Gentoo binary package container format From: =?UTF-8?Q?Micha=C5=82_G=C3=B3rny?= To: gentoo-dev@lists.gentoo.org Date: Mon, 19 Nov 2018 19:35:04 +0100 In-Reply-To: <1542453700.31427.2.camel@gentoo.org> References: <1542453700.31427.2.camel@gentoo.org> Organization: Gentoo Content-Type: multipart/signed; micalg="pgp-sha512"; protocol="application/pgp-signature"; boundary="=-0ZAOyETiZHTA92JepzfG" X-Mailer: Evolution 3.26.6 Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org Mime-Version: 1.0 X-Archives-Salt: 997aedf5-4b7f-429e-a433-7177ec391259 X-Archives-Hash: 849b9c0c513cadca5d761f11c588981b --=-0ZAOyETiZHTA92JepzfG Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi, On Sat, 2018-11-17 at 12:21 +0100, Micha=C5=82 G=C3=B3rny wrote: > Here's a pre-GLEP draft based on the earlier discussion on gentoo- > portage-dev mailing list. The specification uses GLEP form as it > provides for cleanly specifying the motivation and rationale. Changes in -r1: took into account the feedback and restructured the motivation into pointing out advantages of the existing format, and focusing on the two real issues of non-transparency and OpenPGP implementations deficiencies. Also added a section on why there's no explicit version number. > Also available via HTTPS: >=20 > rst: https://dev.gentoo.org/~mgorny/tmp/glep-0078.rst > html: https://dev.gentoo.org/~mgorny/tmp/glep-0078.html >=20 --- GLEP: 9999 Title: Gentoo binary package container format Author: Micha=C5=82 G=C3=B3rny Type: Standards Track Status: Draft Version: 1 Created: 2018-11-15 Last-Modified: 2018-11-16 Post-History: 2018-11-17 Content-Type: text/x-rst --- Abstract =3D=3D=3D=3D=3D=3D=3D=3D This GLEP proposes a new binary package container format for Gentoo. The current tbz2/XPAK format is shortly described, and its deficiences are explained. Accordingly, the requirements for a new format are set and a gpkg format satisfying them is proposed. The rationale for the design decisions is provided. Motivation =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D The current Portage binary package format ----------------------------------------- The historical ``.tbz2`` binary package format used by Portage is a concatenation of two distinct formats: header-oriented compressed .tar format (used to hold package files) and trailer-oriented custom XPAK format (used to hold metadata) [#MAN-XPAK]_. The format has already been extended incompatibly twice. The first time, support for storing multiple successive builds of binary package for a single ebuild version has been added. This feature relies on appending additional hyphen, followed by an integer to the package filename. It is disabled by default (preserving backwards compatibility) and controlled by ``binpkg-multi-instance`` feature. The second time, support for additional compression formats has been added. When format other than bzip2 is used, the ``.tbz2`` suffix is replaced by ``.xpak`` and Portage relies on magic bytes to detect compression used. For backwards compatibility, Portage still defaults to using bzip2; compression program can be switched using ``BINPKG_COMPRESS`` configuration variable. Additionally, there have been minor changes to the stored metadata and file storage policies. In particular, behavior regarding ``INSTALL_MASK``, controllable file compression and stripping has changed over time. The advantages of tbz2/XPAK format ---------------------------------- The tbz2/XPAK format used by Portage has three interesting features: 1. **Each binary package is fully contained within a single file.** While this might seem unnecessary, it makes it easier for the user to transfer binary packages without having to be concerned about finding all the necessary files to transfer. 2. **The binary packages are compatible with regular compressed tarballs, most of the time.** With notable exceptions of historical versions of pbzip2 and the recent zstd compressor, tbz2/XPAK packages can be extracted using regular tar utility with a compressor implementation that discards trailing garbage. 3. **The metadata is uncompressed, and can be efficiently accessed without decompressing package contents.** This includes the possibility of rewriting it (e.g. as a result of package moves) without the necessity of repacking the files. Transparency problem with the current binary package format ----------------------------------------------------------- Notwithstanding its advantages, the tbz2/XPAK format has a significant design fault that consists of two issues: 1. **The XPAK format is a custom binary format with explicit use of binary-encoded file offsets and field lengths.** As such, it is non-trivial to read or edit without specialized tools. Such tools are currently implemented separately from the package manager, as part of the portage-utils toolkit, written in C [#PORTAGE-UTILS]_. 2. **The tarball compatibility feature relies on obscure feature of ignoring trailing garbage in compressed files**. While this is implemented consistently in most of the compressors, this feature is not really a part of specification but rather traditional behavior. Given that the original reasons for this no longer apply, new compressor implementations are likely to miss support for this. Both of the issues make the format hard to use without dedicated tools, or when the tools misbehave. This impacts the following scenarios: A. **Using binary packages for system recovery.** In case of serious breakage, it is really preferable that the format depends on as few tools a possible, and especially not on Gentoo-specific tools. B. **Inspecting binary packages in detail exceeding standard package manager facilities.** C. **Modifying binary packages in ways not predicted by the package manager authors.** A real-life example of this is working around broken ``pkg_*`` phases which prevent the package from being installed. OpenPGP extensibility problem ----------------------------- There are at least three obvious ways in which the current format could be extended to support OpenPGP signatures, and each of them has its own distinct problem: 1. **Adding a detached signature.** This option is non-intrusive but causes the format to no longer be contained in a single file. 2. **Wrapping the package in OpenPGP message format.** This would use a standard format and make verification and unpacking relatively easy. However, it would break backwards compatibility and add explicit dependency on OpenPGP implementation in order to unpack the package. 3. **Adding OpenPGP signature as extra XPAK member.** This is the clever solution. It implies strengthening the dependency on custom tooling, now additionally necessary to extract the signature and reconstruct the original file to accommodate verification. Goals for a new container format -------------------------------- All of the above considered, the new format should combine the advantages of the existing format and at the same time address its deficiencies whenever possible. Furthermore, since a format replacement is taking place it is worthwhile to consider additional goals that could be satisfied with little change. The following obligatory goals have been set for a replacement format: 1. **The packages must remain contained in a single file.** As a matter of user convenience, it should be possible to transfer binary packages without having to use multiple files, and to install them from any location. 2. **The file format must be entirely based on common file formats, respecting best practices, with as little customization as necessary to satisfy the requirements.** The format should be transparent enough to let user inspect and manipulate it without special tooling or detailed knowledge. 3. **The file format must provide support for OpenPGP signatures.** Preferably, it should use standard OpenPGP message formats. 4. **The file format must allow for efficient metadata updates.** In particular, it should be possible to update the metadata without having to recompress package files. Additionally, the following optional goals have been noted: A. **The file format should account for easy recognition both through filename and through contents.** Preferably, it should have distinct features making it possible to detect it via file(1). B. **The file format should provide for partial fetching of binary packages.** It should be possible to easily fetch and read the package metadata without having to download the whole package. C. **The file format should allow for metadata compression.** D. **The file format should make future extensions easily possible without breaking backwards compatibility.** Specification =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D The container format -------------------- The gpkg package container is an uncompressed .tar achive whose filename uses ``.gpkg.tar`` suffix. This archive contains the following members, in order: 1. A volume label: ``gpkg: ${full_package_identifier}`` (optional). 2. A signature for the metadata archive: ``metadata.tar${comp}.sig`` (optional). 3. The metadata archive ``metadata.tar${comp}``, optionally compressed (required). 4. A signature for the filesystem image archive: ``image.tar${comp}.sig`` (optional). 5. The filesystem image archive ``image.tar${comp}``, optionally compressed (required). It is recommended that relative order of the archive members is preserved. However, implementations must support archives with members out of order. The container may be extended with additional members in the future. The implementations should ignore unrecognized members and preserve them across package updates. The volume label ---------------- The volume label provides an easy way for users to identify the binary package without dedicated tooling or specific format knowledge. The implementations should include a volume label consisting of fixed string ``gpkg:``, followed by a single space, followed by full package identifier. However, the implementations must not rely on the volume label being present or attempt to parse its value when it is. Furthermore, since the volume label is included in the .tar archive as the first member, it provides a magic string at a fixed location that can be used by tools such as file(1) to easily distinguish Gentoo binary packages from regular .tar archives. The metadata archive -------------------- The metadata archive stores the package metadata needed for the package manager to process it. The archive should be included at the beginning of the binary package in order to make it possible to read it out of partially fetched binary package, and to avoid fetching the remaining part of the package if not necessary. The archive contains a single directory called ``metadata``. In this directory, the individual metadata keys are stored as files. The exact keys and metadata format is outside the scope of this specification. The package manager may need to modify the package metadata. In this case, it should replace the metadata archive without having to alter other package members. The metadata archive can optionally be compressed. It can also be supplemented with a detached OpenPGP signature. The image archive ----------------- The image archive stores all the files to be installed by the binary package. It should be included as the last of the files in the binary package container. The archive contains a single directory called ``image``. Inside this directory, all package files are stored in filesystem layout, relative to the root directory. The image archive can optionally be compressed. It can also be supplemented with a detached OpenPGP signature. Archive member compression -------------------------- The archive members outlined above support optional compression using one of the compressed file formats supported by the package manager. The exact list of compression types is outside the scope of this specification. The implementations must support archive members being uncompressed, and must support using different compression types for different files. When compressing an archive member, the member filename should be suffixed using the standard suffix for the particular compressed file type (e.g. ``.bz2`` for bzip2 format). OpenPGP member signatures ------------------------- The archive members support optional OpenPGP signatures. The implementations must allow the user to specify whether OpenPGP signatures are to be expected in remotely fetched packages. If the signatures are expected and the archive member is unsigned, the package manager must reject processing it. If the signature does not verify, the package manager must reject processing the corresponding archive member. In particular, it must not attempt decompressing compressed members in those circumstances. If the implementation needs to manipulate archive members, it must either create a new signature or discard the existing signature. The signatures are created as binary detached OpenPGP signature files, with filename corresponding to the member filename with ``.sig`` suffix appended. Rationale =3D=3D=3D=3D=3D=3D=3D=3D=3D Nested archive format --------------------- The basic problem in designing the new format was how to embed multiple data streams (metadata, image) into a single file. Traditionally, this has been done via using two non-conflicting file formats. However, while such a solution is clever, it suffers in terms of transparency. Therefore, it has been established that the new format should really consist of a single archive format, with all necessary data transparently accessible inside the file. Consequently, it has been debated how different parts of binary package data should be stored inside that archive. The proposal to continue storing image data as top-level data in the package format, and store metadata as special directory in that structure has been discarded as a case of in-band signalling. Finally, the proposal has been shaped to store different kinds of data as nested archives in the outer binary package container. Besides providing a clean way of accessing different kinds of information, it makes it possible to add separate OpenPGP signatures to them. Inner vs. outer compression --------------------------- One of the points in the new format debate was whether the binary package as a whole should be compressed vs. compressing individual members. The first option may seem as an obvious choice, especially given that with a larger data set, the compression may proceed more effectively. However, it has a single strong disadvantage: compression prevents random access and manipulation of the binary package members. While for the purpose of reading binary packages, the problem could be circumvented through convenient member ordering and avoiding disjoint reads of the binary package, metadata updates would either require recompressing the whole package (which could be really time consuming with large packages) or applying complex techniques such as splitting the compressed archive into multiple compressed streams. This considered, the simplest solution is to apply compression to the individual package members, while leaving the container format uncompressed. It provides fast random access to the individual members, as well as capability of updating them without the necessity of recompressing other files in the container. This also makes it possible to easily protect compressed files using standard OpenPGP detached signature format. All this combined, the package manager may perform partial fetch of binary package, verify the signature of its metadata member and process it without having to fetch the potentially-large image part. Container and archive formats ----------------------------- During the debate, the actual archive formats to use were considered. The .tar format seemed an obvious choice for the image archive since it is the only widely deployed archive format that stores all kinds of file metadata on POSIX systems. However, multiple options for the outer format has been debated. Firstly, the ZIP format has been proposed as the only commonly supported format supporting adding files from stdin (i.e. making it possible to pipe the inner archives straight into the container without using temporary files). However, this format has been clearly rejected as both not being present in the system set, and being trailer-based and therefore unusable without having to fetch the whole file. Secondly, the ar and cpio formats were considered. The former is used by Debian and its derivative binary packages; the latter is used by Red Hat derivatives. Both formats have the advantage of having less historical baggage than .tar, and having less overhead. However, both are also rather obscure (especially given that ar is actually provided by GNU binutils rather than as a stand-alone archiver), considered obsolete by POSIX and both have file size limitations smaller than .tar. All that considered, it has been decided that there is no purpose in using a second archive format in the specification unless it has significant advantage to .tar. Therefore, .tar has also been used as outer package format, even though it has larger overhead than other formats (mostly due to padding). Member ordering --------------- The member ordering is explicitly specified in order to provide for trivially reading metadata from partially fetched archives. By requiring the metadata archive to be stored before the image archive, the package manager may stop fetching after reading it and save bandwidth and/or space. Detached OpenPGP signatures --------------------------- The use of detached OpenPGP signatures is to provide authenticity checks for binary packages. Covering the complete members with signatures provide for trivial verification of all metadata and image contents respectively, without having to invent custom mechanisms for combining them. Covering the compressed archives helps to prevent zipbomb attacks. Covering the individual members rather than the whole package provides for verification of partially fetched binary packages. Format versioning ----------------- It has been requested that an explicit version identifier is added into the binary package containers in order to account for possible incompatible changes in the format. However, such an explicit notion does not seem necessary. Firstly, the format is meant to be extensible while preserving backwards compatibility. If a backwards-incompatible change needs to be done, and that change does not cause the packages implicitly incompatible by design, the incompatibility can be easily forced e.g. via renaming the metadata archive to ``metadata-v2.tar*``. Secondly, the only really clean place for such a version would be an additional file which would unnecessary grow the uncompressed tarball. The label is non-obligatory and user-oriented, and as such can not be used to carry information significant to the package manager. Finally, such a version number can be added into the metadata archive which needs to be processed by the package manager to extract all significant binary package information. Backwards Compatibility =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D The format does not preserve backwards compatibility with the tbz2 packages. It has been established that preserving compatibility with the old format was impossible without making the new format even worse than the old one was. For example, adding any visible members to the tarball would cause them to be installed to the filesystem by old Portage versions. Working around this would require some kind of awful hacks that would oppose the goal of using simple and transparent package format. Reference Implementation =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D The proof-of-concept implementation of binary package format converter is available as xpak2gpkg [#XPAK2GPKG]_. It can be used to easily create packages in the new format for early inspection. References =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D .. [#MAN-XPAK] xpak - The XPAK Data Format used with Portage binary packages (https://dev.gentoo.org/~zmedico/portage/doc/man/xpak.5.html) .. [#PORTAGE-UTILS] portage-utils: Small and fast Portage helper tools written in C (https://packages.gentoo.org/packages/app-portage/portage-utils) .. [#XPAK2GPKG] xpak2gpkg: Proof-of-concept converter from tbz2/xpak to gpkg binpkg format (https://github.com/mgorny/xpak2gpkg) Copyright =3D=3D=3D=3D=3D=3D=3D=3D=3D This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/. --=20 Best regards, Micha=C5=82 G=C3=B3rny --=-0ZAOyETiZHTA92JepzfG Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- iQKTBAABCgB9FiEEXr8g+Zb7PCLMb8pAur8dX/jIEQoFAlvzAlhfFIAAAAAALgAo aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDVF QkYyMEY5OTZGQjNDMjJDQzZGQ0E0MEJBQkYxRDVGRjhDODExMEEACgkQur8dX/jI EQrHhhAA0cH2lCsni+mp2RjOy9QN0iuW9HX7ngLvMbKdrTLgvtUoyy4YfUppexv5 jkjAg1bQTkiYwKCLfkRAXgpUv7XdxgXukLgoV6QTTPn6u4xB9V7iV4/c2RM5a26u T6U7E8xd0BRG0LVZFsiCbAQ1DGapvv170gdG6B9E9p0LcHoSOubTPDMVpbqhl1Lm 4BR06xliBak7HkJ7S8KmRkfXmBRmCX2q+T/yShLppRKynUb8Yx1EBaN0ThsZlM3V R+Egdz3hpRmP9vy4Is+m+4el0Byb9tTZjlT809nZifPA7zL4Ok/afme83a+73zkr Yq0T5I2biabcWOfSf4FI6EBgMoSgVVpgEcd8IXRFnFNE+Kboa20l2DuloIPdncf3 K5sWjWiM4ZpxDjbugmPCFTF8Xv1VCgPRB41504vAv7TJMBREIOP5je44QAwzCilj djtVIs/K92OEIhjUpxgBzxbpv0HrbXKkRFTw9kXfU2PTrRt7dCP32aLSTfDBK83i 5DTuJGkB9JY6Msm2xT6SDbissOAUb0XhqxsQGpv6dk1OSlsZIvqA2LBKH0pDul2H rFqPU7yA3qjlk6kk+77Db6MpLtg7Wzrq6nOB0ke1gskF9fwKa4NxCIsPdAdCXhB4 13qHfNPYIXPi6teXB72efDaNCANaqSs/Jc2X4lImlpEksCH58Qg= =Lr65 -----END PGP SIGNATURE----- --=-0ZAOyETiZHTA92JepzfG--