From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 74DE1138334 for ; Wed, 21 Nov 2018 09:33:31 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id D0D2EE08B0; Wed, 21 Nov 2018 09:33:26 +0000 (UTC) Received: from smtp.gentoo.org (smtp.gentoo.org [140.211.166.183]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 6AEF4E0886 for ; Wed, 21 Nov 2018 09:33:24 +0000 (UTC) Received: from pomiot (d202-252.icpnet.pl [109.173.202.252]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) (Authenticated sender: mgorny) by smtp.gentoo.org (Postfix) with ESMTPSA id 9601B335C07; Wed, 21 Nov 2018 09:33:22 +0000 (UTC) Message-ID: <1542792798.16894.17.camel@gentoo.org> Subject: Re: [gentoo-dev] [pre-GLEP] Gentoo binary package container format From: =?UTF-8?Q?Micha=C5=82_G=C3=B3rny?= To: gentoo-dev@lists.gentoo.org Date: Wed, 21 Nov 2018 10:33:18 +0100 In-Reply-To: <20181118110048.GB880@gentoo.org> References: <1542453700.31427.2.camel@gentoo.org> <20181118091644.GA880@gentoo.org> <1542533931.1293.23.camel@gentoo.org> <20181118110048.GB880@gentoo.org> Organization: Gentoo Content-Type: multipart/signed; micalg="pgp-sha512"; protocol="application/pgp-signature"; boundary="=-BLNrOf25xxZBbqkZhJ1g" X-Mailer: Evolution 3.26.6 Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-dev@lists.gentoo.org Reply-to: gentoo-dev@lists.gentoo.org Mime-Version: 1.0 X-Archives-Salt: c536d1da-4bce-4207-89b5-b98ba4bb1c19 X-Archives-Hash: 7cdc77353bb41f4cda74d3cef4d76498 --=-BLNrOf25xxZBbqkZhJ1g Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Sun, 2018-11-18 at 12:00 +0100, Fabian Groffen wrote: > On 18-11-2018 10:38:51 +0100, Micha=C5=82 G=C3=B3rny wrote: > > On Sun, 2018-11-18 at 10:16 +0100, Fabian Groffen wrote: > > > On 17-11-2018 12:21:40 +0100, Micha=C5=82 G=C3=B3rny wrote: > > > > Problems with the current binary package format > > > > ----------------------------------------------- > > > >=20 > > > > The following problems were identified with the package format curr= ently > > > > in use: > > > >=20 > > > > 1. **The packages rely on custom binary archive format to store > > > > metadata.** It is entirely Gentoo invented, and requires dedica= ted > > > > tooling to work with it. In fact, the reference implementation > > > > in Portage does not even include a CLI tool to work with tbz2 > > > > packages; an unofficial implementation is provided as part > > > > of portage-utils toolkit [#PORTAGE-UTILS]_. > > >=20 > > > I think you should rewrite this section to the argument that the > > > metadata is hard to edit, and that there is only one tool to do so > > > (except a python interface from Portage?). > > > On a separate note, I don't think portage-utils can be considered > > > "unofficial", it is a Gentoo official project as far as I am aware. > >=20 > > In this context, Portage is 'official'. Portage-utils is a project > > that's developed entirely separately from Portage and doesn't use > > Portage APIs but instead reinvents everything. As such, it is easy for > > the two to go out of sync. Or for one of them to have bugs that > > the other one doesn't have (say, with endianness). >=20 > I'm not sure if it's actually true, I was under the impression the same > author(s) worked on the Portage as well as portage-utils code. Anyway, > aren't quickpkg and emerge enough from a user's perspective? Gentoo users have a wide perspective. Assuming that you can think of all things the users need and you don't need to care beyond that is plain wrong and results in Windows. > > > > 2. **The format relies on obscure compressor feature of ignoring > > > > trailing garbage**. While this behavior is traditionally implem= ented > > > > by many compressors, the original reasons for it have become lon= g > > > > irrelevant and it is not surprising that new compressors do not > > > > support it. In particular, Portage already hit this problem twi= ce: > > > > once when users replaced bzip2 with parallel-capable pbzip2 > > > > implementation [#PBZIP2]_, and the second time when support for = zstd > > > > compressor was added [#ZSTD]_. > > >=20 > > > I think this is actually the result of a rather opportunistic > > > implementation. The fault is that we chose to use an extension that > > > suggests the file is a regular compressed tarball. > > > When one detects that a file is xpak padded, it is trivial to feed th= e > > > decompressor just the relevant part of the datastream. The format > > > itself isn't bad, and doesn't rely on obscure behaviour. > >=20 > > Except if you don't have the proper tools installed. In which case > > the 'opportunistic' behavior made it possible to extract the contents > > without special tools... except when it actually happens not to work > > anymore. Roy's reply indicates that there is actually interest in this > > design feature. >=20 > Your point is that the format is broken (=3D=3D relies on obscure compres= sor > feature). My point is that the format simply requires a special tool. > The fact that we prefer to use existing tools doesn't imply in any way > that the format is broken to me. > I think you should rewrite your point to mention that you don't want to > use a tool that doesn't exist in @system (?) to unpack a binpkg. My > guess is that you could use some head/tail magic in a script if the > trailing block is upsetting the decompressor. >=20 > I'm not saying this may look ugly, I'm just saying that your point seems > biased. I've spent a significant effort rewriting those point to make it clear what the problem is, and separating it from other changes 'worth doing while we're changing stuff'. Hope that satisfies your nitpicking. > > > > 3. **Placing metadata at the end of file makes partial fetches > > > > complex.** While it is technically possible to obtain package > > > > metadata remotely without fetching the whole package, it usually > > > > requires e.g. 2-3 HTTP requests with rather complex driver. For > > > > comparison, if metadata was placed at the beginning of the file, > > > > early-terminated pipeline with a single fetch request would suff= ice. > > >=20 > > > I think this point needs to be quantified somewhat why it is so > > > important. > > > I may be wrong, but the average binpkg is small, <1MiB, bigger packag= es > > > are <50MiB. > > > So what is the gain to be saved here? A "few" MiBs for what operatio= n > > > exactly? I say "few" because I know for some users this is actually = not > > > just a blib before it's downloaded. So if this is possible to achiev= e, > > > in what scenarios is this going to be used (and is this often?). > >=20 > > Last I checked, Gentoo aimed to support more users than the 'majority' > > of people with high-throughput Internet access. If there's no cost > > in doing things better, why not do them better? >=20 > You didn't address the critical question, but instead just repeated what > I said. > So again, why do you need to read just the metadata? The original idea was to provide the ability of indexing remote packages without having a server-side cache available (or up-to-date). In order to do that, the package manager would need to fetch the metadata of all packages (but there's no necessity in fetching the whole packages).=20 However, that's merely a possible future idea. It's not worth debating today. Today I really understood the point of avoiding premature optimization.=20 Even if the change is practically zero-cost and harmless (as it's simply reordering files), it's going to cost you a lot of time because someone will keep nitpicking on it, even though any other order will not change anything. > > > > 4. **Extending the format with OpenPGP signatures is non-trivial.** > > > > Depending on the implementation details, it either requires fetc= hing > > > > additional detached signature, breaking backwards compatibility = or > > > > introducing more custom logic to reassemble OpenPGP packets. > > >=20 > > > I think one could add an extra key to the xpak that holds a gpg sig o= r > > > something. Perhaps this point is better phrased as that current binp= kgs > > > don't have any validation options defined. > >=20 > > ...which extra key would mean that the two disjoint implementations > > in use would need more custom code that extracts the signature, > > reconstructs signed data for verification and verifies it. Or, in othe= r > > words, that user needs even more custom tooling to manually verify > > the package he just fetched. >=20 > I don't see your point. If you define what the package format looks > like, you just need to implement that. There is no point in having a > binpkg format that Portage doesn't implement properly. Portage is > well-equipped to implement any of the approaches. A user should use > Portage to install a package. A poweruser could use a separate tool for > a scenario where he/she's in charge of keeping things sane. Relevancy? >=20 > I just don't agree that extending the format is non-trivial. You seem > to have no arguments other than adding "custom logic", which is what you > eventually also do in the reference implementation of your new approach. The difference is that my format is transparent. You file(1) it, you see a .tar archive. You extract the archive, you see subarchives and .sig which are widely recognized. You don't have to read the spec, you don't have to get special tools. If you ever verified detached signature, you know how to proceed. If you didn't, you'll learn something you can reuse. Now, implementing signatures on top of XPAK is more effort, and yields something that is more fragile and in the end doesn't benefit anyone. >=20 > > > > 5. **Metadata is not compressed.** This is not a significant probl= em, > > > > it is just listed for completeness. > > > >=20 > > > >=20 > > > > Goals for a new container format > > > > -------------------------------- > > > >=20 > > > > The following goals have been set for a replacement format: > > > >=20 > > > > 1. **The packages must remain contained in a single file.** As a m= atter > > > > of user convenience, it should be possible to transfer binary > > > > packages without having to use multiple files, and to install th= em > > > > from any location. > > > >=20 > > > > 2. **The file format must be entirely based on common file formats, > > > > respecting best practices, with as little customization as neces= sary > > > > to satisfy the requirements.** In particular, it is unacceptabl= e > > > > to create new binary formats. > > >=20 > > > I take this as your personal opinion. I don't quite get why it is > > > unacceptable to create a new binary format though. In particular whe= n > > > you're looking for efficiency, such format could serve your purposes. > > > As long as it's clearly defined, I don't see the problem with a binar= y > > > format either. > > > Could you add why it is you think binary formats are unacceptable her= e? > >=20 > > Because custom binary formats require specialized tooling, and are > > a royal PITA when the user wants to do something that the author of > > specialized tooling just happened not to think worthwhile, or when > > the tooling is not available for some reason. And before you ask reall= y > > silly questions, yes, I did fight binary packages over hex editor > > at some point. >=20 > Which I still don't understand, to be frank. I think even Portage > exposes python APIs to get to the data. Compare the time needed to make a trivial (but unforeseen) change on a format that's transparent vs a format that requires you to learn its spec and/or API, write a program and debug it. > > The most trivial case is an attempted recovery of a broken system. > > If you don't have Portage working and don't have portage-utils > > installed, do you really prefer a custom format which will require you > > to fetch and compile special tools? Or is one that can be processed > > with tools you're quite likely to have on every system, like tar? >=20 > Well, I think the idea behind the original binpkg format was to use tar > directly on the files in emergency scenarios like these... > The assumption was bzip2 decompressor and tar being available. > I think it is an example of how you add something, while still allowing > to fallback on existing tools. Except progress in compressors has made it work less and less reliably.=20 It's mostly an example how to be *clever*. However, being clever usually doesn't pay off in the long term, compared to doing things *in a simple way*. > > > > 3. **The file format should provide for partial fetching of binary > > > > packages.** It should be possible to easily fetch and read > > > > the package metadata without having to download the whole packag= e. > > >=20 > > > Like above, what is the use-case here? Why would you want this? I > > > think I'm missing something here. > >=20 > > Does this harm anything? Even if there's little real use for this, is > > there any harm in supporting it? Are we supposed to do things the othe= r > > way around with no benefit just because you don't see any real use for > > it? >=20 > Well, you make a huge point out of it. And if it isn't used, then why > bother so much about it. Then it just looks like you want to use it as > an argument to get rid of something you just don't like. >=20 > In my opinion you better just say "hey I would like to implement this > binpkg format, because I think it would be easier to support with > minimal tools since it doesn't have custom features". I would have > nothing against that. Simple and elegant is nice, you don't need to > invent arguments for that, in my opinion. The spec is now more focused on that. >=20 > Fabian >=20 > > > > 4. **The file format must provide support for OpenPGP signatures.** > > > > Preferably, it should use standard OpenPGP message formats. > > > >=20 > > > > 5. **The file format must allow for efficient metadata updates.** > > > > In particular, it should be possible to update the metadata with= out > > > > having to recompress package files. > > > >=20 > > > > 6. **The file format should account for easy recognition both throu= gh > > > > filename and through contents.** Preferably, it should have dis= tinct > > > > features making it possible to detect it via file(1). > > > >=20 > > > > 7. **The file format should allow for metadata compression.** > > > >=20 > > > > 8. **The file format should make future extensions easily possible > > > > without breaking backwards compatibility.** > > >=20 > > >=20 > >=20 > > --=20 > > Best regards, > > Micha=C5=82 G=C3=B3rny >=20 >=20 >=20 --=20 Best regards, Micha=C5=82 G=C3=B3rny --=-BLNrOf25xxZBbqkZhJ1g Content-Type: application/pgp-signature; name="signature.asc" Content-Description: This is a digitally signed message part Content-Transfer-Encoding: 7bit -----BEGIN PGP SIGNATURE----- iQKTBAABCgB9FiEEXr8g+Zb7PCLMb8pAur8dX/jIEQoFAlv1Jl5fFIAAAAAALgAo aXNzdWVyLWZwckBub3RhdGlvbnMub3BlbnBncC5maWZ0aGhvcnNlbWFuLm5ldDVF QkYyMEY5OTZGQjNDMjJDQzZGQ0E0MEJBQkYxRDVGRjhDODExMEEACgkQur8dX/jI EQqAYRAA04to8YFqy7VNd2Ygjoqia/RlsXh+vKSa+lKVgWBCm/Az/TLaaG8sXK5w FlUv6+B+udsP9CnWSMvNkVHlVyhZCiHzOgsHHgIknXX7fQiI8pitNNq8qVoHSu5W 32FJwvy108lHJ46AWLhFzf4eKFpy3iJlWE69paXAolSfF6XUcwxw2xZbRBjVIo/2 6XdULCkzycsH7WT9jOfTuolpmafhEVOmCED62w8Q9pvNW4uwEsMU+cvZIuPo9uo1 rekfwoUi+cUsjNye8gqEUOuJb0KhKdz371uCZN/6mg/i3swlSXjCQFVeu06SCES0 cORYzIm0OaVIu7Sc+tWsoahpD77LDB/LcPfDvduE+Txte6v7k3zUw/AMvsJPxq7t PyfoYCjR+u7SkoigbSW+LCBnrybBWMOBlbH1qQERzE/KuFaLvmesQpxrZY3PTVn+ 5uDOXpuzeP3R4cHbDpSXXpIbLJjgfUSd+6JqLnPBOKOPL/KVj59mAPppbWMHgyBP V4WyWluU48iINkI/tHbtOl8Mh06lQwvuJKs+20R/HJcp1jB0y0xOvwjzwaB43Q3B mNIgpkAUWulc1HPtdlxgw9Y6qmmmx0qe6LQ1I46BZjxUyJP9lu1ve1qi6i37/VUW szf7FFoLA58lHz68DmVX8c0xhwSg+f1kr0ktJee+irmNf1AKkmE= =hFSA -----END PGP SIGNATURE----- --=-BLNrOf25xxZBbqkZhJ1g--