public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] [RFC] GLEP 74 post-Council review update
@ 2017-11-16 10:19 Michał Górny
  2017-11-17 20:37 ` Daniel Campbell
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Michał Górny @ 2017-11-16 10:19 UTC (permalink / raw
  To: gentoo-dev

Hi, everyone.

Here's the updated version of GLEP 74 taking into consideration
the points made during the Council pre-review.

ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html

Changes:

09ed01f glep-0074: Explain combining multiple Manifest trees
9de0840 glep-0074: Clarify timestamp handling of sub-Manifests
516c2ec glep-0074: Forbid compressing top-level Manifest
b01783e glep-0074: Clarify sub-Manifest signing paragraph


---
GLEP: 74
Title: Full-tree verification using Manifest files
Author: Michał Górny <mgorny@gentoo.org>,
        Robin Hugh Johnson <robbat2@gentoo.org>,
        Ulrich Müller <ulm@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2017-10-21
Last-Modified: 2017-11-16
Post-History: 2017-10-26, 2017-11-16
Content-Type: text/x-rst
Requires: 59, 61
Replaces: 44, 58, 60
---

Abstract
========

This GLEP extends the Manifest file format to cover full-tree file
integrity and authenticity checks.The format aims to be future-proof,
efficient and provide means of backwards compatibility.


Motivation
==========

The Manifest files as defined by GLEP 44 [#GLEP44]_ provide the current
means of verifying the integrity of distfiles and package files
in Gentoo. Combined with OpenPGP signatures, they provide means to
ensure the authenticity of the covered files. However, as noted
in GLEP 57 [#GLEP57]_ they lack the ability to provide full-tree
authenticity verification as they do not cover any files outside
the package directory. In particular, they provide multiple ways
for a third party to inject malicious code into the ebuild environment.

Historically, the topic of providing authenticity coverage for the whole
repository has been mentioned multiple times. The most noteworthy effort
are GLEPs 58 [#GLEP58]_ and 60 [#GLEP60]_ by Robin H. Johnson from 2008.
They were accepted by the Council in 2010 but have never been
implemented. When potential implementation work started in 2017, a new
discussion about the specification arose. It prompted the creation
of a competing GLEP that would provide a redesigned alternative to
the old GLEPs.

This specification is designed with the following goals in mind:

1. It should provide means to ensure the authenticity of the complete
   repository, including preventing the injection of additional files.

2. The format should be universal enough to work both for the Gentoo
   repository and third-party repositories of different characteristics.

3. The Manifest files should be verifiable stand-alone, that is without
   knowing any details about the underlying repository format.


Specification
=============

Manifest file format
--------------------

This specification reuses and extends the Manifest file format defined
in GLEP 44 [#GLEP44]_. For the purpose of it, the *file type* field is
repurposed as a generic *tag* that could also indicate additional
(non-checksum) metadata. Appropriately, those tags can be followed by
other space-separated values.

Unless specified otherwise, the paths used in the Manifest files
are relative to the directory containing the Manifest file. The paths
must not reference the parent directory (``..``).


Manifest file locations and nesting
-----------------------------------

The ``Manifest`` file located in the root directory of the repository
is called top-level Manifest, and it is used to perform the full-tree
verification. In order to verify the authenticity, it must be signed
using OpenPGP, using the armored cleartext format.

The top-level Manifest may reference sub-Manifests contained
in subdirectories of the repository. The sub-Manifests are traditionally
named ``Manifest``; however, the implementation must support arbitrary
names, including the possibility of multiple (split) Manifests
for a single directory. The sub-Manifest can only cover the files inside
the directory tree where it resides.

The sub-Manifest can also be signed using OpenPGP armored cleartext
format. However, the signature verification can be omitted since it
already is covered by the signed top-level Manifest.


Directory tree coverage
-----------------------

The specification provides three ways of skipping Manifest verification
of specific files and directories (recursively):

1. explicit ``IGNORE`` entries in Manifest files,

2. injected ignore paths via package manager configuration,

3. using names starting with a dot (``.``) which are always skipped.

All files that are not ignored must be covered by at least one
of the Manifests.

A single file may be matched by multiple identical or equivalent
Manifest entries, if and only if the entries have the same semantics,
specify the same size and the checksums common to both entries match.
It is an error for a single file to be matched by multiple entries
of different semantics, file size or checksum values. It is an error
to specify another entry for a file matching ``IGNORE``, or one of its
subdirectories.

The file entries (except for ``IGNORE``) can be specified for regular
files only. Symbolic links are followed when opening files
and traversing directories. It is an error to specify an entry for
a different file type. If the tree contain files of other types
that are not otherwise ignored, they need to be covered by an explicit
``IGNORE``.

All the local (non-``DIST``) files covered by a Manifest tree must
reside on the same filesystem. It is an error to specify entries
applying to files on another filesystem. If files or directories that
are not otherwise ignored reside on a different filesystem, or symbolic
links point to targets on a different filesystem, they must
be explicitly excluded via ``IGNORE``.


File verification
-----------------

When verifying a file against the Manifest, the following rules are
used:

1. If the file is covered directly or indirectly by an entry
   of the ``IGNORE`` type, the verification always succeeds.

2. If the file is covered by an entry of the ``MANIFEST``, ``DATA``,
   ``MISC``, ``EBUILD`` or ``AUX`` type:

   a. if the file is not present, then the verification fails,

   b. if the file is present but has a different size or one
      of the checksums does not match, the verification fails,

   c. otherwise, the verification succeeds.

3. If the file is present but not listed in Manifest, the verification
   fails.

Unless specified otherwise, the package manager must not allow using
any files for which the verification failed. The package manager may
reject any package or even the whole repository if it may refer to files
for which the verification failed.


Timestamp verification
----------------------

The top-level Manifest file can contain a ``TIMESTAMP`` entry to account
for attacks against tree update distribution. If such an entry
is present, it should be updated every time at least one
of the Manifests changes. Every unique timestamp value must correspond
to a single tree state.

During the verification process, the client should compare the timestamp
against the update time obtained from a local clock or a trusted time
source. If the comparison result indicates that the Manifest at the time
of receiving was already significantly outdated, the client should
either fail the verification or require manual confirmation from user.

Furthermore, the Manifest provider may employ additional methods
of distributing the timestamps of recently generated Manifests
using a secure channel from a trusted source for exact comparison.
The exact details of such a solution are outside the scope of this
specification.

``TIMESTAMP`` entries may also be present in sub-Manifests. Those
timestamps must not be newer than the timestamp of the top-level
Manifest (if present). This specification does not define any specific
use for them.


Modern Manifest tags
--------------------

The Manifest files can specify the following tags:

``TIMESTAMP <iso8601>``
  Specifies a timestamp of when the Manifest file was last updated.
  The timestamp must be a valid second-precision ISO8601 extended format
  combined date and time in UTC timezone, i.e. using the following
  ``strftime()`` format string: ``%Y-%m-%dT%H:%M:%SZ``. Optional.
  The package manager can use it to detect an outdated repository
  checkout as described in `Timestamp verification`_.

``MANIFEST <path> <size> <checksums>...``
  Specifies a sub-Manifest. The sub-Manifest must be verified like
  a regular file. If the verification succeeds, the entries from
  the sub-Manifest are included for verification as described
  in `Manifest file locations and nesting`_.

``IGNORE <path>``
  Ignores a subdirectory or file from Manifest checks. If the specified
  path is present, it and its contents are omitted from the Manifest
  verification (always pass). *Path* must be a plain file or directory
  path without a trailing slash, and must not contain wildcards.

``DATA <path> <size> <checksums>...``
  Specifies a regular file subject to Manifest verification. The file
  is required to pass verification. Used for all files that do not match
  any other type.

``DIST <filename> <size> <checksums>...``
  Specifies a distfile entry used to verify files fetched as part
  of ``SRC_URI``. The filename must match the filename used to store
  the fetched file as specified in the PMS [#PMS-FETCH]_. The package
  manager must reject the fetched file if it fails verification.
  ``DIST`` entries apply to all packages below the Manifest file
  specifying them.


Deprecated Manifest tags
------------------------

For backwards compatibility, the following tags are additionally
allowed at the package directory level:

``EBUILD <filename> <size> <checksums>...``
  Equivalent to the ``DATA`` type.

``MISC <path> <size> <checksums>...``
  Equivalent to the ``DATA`` type. Historically indicated that
  the package manager may ignore a verification failure if operating
  in non-strict mode. However, that behavior is deprecated.

``AUX <filename> <size> <checksums>...``
  Equivalent to the ``DATA`` type, except that the filename is relative
  to ``files/`` subdirectory.


Algorithm for full-tree verification
------------------------------------

In order to perform full-tree verification, the following algorithm
can be used:

1. Collect all files present in the repository into *present* set.

2. Start at the top-level Manifest file. Verify its OpenPGP signature.
   Optionally verify the ``TIMESTAMP`` entry if present as specified
   in `timestamp verification`. Remove the top-level Manifest
   from the *present* set.

3. Process all ``MANIFEST`` entries, recursively. Verify the Manifest
   files according to `file verification`_ section, and include their
   entries in the current Manifest entry list (using paths relative
   to directories containing the Manifests).

4. Process all ``IGNORE`` entries. Remove any paths matching them
   from the *present* set.

5. Collect all files covered by ``DATA``, ``MISC``, ``EBUILD``
   and ``AUX`` entries into the *covered* set.

6. Verify the entries in *covered* set for incompatible duplicates
   and collisions with ignored files as explained in `Manifest file
   locations and nesting`_.

7. Verify all the files in the union of the *present* and *covered*
   sets, according to `file verification`_ section.


Algorithm for finding parent Manifests
--------------------------------------

In order to find the top-level Manifest from the current directory
the following algorithm can be used:

1. Store the current directory as *original* and the device ID
   of the containing filesystem (``st_dev``) as *startdev*,

2. If the device ID of the containing filesystem (``st_dev``)
   of the current directory is different than *startdev*, stop.

3. If the current directory contains a ``Manifest`` file:

   a. If a ``IGNORE`` entry in the ``Manifest`` file covers
      the *original* directory (or one of the parent directories), stop.

   b. Otherwise, store the current directory as *last_found*.

4. If the current directory is the root system directory (``/``), stop.

5. Otherwise, enter the parent directory and jump to step 2.

Once the algorithm stops, *last_found* will contain the relevant
top-level Manifest. If *last_found* is null, then the directory tree
does not contain any valid top-level Manifest candidates and one should
be created in the *original* directory.

Once the top-level Manifest is found, its ``MANIFEST`` entries should
be used to find any sub-Manifests below the top-level Manifest,
up to and including the *original* directory. Note that those
sub-Manifests can use different filenames than ``Manifest``.


Checksum algorithms
-------------------

This section is informational only. Specifying the exact set
of supported algorithms is outside the scope of this specification.

The algorithm names reserved at the time of writing are:

- ``MD5`` [#MD5]_,
- ``RMD160`` -- RIPEMD-160 [#RIPEMD160]_,
- ``SHA1`` [#SHS]_,
- ``SHA256`` and ``SHA512`` -- SHA-2 family of hashes [#SHS]_,
- ``WHIRLPOOL`` [#WHIRLPOOL]_,
- ``BLAKE2B`` and ``BLAKE2S`` -- BLAKE2 family of hashes [#BLAKE2]_,
- ``SHA3_256`` and ``SHA3_512`` -- SHA-3 family of hashes [#SHA3]_,
- ``STREEBOG256`` and ``STREEBOG512`` -- Streebog family of hashes
  [#STREEBOG]_.

The method of introducing new hashes is defined by GLEP 59 [#GLEP59]_.
It is recommended that any new hashes are named after the Python
``hashlib`` module algorithm names, transformed into uppercase.


Manifest compression
--------------------

The topic of Manifest file compression is covered by GLEP 61 [#GLEP61]_.
This section merely addresses interoperability issues between Manifest
compression and this specification.

The compressed Manifest files are required to be suffixed for their
compression algorithm. This suffix should be used to recognize
the compression and decompress Manifests transparently. The exact list
of algorithms and their corresponding suffixes are outside the scope
of this specification.

The top-level Manifest file must not be compressed. Since the OpenPGP
signature covers the uncompressed text and is compressed itself,
the data would have to be decompressed without any prior verification.
This could expose users e.g. to zip bombs or exploits on decompressor
vulnerabilities.

Whenever this specification refers to sub-Manifests, they can use any
names but are also required to use a specific compression suffix.
The ``MANIFEST`` entries are required to specify the full name including
compression suffix, and the verification is performed on the compressed
file.

The specification permits uncompressed Manifests to exist alongside
their compressed counterparts, and multiple compressed formats
to coexist. If that is the case, the files must have the same
uncompressed content and the specification is free to choose either
of the files using the same base name.


Combining multiple Manifest trees (informational)
-------------------------------------------------

This specification permits nesting multiple hierarchical Manifest trees.
In this layout, the specific directories of the Manifest tree can
be verified both as a part of another top-level Manifest,
and as an independent Manifest tree (when obtained without the parent
directory).

For this to work, the sub-Manifest file in the directory must also
satisfy the requirements for the top-level Manifest file. That is:

- it must be named ``Manifest`` and not compressed,

- it must cover all the files in this directory and its subdirectories
  (i.e. no files from the directory tree can be covered by parent
  Manifest),

- if authenticity verification is desired, it must be OpenPGP-signed.

It should be noted that if such a directory is a subdirectory of a valid
Manifest tree, the sub-Manifest needs to be valid according
to the top-level Manifest and the OpenPGP signature is disregarded
as detailed in `Manifest file locations and nesting`_. The top-level
behavior is exhibited only when the directory is obtained without parent
directories.


An example Manifest file (informational)
----------------------------------------

An example top-level Manifest file for the Gentoo repository would have
the following content::

    TIMESTAMP 2017-10-30T10:11:12Z
    IGNORE distfiles
    IGNORE local
    IGNORE lost+found
    IGNORE packages
    MANIFEST app-accessibility/Manifest 14821 SHA256 1b5f.. SHA512 f7eb..
    ...
    MANIFEST eclass/Manifest.gz 50812 SHA256 8c55.. SHA512 2915..
    ...

An example modern Manifest (disregarding backwards compatibility)
for a package directory would have the following content::

    DATA SphinxTrain-0.9.1-r1.ebuild 932 SHA256 3d3b.. SHA512 be4d..
    DATA SphinxTrain-1.0.8.ebuild 912 SHA256 f681.. SHA512 0749..
    DATA metadata.xml 664 SHA256 97c6.. SHA512 1175..
    DATA files/gcc.patch 816 SHA256 b56e.. SHA512 2468..
    DATA files/gcc34.patch 333 SHA256 c107.. SHA512 9919..
    DIST SphinxTrain-0.9.1-beta.tar.gz 469617 SHA256 c1a4.. SHA512 1b33..
    DIST sphinxtrain-1.0.8.tar.gz 8925803 SHA256 548e.. SHA512 465d..


Rationale
=========

Stand-alone format
------------------

The first question that needed to be asked before proceeding with
the design was whether the Manifest file format was supposed to be
stand-alone, or tightly bound to the repository format.

The stand-alone format has been selected because of its three
advantages:

1. It is more future-proof. If an incompatible change to the repository
   format is introduced, only developers need to be upgrade the tools
   they use to generate the Manifests. The tools used to verify
   the updated Manifests will continue to work.

2. It is more flexible and universal. With a dedicated tool,
   the Manifest files can be used to sign and verify arbitrary file
   sets.

3. It keeps the verification tool simpler. In particular, we can easily
   write an independent verification tool that could work on any
   distribution without needing to depend on a package manager
   implementation or rewrite parts of it.

Designing a stand-alone format requires that the Manifest carries enough
information to perform the verification following all the rules specific
to the Gentoo repository.


Tree design
-----------

The second important point of the design was determining whether
the Manifest files should be structured hierarchically, or independent.
Both options have their advantages.

In the hierarchical model, each sub-Manifest file is covered by a higher
level Manifest. As a result, only the top-level Manifest has to be
OpenPGP-signed, and subsequent Manifests need to be only verified by
checksum stored in the parent Manifest. This has the following
implications:

- Verifying any set of files in the repository requires using checksums
  from the most relevant Manifests and the parent Manifests.

- The OpenPGP signature of the top-level Manifest needs to be verified
  only once per process.

- Altering any set of files requires updating the relevant Manifests,
  and their parent Manifests up to the top-level Manifest, and signing
  the last one.

- As a result, the top-level Manifest changes on every commit,
  and various middle-level Manifests change (and need to be transferred)
  frequently.

In the independent model, each sub-Manifest file is independent
of the parent Manifests. As a result, each of them needs to be signed
and verified independently. However, the parent Manifests still need
to list sub-Manifests (albeit without verification data) in order
to detect removal or replacement of subdirectories. This has
the following implications:

- Verifying any set of files in the repository requires using checksums
  and verifying signatures of the most relevant Manifest files.

- Altering any set of files requires updating the relevant Manifests
  and signing them again.

- Parent Manifests are updated only when Manifests are added or removed
  from subdirectories. As a result, they change infrequently.

While both models have their advantages, the hierarchical model was
selected because it reduces the number of OpenPGP operations
which are comparatively costly to the minimum.


Tree layout restrictions
------------------------

The algorithm is meant to work primarily with ebuild repositories which
normally contain only files and directories. Directories provide
no useful metadata for verification, and specifying special entries
for additional file types is purposeless. Therefore, the specification
is restricted to dealing with regular files.

The Gentoo repository does not use symbolic links. Some Gentoo
repositories do, however. To provide a simple solution for dealing with
symlinks without having to take care to implement special handling for
them, the common behavior of implicitly resolving them is used.
Therefore, symbolic links to files are stored as if they were regular
files, and symbolic links to directories are followed as if they were
regular directories.

Dotfiles are implicitly ignored as that is a common notion used
in software written for POSIX systems. All other filenames require
explicit ``IGNORE`` lines.

An ability to inject additional ignore entries is provided to account
for site configuration affecting the repository tree -- placing
additional files in it, skipping some of the categories from syncing.
This configuration can extend beyond the limits of this GLEP,
e.g. by allowing wildcards or regular expressions.

The algorithm is restricted to work on a single filesystem. This is
mostly relevant when scanning for top-level Manifest -- we do not want
to cross filesystem boundaries then. However, to ensure consistent
bidirectional behavior we need to also ban them when operating downwards
the tree.

The directories and files on different filesystems need to be ignored
explicitly as implicitly skipping them would cause confusion.
In particular, tools might then claim that a file does not exist when
it clearly does because it was skipped due to filesystem boundaries.


File verification model
-----------------------

The verification model aims to provide full coverage against different
forms of attack. In particular, three different kinds of manipulation
are considered:

1. Alteration of the file content.

2. Removal of a file.

3. Addition of a new file.

In order to prevent against all three, the system requires that all
files in the repository are listed in Manifests and verified against
them.

As a special case, ignores are allowed to account for directories
that are not part of the repository but were traditionally placed inside
it. Those directories were ``distfiles``, ``local`` and ``packages``. It
could be also used to ignore VCS directories such as ``CVS``.


Non-strict Manifest verification
--------------------------------

Originally the Manifest2 format provided a special ``MISC`` tag that
was used for ``metadata.xml`` and ``ChangeLog`` files. This tag
indicated that the Manifest verification failures could be ignored for
those files unless the package manager was working in strict mode.

The first versions of this specification continued the use of this tag.
However, after a long debate it was decided to deprecate it along with
the non-strict behavior, and require all files to strictly match.

Two arguments were mentioned for the usefulness of a ``MISC`` type:

1. being able to reduce the checkout size by stripping unnecessary
   files out, and

2. being able to run update automatically generated files locally
   without causing unnecessary verification failures.

However, the usefulness of ``MISC`` in both cases is doubtful.

The cases for stripping unnecessary files mostly focused around space
savings. For this purpose, stripping ``metadata.xml`` and similar files
has little value. It is much more common for users to strip whole
packages or categories. The ``MISC`` type is not suitable for that,
and so a dedicated package manager mechanism needs to be developed
instead. The same mechanism can also handle files that historically used
the ``MISC`` type. As an example, the package manager may choose
to generate both the rsync exclusion list and Manifest ignore list
using a single source list.

The cases for autogenerated files involve such cache files
as ``use.local.desc``. However, we can not include ``md5-cache`` there
due to security concerns which results in inconsistent cache handling.
Furthermore, the tools were historically modified to provide stable
output which means that their content can not change without
a non-``MISC`` content being changed first. This practically defeats
the purpose of using ``MISC``.

Finally, the non-strict mode could be used as means to an attack.
The allowance of missing or modified documentation file could be used
to spread misinformation, resulting in bad decisions made by the user.
A modified file could also be used e.g. to exploit vulnerabilities
of an XML parser.


Timestamp field
---------------

The top-level Manifests optionally allows using a ``TIMESTAMP`` tag
to include a generation timestamp in the Manifest. A similar feature
was originally proposed in GLEP 58 [#GLEP58]_.

A malicious third-party may use the principles of exclusion or replay
[#C08]_ to deny an update to clients, while at the same time recording
the identity of clients to attack. The timestamp field can be used to
detect that.

In order to provide a more complete protection, the Gentoo
Infrastructure should provide an ability to obtain the timestamps
of all Manifests from a recent timeframe over a secure channel
from a trusted source for comparison.

Strictly speaking, this information is already provided by the various
``metadata/timestamp*`` files that are already present. However,
including the value in the Manifest itself has a little cost
and provides the ability to perform the verification stand-alone.

Furthermore, some of the timestamp files are added very late
in the distribution process, past the Manifest generation phase. Those
files will most likely receive ``IGNORE`` entries and therefore
be not suitable to safe use.

The specification permits additional timestamps in sub-Manifest files
for local use. A generic testing tool should ignore them.


New vs deprecated tags
----------------------

Out of the four types defined by Manifest2, only one is reused
and the remaining three is replaced by a single, universal ``DATA``
type.

The ``DIST`` tag is reused since the specification does not change
anything with regard to distfile handling.

The ``EBUILD`` tag could potentially be reused for generic file
verification data. However, it would be confusing if all the different
data files were marked as ``EBUILD``. Therefore, an equivalent ``DATA``
type was introduced as a replacement.

The ``MISC`` tag and the relevant non-strict mode has been removed
as being of little value, as detailed in the `Non-strict Manifest
verification`_ section.

The ``AUX`` tag is deprecated as it is redundant to ``DATA``, and has
the limiting property of implicit ``files/`` path prefix.


Finding top-level Manifest
--------------------------

The development of a reference implementation for this GLEP has brought
the following problem: how to find all the relevant Manifests when
the Manifest tool is run inside a subdirectory of the repository?

One of the options would be to provide a bi-directional linking
of Manifests via a ``PARENT`` tag. However, that would not solve
the problem when a new Manifest file is being created.

Instead, an algorithm for iterating over parent directories is proposed.
Since there is no obligatory explicit indicator for the top-level
Manifest, the algorithm assumes that the top-level Manifest
is the highest ``Manifest`` in the directory hierarchy that can cover
the current directory. This generally makes sense since the Manifest
files are required to provide coverage for all subdirectories, so all
Manifests starting from that one need to be updated.

If independent Manifest trees are nested in the directory structure,
then an ``IGNORE`` entry needs to be used to separate them.

Since sub-Manifests can use any filenames, the Manifest finding
algorithm must not short-cut the procedure by storing all ``Manifest``
files along the parent directories. Instead, it needs to retrace
the relevant sub-Manifest files along ``MANIFEST`` entries
in the top-level Manifest.


Injecting ChangeLogs into the checkout
--------------------------------------

One of the problems considered in the new Manifest format was that
of injecting historical and autogenerated ChangeLog into the repository.
Normally we are not including those files to reduce the checkout size.
However, some users have shown interest in them and Infra is working
on providing them via an additional rsync module.

If such files were injected into the repository, they would cause
verification failures of Manifests. To account for this, Infra could
provide ``IGNORE`` entries to allow them to exist.


Splitting distfile checksums from file checksums
------------------------------------------------

Another problem with the current Manifest format is that the checksums
for fetched files are combined with checksums for local files
in a single file inside the package directory. It has been specifically
pointed out that:

- since distfiles are sometimes reused across different packages,
  the repeating checksums are redundant [#DIST]_.
  
- mirror admins were interested in the possibility of verifying all
  the distfiles with a single tool.

This specification does not provide a clean solution to this problem.
It technically permits moving ``DIST`` entries to higher-level Manifests
but the usefulness of such a solution is doubtful.

However, for the second problem we will probably deliver a dedicated
tool working with this Manifest format.


Hash algorithms
---------------

While maintaining a consistent supported hash set is important
for interoperability, it is no good fit for the generic layout of this
GLEP. Furthermore, it would require updating the GLEP in the future
every time the used algorithms change.

Instead, the specification focuses on listing the currently used
algorithm names for interoperability, and sets a recommendation
for consistent naming of algorithms in the future. The Python
``hashlib`` module is used as a reference since it is used
as the provider of hash functions for most of the Python software,
including Portage and PkgCore.

The basic rules for changing hash algorithms are defined in GLEP 59
[#GLEP59]_. The implementations can focus only on those algorithms
that are actually used or planned on being used. It may be feasible
to devise a new GLEP that specifies the currently used hashes (or update
GLEP 59 accordingly).


Manifest compression
--------------------

The support for Manifest compression is introduced with minimal changes
to the file format. The ``MANIFEST`` entries are required to provide
the real (compressed) file path for compatibility with other file
entries and to avoid confusion.

The compression of top-level Manifest file has been prohibited
as the specification currently does not provide any means of verifying
the file prior to decompression. This would make it possibly for
a malicious third party to provide a compressed Manifest exposing
decompressor vulnerabilities, or being a zip bomb, and the tooling
would have to unpack it before being able to verify the contents.

The OpenPGP cleartext signature covers the contents of the Manifest,
and is therefore compressed along with them. The possibility of using
detached signature has been considered but it was rejected as
unnecessary complexity for minor gain.

Technically, a similar result could be effected via moving all the data
into a compressed sub-Manifest in the top directory (e.g.
``Manifest.sub.gz``), and including a ``MANIFEST`` entry for this file
in a signed, uncompressed top-level Manifest.

The existence of additional entries for uncompressed Manifest checksums
was debated. However, plain entries for the uncompressed file would
be confusing if only compressed file existed, and conflicting if both
uncompressed and compressed variants existed. Furthermore, it has been
pointed out that ``DIST`` entries do not have uncompressed variant
either.


Performance considerations
--------------------------

Performing a full-tree verification on every sync raises some
performance concerns for end-user systems. The initial testing has shown
that a cold-cache verification on a btrfs file system can take up around
4 minutes, with the process being mostly I/O bound. On the other hand,
it can be expected that the verification will be performed directly
after syncing, taking advantage of warm filesystem cache.

To improve speed on I/O and/or CPU-restrained systems even further,
the algorithms can be easily extended to perform incremental
verification. Given that rsync does not preserve mtimes by default,
the tool can take advantage of mtime and Manifest comparisons to recheck
only the parts of the repository that have changed.

Furthermore, the package manager implementations can restrict checking
only to the parts of the repository that are actually being used.


Backwards Compatibility
=======================

This GLEP provides optional means of preserving backwards compatibility.
To preserve the backwards compatibility, the following needs to hold
for the ``Manifest`` file in every package directory:

- all files must be covered by the single ``Manifest`` file,

- all distfiles used by the package must be included,

- all files inside the ``files/`` subdirectory need to use
  the ``AUX`` tag (rather than ``DATA``),

- all ``.ebuild`` files need to use the ``EBUILD`` tag,

- the ``metadata.xml`` and ``ChangeLog`` files need to use
  the ``MISC`` tag,

- the Manifest can be signed to provide authenticity verification,

- an uncompressed Manifest must always exist, and a compressed Manifest
  of identical content may be present.

Once the backwards compatibility is no longer a concern, the above
no longer needs to hold and the deprecated tags can be removed.


Reference Implementation
========================

The reference implementation for this GLEP is being developed
as the gemato project [#GEMATO]_.


Credits
=======

Thanks to all the people whose contributions were invaluable
to the creation of this GLEP. This includes but is not limited to:

- Robin Hugh Johnson,
- Ulrich Müller.

Additionally, thanks to Robin Hugh Johnson for the original
MataManifest GLEP series which served both as inspiration and source
of many concepts used in this GLEP. Recursively, also thanks to all
the people who contributed to the original GLEPs.


References
==========

.. [#GLEP44] GLEP 44: Manifest2 format
   (https://www.gentoo.org/glep/glep-0044.html)

.. [#GLEP57] GLEP 57: Security of distribution of Gentoo software
   - Overview
   (https://www.gentoo.org/glep/glep-0057.html)

.. [#GLEP58] GLEP 58: Security of distribution of Gentoo software
   - Infrastructure to User distribution - MetaManifest
   (https://www.gentoo.org/glep/glep-0058.html)

.. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
   (https://www.gentoo.org/glep/glep-0059.html)

.. [#GLEP60] GLEP 60: Manifest2 filetypes
   (https://www.gentoo.org/glep/glep-0060.html)

.. [#GLEP61] GLEP 61: Manifest2 compression
   (https://www.gentoo.org/glep/glep-0061.html)

.. [#PMS-FETCH] Package Manager Specification: Dependency Specification
   Format - SRC_URI
   (https://projects.gentoo.org/pms/6/pms.html#x1-940008.2.10)

.. [#MD5] RFC1321: The MD5 Message-Digest Algorithm
   (https://www.ietf.org/rfc/rfc1321.txt)

.. [#RIPEMD160] The hash function RIPEMD-160
   (https://homes.esat.kuleuven.be/~bosselae/ripemd160.html)

.. [#SHS] FIPS PUB 180-4: Secure Hash Standard (SHS)
   (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf)

.. [#WHIRLPOOL] The WHIRLPOOL Hash Function
   (http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html)

.. [#BLAKE2] BLAKE2 -- fast secure hashing
   (https://blake2.net/)

.. [#SHA3] FIPS PUB 202: SHA-3 Standard: Permutation-Based Hash
   and Extendable-Output Functions
   (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf)

.. [#STREEBOG] GOST R 34.11-2012: Streebog Hash Function
   (https://www.streebog.net/)

.. [#C08] Cappos, J et al. (2008). "Attacks on Package Managers"
   (https://www2.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html)

.. [#DIST] According to Robin H. Johnson, 8.4% of all DIST entries
   at the time of writing are duplicate, representing a 2 MiB
   out of 25 MiB of DIST entries altogether.

.. [#GEMATO] gemato: Gentoo Manifest Tool
   (https://github.com/mgorny/gemato/)

Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.


-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update
  2017-11-16 10:19 [gentoo-dev] [RFC] GLEP 74 post-Council review update Michał Górny
@ 2017-11-17 20:37 ` Daniel Campbell
  2017-11-20 17:24   ` Michał Górny
  2017-11-20 18:42 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2] Michał Górny
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Daniel Campbell @ 2017-11-17 20:37 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 9506 bytes --]

On Thu, Nov 16, 2017 at 11:19:54AM +0100, Michał Górny wrote:
> [snip]
> Abstract
> ========
> 
> This GLEP extends the Manifest file format to cover full-tree file
> integrity and authenticity checks.The format aims to be future-proof,

Missing a space after the first sentence, between "checks." and "The".

> efficient and provide means of backwards compatibility.

Could use an Oxford comma after "efficient", but it's a style choice. Up
to you.

> 
> 
> Motivation
> ==========
> 
> The Manifest files as defined by GLEP 44 [#GLEP44]_ provide the current
> means of verifying the integrity of distfiles and package files
> in Gentoo. Combined with OpenPGP signatures, they provide means to
> ensure the authenticity of the covered files. However, as noted
> in GLEP 57 [#GLEP57]_ they lack the ability to provide full-tree
> authenticity verification as they do not cover any files outside
> the package directory. In particular, they provide multiple ways
> for a third party to inject malicious code into the ebuild environment.
> 
> Historically, the topic of providing authenticity coverage for the whole
> repository has been mentioned multiple times. The most noteworthy effort
> are GLEPs 58 [#GLEP58]_ and 60 [#GLEP60]_ by Robin H. Johnson from 2008.
> They were accepted by the Council in 2010 but have never been
> implemented. When potential implementation work started in 2017, a new
> discussion about the specification arose. It prompted the creation
> of a competing GLEP that would provide a redesigned alternative to
> the old GLEPs.

No correction, but I really like the inclusion of history here. It gives
the reader more context, should they have questions about prior
discussions.

> [snip]
> 1. It is more future-proof. If an incompatible change to the repository
>    format is introduced, only developers need to be upgrade the tools
>    they use to generate the Manifests. The tools used to verify
>    the updated Manifests will continue to work.

"be upgrade" -> "upgrade"

> 
> [snip]
> While both models have their advantages, the hierarchical model was
> selected because it reduces the number of OpenPGP operations
> which are comparatively costly to the minimum.

It seems like "which are comparatively costly" should be in parentheses,
or separated by some other punctuation like en- or em-dashes. e.g.

"... because it reduces the number of OpenPGP operations (which are
comparatively costly) to the minimum."

or

"... because it reduces the number of OpenPGP operations – which are
comparatively costly – to the minimum."

(Note en-dash was used (0x2013), not a regular hyphen (0x2D).)

Or something like that.

> 
> [snip]
> Non-strict Manifest verification
> --------------------------------
> 
> Originally the Manifest2 format provided a special ``MISC`` tag that
> was used for ``metadata.xml`` and ``ChangeLog`` files. This tag
> indicated that the Manifest verification failures could be ignored for
> those files unless the package manager was working in strict mode.
> 
> The first versions of this specification continued the use of this tag.
> However, after a long debate it was decided to deprecate it along with
> the non-strict behavior, and require all files to strictly match.

It may be outside the scope of the GLEP, but a link to said long debate
might be relevant to the reader, especially if they have suggestions or
points that have already been discussed in the debate.

> [snip]
> Finally, the non-strict mode could be used as means to an attack.
> The allowance of missing or modified documentation file could be used
> to spread misinformation, resulting in bad decisions made by the user.
> A modified file could also be used e.g. to exploit vulnerabilities
> of an XML parser.

"used e.g." -> "used, e.g."

Helps it reflect the way it would be spoken.

> 
> 
> Timestamp field
> ---------------
> 
> The top-level Manifests optionally allows using a ``TIMESTAMP`` tag
> to include a generation timestamp in the Manifest. A similar feature
> was originally proposed in GLEP 58 [#GLEP58]_.

"Manifests" and "allows" disagree grammatically -- one of them needs to
drop the "s". Context clues indicate a singular top-level Manifest.

> 
> A malicious third-party may use the principles of exclusion or replay
> [#C08]_ to deny an update to clients, while at the same time recording
> the identity of clients to attack. The timestamp field can be used to
> detect that.
> 
> In order to provide a more complete protection, the Gentoo
> Infrastructure should provide an ability to obtain the timestamps
> of all Manifests from a recent timeframe over a secure channel
> from a trusted source for comparison.

"a more complete protection"; should probably drop the "a".

> 
> Strictly speaking, this information is already provided by the various
> ``metadata/timestamp*`` files that are already present. However,
> including the value in the Manifest itself has a little cost
> and provides the ability to perform the verification stand-alone.
> 
> Furthermore, some of the timestamp files are added very late
> in the distribution process, past the Manifest generation phase. Those
> files will most likely receive ``IGNORE`` entries and therefore
> be not suitable to safe use.

Looks like a few extra words in the last sentence. Here's my attempt:

"These files will likely receive ``IGNORE`` entries and therefore be
unsafe to use."

("unsuitable" may replace "unsafe", up to you)

> 
> The specification permits additional timestamps in sub-Manifest files
> for local use. A generic testing tool should ignore them.
> 
> 
> New vs deprecated tags
> ----------------------
> 
> Out of the four types defined by Manifest2, only one is reused
> and the remaining three is replaced by a single, universal ``DATA``
> type.

"the remaining three is" -> "the remaining three are"

> [snip]
> Injecting ChangeLogs into the checkout
> --------------------------------------
> 
> One of the problems considered in the new Manifest format was that
> of injecting historical and autogenerated ChangeLog into the repository.
> Normally we are not including those files to reduce the checkout size.
> However, some users have shown interest in them and Infra is working
> on providing them via an additional rsync module.

"that of" is extraneous here.

The second sentence should read something like "We normally don't
include these files, to reduce checkout size."

> 
> [snip]
> Hash algorithms
> ---------------
> 
> While maintaining a consistent supported hash set is important
> for interoperability, it is no good fit for the generic layout of this
> GLEP. Furthermore, it would require updating the GLEP in the future
> every time the used algorithms change.

"it is no good fit" -> "it is not a good fit"

> 
> [snip]
> The compression of top-level Manifest file has been prohibited
> as the specification currently does not provide any means of verifying
> the file prior to decompression. This would make it possibly for
> a malicious third party to provide a compressed Manifest exposing
> decompressor vulnerabilities, or being a zip bomb, and the tooling
> would have to unpack it before being able to verify the contents.

The latter half of the paragraph is a little scattered. Here's my
attempt, after the first sentence:

"If the top-level Manifest is compressed, tooling will have to unpack
the file before being able to verify the contents. This makes it
possible for a malicious third party to attack a system by providing a
compressed Manifest that exposes decompressor vulnerabilities, or a zip
bomb."

(Maybe 'zip bomb' should be a link or a footnote, describing what it
is.)

> [snip]
> 
> The existence of additional entries for uncompressed Manifest checksums
> was debated. However, plain entries for the uncompressed file would
> be confusing if only compressed file existed, and conflicting if both

"only compressed" -> "only the compressed"

> uncompressed and compressed variants existed. Furthermore, it has been
> pointed out that ``DIST`` entries do not have uncompressed variant

"uncompressed variant" -> "an uncompressed variant"

> either.
> 
> 
> Performance considerations
> --------------------------
> 
> Performing a full-tree verification on every sync raises some
> performance concerns for end-user systems. The initial testing has shown
> that a cold-cache verification on a btrfs file system can take up around
> 4 minutes, with the process being mostly I/O bound. On the other hand,
> it can be expected that the verification will be performed directly
> after syncing, taking advantage of warm filesystem cache.

"warm" -> "a warm"

> 
> [snip]
> Thanks to all the people whose contributions were invaluable
> to the creation of this GLEP. This includes but is not limited to:
> 
> - Robin Hugh Johnson,
> - Ulrich Müller.
> 
> Additionally, thanks to Robin Hugh Johnson for the original
> MataManifest GLEP series which served both as inspiration and source

"MataManifest" -> "MetaManifest"

>
> [snip]
> 

Aside from the few nitpicks this looks good. Hope this helps.

-- 
Daniel Campbell - Gentoo Developer, Trustee, Treasurer
OpenPGP Key: 0x1EA055D6 @ hkp://keys.gnupg.net
fpr: AE03 9064 AE00 053C 270C  1DE4 6F7A 9091 1EA0 55D6

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update
  2017-11-17 20:37 ` Daniel Campbell
@ 2017-11-20 17:24   ` Michał Górny
  0 siblings, 0 replies; 23+ messages in thread
From: Michał Górny @ 2017-11-20 17:24 UTC (permalink / raw
  To: gentoo-dev

W dniu pią, 17.11.2017 o godzinie 12∶37 -0800, użytkownik Daniel
Campbell napisał:
> > 
> > [snip]
> > Non-strict Manifest verification
> > --------------------------------
> > 
> > Originally the Manifest2 format provided a special ``MISC`` tag that
> > was used for ``metadata.xml`` and ``ChangeLog`` files. This tag
> > indicated that the Manifest verification failures could be ignored for
> > those files unless the package manager was working in strict mode.
> > 
> > The first versions of this specification continued the use of this tag.
> > However, after a long debate it was decided to deprecate it along with
> > the non-strict behavior, and require all files to strictly match.
> 
> It may be outside the scope of the GLEP, but a link to said long debate
> might be relevant to the reader, especially if they have suggestions or
> points that have already been discussed in the debate.

It's been on IRC.

> Aside from the few nitpicks this looks good. Hope this helps.

I think I've fixed every single one of them. I'm going to fix one issue
I've noticed (lack of filename whitespace restriction) and resubmit.

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-16 10:19 [gentoo-dev] [RFC] GLEP 74 post-Council review update Michał Górny
  2017-11-17 20:37 ` Daniel Campbell
@ 2017-11-20 18:42 ` Michał Górny
  2017-11-20 21:37   ` Ulrich Mueller
  2017-11-22  2:59   ` R0b0t1
  2017-11-21 17:26 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v3] Michał Górny
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 23+ messages in thread
From: Michał Górny @ 2017-11-20 18:42 UTC (permalink / raw
  To: gentoo-dev

W dniu czw, 16.11.2017 o godzinie 11∶19 +0100, użytkownik Michał Górny
napisał:
> Hi, everyone.
> 
> Here's the updated version of GLEP 74 taking into consideration
> the points made during the Council pre-review.
> 
> ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
> HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html
> 

New changes:

9d819c9 glep-0074: Disallow filenames containing whitespace
4124b2f glep-0074: Explicitly specify UTF-8 encoding
7f9bd9f glep-0074: Include suggestions from Daniel Campbell


---
GLEP: 74
Title: Full-tree verification using Manifest files
Author: Michał Górny <mgorny@gentoo.org>,
        Robin Hugh Johnson <robbat2@gentoo.org>,
        Ulrich Müller <ulm@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2017-10-21
Last-Modified: 2017-11-16
Post-History: 2017-10-26, 2017-11-16
Content-Type: text/x-rst
Requires: 59, 61
Replaces: 44, 58, 60
---

Abstract
========

This GLEP extends the Manifest file format to cover full-tree file
integrity and authenticity checks. The format aims to be future-proof,
efficient and provide means of backwards compatibility.


Motivation
==========

The Manifest files as defined by GLEP 44 [#GLEP44]_ provide the current
means of verifying the integrity of distfiles and package files
in Gentoo. Combined with OpenPGP signatures, they provide means to
ensure the authenticity of the covered files. However, as noted
in GLEP 57 [#GLEP57]_ they lack the ability to provide full-tree
authenticity verification as they do not cover any files outside
the package directory. In particular, they provide multiple ways
for a third party to inject malicious code into the ebuild environment.

Historically, the topic of providing authenticity coverage for the whole
repository has been mentioned multiple times. The most noteworthy effort
are GLEPs 58 [#GLEP58]_ and 60 [#GLEP60]_ by Robin H. Johnson from 2008.
They were accepted by the Council in 2010 but have never been
implemented. When potential implementation work started in 2017, a new
discussion about the specification arose. It prompted the creation
of a competing GLEP that would provide a redesigned alternative to
the old GLEPs.

This specification is designed with the following goals in mind:

1. It should provide means to ensure the authenticity of the complete
   repository, including preventing the injection of additional files.

2. The format should be universal enough to work both for the Gentoo
   repository and third-party repositories of different characteristics.

3. The Manifest files should be verifiable stand-alone, that is without
   knowing any details about the underlying repository format.


Specification
=============

Manifest file format
--------------------

This specification reuses and extends the Manifest file format defined
in GLEP 44 [#GLEP44]_. For the purpose of it, the *file type* field is
repurposed as a generic *tag* that could also indicate additional
(non-checksum) metadata. Appropriately, those tags can be followed by
other space-separated values.

Unless specified otherwise, the paths used in the Manifest files
are relative to the directory containing the Manifest file. The paths
must not reference the parent directory (``..``).

The Manifest files use UTF-8 encoding.


Manifest file locations and nesting
-----------------------------------

The ``Manifest`` file located in the root directory of the repository
is called top-level Manifest, and it is used to perform the full-tree
verification. In order to verify the authenticity, it must be signed
using OpenPGP, using the armored cleartext format.

The top-level Manifest may reference sub-Manifests contained
in subdirectories of the repository. The sub-Manifests are traditionally
named ``Manifest``; however, the implementation must support arbitrary
names, including the possibility of multiple (split) Manifests
for a single directory. The sub-Manifest can only cover the files inside
the directory tree where it resides.

The sub-Manifest can also be signed using OpenPGP armored cleartext
format. However, the signature verification can be omitted since it
already is covered by the signed top-level Manifest.


Directory tree coverage
-----------------------

The specification provides three ways of skipping Manifest verification
of specific files and directories (recursively):

1. explicit ``IGNORE`` entries in Manifest files,

2. injected ignore paths via package manager configuration,

3. using names starting with a dot (``.``) which are always skipped.

All files that are not ignored must be covered by at least one
of the Manifests.

A single file may be matched by multiple identical or equivalent
Manifest entries, if and only if the entries have the same semantics,
specify the same size and the checksums common to both entries match.
It is an error for a single file to be matched by multiple entries
of different semantics, file size or checksum values. It is an error
to specify another entry for a file matching ``IGNORE``, or one of its
subdirectories.

The file entries (except for ``IGNORE``) can be specified for regular
files only. Symbolic links are followed when opening files
and traversing directories. It is an error to specify an entry for
a different file type. If the tree contain files of other types
that are not otherwise ignored, they need to be covered by an explicit
``IGNORE``.

All the local (non-``DIST``) files covered by a Manifest tree must
reside on the same filesystem. It is an error to specify entries
applying to files on another filesystem. If files or directories that
are not otherwise ignored reside on a different filesystem, or symbolic
links point to targets on a different filesystem, they must
be explicitly excluded via ``IGNORE``.

All paths specified in the Manifest file must consist of characters
corresponding to valid UTF-8 code points excluding the NULL character
(``U+0000``) and characters classified as whitespace in the current
version of the Unicode standard [#UNICODE]_. It is an error to use
Manifest files in directories containing files whose names contain
the disallowed characters.


File verification
-----------------

When verifying a file against the Manifest, the following rules are
used:

1. If the file is covered directly or indirectly by an entry
   of the ``IGNORE`` type, the verification always succeeds.

2. If the file is covered by an entry of the ``MANIFEST``, ``DATA``,
   ``MISC``, ``EBUILD`` or ``AUX`` type:

   a. if the file is not present, then the verification fails,

   b. if the file is present but has a different size or one
      of the checksums does not match, the verification fails,

   c. otherwise, the verification succeeds.

3. If the file is present but not listed in Manifest, the verification
   fails.

Unless specified otherwise, the package manager must not allow using
any files for which the verification failed. The package manager may
reject any package or even the whole repository if it may refer to files
for which the verification failed.


Timestamp verification
----------------------

The top-level Manifest file can contain a ``TIMESTAMP`` entry to account
for attacks against tree update distribution. If such an entry
is present, it should be updated every time at least one
of the Manifests changes. Every unique timestamp value must correspond
to a single tree state.

During the verification process, the client should compare the timestamp
against the update time obtained from a local clock or a trusted time
source. If the comparison result indicates that the Manifest at the time
of receiving was already significantly outdated, the client should
either fail the verification or require manual confirmation from user.

Furthermore, the Manifest provider may employ additional methods
of distributing the timestamps of recently generated Manifests
using a secure channel from a trusted source for exact comparison.
The exact details of such a solution are outside the scope of this
specification.

``TIMESTAMP`` entries may also be present in sub-Manifests. Those
timestamps must not be newer than the timestamp of the top-level
Manifest (if present). This specification does not define any specific
use for them.


Modern Manifest tags
--------------------

The Manifest files can specify the following tags:

``TIMESTAMP <iso8601>``
  Specifies a timestamp of when the Manifest file was last updated.
  The timestamp must be a valid second-precision ISO8601 extended format
  combined date and time in UTC timezone, i.e. using the following
  ``strftime()`` format string: ``%Y-%m-%dT%H:%M:%SZ``. Optional.
  The package manager can use it to detect an outdated repository
  checkout as described in `Timestamp verification`_.

``MANIFEST <path> <size> <checksums>...``
  Specifies a sub-Manifest. The sub-Manifest must be verified like
  a regular file. If the verification succeeds, the entries from
  the sub-Manifest are included for verification as described
  in `Manifest file locations and nesting`_.

``IGNORE <path>``
  Ignores a subdirectory or file from Manifest checks. If the specified
  path is present, it and its contents are omitted from the Manifest
  verification (always pass). *Path* must be a plain file or directory
  path without a trailing slash, and must not contain wildcards.

``DATA <path> <size> <checksums>...``
  Specifies a regular file subject to Manifest verification. The file
  is required to pass verification. Used for all files that do not match
  any other type.

``DIST <filename> <size> <checksums>...``
  Specifies a distfile entry used to verify files fetched as part
  of ``SRC_URI``. The filename must match the filename used to store
  the fetched file as specified in the PMS [#PMS-FETCH]_. The package
  manager must reject the fetched file if it fails verification.
  ``DIST`` entries apply to all packages below the Manifest file
  specifying them.


Deprecated Manifest tags
------------------------

For backwards compatibility, the following tags are additionally
allowed at the package directory level:

``EBUILD <filename> <size> <checksums>...``
  Equivalent to the ``DATA`` type.

``MISC <path> <size> <checksums>...``
  Equivalent to the ``DATA`` type. Historically indicated that
  the package manager may ignore a verification failure if operating
  in non-strict mode. However, that behavior is deprecated.

``AUX <filename> <size> <checksums>...``
  Equivalent to the ``DATA`` type, except that the filename is relative
  to ``files/`` subdirectory.


Algorithm for full-tree verification
------------------------------------

In order to perform full-tree verification, the following algorithm
can be used:

1. Collect all files present in the repository into *present* set.

2. Start at the top-level Manifest file. Verify its OpenPGP signature.
   Optionally verify the ``TIMESTAMP`` entry if present as specified
   in `timestamp verification`. Remove the top-level Manifest
   from the *present* set.

3. Process all ``MANIFEST`` entries, recursively. Verify the Manifest
   files according to `file verification`_ section, and include their
   entries in the current Manifest entry list (using paths relative
   to directories containing the Manifests).

4. Process all ``IGNORE`` entries. Remove any paths matching them
   from the *present* set.

5. Collect all files covered by ``DATA``, ``MISC``, ``EBUILD``
   and ``AUX`` entries into the *covered* set.

6. Verify the entries in *covered* set for incompatible duplicates
   and collisions with ignored files as explained in `Manifest file
   locations and nesting`_.

7. Verify all the files in the union of the *present* and *covered*
   sets, according to `file verification`_ section.


Algorithm for finding parent Manifests
--------------------------------------

In order to find the top-level Manifest from the current directory
the following algorithm can be used:

1. Store the current directory as *original* and the device ID
   of the containing filesystem (``st_dev``) as *startdev*,

2. If the device ID of the containing filesystem (``st_dev``)
   of the current directory is different than *startdev*, stop.

3. If the current directory contains a ``Manifest`` file:

   a. If a ``IGNORE`` entry in the ``Manifest`` file covers
      the *original* directory (or one of the parent directories), stop.

   b. Otherwise, store the current directory as *last_found*.

4. If the current directory is the root system directory (``/``), stop.

5. Otherwise, enter the parent directory and jump to step 2.

Once the algorithm stops, *last_found* will contain the relevant
top-level Manifest. If *last_found* is null, then the directory tree
does not contain any valid top-level Manifest candidates and one should
be created in the *original* directory.

Once the top-level Manifest is found, its ``MANIFEST`` entries should
be used to find any sub-Manifests below the top-level Manifest,
up to and including the *original* directory. Note that those
sub-Manifests can use different filenames than ``Manifest``.


Checksum algorithms
-------------------

This section is informational only. Specifying the exact set
of supported algorithms is outside the scope of this specification.

The algorithm names reserved at the time of writing are:

- ``MD5`` [#MD5]_,
- ``RMD160`` -- RIPEMD-160 [#RIPEMD160]_,
- ``SHA1`` [#SHS]_,
- ``SHA256`` and ``SHA512`` -- SHA-2 family of hashes [#SHS]_,
- ``WHIRLPOOL`` [#WHIRLPOOL]_,
- ``BLAKE2B`` and ``BLAKE2S`` -- BLAKE2 family of hashes [#BLAKE2]_,
- ``SHA3_256`` and ``SHA3_512`` -- SHA-3 family of hashes [#SHA3]_,
- ``STREEBOG256`` and ``STREEBOG512`` -- Streebog family of hashes
  [#STREEBOG]_.

The method of introducing new hashes is defined by GLEP 59 [#GLEP59]_.
It is recommended that any new hashes are named after the Python
``hashlib`` module algorithm names, transformed into uppercase.


Manifest compression
--------------------

The topic of Manifest file compression is covered by GLEP 61 [#GLEP61]_.
This section merely addresses interoperability issues between Manifest
compression and this specification.

The compressed Manifest files are required to be suffixed for their
compression algorithm. This suffix should be used to recognize
the compression and decompress Manifests transparently. The exact list
of algorithms and their corresponding suffixes are outside the scope
of this specification.

The top-level Manifest file must not be compressed. Since the OpenPGP
signature covers the uncompressed text and is compressed itself,
the data would have to be decompressed without any prior verification.
This could expose users e.g. to zip bombs or exploits on decompressor
vulnerabilities.

Whenever this specification refers to sub-Manifests, they can use any
names but are also required to use a specific compression suffix.
The ``MANIFEST`` entries are required to specify the full name including
compression suffix, and the verification is performed on the compressed
file.

The specification permits uncompressed Manifests to exist alongside
their compressed counterparts, and multiple compressed formats
to coexist. If that is the case, the files must have the same
uncompressed content and the specification is free to choose either
of the files using the same base name.


Combining multiple Manifest trees (informational)
-------------------------------------------------

This specification permits nesting multiple hierarchical Manifest trees.
In this layout, the specific directories of the Manifest tree can
be verified both as a part of another top-level Manifest,
and as an independent Manifest tree (when obtained without the parent
directory).

For this to work, the sub-Manifest file in the directory must also
satisfy the requirements for the top-level Manifest file. That is:

- it must be named ``Manifest`` and not compressed,

- it must cover all the files in this directory and its subdirectories
  (i.e. no files from the directory tree can be covered by parent
  Manifest),

- if authenticity verification is desired, it must be OpenPGP-signed.

It should be noted that if such a directory is a subdirectory of a valid
Manifest tree, the sub-Manifest needs to be valid according
to the top-level Manifest and the OpenPGP signature is disregarded
as detailed in `Manifest file locations and nesting`_. The top-level
behavior is exhibited only when the directory is obtained without parent
directories.


An example Manifest file (informational)
----------------------------------------

An example top-level Manifest file for the Gentoo repository would have
the following content::

    TIMESTAMP 2017-10-30T10:11:12Z
    IGNORE distfiles
    IGNORE local
    IGNORE lost+found
    IGNORE packages
    MANIFEST app-accessibility/Manifest 14821 SHA256 1b5f.. SHA512 f7eb..
    ...
    MANIFEST eclass/Manifest.gz 50812 SHA256 8c55.. SHA512 2915..
    ...

An example modern Manifest (disregarding backwards compatibility)
for a package directory would have the following content::

    DATA SphinxTrain-0.9.1-r1.ebuild 932 SHA256 3d3b.. SHA512 be4d..
    DATA SphinxTrain-1.0.8.ebuild 912 SHA256 f681.. SHA512 0749..
    DATA metadata.xml 664 SHA256 97c6.. SHA512 1175..
    DATA files/gcc.patch 816 SHA256 b56e.. SHA512 2468..
    DATA files/gcc34.patch 333 SHA256 c107.. SHA512 9919..
    DIST SphinxTrain-0.9.1-beta.tar.gz 469617 SHA256 c1a4.. SHA512 1b33..
    DIST sphinxtrain-1.0.8.tar.gz 8925803 SHA256 548e.. SHA512 465d..


Rationale
=========

Stand-alone format
------------------

The first question that needed to be asked before proceeding with
the design was whether the Manifest file format was supposed to be
stand-alone, or tightly bound to the repository format.

The stand-alone format has been selected because of its three
advantages:

1. It is more future-proof. If an incompatible change to the repository
   format is introduced, only developers need to upgrade the tools
   they use to generate the Manifests. The tools used to verify
   the updated Manifests will continue to work.

2. It is more flexible and universal. With a dedicated tool,
   the Manifest files can be used to sign and verify arbitrary file
   sets.

3. It keeps the verification tool simpler. In particular, we can easily
   write an independent verification tool that could work on any
   distribution without needing to depend on a package manager
   implementation or rewrite parts of it.

Designing a stand-alone format requires that the Manifest carries enough
information to perform the verification following all the rules specific
to the Gentoo repository.


Tree design
-----------

The second important point of the design was determining whether
the Manifest files should be structured hierarchically, or independent.
Both options have their advantages.

In the hierarchical model, each sub-Manifest file is covered by a higher
level Manifest. As a result, only the top-level Manifest has to be
OpenPGP-signed, and subsequent Manifests need to be only verified by
checksum stored in the parent Manifest. This has the following
implications:

- Verifying any set of files in the repository requires using checksums
  from the most relevant Manifests and the parent Manifests.

- The OpenPGP signature of the top-level Manifest needs to be verified
  only once per process.

- Altering any set of files requires updating the relevant Manifests,
  and their parent Manifests up to the top-level Manifest, and signing
  the last one.

- As a result, the top-level Manifest changes on every commit,
  and various middle-level Manifests change (and need to be transferred)
  frequently.

In the independent model, each sub-Manifest file is independent
of the parent Manifests. As a result, each of them needs to be signed
and verified independently. However, the parent Manifests still need
to list sub-Manifests (albeit without verification data) in order
to detect removal or replacement of subdirectories. This has
the following implications:

- Verifying any set of files in the repository requires using checksums
  and verifying signatures of the most relevant Manifest files.

- Altering any set of files requires updating the relevant Manifests
  and signing them again.

- Parent Manifests are updated only when Manifests are added or removed
  from subdirectories. As a result, they change infrequently.

While both models have their advantages, the hierarchical model was
selected because it reduces the number of OpenPGP operations
(which are comparatively costly) to the minimum.


Tree layout restrictions
------------------------

The algorithm is meant to work primarily with ebuild repositories which
normally contain only files and directories. Directories provide
no useful metadata for verification, and specifying special entries
for additional file types is purposeless. Therefore, the specification
is restricted to dealing with regular files.

The Gentoo repository does not use symbolic links. Some Gentoo
repositories do, however. To provide a simple solution for dealing with
symlinks without having to take care to implement special handling for
them, the common behavior of implicitly resolving them is used.
Therefore, symbolic links to files are stored as if they were regular
files, and symbolic links to directories are followed as if they were
regular directories.

Dotfiles are implicitly ignored as that is a common notion used
in software written for POSIX systems. All other filenames require
explicit ``IGNORE`` lines.

An ability to inject additional ignore entries is provided to account
for site configuration affecting the repository tree -- placing
additional files in it, skipping some of the categories from syncing.
This configuration can extend beyond the limits of this GLEP,
e.g. by allowing wildcards or regular expressions.

The algorithm is restricted to work on a single filesystem. This is
mostly relevant when scanning for top-level Manifest -- we do not want
to cross filesystem boundaries then. However, to ensure consistent
bidirectional behavior we need to also ban them when operating downwards
the tree.

The directories and files on different filesystems need to be ignored
explicitly as implicitly skipping them would cause confusion.
In particular, tools might then claim that a file does not exist when
it clearly does because it was skipped due to filesystem boundaries.


Filename character set restriction
----------------------------------

The valid set of filename characters for the Gentoo repository
is restricted by the devmanual 'File Naming Rules' section
[#FILE-NAMING-RULES]_, and enforced via a git hook. The valid distfile
names are not restricted explicitly -- however, the PMS dependency
specification syntax [#PMS-FETCH]_ implicitly makes it impossible to use
filenames containing whitespace.

This specification aims to avoid arbitrary restrictions. For this
reason, the filename characters are only restricted by excluding two
technically problematic groups:

1. The NULL character (``U+0000``) is normally used to indicate the end
   of a null-terminated string. Its use could therefore break programs
   written using C. Furthermore, it is not allowed in any known
   filesystem.

2. The whitespace characters are used to separate Manifest fields. While
   technically it would be enough to restrict space (``U+0020``)
   character that is normally used as the separator, all whitespace
   characters are forbidden to avoid confusion and implementation
   errors.

While the specification could be extended to allow such filenames
by using some form of escaping, there is currently no apparent need
for such a feature.

Historically, Portage attempted to overcome the whitespace limitation
by attempting to locate the size field and take everything before it
as filename. This was terribly fragile and even if it worked, it would
solve the problem only partially.

Since the same restrictions apply to ``IGNORE`` rules, it is currently
not possible to either list or ignore the file using whitespace
characters. Therefore, the presence of such files is forbidden entirely.


File verification model
-----------------------

The verification model aims to provide full coverage against different
forms of attack. In particular, three different kinds of manipulation
are considered:

1. Alteration of the file content.

2. Removal of a file.

3. Addition of a new file.

In order to prevent against all three, the system requires that all
files in the repository are listed in Manifests and verified against
them.

As a special case, ignores are allowed to account for directories
that are not part of the repository but were traditionally placed inside
it. Those directories were ``distfiles``, ``local`` and ``packages``. It
could be also used to ignore VCS directories such as ``CVS``.


Non-strict Manifest verification
--------------------------------

Originally the Manifest2 format provided a special ``MISC`` tag that
was used for ``metadata.xml`` and ``ChangeLog`` files. This tag
indicated that the Manifest verification failures could be ignored for
those files unless the package manager was working in strict mode.

The first versions of this specification continued the use of this tag.
However, after a long debate it was decided to deprecate it along with
the non-strict behavior, and require all files to strictly match.

Two arguments were mentioned for the usefulness of a ``MISC`` type:

1. being able to reduce the checkout size by stripping unnecessary
   files out, and

2. being able to run update automatically generated files locally
   without causing unnecessary verification failures.

However, the usefulness of ``MISC`` in both cases is doubtful.

The cases for stripping unnecessary files mostly focused around space
savings. For this purpose, stripping ``metadata.xml`` and similar files
has little value. It is much more common for users to strip whole
packages or categories. The ``MISC`` type is not suitable for that,
and so a dedicated package manager mechanism needs to be developed
instead. The same mechanism can also handle files that historically used
the ``MISC`` type. As an example, the package manager may choose
to generate both the rsync exclusion list and Manifest ignore list
using a single source list.

The cases for autogenerated files involve such cache files
as ``use.local.desc``. However, we can not include ``md5-cache`` there
due to security concerns which results in inconsistent cache handling.
Furthermore, the tools were historically modified to provide stable
output which means that their content can not change without
a non-``MISC`` content being changed first. This practically defeats
the purpose of using ``MISC``.

Finally, the non-strict mode could be used as means to an attack.
The allowance of missing or modified documentation file could be used
to spread misinformation, resulting in bad decisions made by the user.
A modified file could also be used, e.g. to exploit vulnerabilities
of an XML parser.


Timestamp field
---------------

The top-level Manifest optionally allows using a ``TIMESTAMP`` tag
to include a generation timestamp in the Manifest. A similar feature
was originally proposed in GLEP 58 [#GLEP58]_.

A malicious third-party may use the principles of exclusion or replay
[#C08]_ to deny an update to clients, while at the same time recording
the identity of clients to attack. The timestamp field can be used to
detect that.

In order to provide more complete protection, the Gentoo Infrastructure
should provide an ability to obtain the timestamps of all Manifests
from a recent timeframe over a secure channel from a trusted source
for comparison.

Strictly speaking, this information is already provided by the various
``metadata/timestamp*`` files that are already present. However,
including the value in the Manifest itself has a little cost
and provides the ability to perform the verification stand-alone.

Furthermore, some of the timestamp files are added very late
in the distribution process, past the Manifest generation phase. Those
files will most likely receive ``IGNORE`` entries and therefore
be unsafe to use.

The specification permits additional timestamps in sub-Manifest files
for local use. A generic testing tool should ignore them.


New vs deprecated tags
----------------------

Out of the four types defined by Manifest2, only one is reused
and the remaining three are replaced by a single, universal ``DATA``
type.

The ``DIST`` tag is reused since the specification does not change
anything with regard to distfile handling.

The ``EBUILD`` tag could potentially be reused for generic file
verification data. However, it would be confusing if all the different
data files were marked as ``EBUILD``. Therefore, an equivalent ``DATA``
type was introduced as a replacement.

The ``MISC`` tag and the relevant non-strict mode has been removed
as being of little value, as detailed in the `Non-strict Manifest
verification`_ section.

The ``AUX`` tag is deprecated as it is redundant to ``DATA``, and has
the limiting property of implicit ``files/`` path prefix.


Finding top-level Manifest
--------------------------

The development of a reference implementation for this GLEP has brought
the following problem: how to find all the relevant Manifests when
the Manifest tool is run inside a subdirectory of the repository?

One of the options would be to provide a bi-directional linking
of Manifests via a ``PARENT`` tag. However, that would not solve
the problem when a new Manifest file is being created.

Instead, an algorithm for iterating over parent directories is proposed.
Since there is no obligatory explicit indicator for the top-level
Manifest, the algorithm assumes that the top-level Manifest
is the highest ``Manifest`` in the directory hierarchy that can cover
the current directory. This generally makes sense since the Manifest
files are required to provide coverage for all subdirectories, so all
Manifests starting from that one need to be updated.

If independent Manifest trees are nested in the directory structure,
then an ``IGNORE`` entry needs to be used to separate them.

Since sub-Manifests can use any filenames, the Manifest finding
algorithm must not short-cut the procedure by storing all ``Manifest``
files along the parent directories. Instead, it needs to retrace
the relevant sub-Manifest files along ``MANIFEST`` entries
in the top-level Manifest.


Injecting ChangeLogs into the checkout
--------------------------------------

One of the problems considered in the new Manifest format was injecting
historical and autogenerated ChangeLog into the repository. We normally
don't include those files, to reduce the checkout size. However, some
users have shown interest in them and Infra is working on providing them
via an additional rsync module.

If such files were injected into the repository, they would cause
verification failures of Manifests. To account for this, Infra could
provide ``IGNORE`` entries to allow them to exist.


Splitting distfile checksums from file checksums
------------------------------------------------

Another problem with the current Manifest format is that the checksums
for fetched files are combined with checksums for local files
in a single file inside the package directory. It has been specifically
pointed out that:

- since distfiles are sometimes reused across different packages,
  the repeating checksums are redundant [#DIST]_.
  
- mirror admins were interested in the possibility of verifying all
  the distfiles with a single tool.

This specification does not provide a clean solution to this problem.
It technically permits moving ``DIST`` entries to higher-level Manifests
but the usefulness of such a solution is doubtful.

However, for the second problem we will probably deliver a dedicated
tool working with this Manifest format.


Hash algorithms
---------------

While maintaining a consistent supported hash set is important
for interoperability, it is not a good fit for the generic layout
of this GLEP. Furthermore, it would require updating the GLEP
in the future every time the used algorithms change.

Instead, the specification focuses on listing the currently used
algorithm names for interoperability, and sets a recommendation
for consistent naming of algorithms in the future. The Python
``hashlib`` module is used as a reference since it is used
as the provider of hash functions for most of the Python software,
including Portage and PkgCore.

The basic rules for changing hash algorithms are defined in GLEP 59
[#GLEP59]_. The implementations can focus only on those algorithms
that are actually used or planned on being used. It may be feasible
to devise a new GLEP that specifies the currently used hashes (or update
GLEP 59 accordingly).


Manifest compression
--------------------

The support for Manifest compression is introduced with minimal changes
to the file format. The ``MANIFEST`` entries are required to provide
the real (compressed) file path for compatibility with other file
entries and to avoid confusion.

The compression of top-level Manifest file has been prohibited
as the specification currently does not provide any means of verifying
the file prior to decompression. If the top-level Manifest is
compressed, tooling will have to unpack the file before being able
to verify the contents. This makes it possible for a malicious third
party to attack the system by providing a compressed Manifest that
exposes decompressor vulnerabilities, or a zip bomb.

The OpenPGP cleartext signature covers the contents of the Manifest,
and is therefore compressed along with them. The possibility of using
detached signature has been considered but it was rejected as
unnecessary complexity for minor gain.

Technically, a similar result could be effected via moving all the data
into a compressed sub-Manifest in the top directory (e.g.
``Manifest.sub.gz``), and including a ``MANIFEST`` entry for this file
in a signed, uncompressed top-level Manifest.

The existence of additional entries for uncompressed Manifest checksums
was debated. However, plain entries for the uncompressed file would
be confusing if only the compressed file existed, and conflicting
if both uncompressed and compressed variants existed. Furthermore,
it has been pointed out that ``DIST`` entries do not have uncompressed
variant either.


Performance considerations
--------------------------

Performing a full-tree verification on every sync raises some
performance concerns for end-user systems. The initial testing has shown
that a cold-cache verification on a btrfs file system can take up around
4 minutes, with the process being mostly I/O bound. On the other hand,
it can be expected that the verification will be performed directly
after syncing, taking advantage of a warm filesystem cache.

To improve speed on I/O and/or CPU-restrained systems even further,
the algorithms can be easily extended to perform incremental
verification. Given that rsync does not preserve mtimes by default,
the tool can take advantage of mtime and Manifest comparisons to recheck
only the parts of the repository that have changed.

Furthermore, the package manager implementations can restrict checking
only to the parts of the repository that are actually being used.


Backwards Compatibility
=======================

This GLEP provides optional means of preserving backwards compatibility.
To preserve the backwards compatibility, the following needs to hold
for the ``Manifest`` file in every package directory:

- all files must be covered by the single ``Manifest`` file,

- all distfiles used by the package must be included,

- all files inside the ``files/`` subdirectory need to use
  the ``AUX`` tag (rather than ``DATA``),

- all ``.ebuild`` files need to use the ``EBUILD`` tag,

- the ``metadata.xml`` and ``ChangeLog`` files need to use
  the ``MISC`` tag,

- the Manifest can be signed to provide authenticity verification,

- an uncompressed Manifest must always exist, and a compressed Manifest
  of identical content may be present.

Once the backwards compatibility is no longer a concern, the above
no longer needs to hold and the deprecated tags can be removed.


Reference Implementation
========================

The reference implementation for this GLEP is being developed
as the gemato project [#GEMATO]_.


Credits
=======

Thanks to all the people whose contributions were invaluable
to the creation of this GLEP. This includes but is not limited to:

- Robin Hugh Johnson,
- Ulrich Müller.

Additionally, thanks to Robin Hugh Johnson for the original
MetaManifest GLEP series which served both as inspiration and source
of many concepts used in this GLEP. Recursively, also thanks to all
the people who contributed to the original GLEPs.


References
==========

.. [#GLEP44] GLEP 44: Manifest2 format
   (https://www.gentoo.org/glep/glep-0044.html)

.. [#GLEP57] GLEP 57: Security of distribution of Gentoo software
   - Overview
   (https://www.gentoo.org/glep/glep-0057.html)

.. [#GLEP58] GLEP 58: Security of distribution of Gentoo software
   - Infrastructure to User distribution - MetaManifest
   (https://www.gentoo.org/glep/glep-0058.html)

.. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
   (https://www.gentoo.org/glep/glep-0059.html)

.. [#GLEP60] GLEP 60: Manifest2 filetypes
   (https://www.gentoo.org/glep/glep-0060.html)

.. [#GLEP61] GLEP 61: Manifest2 compression
   (https://www.gentoo.org/glep/glep-0061.html)

.. [#UNICODE] The Unicode standard
   (https://unicode.org/versions/latest/)

.. [#PMS-FETCH] Package Manager Specification: Dependency Specification
   Format - SRC_URI
   (https://projects.gentoo.org/pms/6/pms.html#x1-940008.2.10)

.. [#FILE-NAMING-RULES] Ebuild File Format -- Gentoo Development Guide
   (https://devmanual.gentoo.org/ebuild-writing/file-format/#file-naming-rules)

.. [#MD5] RFC1321: The MD5 Message-Digest Algorithm
   (https://www.ietf.org/rfc/rfc1321.txt)

.. [#RIPEMD160] The hash function RIPEMD-160
   (https://homes.esat.kuleuven.be/~bosselae/ripemd160.html)

.. [#SHS] FIPS PUB 180-4: Secure Hash Standard (SHS)
   (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf)

.. [#WHIRLPOOL] The WHIRLPOOL Hash Function
   (http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html)

.. [#BLAKE2] BLAKE2 -- fast secure hashing
   (https://blake2.net/)

.. [#SHA3] FIPS PUB 202: SHA-3 Standard: Permutation-Based Hash
   and Extendable-Output Functions
   (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf)

.. [#STREEBOG] GOST R 34.11-2012: Streebog Hash Function
   (https://www.streebog.net/)

.. [#C08] Cappos, J et al. (2008). "Attacks on Package Managers"
   (https://www2.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html)

.. [#DIST] According to Robin H. Johnson, 8.4% of all DIST entries
   at the time of writing are duplicate, representing a 2 MiB
   out of 25 MiB of DIST entries altogether.

.. [#GEMATO] gemato: Gentoo Manifest Tool
   (https://github.com/mgorny/gemato/)

Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-20 18:42 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2] Michał Górny
@ 2017-11-20 21:37   ` Ulrich Mueller
  2017-11-21  6:30     ` Ulrich Mueller
  2017-11-21 17:14     ` Michał Górny
  2017-11-22  2:59   ` R0b0t1
  1 sibling, 2 replies; 23+ messages in thread
From: Ulrich Mueller @ 2017-11-20 21:37 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 6579 bytes --]

>>>>> On Mon, 20 Nov 2017, Michał Górny wrote:

> New changes:

> 9d819c9 glep-0074: Disallow filenames containing whitespace
> 4124b2f glep-0074: Explicitly specify UTF-8 encoding
> 7f9bd9f glep-0074: Include suggestions from Daniel Campbell

Here are a few comments (quoting below only the parts of the text
referenced by them):

> The Manifest files use UTF-8 encoding.

I don't understand the purpose of that requirement. The only place
where bytes outside of the ASCII range can occur are names of
distfiles, and these should simply be passed transparently. Otherwise,
you would have to reject any sequence of non-ASCII bytes that doesn't
form a valid UTF-8 sequence, which looks like an arbitrary restriction
to me.

> It is an error for a single file to be matched by multiple entries
> of different semantics, file size or checksum values. It is an error
> to specify another entry for a file matching ``IGNORE``, or one of its
> subdirectories.

What about regular files in a directory (or subdirectory) matched by
IGNORE? Looks like this case is not covered (?).

> All paths specified in the Manifest file must consist of characters
> corresponding to valid UTF-8 code points excluding the NULL character
> (``U+0000``) and characters classified as whitespace in the current
> version of the Unicode standard [#UNICODE]_. It is an error to use
> Manifest files in directories containing files whose names contain
> the disallowed characters.

See above. I believe that NUL and ASCII whitespace (i.e. characters 09
0a 0b 0c 0d 20) should be excluded, but excluding byte sequences like
"e1 9a 80" (which is the UTF-8 encoding for U+1680 "OGHAM SPACE MARK")
doesn't make sense.

> During the verification process, the client should compare the timestamp
> against the update time obtained from a local clock or a trusted time
> source. If the comparison result indicates that the Manifest at the time
> of receiving was already significantly outdated, the client should
> either fail the verification or require manual confirmation from user.

s/from user./from the user./

> ``TIMESTAMP <iso8601>``
>   Specifies a timestamp of when the Manifest file was last updated.
>   The timestamp must be a valid second-precision ISO8601 extended format

s/ISO8601/ISO 8601/

> ``IGNORE <path>``
>   Ignores a subdirectory or file from Manifest checks. If the specified
>   path is present, it and its contents are omitted from the Manifest
>   verification (always pass). *Path* must be a plain file or directory
>   path without a trailing slash, and must not contain wildcards.

What does that mean? Wildcards are not special (so "foo*" will match
literally), or wildcard characters like "*" are not allowed at all?

> ``AUX <filename> <size> <checksums>...``
>   Equivalent to the ``DATA`` type, except that the filename is relative
>   to ``files/`` subdirectory.

s/to/to the/

> 3. Process all ``MANIFEST`` entries, recursively. Verify the Manifest
>    files according to `file verification`_ section, and include their

s/according to/according to the/

> 6. Verify the entries in *covered* set for incompatible duplicates

s/in *covered* set/in the *covered* set/

> 7. Verify all the files in the union of the *present* and *covered*
>    sets, according to `file verification`_ section.

s/to/to the/

>    a. If a ``IGNORE`` entry in the ``Manifest`` file covers
>       the *original* directory (or one of the parent directories), stop.

s/a ``IGNORE`` entry/an ``IGNORE`` entry/

> An example top-level Manifest file for the Gentoo repository would have
> the following content::

>     TIMESTAMP 2017-10-30T10:11:12Z
>     IGNORE distfiles
>     IGNORE local
>     IGNORE lost+found
>     IGNORE packages
>     MANIFEST app-accessibility/Manifest 14821 SHA256 1b5f.. SHA512 f7eb..
>     ...
>     MANIFEST eclass/Manifest.gz 50812 SHA256 8c55.. SHA512 2915..
>     ...

> An example modern Manifest (disregarding backwards compatibility)
> for a package directory would have the following content::

>     DATA SphinxTrain-0.9.1-r1.ebuild 932 SHA256 3d3b.. SHA512 be4d..
>     DATA SphinxTrain-1.0.8.ebuild 912 SHA256 f681.. SHA512 0749..
>     DATA metadata.xml 664 SHA256 97c6.. SHA512 1175..
>     DATA files/gcc.patch 816 SHA256 b56e.. SHA512 2468..
>     DATA files/gcc34.patch 333 SHA256 c107.. SHA512 9919..
>     DIST SphinxTrain-0.9.1-beta.tar.gz 469617 SHA256 c1a4.. SHA512 1b33..
>     DIST sphinxtrain-1.0.8.tar.gz 8925803 SHA256 548e.. SHA512 465d..

Update hashes to BLAKE2B SHA512?

> This specification aims to avoid arbitrary restrictions. For this
> reason, the filename characters are only restricted by excluding two

s/the filename characters/filename characters/

> technically problematic groups:

> 1. The NULL character (``U+0000``) is normally used to indicate the end
>    of a null-terminated string. Its use could therefore break programs
>    written using C. Furthermore, it is not allowed in any known
>    filesystem.

> 2. The whitespace characters are used to separate Manifest fields. While

s/The whitespace characters/Whitespace characters/

> 2. being able to run update automatically generated files locally
>    without causing unnecessary verification failures.

Strike the word "run"?

> Strictly speaking, this information is already provided by the various
> ``metadata/timestamp*`` files that are already present. However,

Twice "already" in this sentence.

> The OpenPGP cleartext signature covers the contents of the Manifest,
> and is therefore compressed along with them. The possibility of using
> detached signature has been considered but it was rejected as

s/detached signature/a detached signature/

> The existence of additional entries for uncompressed Manifest checksums
> was debated. However, plain entries for the uncompressed file would
> be confusing if only the compressed file existed, and conflicting
> if both uncompressed and compressed variants existed. Furthermore,
> it has been pointed out that ``DIST`` entries do not have uncompressed
> variant either.

s/uncompressed variant/an uncompressed variant/

> .. [#DIST] According to Robin H. Johnson, 8.4% of all DIST entries
>    at the time of writing are duplicate, representing a 2 MiB
>    out of 25 MiB of DIST entries altogether.

s/a 2 MiB/2 MiB/

> Copyright
> =========

There should be two blank lines before this section heading (as
required by GLEP 2).

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-20 21:37   ` Ulrich Mueller
@ 2017-11-21  6:30     ` Ulrich Mueller
  2017-11-21 17:14     ` Michał Górny
  1 sibling, 0 replies; 23+ messages in thread
From: Ulrich Mueller @ 2017-11-21  6:30 UTC (permalink / raw
  To: Ulrich Mueller; +Cc: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1440 bytes --]

>>>>> On Mon, 20 Nov 2017, Ulrich Mueller wrote:

>>>>> On Mon, 20 Nov 2017, Michał Górny wrote:
>> All paths specified in the Manifest file must consist of characters
>> corresponding to valid UTF-8 code points excluding the NULL character
>> (``U+0000``) and characters classified as whitespace in the current
>> version of the Unicode standard [#UNICODE]_. It is an error to use
>> Manifest files in directories containing files whose names contain
>> the disallowed characters.

> See above. I believe that NUL and ASCII whitespace (i.e. characters
> 09 0a 0b 0c 0d 20) should be excluded, but excluding byte sequences
> like "e1 9a 80" (which is the UTF-8 encoding for U+1680 "OGHAM SPACE
> MARK") doesn't make sense.

Thinking about it, this still looks too complicated. So, exclude only
SPACE (0x20) which is used as separator between fields. (NUL can be
excluded too, but it won't occur anyway.)

In fact, all Manifest files in the tree are ASCII only.
So alternatively, filenames could be restricted to printable ASCII.
This is also what GLEP 31 [1] says:

| Suitable Characters for File and Directory Names
|
| Characters outside the ASCII 0..127 range cannot safely be used for
| file or directory names. (Of course, not all characters inside the
| ASCII 0..127 range can be used safely either.)

Ulrich


[1] Character Sets for Portage Tree Items
    https://www.gentoo.org/glep/glep-0031.html

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-20 21:37   ` Ulrich Mueller
  2017-11-21  6:30     ` Ulrich Mueller
@ 2017-11-21 17:14     ` Michał Górny
  2017-11-21 20:28       ` Ulrich Mueller
  1 sibling, 1 reply; 23+ messages in thread
From: Michał Górny @ 2017-11-21 17:14 UTC (permalink / raw
  To: gentoo-dev

W dniu pon, 20.11.2017 o godzinie 22∶37 +0100, użytkownik Ulrich Mueller
napisał:
> > > > > > On Mon, 20 Nov 2017, Michał Górny wrote:
> > New changes:
> > 9d819c9 glep-0074: Disallow filenames containing whitespace
> > 4124b2f glep-0074: Explicitly specify UTF-8 encoding
> > 7f9bd9f glep-0074: Include suggestions from Daniel Campbell
> 
> Here are a few comments (quoting below only the parts of the text
> referenced by them):
> 
> > The Manifest files use UTF-8 encoding.
> 
> I don't understand the purpose of that requirement. The only place
> where bytes outside of the ASCII range can occur are names of
> distfiles, and these should simply be passed transparently. Otherwise,
> you would have to reject any sequence of non-ASCII bytes that doesn't
> form a valid UTF-8 sequence, which looks like an arbitrary restriction
> to me.

Let me reply in parts.

Why not plain ASCII? Because the spec tries to avoid entirely arbitrary
restrictions, and forcing everyone to use just ASCII entirely counts
as such.

Why not plain bytestring? Mostly because it's really PITA to work
on them in Python. Besides, you can't allow arbitrary bytestring since
you still need to apply restrictions making it safe to parse in text
context, i.e. forbid 0x20, 0x0A, possibly more. Which makes
the definition kinda silly in the end. Not to mention transferring files
over systems which can recode filenames but will not recode Manifest
contents.

Why UTF-8 then? Because it's quite reliable and widely established.
It works for most of the people out of the box. Those who use other
encodings can usually transcode reliably. It's what we're using
in ebuilds and everywhere else wrt GLEP 31, so I don't think we should
make Manifests any different.

> > It is an error for a single file to be matched by multiple entries
> > of different semantics, file size or checksum values. It is an error
> > to specify another entry for a file matching ``IGNORE``, or one of its
> > subdirectories.
> 
> What about regular files in a directory (or subdirectory) matched by
> IGNORE? Looks like this case is not covered (?).

Ignored regular files must not have any other (e.g. DATA) entries.
Otherwise the expected behavior is unclear -- are we supposed to verify
the file or ignore it?

> > All paths specified in the Manifest file must consist of characters
> > corresponding to valid UTF-8 code points excluding the NULL character
> > (``U+0000``) and characters classified as whitespace in the current
> > version of the Unicode standard [#UNICODE]_. It is an error to use
> > Manifest files in directories containing files whose names contain
> > the disallowed characters.
> 
> See above. I believe that NUL and ASCII whitespace (i.e. characters 09
> 0a 0b 0c 0d 20) should be excluded, but excluding byte sequences like
> "e1 9a 80" (which is the UTF-8 encoding for U+1680 "OGHAM SPACE MARK")
> doesn't make sense.

The restriction is meant to be intentionally wider to prevent problems
with implementations which e.g. use Python's str.split() or '\S' regular
expression character (Portage). When working in Unicode-compliant mode,
those can match additional whitespace characters, and I'm rejecting them
to be on the safe side.

> > During the verification process, the client should compare the timestamp
> > against the update time obtained from a local clock or a trusted time
> > source. If the comparison result indicates that the Manifest at the time
> > of receiving was already significantly outdated, the client should
> > either fail the verification or require manual confirmation from user.
> 
> s/from user./from the user./
> 
> > ``TIMESTAMP <iso8601>``
> >   Specifies a timestamp of when the Manifest file was last updated.
> >   The timestamp must be a valid second-precision ISO8601 extended format
> 
> s/ISO8601/ISO 8601/

Both done.

> 
> > ``IGNORE <path>``
> >   Ignores a subdirectory or file from Manifest checks. If the specified
> >   path is present, it and its contents are omitted from the Manifest
> >   verification (always pass). *Path* must be a plain file or directory
> >   path without a trailing slash, and must not contain wildcards.
> 
> What does that mean? Wildcards are not special (so "foo*" will match
> literally), or wildcard characters like "*" are not allowed at all?

Not special. Will reword to:

| Wildcards are not supported and wildcard characters are interpreted
| literally.

> 
> > ``AUX <filename> <size> <checksums>...``
> >   Equivalent to the ``DATA`` type, except that the filename is relative
> >   to ``files/`` subdirectory.
> 
> s/to/to the/
> 
> > 3. Process all ``MANIFEST`` entries, recursively. Verify the Manifest
> >    files according to `file verification`_ section, and include their
> 
> s/according to/according to the/
> 
> > 6. Verify the entries in *covered* set for incompatible duplicates
> 
> s/in *covered* set/in the *covered* set/
> 
> > 7. Verify all the files in the union of the *present* and *covered*
> >    sets, according to `file verification`_ section.
> 
> s/to/to the/
> 
> >    a. If a ``IGNORE`` entry in the ``Manifest`` file covers
> >       the *original* directory (or one of the parent directories), stop.
> 
> s/a ``IGNORE`` entry/an ``IGNORE`` entry/

All done.

> 
> > An example top-level Manifest file for the Gentoo repository would have
> > the following content::
> >     TIMESTAMP 2017-10-30T10:11:12Z
> >     IGNORE distfiles
> >     IGNORE local
> >     IGNORE lost+found
> >     IGNORE packages
> >     MANIFEST app-accessibility/Manifest 14821 SHA256 1b5f.. SHA512 f7eb..
> >     ...
> >     MANIFEST eclass/Manifest.gz 50812 SHA256 8c55.. SHA512 2915..
> >     ...
> > An example modern Manifest (disregarding backwards compatibility)
> > for a package directory would have the following content::
> >     DATA SphinxTrain-0.9.1-r1.ebuild 932 SHA256 3d3b.. SHA512 be4d..
> >     DATA SphinxTrain-1.0.8.ebuild 912 SHA256 f681.. SHA512 0749..
> >     DATA metadata.xml 664 SHA256 97c6.. SHA512 1175..
> >     DATA files/gcc.patch 816 SHA256 b56e.. SHA512 2468..
> >     DATA files/gcc34.patch 333 SHA256 c107.. SHA512 9919..
> >     DIST SphinxTrain-0.9.1-beta.tar.gz 469617 SHA256 c1a4.. SHA512 1b33..
> >     DIST sphinxtrain-1.0.8.tar.gz 8925803 SHA256 548e.. SHA512 465d..
> 
> Update hashes to BLAKE2B SHA512?

I don't really want to go through the hoop of updating the first two
bytes of each hash, and I don't think it'd be nice to replace the key
while keeping incorrect value. Even that it does not serve any purpose.

> 
> > This specification aims to avoid arbitrary restrictions. For this
> > reason, the filename characters are only restricted by excluding two
> 
> s/the filename characters/filename characters/
> 
> > technically problematic groups:
> > 1. The NULL character (``U+0000``) is normally used to indicate the end
> >    of a null-terminated string. Its use could therefore break programs
> >    written using C. Furthermore, it is not allowed in any known
> >    filesystem.
> > 2. The whitespace characters are used to separate Manifest fields. While
> 
> s/The whitespace characters/Whitespace characters/
> 
> > 2. being able to run update automatically generated files locally
> >    without causing unnecessary verification failures.
> 	
> Strike the word "run"?
> 
> > Strictly speaking, this information is already provided by the various
> > ``metadata/timestamp*`` files that are already present. However,
> 
> Twice "already" in this sentence.
> 
> > The OpenPGP cleartext signature covers the contents of the Manifest,
> > and is therefore compressed along with them. The possibility of using
> > detached signature has been considered but it was rejected as
> 
> s/detached signature/a detached signature/
> 
> > The existence of additional entries for uncompressed Manifest checksums
> > was debated. However, plain entries for the uncompressed file would
> > be confusing if only the compressed file existed, and conflicting
> > if both uncompressed and compressed variants existed. Furthermore,
> > it has been pointed out that ``DIST`` entries do not have uncompressed
> > variant either.
> 
> s/uncompressed variant/an uncompressed variant/
> 
> > .. [#DIST] According to Robin H. Johnson, 8.4% of all DIST entries
> >    at the time of writing are duplicate, representing a 2 MiB
> >    out of 25 MiB of DIST entries altogether.
> 
> s/a 2 MiB/2 MiB/
> 
> > Copyright
> > =========
> 
> There should be two blank lines before this section heading (as
> required by GLEP 2).
> 
> Ulrich

All fixed.

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v3]
  2017-11-16 10:19 [gentoo-dev] [RFC] GLEP 74 post-Council review update Michał Górny
  2017-11-17 20:37 ` Daniel Campbell
  2017-11-20 18:42 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2] Michał Górny
@ 2017-11-21 17:26 ` Michał Górny
  2017-11-21 18:20   ` Ulrich Mueller
  2017-11-22 16:54 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v4] Michał Górny
  2017-11-23 20:53 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v5] Michał Górny
  4 siblings, 1 reply; 23+ messages in thread
From: Michał Górny @ 2017-11-21 17:26 UTC (permalink / raw
  To: gentoo-dev

W dniu czw, 16.11.2017 o godzinie 11∶19 +0100, użytkownik Michał Górny
napisał:
> Hi, everyone.
> 
> Here's the updated version of GLEP 74 taking into consideration
> the points made during the Council pre-review.
> 
> ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
> HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html
> 
> Changes:
> 

5ba0654 glep-0074: Specify slash as path separator, disallow backwards
slash
d3b65ba glep-0074: Mention that newline needs to be restricted too in
rationale
54cc3ef glep-0074: Apply suggestions from Ulrich Müller


---
GLEP: 74
Title: Full-tree verification using Manifest files
Author: Michał Górny <mgorny@gentoo.org>,
        Robin Hugh Johnson <robbat2@gentoo.org>,
        Ulrich Müller <ulm@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2017-10-21
Last-Modified: 2017-11-16
Post-History: 2017-10-26, 2017-11-16
Content-Type: text/x-rst
Requires: 59, 61
Replaces: 44, 58, 60
---

Abstract
========

This GLEP extends the Manifest file format to cover full-tree file
integrity and authenticity checks. The format aims to be future-proof,
efficient and provide means of backwards compatibility.


Motivation
==========

The Manifest files as defined by GLEP 44 [#GLEP44]_ provide the current
means of verifying the integrity of distfiles and package files
in Gentoo. Combined with OpenPGP signatures, they provide means to
ensure the authenticity of the covered files. However, as noted
in GLEP 57 [#GLEP57]_ they lack the ability to provide full-tree
authenticity verification as they do not cover any files outside
the package directory. In particular, they provide multiple ways
for a third party to inject malicious code into the ebuild environment.

Historically, the topic of providing authenticity coverage for the whole
repository has been mentioned multiple times. The most noteworthy effort
are GLEPs 58 [#GLEP58]_ and 60 [#GLEP60]_ by Robin H. Johnson from 2008.
They were accepted by the Council in 2010 but have never been
implemented. When potential implementation work started in 2017, a new
discussion about the specification arose. It prompted the creation
of a competing GLEP that would provide a redesigned alternative to
the old GLEPs.

This specification is designed with the following goals in mind:

1. It should provide means to ensure the authenticity of the complete
   repository, including preventing the injection of additional files.

2. The format should be universal enough to work both for the Gentoo
   repository and third-party repositories of different characteristics.

3. The Manifest files should be verifiable stand-alone, that is without
   knowing any details about the underlying repository format.


Specification
=============

Manifest file format
--------------------

This specification reuses and extends the Manifest file format defined
in GLEP 44 [#GLEP44]_. For the purpose of it, the *file type* field is
repurposed as a generic *tag* that could also indicate additional
(non-checksum) metadata. Appropriately, those tags can be followed by
other space-separated values.

Unless specified otherwise, the paths used in the Manifest files
are relative to the directory containing the Manifest file. The paths
must not reference the parent directory (``..``).

The Manifest files use UTF-8 encoding.


Manifest file locations and nesting
-----------------------------------

The ``Manifest`` file located in the root directory of the repository
is called top-level Manifest, and it is used to perform the full-tree
verification. In order to verify the authenticity, it must be signed
using OpenPGP, using the armored cleartext format.

The top-level Manifest may reference sub-Manifests contained
in subdirectories of the repository. The sub-Manifests are traditionally
named ``Manifest``; however, the implementation must support arbitrary
names, including the possibility of multiple (split) Manifests
for a single directory. The sub-Manifest can only cover the files inside
the directory tree where it resides.

The sub-Manifest can also be signed using OpenPGP armored cleartext
format. However, the signature verification can be omitted since it
already is covered by the signed top-level Manifest.


Directory tree coverage
-----------------------

The specification provides three ways of skipping Manifest verification
of specific files and directories (recursively):

1. explicit ``IGNORE`` entries in Manifest files,

2. injected ignore paths via package manager configuration,

3. using names starting with a dot (``.``) which are always skipped.

All files that are not ignored must be covered by at least one
of the Manifests.

A single file may be matched by multiple identical or equivalent
Manifest entries, if and only if the entries have the same semantics,
specify the same size and the checksums common to both entries match.
It is an error for a single file to be matched by multiple entries
of different semantics, file size or checksum values. It is an error
to specify another entry for a file matching ``IGNORE``, or one of its
subdirectories.

The file entries (except for ``IGNORE``) can be specified for regular
files only. Symbolic links are followed when opening files
and traversing directories. It is an error to specify an entry for
a different file type. If the tree contain files of other types
that are not otherwise ignored, they need to be covered by an explicit
``IGNORE``.

All the local (non-``DIST``) files covered by a Manifest tree must
reside on the same filesystem. It is an error to specify entries
applying to files on another filesystem. If files or directories that
are not otherwise ignored reside on a different filesystem, or symbolic
links point to targets on a different filesystem, they must
be explicitly excluded via ``IGNORE``.

All paths specified in the Manifest file must consist of characters
corresponding to valid UTF-8 code points excluding the NULL character
(``U+0000``), the backwards slash (``\``) and characters classified
as whitespace in the current version of the Unicode standard
[#UNICODE]_. It is an error to use Manifest files in directories
containing files whose names contain the disallowed characters.
The forward slash (``/``) must be used as path separator.


File verification
-----------------

When verifying a file against the Manifest, the following rules are
used:

1. If the file is covered directly or indirectly by an entry
   of the ``IGNORE`` type, the verification always succeeds.

2. If the file is covered by an entry of the ``MANIFEST``, ``DATA``,
   ``MISC``, ``EBUILD`` or ``AUX`` type:

   a. if the file is not present, then the verification fails,

   b. if the file is present but has a different size or one
      of the checksums does not match, the verification fails,

   c. otherwise, the verification succeeds.

3. If the file is present but not listed in Manifest, the verification
   fails.

Unless specified otherwise, the package manager must not allow using
any files for which the verification failed. The package manager may
reject any package or even the whole repository if it may refer to files
for which the verification failed.


Timestamp verification
----------------------

The top-level Manifest file can contain a ``TIMESTAMP`` entry to account
for attacks against tree update distribution. If such an entry
is present, it should be updated every time at least one
of the Manifests changes. Every unique timestamp value must correspond
to a single tree state.

During the verification process, the client should compare the timestamp
against the update time obtained from a local clock or a trusted time
source. If the comparison result indicates that the Manifest at the time
of receiving was already significantly outdated, the client should
either fail the verification or require manual confirmation from
the user.

Furthermore, the Manifest provider may employ additional methods
of distributing the timestamps of recently generated Manifests
using a secure channel from a trusted source for exact comparison.
The exact details of such a solution are outside the scope of this
specification.

``TIMESTAMP`` entries may also be present in sub-Manifests. Those
timestamps must not be newer than the timestamp of the top-level
Manifest (if present). This specification does not define any specific
use for them.


Modern Manifest tags
--------------------

The Manifest files can specify the following tags:

``TIMESTAMP <iso8601>``
  Specifies a timestamp of when the Manifest file was last updated.
  The timestamp must be a valid second-precision ISO 8601 extended
  format combined date and time in UTC timezone, i.e. using
  the following ``strftime()`` format string: ``%Y-%m-%dT%H:%M:%SZ``.
  Optional. The package manager can use it to detect an outdated
  repository checkout as described in `Timestamp verification`_.

``MANIFEST <path> <size> <checksums>...``
  Specifies a sub-Manifest. The sub-Manifest must be verified like
  a regular file. If the verification succeeds, the entries from
  the sub-Manifest are included for verification as described
  in `Manifest file locations and nesting`_.

``IGNORE <path>``
  Ignores a subdirectory or file from Manifest checks. If the specified
  path is present, it and its contents are omitted from the Manifest
  verification (always pass). *Path* must be a plain file or directory
  path without a trailing slash. Wildcards are not supported
  and wildcard characters are interpreted literally.

``DATA <path> <size> <checksums>...``
  Specifies a regular file subject to Manifest verification. The file
  is required to pass verification. Used for all files that do not match
  any other type.

``DIST <filename> <size> <checksums>...``
  Specifies a distfile entry used to verify files fetched as part
  of ``SRC_URI``. The filename must match the filename used to store
  the fetched file as specified in the PMS [#PMS-FETCH]_. The package
  manager must reject the fetched file if it fails verification.
  ``DIST`` entries apply to all packages below the Manifest file
  specifying them.


Deprecated Manifest tags
------------------------

For backwards compatibility, the following tags are additionally
allowed at the package directory level:

``EBUILD <filename> <size> <checksums>...``
  Equivalent to the ``DATA`` type.

``MISC <path> <size> <checksums>...``
  Equivalent to the ``DATA`` type. Historically indicated that
  the package manager may ignore a verification failure if operating
  in non-strict mode. However, that behavior is deprecated.

``AUX <filename> <size> <checksums>...``
  Equivalent to the ``DATA`` type, except that the filename is relative
  to the ``files/`` subdirectory.


Algorithm for full-tree verification
------------------------------------

In order to perform full-tree verification, the following algorithm
can be used:

1. Collect all files present in the repository into *present* set.

2. Start at the top-level Manifest file. Verify its OpenPGP signature.
   Optionally verify the ``TIMESTAMP`` entry if present as specified
   in `timestamp verification`. Remove the top-level Manifest
   from the *present* set.

3. Process all ``MANIFEST`` entries, recursively. Verify the Manifest
   files according to the `file verification`_ section, and include
   their entries in the current Manifest entry list (using paths
   relative to directories containing the Manifests).

4. Process all ``IGNORE`` entries. Remove any paths matching them
   from the *present* set.

5. Collect all files covered by ``DATA``, ``MISC``, ``EBUILD``
   and ``AUX`` entries into the *covered* set.

6. Verify the entries in the *covered* set for incompatible duplicates
   and collisions with ignored files as explained in `Manifest file
   locations and nesting`_.

7. Verify all the files in the union of the *present* and *covered*
   sets, according to the `file verification`_ section.


Algorithm for finding parent Manifests
--------------------------------------

In order to find the top-level Manifest from the current directory
the following algorithm can be used:

1. Store the current directory as *original* and the device ID
   of the containing filesystem (``st_dev``) as *startdev*,

2. If the device ID of the containing filesystem (``st_dev``)
   of the current directory is different than *startdev*, stop.

3. If the current directory contains a ``Manifest`` file:

   a. If an ``IGNORE`` entry in the ``Manifest`` file covers
      the *original* directory (or one of the parent directories), stop.

   b. Otherwise, store the current directory as *last_found*.

4. If the current directory is the root system directory (``/``), stop.

5. Otherwise, enter the parent directory and jump to step 2.

Once the algorithm stops, *last_found* will contain the relevant
top-level Manifest. If *last_found* is null, then the directory tree
does not contain any valid top-level Manifest candidates and one should
be created in the *original* directory.

Once the top-level Manifest is found, its ``MANIFEST`` entries should
be used to find any sub-Manifests below the top-level Manifest,
up to and including the *original* directory. Note that those
sub-Manifests can use different filenames than ``Manifest``.


Checksum algorithms
-------------------

This section is informational only. Specifying the exact set
of supported algorithms is outside the scope of this specification.

The algorithm names reserved at the time of writing are:

- ``MD5`` [#MD5]_,
- ``RMD160`` -- RIPEMD-160 [#RIPEMD160]_,
- ``SHA1`` [#SHS]_,
- ``SHA256`` and ``SHA512`` -- SHA-2 family of hashes [#SHS]_,
- ``WHIRLPOOL`` [#WHIRLPOOL]_,
- ``BLAKE2B`` and ``BLAKE2S`` -- BLAKE2 family of hashes [#BLAKE2]_,
- ``SHA3_256`` and ``SHA3_512`` -- SHA-3 family of hashes [#SHA3]_,
- ``STREEBOG256`` and ``STREEBOG512`` -- Streebog family of hashes
  [#STREEBOG]_.

The method of introducing new hashes is defined by GLEP 59 [#GLEP59]_.
It is recommended that any new hashes are named after the Python
``hashlib`` module algorithm names, transformed into uppercase.


Manifest compression
--------------------

The topic of Manifest file compression is covered by GLEP 61 [#GLEP61]_.
This section merely addresses interoperability issues between Manifest
compression and this specification.

The compressed Manifest files are required to be suffixed for their
compression algorithm. This suffix should be used to recognize
the compression and decompress Manifests transparently. The exact list
of algorithms and their corresponding suffixes are outside the scope
of this specification.

The top-level Manifest file must not be compressed. Since the OpenPGP
signature covers the uncompressed text and is compressed itself,
the data would have to be decompressed without any prior verification.
This could expose users e.g. to zip bombs or exploits on decompressor
vulnerabilities.

Whenever this specification refers to sub-Manifests, they can use any
names but are also required to use a specific compression suffix.
The ``MANIFEST`` entries are required to specify the full name including
compression suffix, and the verification is performed on the compressed
file.

The specification permits uncompressed Manifests to exist alongside
their compressed counterparts, and multiple compressed formats
to coexist. If that is the case, the files must have the same
uncompressed content and the specification is free to choose either
of the files using the same base name.


Combining multiple Manifest trees (informational)
-------------------------------------------------

This specification permits nesting multiple hierarchical Manifest trees.
In this layout, the specific directories of the Manifest tree can
be verified both as a part of another top-level Manifest,
and as an independent Manifest tree (when obtained without the parent
directory).

For this to work, the sub-Manifest file in the directory must also
satisfy the requirements for the top-level Manifest file. That is:

- it must be named ``Manifest`` and not compressed,

- it must cover all the files in this directory and its subdirectories
  (i.e. no files from the directory tree can be covered by parent
  Manifest),

- if authenticity verification is desired, it must be OpenPGP-signed.

It should be noted that if such a directory is a subdirectory of a valid
Manifest tree, the sub-Manifest needs to be valid according
to the top-level Manifest and the OpenPGP signature is disregarded
as detailed in `Manifest file locations and nesting`_. The top-level
behavior is exhibited only when the directory is obtained without parent
directories.


An example Manifest file (informational)
----------------------------------------

An example top-level Manifest file for the Gentoo repository would have
the following content::

    TIMESTAMP 2017-10-30T10:11:12Z
    IGNORE distfiles
    IGNORE local
    IGNORE lost+found
    IGNORE packages
    MANIFEST app-accessibility/Manifest 14821 SHA256 1b5f.. SHA512 f7eb..
    ...
    MANIFEST eclass/Manifest.gz 50812 SHA256 8c55.. SHA512 2915..
    ...

An example modern Manifest (disregarding backwards compatibility)
for a package directory would have the following content::

    DATA SphinxTrain-0.9.1-r1.ebuild 932 SHA256 3d3b.. SHA512 be4d..
    DATA SphinxTrain-1.0.8.ebuild 912 SHA256 f681.. SHA512 0749..
    DATA metadata.xml 664 SHA256 97c6.. SHA512 1175..
    DATA files/gcc.patch 816 SHA256 b56e.. SHA512 2468..
    DATA files/gcc34.patch 333 SHA256 c107.. SHA512 9919..
    DIST SphinxTrain-0.9.1-beta.tar.gz 469617 SHA256 c1a4.. SHA512 1b33..
    DIST sphinxtrain-1.0.8.tar.gz 8925803 SHA256 548e.. SHA512 465d..


Rationale
=========

Stand-alone format
------------------

The first question that needed to be asked before proceeding with
the design was whether the Manifest file format was supposed to be
stand-alone, or tightly bound to the repository format.

The stand-alone format has been selected because of its three
advantages:

1. It is more future-proof. If an incompatible change to the repository
   format is introduced, only developers need to upgrade the tools
   they use to generate the Manifests. The tools used to verify
   the updated Manifests will continue to work.

2. It is more flexible and universal. With a dedicated tool,
   the Manifest files can be used to sign and verify arbitrary file
   sets.

3. It keeps the verification tool simpler. In particular, we can easily
   write an independent verification tool that could work on any
   distribution without needing to depend on a package manager
   implementation or rewrite parts of it.

Designing a stand-alone format requires that the Manifest carries enough
information to perform the verification following all the rules specific
to the Gentoo repository.


Tree design
-----------

The second important point of the design was determining whether
the Manifest files should be structured hierarchically, or independent.
Both options have their advantages.

In the hierarchical model, each sub-Manifest file is covered by a higher
level Manifest. As a result, only the top-level Manifest has to be
OpenPGP-signed, and subsequent Manifests need to be only verified by
checksum stored in the parent Manifest. This has the following
implications:

- Verifying any set of files in the repository requires using checksums
  from the most relevant Manifests and the parent Manifests.

- The OpenPGP signature of the top-level Manifest needs to be verified
  only once per process.

- Altering any set of files requires updating the relevant Manifests,
  and their parent Manifests up to the top-level Manifest, and signing
  the last one.

- As a result, the top-level Manifest changes on every commit,
  and various middle-level Manifests change (and need to be transferred)
  frequently.

In the independent model, each sub-Manifest file is independent
of the parent Manifests. As a result, each of them needs to be signed
and verified independently. However, the parent Manifests still need
to list sub-Manifests (albeit without verification data) in order
to detect removal or replacement of subdirectories. This has
the following implications:

- Verifying any set of files in the repository requires using checksums
  and verifying signatures of the most relevant Manifest files.

- Altering any set of files requires updating the relevant Manifests
  and signing them again.

- Parent Manifests are updated only when Manifests are added or removed
  from subdirectories. As a result, they change infrequently.

While both models have their advantages, the hierarchical model was
selected because it reduces the number of OpenPGP operations
(which are comparatively costly) to the minimum.


Tree layout restrictions
------------------------

The algorithm is meant to work primarily with ebuild repositories which
normally contain only files and directories. Directories provide
no useful metadata for verification, and specifying special entries
for additional file types is purposeless. Therefore, the specification
is restricted to dealing with regular files.

The Gentoo repository does not use symbolic links. Some Gentoo
repositories do, however. To provide a simple solution for dealing with
symlinks without having to take care to implement special handling for
them, the common behavior of implicitly resolving them is used.
Therefore, symbolic links to files are stored as if they were regular
files, and symbolic links to directories are followed as if they were
regular directories.

Dotfiles are implicitly ignored as that is a common notion used
in software written for POSIX systems. All other filenames require
explicit ``IGNORE`` lines.

An ability to inject additional ignore entries is provided to account
for site configuration affecting the repository tree -- placing
additional files in it, skipping some of the categories from syncing.
This configuration can extend beyond the limits of this GLEP,
e.g. by allowing wildcards or regular expressions.

The algorithm is restricted to work on a single filesystem. This is
mostly relevant when scanning for top-level Manifest -- we do not want
to cross filesystem boundaries then. However, to ensure consistent
bidirectional behavior we need to also ban them when operating downwards
the tree.

The directories and files on different filesystems need to be ignored
explicitly as implicitly skipping them would cause confusion.
In particular, tools might then claim that a file does not exist when
it clearly does because it was skipped due to filesystem boundaries.


Filename character set restriction
----------------------------------

The valid set of filename characters for the Gentoo repository
is restricted by the devmanual 'File Naming Rules' section
[#FILE-NAMING-RULES]_, and enforced via a git hook. The valid distfile
names are not restricted explicitly -- however, the PMS dependency
specification syntax [#PMS-FETCH]_ implicitly makes it impossible to use
filenames containing whitespace.

This specification aims to avoid arbitrary restrictions. For this
reason, filename characters are only restricted by excluding two
technically problematic groups:

1. The NULL character (``U+0000``) is normally used to indicate the end
   of a null-terminated string. Its use could therefore break programs
   written using C. Furthermore, it is not allowed in any known
   filesystem.

2. The backwards slash character (``\``) is frequently used as an escape
   character, in particular in the languages derived from C and in shell
   script. Furthermore, it is used as path separator on Windows systems.
   It is forbidden to avoid implementation mistakes (in particular,
   attempting to use it to escape whitespace or as path separator
   on Windows) but also reserved for possible future extension.

3. Whitespace characters are used to separate Manifest fields
   and entries. While technically it would be enough to restrict space
   (``U+0020``) character that is normally used as the separator
   and newline (``U+000A``) character that is used to separate lines,
   all whitespace characters are forbidden to avoid confusion
   and implementation errors.

While the specification could be extended to allow such filenames
by using some form of escaping, there is currently no apparent need
for such a feature.

Historically, Portage attempted to overcome the whitespace limitation
by attempting to locate the size field and take everything before it
as filename. This was terribly fragile and even if it worked, it would
solve the problem only partially.

Since the same restrictions apply to ``IGNORE`` rules, it is currently
not possible to either list or ignore the file using whitespace
characters. Therefore, the presence of such files is forbidden entirely.


File verification model
-----------------------

The verification model aims to provide full coverage against different
forms of attack. In particular, three different kinds of manipulation
are considered:

1. Alteration of the file content.

2. Removal of a file.

3. Addition of a new file.

In order to prevent against all three, the system requires that all
files in the repository are listed in Manifests and verified against
them.

As a special case, ignores are allowed to account for directories
that are not part of the repository but were traditionally placed inside
it. Those directories were ``distfiles``, ``local`` and ``packages``. It
could be also used to ignore VCS directories such as ``CVS``.


Non-strict Manifest verification
--------------------------------

Originally the Manifest2 format provided a special ``MISC`` tag that
was used for ``metadata.xml`` and ``ChangeLog`` files. This tag
indicated that the Manifest verification failures could be ignored for
those files unless the package manager was working in strict mode.

The first versions of this specification continued the use of this tag.
However, after a long debate it was decided to deprecate it along with
the non-strict behavior, and require all files to strictly match.

Two arguments were mentioned for the usefulness of a ``MISC`` type:

1. being able to reduce the checkout size by stripping unnecessary
   files out, and

2. being able to update automatically generated files locally
   without causing unnecessary verification failures.

However, the usefulness of ``MISC`` in both cases is doubtful.

The cases for stripping unnecessary files mostly focused around space
savings. For this purpose, stripping ``metadata.xml`` and similar files
has little value. It is much more common for users to strip whole
packages or categories. The ``MISC`` type is not suitable for that,
and so a dedicated package manager mechanism needs to be developed
instead. The same mechanism can also handle files that historically used
the ``MISC`` type. As an example, the package manager may choose
to generate both the rsync exclusion list and Manifest ignore list
using a single source list.

The cases for autogenerated files involve such cache files
as ``use.local.desc``. However, we can not include ``md5-cache`` there
due to security concerns which results in inconsistent cache handling.
Furthermore, the tools were historically modified to provide stable
output which means that their content can not change without
a non-``MISC`` content being changed first. This practically defeats
the purpose of using ``MISC``.

Finally, the non-strict mode could be used as means to an attack.
The allowance of missing or modified documentation file could be used
to spread misinformation, resulting in bad decisions made by the user.
A modified file could also be used, e.g. to exploit vulnerabilities
of an XML parser.


Timestamp field
---------------

The top-level Manifest optionally allows using a ``TIMESTAMP`` tag
to include a generation timestamp in the Manifest. A similar feature
was originally proposed in GLEP 58 [#GLEP58]_.

A malicious third-party may use the principles of exclusion or replay
[#C08]_ to deny an update to clients, while at the same time recording
the identity of clients to attack. The timestamp field can be used to
detect that.

In order to provide more complete protection, the Gentoo Infrastructure
should provide an ability to obtain the timestamps of all Manifests
from a recent timeframe over a secure channel from a trusted source
for comparison.

Strictly speaking, this information is provided by the various
``metadata/timestamp*`` files that are already present. However,
including the value in the Manifest itself has a little cost
and provides the ability to perform the verification stand-alone.

Furthermore, some of the timestamp files are added very late
in the distribution process, past the Manifest generation phase. Those
files will most likely receive ``IGNORE`` entries and therefore
be unsafe to use.

The specification permits additional timestamps in sub-Manifest files
for local use. A generic testing tool should ignore them.


New vs deprecated tags
----------------------

Out of the four types defined by Manifest2, only one is reused
and the remaining three are replaced by a single, universal ``DATA``
type.

The ``DIST`` tag is reused since the specification does not change
anything with regard to distfile handling.

The ``EBUILD`` tag could potentially be reused for generic file
verification data. However, it would be confusing if all the different
data files were marked as ``EBUILD``. Therefore, an equivalent ``DATA``
type was introduced as a replacement.

The ``MISC`` tag and the relevant non-strict mode has been removed
as being of little value, as detailed in the `Non-strict Manifest
verification`_ section.

The ``AUX`` tag is deprecated as it is redundant to ``DATA``, and has
the limiting property of implicit ``files/`` path prefix.


Finding top-level Manifest
--------------------------

The development of a reference implementation for this GLEP has brought
the following problem: how to find all the relevant Manifests when
the Manifest tool is run inside a subdirectory of the repository?

One of the options would be to provide a bi-directional linking
of Manifests via a ``PARENT`` tag. However, that would not solve
the problem when a new Manifest file is being created.

Instead, an algorithm for iterating over parent directories is proposed.
Since there is no obligatory explicit indicator for the top-level
Manifest, the algorithm assumes that the top-level Manifest
is the highest ``Manifest`` in the directory hierarchy that can cover
the current directory. This generally makes sense since the Manifest
files are required to provide coverage for all subdirectories, so all
Manifests starting from that one need to be updated.

If independent Manifest trees are nested in the directory structure,
then an ``IGNORE`` entry needs to be used to separate them.

Since sub-Manifests can use any filenames, the Manifest finding
algorithm must not short-cut the procedure by storing all ``Manifest``
files along the parent directories. Instead, it needs to retrace
the relevant sub-Manifest files along ``MANIFEST`` entries
in the top-level Manifest.


Injecting ChangeLogs into the checkout
--------------------------------------

One of the problems considered in the new Manifest format was injecting
historical and autogenerated ChangeLog into the repository. We normally
don't include those files, to reduce the checkout size. However, some
users have shown interest in them and Infra is working on providing them
via an additional rsync module.

If such files were injected into the repository, they would cause
verification failures of Manifests. To account for this, Infra could
provide ``IGNORE`` entries to allow them to exist.


Splitting distfile checksums from file checksums
------------------------------------------------

Another problem with the current Manifest format is that the checksums
for fetched files are combined with checksums for local files
in a single file inside the package directory. It has been specifically
pointed out that:

- since distfiles are sometimes reused across different packages,
  the repeating checksums are redundant [#DIST]_.
  
- mirror admins were interested in the possibility of verifying all
  the distfiles with a single tool.

This specification does not provide a clean solution to this problem.
It technically permits moving ``DIST`` entries to higher-level Manifests
but the usefulness of such a solution is doubtful.

However, for the second problem we will probably deliver a dedicated
tool working with this Manifest format.


Hash algorithms
---------------

While maintaining a consistent supported hash set is important
for interoperability, it is not a good fit for the generic layout
of this GLEP. Furthermore, it would require updating the GLEP
in the future every time the used algorithms change.

Instead, the specification focuses on listing the currently used
algorithm names for interoperability, and sets a recommendation
for consistent naming of algorithms in the future. The Python
``hashlib`` module is used as a reference since it is used
as the provider of hash functions for most of the Python software,
including Portage and PkgCore.

The basic rules for changing hash algorithms are defined in GLEP 59
[#GLEP59]_. The implementations can focus only on those algorithms
that are actually used or planned on being used. It may be feasible
to devise a new GLEP that specifies the currently used hashes (or update
GLEP 59 accordingly).


Manifest compression
--------------------

The support for Manifest compression is introduced with minimal changes
to the file format. The ``MANIFEST`` entries are required to provide
the real (compressed) file path for compatibility with other file
entries and to avoid confusion.

The compression of top-level Manifest file has been prohibited
as the specification currently does not provide any means of verifying
the file prior to decompression. If the top-level Manifest is
compressed, tooling will have to unpack the file before being able
to verify the contents. This makes it possible for a malicious third
party to attack the system by providing a compressed Manifest that
exposes decompressor vulnerabilities, or a zip bomb.

The OpenPGP cleartext signature covers the contents of the Manifest,
and is therefore compressed along with them. The possibility of using
a detached signature has been considered but it was rejected as
unnecessary complexity for minor gain.

Technically, a similar result could be effected via moving all the data
into a compressed sub-Manifest in the top directory (e.g.
``Manifest.sub.gz``), and including a ``MANIFEST`` entry for this file
in a signed, uncompressed top-level Manifest.

The existence of additional entries for uncompressed Manifest checksums
was debated. However, plain entries for the uncompressed file would
be confusing if only the compressed file existed, and conflicting
if both uncompressed and compressed variants existed. Furthermore,
it has been pointed out that ``DIST`` entries do not have
an uncompressed variant either.


Performance considerations
--------------------------

Performing a full-tree verification on every sync raises some
performance concerns for end-user systems. The initial testing has shown
that a cold-cache verification on a btrfs file system can take up around
4 minutes, with the process being mostly I/O bound. On the other hand,
it can be expected that the verification will be performed directly
after syncing, taking advantage of a warm filesystem cache.

To improve speed on I/O and/or CPU-restrained systems even further,
the algorithms can be easily extended to perform incremental
verification. Given that rsync does not preserve mtimes by default,
the tool can take advantage of mtime and Manifest comparisons to recheck
only the parts of the repository that have changed.

Furthermore, the package manager implementations can restrict checking
only to the parts of the repository that are actually being used.


Backwards Compatibility
=======================

This GLEP provides optional means of preserving backwards compatibility.
To preserve the backwards compatibility, the following needs to hold
for the ``Manifest`` file in every package directory:

- all files must be covered by the single ``Manifest`` file,

- all distfiles used by the package must be included,

- all files inside the ``files/`` subdirectory need to use
  the ``AUX`` tag (rather than ``DATA``),

- all ``.ebuild`` files need to use the ``EBUILD`` tag,

- the ``metadata.xml`` and ``ChangeLog`` files need to use
  the ``MISC`` tag,

- the Manifest can be signed to provide authenticity verification,

- an uncompressed Manifest must always exist, and a compressed Manifest
  of identical content may be present.

Once the backwards compatibility is no longer a concern, the above
no longer needs to hold and the deprecated tags can be removed.


Reference Implementation
========================

The reference implementation for this GLEP is being developed
as the gemato project [#GEMATO]_.


Credits
=======

Thanks to all the people whose contributions were invaluable
to the creation of this GLEP. This includes but is not limited to:

- Robin Hugh Johnson,
- Ulrich Müller.

Additionally, thanks to Robin Hugh Johnson for the original
MetaManifest GLEP series which served both as inspiration and source
of many concepts used in this GLEP. Recursively, also thanks to all
the people who contributed to the original GLEPs.


References
==========

.. [#GLEP44] GLEP 44: Manifest2 format
   (https://www.gentoo.org/glep/glep-0044.html)

.. [#GLEP57] GLEP 57: Security of distribution of Gentoo software
   - Overview
   (https://www.gentoo.org/glep/glep-0057.html)

.. [#GLEP58] GLEP 58: Security of distribution of Gentoo software
   - Infrastructure to User distribution - MetaManifest
   (https://www.gentoo.org/glep/glep-0058.html)

.. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
   (https://www.gentoo.org/glep/glep-0059.html)

.. [#GLEP60] GLEP 60: Manifest2 filetypes
   (https://www.gentoo.org/glep/glep-0060.html)

.. [#GLEP61] GLEP 61: Manifest2 compression
   (https://www.gentoo.org/glep/glep-0061.html)

.. [#UNICODE] The Unicode standard
   (https://unicode.org/versions/latest/)

.. [#PMS-FETCH] Package Manager Specification: Dependency Specification
   Format - SRC_URI
   (https://projects.gentoo.org/pms/6/pms.html#x1-940008.2.10)

.. [#FILE-NAMING-RULES] Ebuild File Format -- Gentoo Development Guide
   (https://devmanual.gentoo.org/ebuild-writing/file-format/#file-naming-rules)

.. [#MD5] RFC1321: The MD5 Message-Digest Algorithm
   (https://www.ietf.org/rfc/rfc1321.txt)

.. [#RIPEMD160] The hash function RIPEMD-160
   (https://homes.esat.kuleuven.be/~bosselae/ripemd160.html)

.. [#SHS] FIPS PUB 180-4: Secure Hash Standard (SHS)
   (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf)

.. [#WHIRLPOOL] The WHIRLPOOL Hash Function
   (http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html)

.. [#BLAKE2] BLAKE2 -- fast secure hashing
   (https://blake2.net/)

.. [#SHA3] FIPS PUB 202: SHA-3 Standard: Permutation-Based Hash
   and Extendable-Output Functions
   (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf)

.. [#STREEBOG] GOST R 34.11-2012: Streebog Hash Function
   (https://www.streebog.net/)

.. [#C08] Cappos, J et al. (2008). "Attacks on Package Managers"
   (https://www2.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html)

.. [#DIST] According to Robin H. Johnson, 8.4% of all DIST entries
   at the time of writing are duplicate, representing 2 MiB
   out of 25 MiB of DIST entries altogether.

.. [#GEMATO] gemato: Gentoo Manifest Tool
   (https://github.com/mgorny/gemato/)


Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v3]
  2017-11-21 17:26 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v3] Michał Górny
@ 2017-11-21 18:20   ` Ulrich Mueller
  2017-11-21 18:22     ` Michał Górny
  0 siblings, 1 reply; 23+ messages in thread
From: Ulrich Mueller @ 2017-11-21 18:20 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 938 bytes --]

>>>>> On Tue, 21 Nov 2017, Michał Górny wrote:

> All paths specified in the Manifest file must consist of characters
> corresponding to valid UTF-8 code points excluding the NULL character
> (``U+0000``), the backwards slash (``\``) and characters classified
> as whitespace in the current version of the Unicode standard
> [#UNICODE]_. It is an error to use Manifest files in directories
> containing files whose names contain the disallowed characters.
> The forward slash (``/``) must be used as path separator.

In addition to whitespace, you should also exclude C0 controls (U+0000
to U+001F), DEL (U+007F), and C1 controls (U+0080 to U+009F).

Rationale, these control characters can leave the user's terminal
in an unusable state when a package manager tries to output such a
filename in a message. As you reserve the backslash for a future
escape mechanism, this shouldn't be a too severe restriction.

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v3]
  2017-11-21 18:20   ` Ulrich Mueller
@ 2017-11-21 18:22     ` Michał Górny
  0 siblings, 0 replies; 23+ messages in thread
From: Michał Górny @ 2017-11-21 18:22 UTC (permalink / raw
  To: gentoo-dev

W dniu wto, 21.11.2017 o godzinie 19∶20 +0100, użytkownik Ulrich Mueller
napisał:
> > > > > > On Tue, 21 Nov 2017, Michał Górny wrote:
> > All paths specified in the Manifest file must consist of characters
> > corresponding to valid UTF-8 code points excluding the NULL character
> > (``U+0000``), the backwards slash (``\``) and characters classified
> > as whitespace in the current version of the Unicode standard
> > [#UNICODE]_. It is an error to use Manifest files in directories
> > containing files whose names contain the disallowed characters.
> > The forward slash (``/``) must be used as path separator.
> 
> In addition to whitespace, you should also exclude C0 controls (U+0000
> to U+001F), DEL (U+007F), and C1 controls (U+0080 to U+009F).
> 
> Rationale, these control characters can leave the user's terminal
> in an unusable state when a package manager tries to output such a
> filename in a message. As you reserve the backslash for a future
> escape mechanism, this shouldn't be a too severe restriction.
> 

Works for me. I'll update the spec later. Can you think of any other
sequences that should be explicitly forbidden?

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-21 17:14     ` Michał Górny
@ 2017-11-21 20:28       ` Ulrich Mueller
  2017-11-21 21:13         ` Michał Górny
  0 siblings, 1 reply; 23+ messages in thread
From: Ulrich Mueller @ 2017-11-21 20:28 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1090 bytes --]

>>>>> On Tue, 21 Nov 2017, Michał Górny wrote:

>> > It is an error for a single file to be matched by multiple entries
>> > of different semantics, file size or checksum values. It is an error
>> > to specify another entry for a file matching ``IGNORE``, or one of its
>> > subdirectories.
>> 
>> What about regular files in a directory (or subdirectory) matched
>> by IGNORE? Looks like this case is not covered (?).

> Ignored regular files must not have any other (e.g. DATA) entries.
> Otherwise the expected behavior is unclear -- are we supposed to
> verify the file or ignore it?

I still believe that the wording doesn't convey that. Maybe an example
will clarify what I mean.

There is a directory foo/bar and a regular file foo/bar/quux in it.
Now in Manifest there are these entries:

   IGNORE foo/bar
   DATA foo/bar/quux <size> <checksums>

The spec says: "It is an error to specify another entry for a file
matching ``IGNORE``, or one of its subdirectories." However, file
foo/bar/quux neither matches IGNORE nor is a subdirectory of it.

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-21 20:28       ` Ulrich Mueller
@ 2017-11-21 21:13         ` Michał Górny
  2017-11-21 21:48           ` Ulrich Mueller
  0 siblings, 1 reply; 23+ messages in thread
From: Michał Górny @ 2017-11-21 21:13 UTC (permalink / raw
  To: gentoo-dev

W dniu wto, 21.11.2017 o godzinie 21∶28 +0100, użytkownik Ulrich Mueller
napisał:
> > > > > > On Tue, 21 Nov 2017, Michał Górny wrote:
> > > > It is an error for a single file to be matched by multiple entries
> > > > of different semantics, file size or checksum values. It is an error
> > > > to specify another entry for a file matching ``IGNORE``, or one of its
> > > > subdirectories.
> > > 
> > > What about regular files in a directory (or subdirectory) matched
> > > by IGNORE? Looks like this case is not covered (?).
> > Ignored regular files must not have any other (e.g. DATA) entries.
> > Otherwise the expected behavior is unclear -- are we supposed to
> > verify the file or ignore it?
> 
> I still believe that the wording doesn't convey that. Maybe an example
> will clarify what I mean.
> 
> There is a directory foo/bar and a regular file foo/bar/quux in it.
> Now in Manifest there are these entries:
> 
>    IGNORE foo/bar
>    DATA foo/bar/quux <size> <checksums>
> 
> The spec says: "It is an error to specify another entry for a file
> matching ``IGNORE``, or one of its subdirectories." However, file
> foo/bar/quux neither matches IGNORE nor is a subdirectory of it.

Indeed, the second part of that sentence needs to change. Do you have
a suggestion how to word it best?

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-21 21:13         ` Michał Górny
@ 2017-11-21 21:48           ` Ulrich Mueller
  2017-11-21 23:51             ` Michał Górny
  0 siblings, 1 reply; 23+ messages in thread
From: Ulrich Mueller @ 2017-11-21 21:48 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 572 bytes --]

>>>>> On Tue, 21 Nov 2017, Michał Górny wrote:

>> > > > It is an error for a single file to be matched by multiple
>> > > > entries of different semantics, file size or checksum values.
>> > > > It is an error to specify another entry for a file matching
>> > > > ``IGNORE``, or one of its subdirectories.

>> [...]

> Indeed, the second part of that sentence needs to change. Do you
> have a suggestion how to word it best?

"It is an error to specify another entry for a file that matches
``IGNORE`` or that is covered by an ignored directory."

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-21 21:48           ` Ulrich Mueller
@ 2017-11-21 23:51             ` Michał Górny
  2017-11-22  5:43               ` Ulrich Mueller
  0 siblings, 1 reply; 23+ messages in thread
From: Michał Górny @ 2017-11-21 23:51 UTC (permalink / raw
  To: gentoo-dev

W dniu wto, 21.11.2017 o godzinie 22∶48 +0100, użytkownik Ulrich Mueller
napisał:
> > > > > > On Tue, 21 Nov 2017, Michał Górny wrote:
> > > > > > It is an error for a single file to be matched by multiple
> > > > > > entries of different semantics, file size or checksum values.
> > > > > > It is an error to specify another entry for a file matching
> > > > > > ``IGNORE``, or one of its subdirectories.
> > > [...]
> > Indeed, the second part of that sentence needs to change. Do you
> > have a suggestion how to word it best?
> 
> "It is an error to specify another entry for a file that matches
> ``IGNORE`` or that is covered by an ignored directory."
> 

I'm not sure if 'covered' wouldn't be confusing here. Maybe:

| It is an error to specify another entry for a file that matches
| ``IGNORE``, or that is located inside an ignored directory.

?

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-20 18:42 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2] Michał Górny
  2017-11-20 21:37   ` Ulrich Mueller
@ 2017-11-22  2:59   ` R0b0t1
  2017-11-22  8:02     ` Michał Górny
  1 sibling, 1 reply; 23+ messages in thread
From: R0b0t1 @ 2017-11-22  2:59 UTC (permalink / raw
  To: gentoo-dev

On Mon, Nov 20, 2017 at 12:42 PM, Michał Górny <mgorny@gentoo.org> wrote:
> W dniu czw, 16.11.2017 o godzinie 11∶19 +0100, użytkownik Michał Górny
> napisał:
>> Hi, everyone.
>>
>> Here's the updated version of GLEP 74 taking into consideration
>> the points made during the Council pre-review.
>>
>> ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
>> HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html
>>
>
> New changes:
>
> 9d819c9 glep-0074: Disallow filenames containing whitespace

This seems like a bad idea. I apologize if this is covered in more
detail somewhere, but the only justification I can see is that the
current grammar does not permit quoting or some other method of
specifying whitespace as part of a field value.

Is there any way to assure that this won't break things in a
non-obvious way? I'm having a hard time imagining how it would be an
inflexible requirement to use a space in a filename, but it could come
up if it was necessary to use Portage on a non-Gentoo distribution.

It seems very arbitrary. I think the better solution is to use a better parser.

Cheers,
      R0b0t1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-21 23:51             ` Michał Górny
@ 2017-11-22  5:43               ` Ulrich Mueller
  0 siblings, 0 replies; 23+ messages in thread
From: Ulrich Mueller @ 2017-11-22  5:43 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1031 bytes --]

>>>>> On Wed, 22 Nov 2017, Michał Górny wrote:

> W dniu wto, 21.11.2017 o godzinie 22∶48 +0100, użytkownik Ulrich Mueller
> napisał:
>> > > > > > On Tue, 21 Nov 2017, Michał Górny wrote:
>> > > > > > It is an error for a single file to be matched by multiple
>> > > > > > entries of different semantics, file size or checksum values.
>> > > > > > It is an error to specify another entry for a file matching
>> > > > > > ``IGNORE``, or one of its subdirectories.
>> > > [...]
>> > Indeed, the second part of that sentence needs to change. Do you
>> > have a suggestion how to word it best?
>>
>> "It is an error to specify another entry for a file that matches
>> ``IGNORE`` or that is covered by an ignored directory."

> I'm not sure if 'covered' wouldn't be confusing here.

Indeed, that verb can have many meanings.

> Maybe:

> | It is an error to specify another entry for a file that matches
> | ``IGNORE``, or that is located inside an ignored directory.

> ?

Works for me.

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-22  2:59   ` R0b0t1
@ 2017-11-22  8:02     ` Michał Górny
  2017-11-22 16:38       ` R0b0t1
  0 siblings, 1 reply; 23+ messages in thread
From: Michał Górny @ 2017-11-22  8:02 UTC (permalink / raw
  To: gentoo-dev

W dniu wto, 21.11.2017 o godzinie 20∶59 -0600, użytkownik R0b0t1
napisał:
> On Mon, Nov 20, 2017 at 12:42 PM, Michał Górny <mgorny@gentoo.org> wrote:
> > W dniu czw, 16.11.2017 o godzinie 11∶19 +0100, użytkownik Michał Górny
> > napisał:
> > > Hi, everyone.
> > > 
> > > Here's the updated version of GLEP 74 taking into consideration
> > > the points made during the Council pre-review.
> > > 
> > > ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
> > > HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html
> > > 
> > 
> > New changes:
> > 
> > 9d819c9 glep-0074: Disallow filenames containing whitespace
> 
> This seems like a bad idea. I apologize if this is covered in more
> detail somewhere, but the only justification I can see is that the
> current grammar does not permit quoting or some other method of
> specifying whitespace as part of a field value.
> 
> Is there any way to assure that this won't break things in a
> non-obvious way? I'm having a hard time imagining how it would be an
> inflexible requirement to use a space in a filename, but it could come
> up if it was necessary to use Portage on a non-Gentoo distribution.

Having a whitespace there *will* break the parser. Until a better parser
is provided, we need to reject it to prevent tools from accidentally
generating broken files. It's better to tell straight away 'sorry, you
can't use Manifest here' than cause completely unexpected behavior
in the parser.

Using whitespace in filenames is going to break Portage in horrible
ways. Half of shell script in it is based on whitespace-separated lists.
PMS doesn't provide any means to replace some of them. It's not going to
happen.

> It seems very arbitrary. I think the better solution is to use a better parser.
> 

The parser is already there for 15 years or more. We can't just replace
it without breaking all old Portage versions.

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2]
  2017-11-22  8:02     ` Michał Górny
@ 2017-11-22 16:38       ` R0b0t1
  0 siblings, 0 replies; 23+ messages in thread
From: R0b0t1 @ 2017-11-22 16:38 UTC (permalink / raw
  To: gentoo-dev

On Wed, Nov 22, 2017 at 2:02 AM, Michał Górny <mgorny@gentoo.org> wrote:
> W dniu wto, 21.11.2017 o godzinie 20∶59 -0600, użytkownik R0b0t1
> napisał:
>> On Mon, Nov 20, 2017 at 12:42 PM, Michał Górny <mgorny@gentoo.org> wrote:
>> > W dniu czw, 16.11.2017 o godzinie 11∶19 +0100, użytkownik Michał Górny
>> > napisał:
>> > > Hi, everyone.
>> > >
>> > > Here's the updated version of GLEP 74 taking into consideration
>> > > the points made during the Council pre-review.
>> > >
>> > > ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
>> > > HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html
>> > >
>> >
>> > New changes:
>> >
>> > 9d819c9 glep-0074: Disallow filenames containing whitespace
>>
>> This seems like a bad idea. I apologize if this is covered in more
>> detail somewhere, but the only justification I can see is that the
>> current grammar does not permit quoting or some other method of
>> specifying whitespace as part of a field value.
>>
>> Is there any way to assure that this won't break things in a
>> non-obvious way? I'm having a hard time imagining how it would be an
>> inflexible requirement to use a space in a filename, but it could come
>> up if it was necessary to use Portage on a non-Gentoo distribution.
>
> Having a whitespace there *will* break the parser. Until a better parser
> is provided, we need to reject it to prevent tools from accidentally
> generating broken files. It's better to tell straight away 'sorry, you
> can't use Manifest here' than cause completely unexpected behavior
> in the parser.
>
> Using whitespace in filenames is going to break Portage in horrible
> ways. Half of shell script in it is based on whitespace-separated lists.
> PMS doesn't provide any means to replace some of them. It's not going to
> happen.
>

Yes, I was talking about providing a better parser. I understand it is
as it is now because whitespace is a delimiter.

If it's not possible to know where all code that has this as a
requirement is, that's fairly bad.

http://langsec.org/occupy/

>> It seems very arbitrary. I think the better solution is to use a better parser.
>>
>
> The parser is already there for 15 years or more. We can't just replace
> it without breaking all old Portage versions.
>

It sounds like portage is already broken.

Cheers,
     R0b0t1


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v4]
  2017-11-16 10:19 [gentoo-dev] [RFC] GLEP 74 post-Council review update Michał Górny
                   ` (2 preceding siblings ...)
  2017-11-21 17:26 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v3] Michał Górny
@ 2017-11-22 16:54 ` Michał Górny
  2017-11-22 20:41   ` Ulrich Mueller
  2017-11-23 20:53 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v5] Michał Górny
  4 siblings, 1 reply; 23+ messages in thread
From: Michał Górny @ 2017-11-22 16:54 UTC (permalink / raw
  To: gentoo-dev

W dniu czw, 16.11.2017 o godzinie 11∶19 +0100, użytkownik Michał Górny
napisał:
> Hi, everyone.
> 
> Here's the updated version of GLEP 74 taking into consideration
> the points made during the Council pre-review.
> 
> ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
> HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html
> 
> Changes:
> 

b3964b6 glep-0074: Recommend escaping control characters, suggested by
ulm
11f19f9 glep-0074: Provide encoding for disallowed characters
da2aace glep-0074: Clarify ignoring directories


---
GLEP: 74
Title: Full-tree verification using Manifest files
Author: Michał Górny <mgorny@gentoo.org>,
        Robin Hugh Johnson <robbat2@gentoo.org>,
        Ulrich Müller <ulm@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2017-10-21
Last-Modified: 2017-11-16
Post-History: 2017-10-26, 2017-11-16
Content-Type: text/x-rst
Requires: 59, 61
Replaces: 44, 58, 60
---

Abstract
========

This GLEP extends the Manifest file format to cover full-tree file
integrity and authenticity checks. The format aims to be future-proof,
efficient and provide means of backwards compatibility.


Motivation
==========

The Manifest files as defined by GLEP 44 [#GLEP44]_ provide the current
means of verifying the integrity of distfiles and package files
in Gentoo. Combined with OpenPGP signatures, they provide means to
ensure the authenticity of the covered files. However, as noted
in GLEP 57 [#GLEP57]_ they lack the ability to provide full-tree
authenticity verification as they do not cover any files outside
the package directory. In particular, they provide multiple ways
for a third party to inject malicious code into the ebuild environment.

Historically, the topic of providing authenticity coverage for the whole
repository has been mentioned multiple times. The most noteworthy effort
are GLEPs 58 [#GLEP58]_ and 60 [#GLEP60]_ by Robin H. Johnson from 2008.
They were accepted by the Council in 2010 but have never been
implemented. When potential implementation work started in 2017, a new
discussion about the specification arose. It prompted the creation
of a competing GLEP that would provide a redesigned alternative to
the old GLEPs.

This specification is designed with the following goals in mind:

1. It should provide means to ensure the authenticity of the complete
   repository, including preventing the injection of additional files.

2. The format should be universal enough to work both for the Gentoo
   repository and third-party repositories of different characteristics.

3. The Manifest files should be verifiable stand-alone, that is without
   knowing any details about the underlying repository format.


Specification
=============

Manifest file format
--------------------

This specification reuses and extends the Manifest file format defined
in GLEP 44 [#GLEP44]_. For the purpose of it, the *file type* field is
repurposed as a generic *tag* that could also indicate additional
(non-checksum) metadata. Appropriately, those tags can be followed by
other space-separated values.

Unless specified otherwise, the paths used in the Manifest files
are relative to the directory containing the Manifest file. The paths
must not reference the parent directory (``..``). Forward slash (``/``)
is used as path component separator.

The Manifest files use UTF-8 encoding.


Manifest file locations and nesting
-----------------------------------

The ``Manifest`` file located in the root directory of the repository
is called top-level Manifest, and it is used to perform the full-tree
verification. In order to verify the authenticity, it must be signed
using OpenPGP, using the armored cleartext format.

The top-level Manifest may reference sub-Manifests contained
in subdirectories of the repository. The sub-Manifests are traditionally
named ``Manifest``; however, the implementation must support arbitrary
names, including the possibility of multiple (split) Manifests
for a single directory. The sub-Manifest can only cover the files inside
the directory tree where it resides.

The sub-Manifest can also be signed using OpenPGP armored cleartext
format. However, the signature verification can be omitted since it
already is covered by the signed top-level Manifest.


Directory tree coverage
-----------------------

The specification provides three ways of skipping Manifest verification
of specific files and directories (recursively):

1. explicit ``IGNORE`` entries in Manifest files,

2. injected ignore paths via package manager configuration,

3. using names starting with a dot (``.``) which are always skipped.

All files that are not ignored must be covered by at least one
of the Manifests.

A single file may be matched by multiple identical or equivalent
Manifest entries, if and only if the entries have the same semantics,
specify the same size and the checksums common to both entries match.
It is an error for a single file to be matched by multiple entries
of different semantics, file size or checksum values. It is an error
to specify another entry for a file that matches ``IGNORE``, or that
is located inside an ignored directory.

The file entries (except for ``IGNORE``) can be specified for regular
files only. Symbolic links are followed when opening files
and traversing directories. It is an error to specify an entry for
a different file type. If the tree contain files of other types
that are not otherwise ignored, they need to be covered by an explicit
``IGNORE``.

All the local (non-``DIST``) files covered by a Manifest tree must
reside on the same filesystem. It is an error to specify entries
applying to files on another filesystem. If files or directories that
are not otherwise ignored reside on a different filesystem, or symbolic
links point to targets on a different filesystem, they must
be explicitly excluded via ``IGNORE``.


Path and filename encoding
--------------------------

The path fields in the Manifest file must consist of characters
corresponding to valid UTF-8 code points excluding the NULL character
(``U+0000``), the backwards slash (``\``) and characters classified
as whitespace in the current version of the Unicode standard
[#UNICODE]_.

Any of the excluded characters that are present in path must be encoded
using one of the following escape sequences:

- characters in the ``U+0000`` to ``U+007F`` range can be encoded
  as ``\xHH`` where ``HH`` specifies the zero-padded, hexadecimal
  character code,

- characters in the ``U+0000`` to ``U+FFFF`` range can be encoded
  as ``\uHHHH`` where ``HHHH`` specifies the zero-padded, hexadecimal
  character code,

- characters in the UCS-4 range can be encoded as ``\UHHHHHHHH``
  where ``HHHHHHHH`` specifies the zero-padded, hexadecimal character
  code.

It is invalid for backwards slash to be used in any other context,
and a backwards slash present in filename must be encoded. Backwards
slash used as path component separator should be replaced by forward
slash instead.

The encoding can be used for other characters as well. In particular,
escaping control characters is recommended to ensure that the file
works correctly in text editors.


File verification
-----------------

When verifying a file against the Manifest, the following rules are
used:

1. If the file is covered directly or indirectly by an entry
   of the ``IGNORE`` type, the verification always succeeds.

2. If the file is covered by an entry of the ``MANIFEST``, ``DATA``,
   ``MISC``, ``EBUILD`` or ``AUX`` type:

   a. if the file is not present, then the verification fails,

   b. if the file is present but has a different size or one
      of the checksums does not match, the verification fails,

   c. otherwise, the verification succeeds.

3. If the file is present but not listed in Manifest, the verification
   fails.

Unless specified otherwise, the package manager must not allow using
any files for which the verification failed. The package manager may
reject any package or even the whole repository if it may refer to files
for which the verification failed.


Timestamp verification
----------------------

The top-level Manifest file can contain a ``TIMESTAMP`` entry to account
for attacks against tree update distribution. If such an entry
is present, it should be updated every time at least one
of the Manifests changes. Every unique timestamp value must correspond
to a single tree state.

During the verification process, the client should compare the timestamp
against the update time obtained from a local clock or a trusted time
source. If the comparison result indicates that the Manifest at the time
of receiving was already significantly outdated, the client should
either fail the verification or require manual confirmation from
the user.

Furthermore, the Manifest provider may employ additional methods
of distributing the timestamps of recently generated Manifests
using a secure channel from a trusted source for exact comparison.
The exact details of such a solution are outside the scope of this
specification.

``TIMESTAMP`` entries may also be present in sub-Manifests. Those
timestamps must not be newer than the timestamp of the top-level
Manifest (if present). This specification does not define any specific
use for them.


Modern Manifest tags
--------------------

The Manifest files can specify the following tags:

``TIMESTAMP <iso8601>``
  Specifies a timestamp of when the Manifest file was last updated.
  The timestamp must be a valid second-precision ISO 8601 extended
  format combined date and time in UTC timezone, i.e. using
  the following ``strftime()`` format string: ``%Y-%m-%dT%H:%M:%SZ``.
  Optional. The package manager can use it to detect an outdated
  repository checkout as described in `Timestamp verification`_.

``MANIFEST <path> <size> <checksums>...``
  Specifies a sub-Manifest. The sub-Manifest must be verified like
  a regular file. If the verification succeeds, the entries from
  the sub-Manifest are included for verification as described
  in `Manifest file locations and nesting`_.

``IGNORE <path>``
  Ignores a subdirectory or file from Manifest checks. If the specified
  path is present, it and its contents are omitted from the Manifest
  verification (always pass). *Path* must be a plain file or directory
  path without a trailing slash. Wildcards are not supported
  and wildcard characters are interpreted literally.

``DATA <path> <size> <checksums>...``
  Specifies a regular file subject to Manifest verification. The file
  is required to pass verification. Used for all files that do not match
  any other type.

``DIST <filename> <size> <checksums>...``
  Specifies a distfile entry used to verify files fetched as part
  of ``SRC_URI``. The filename must match the filename used to store
  the fetched file as specified in the PMS [#PMS-FETCH]_. The package
  manager must reject the fetched file if it fails verification.
  ``DIST`` entries apply to all packages below the Manifest file
  specifying them.


Deprecated Manifest tags
------------------------

For backwards compatibility, the following tags are additionally
allowed at the package directory level:

``EBUILD <filename> <size> <checksums>...``
  Equivalent to the ``DATA`` type.

``MISC <path> <size> <checksums>...``
  Equivalent to the ``DATA`` type. Historically indicated that
  the package manager may ignore a verification failure if operating
  in non-strict mode. However, that behavior is deprecated.

``AUX <filename> <size> <checksums>...``
  Equivalent to the ``DATA`` type, except that the filename is relative
  to the ``files/`` subdirectory.


Algorithm for full-tree verification
------------------------------------

In order to perform full-tree verification, the following algorithm
can be used:

1. Collect all files present in the repository into *present* set.

2. Start at the top-level Manifest file. Verify its OpenPGP signature.
   Optionally verify the ``TIMESTAMP`` entry if present as specified
   in `timestamp verification`. Remove the top-level Manifest
   from the *present* set.

3. Process all ``MANIFEST`` entries, recursively. Verify the Manifest
   files according to the `file verification`_ section, and include
   their entries in the current Manifest entry list (using paths
   relative to directories containing the Manifests).

4. Process all ``IGNORE`` entries. Remove any paths matching them
   from the *present* set.

5. Collect all files covered by ``DATA``, ``MISC``, ``EBUILD``
   and ``AUX`` entries into the *covered* set.

6. Verify the entries in the *covered* set for incompatible duplicates
   and collisions with ignored files as explained in `Manifest file
   locations and nesting`_.

7. Verify all the files in the union of the *present* and *covered*
   sets, according to the `file verification`_ section.


Algorithm for finding parent Manifests
--------------------------------------

In order to find the top-level Manifest from the current directory
the following algorithm can be used:

1. Store the current directory as *original* and the device ID
   of the containing filesystem (``st_dev``) as *startdev*,

2. If the device ID of the containing filesystem (``st_dev``)
   of the current directory is different than *startdev*, stop.

3. If the current directory contains a ``Manifest`` file:

   a. If an ``IGNORE`` entry in the ``Manifest`` file covers
      the *original* directory (or one of the parent directories), stop.

   b. Otherwise, store the current directory as *last_found*.

4. If the current directory is the root system directory (``/``), stop.

5. Otherwise, enter the parent directory and jump to step 2.

Once the algorithm stops, *last_found* will contain the relevant
top-level Manifest. If *last_found* is null, then the directory tree
does not contain any valid top-level Manifest candidates and one should
be created in the *original* directory.

Once the top-level Manifest is found, its ``MANIFEST`` entries should
be used to find any sub-Manifests below the top-level Manifest,
up to and including the *original* directory. Note that those
sub-Manifests can use different filenames than ``Manifest``.


Checksum algorithms
-------------------

This section is informational only. Specifying the exact set
of supported algorithms is outside the scope of this specification.

The algorithm names reserved at the time of writing are:

- ``MD5`` [#MD5]_,
- ``RMD160`` -- RIPEMD-160 [#RIPEMD160]_,
- ``SHA1`` [#SHS]_,
- ``SHA256`` and ``SHA512`` -- SHA-2 family of hashes [#SHS]_,
- ``WHIRLPOOL`` [#WHIRLPOOL]_,
- ``BLAKE2B`` and ``BLAKE2S`` -- BLAKE2 family of hashes [#BLAKE2]_,
- ``SHA3_256`` and ``SHA3_512`` -- SHA-3 family of hashes [#SHA3]_,
- ``STREEBOG256`` and ``STREEBOG512`` -- Streebog family of hashes
  [#STREEBOG]_.

The method of introducing new hashes is defined by GLEP 59 [#GLEP59]_.
It is recommended that any new hashes are named after the Python
``hashlib`` module algorithm names, transformed into uppercase.


Manifest compression
--------------------

The topic of Manifest file compression is covered by GLEP 61 [#GLEP61]_.
This section merely addresses interoperability issues between Manifest
compression and this specification.

The compressed Manifest files are required to be suffixed for their
compression algorithm. This suffix should be used to recognize
the compression and decompress Manifests transparently. The exact list
of algorithms and their corresponding suffixes are outside the scope
of this specification.

The top-level Manifest file must not be compressed. Since the OpenPGP
signature covers the uncompressed text and is compressed itself,
the data would have to be decompressed without any prior verification.
This could expose users e.g. to zip bombs or exploits on decompressor
vulnerabilities.

Whenever this specification refers to sub-Manifests, they can use any
names but are also required to use a specific compression suffix.
The ``MANIFEST`` entries are required to specify the full name including
compression suffix, and the verification is performed on the compressed
file.

The specification permits uncompressed Manifests to exist alongside
their compressed counterparts, and multiple compressed formats
to coexist. If that is the case, the files must have the same
uncompressed content and the specification is free to choose either
of the files using the same base name.


Combining multiple Manifest trees (informational)
-------------------------------------------------

This specification permits nesting multiple hierarchical Manifest trees.
In this layout, the specific directories of the Manifest tree can
be verified both as a part of another top-level Manifest,
and as an independent Manifest tree (when obtained without the parent
directory).

For this to work, the sub-Manifest file in the directory must also
satisfy the requirements for the top-level Manifest file. That is:

- it must be named ``Manifest`` and not compressed,

- it must cover all the files in this directory and its subdirectories
  (i.e. no files from the directory tree can be covered by parent
  Manifest),

- if authenticity verification is desired, it must be OpenPGP-signed.

It should be noted that if such a directory is a subdirectory of a valid
Manifest tree, the sub-Manifest needs to be valid according
to the top-level Manifest and the OpenPGP signature is disregarded
as detailed in `Manifest file locations and nesting`_. The top-level
behavior is exhibited only when the directory is obtained without parent
directories.


An example Manifest file (informational)
----------------------------------------

An example top-level Manifest file for the Gentoo repository would have
the following content::

    TIMESTAMP 2017-10-30T10:11:12Z
    IGNORE distfiles
    IGNORE local
    IGNORE lost+found
    IGNORE packages
    MANIFEST app-accessibility/Manifest 14821 SHA256 1b5f.. SHA512 f7eb..
    ...
    MANIFEST eclass/Manifest.gz 50812 SHA256 8c55.. SHA512 2915..
    ...

An example modern Manifest (disregarding backwards compatibility)
for a package directory would have the following content::

    DATA SphinxTrain-0.9.1-r1.ebuild 932 SHA256 3d3b.. SHA512 be4d..
    DATA SphinxTrain-1.0.8.ebuild 912 SHA256 f681.. SHA512 0749..
    DATA metadata.xml 664 SHA256 97c6.. SHA512 1175..
    DATA files/gcc.patch 816 SHA256 b56e.. SHA512 2468..
    DATA files/gcc34.patch 333 SHA256 c107.. SHA512 9919..
    DIST SphinxTrain-0.9.1-beta.tar.gz 469617 SHA256 c1a4.. SHA512 1b33..
    DIST sphinxtrain-1.0.8.tar.gz 8925803 SHA256 548e.. SHA512 465d..


Rationale
=========

Stand-alone format
------------------

The first question that needed to be asked before proceeding with
the design was whether the Manifest file format was supposed to be
stand-alone, or tightly bound to the repository format.

The stand-alone format has been selected because of its three
advantages:

1. It is more future-proof. If an incompatible change to the repository
   format is introduced, only developers need to upgrade the tools
   they use to generate the Manifests. The tools used to verify
   the updated Manifests will continue to work.

2. It is more flexible and universal. With a dedicated tool,
   the Manifest files can be used to sign and verify arbitrary file
   sets.

3. It keeps the verification tool simpler. In particular, we can easily
   write an independent verification tool that could work on any
   distribution without needing to depend on a package manager
   implementation or rewrite parts of it.

Designing a stand-alone format requires that the Manifest carries enough
information to perform the verification following all the rules specific
to the Gentoo repository.


Tree design
-----------

The second important point of the design was determining whether
the Manifest files should be structured hierarchically, or independent.
Both options have their advantages.

In the hierarchical model, each sub-Manifest file is covered by a higher
level Manifest. As a result, only the top-level Manifest has to be
OpenPGP-signed, and subsequent Manifests need to be only verified by
checksum stored in the parent Manifest. This has the following
implications:

- Verifying any set of files in the repository requires using checksums
  from the most relevant Manifests and the parent Manifests.

- The OpenPGP signature of the top-level Manifest needs to be verified
  only once per process.

- Altering any set of files requires updating the relevant Manifests,
  and their parent Manifests up to the top-level Manifest, and signing
  the last one.

- As a result, the top-level Manifest changes on every commit,
  and various middle-level Manifests change (and need to be transferred)
  frequently.

In the independent model, each sub-Manifest file is independent
of the parent Manifests. As a result, each of them needs to be signed
and verified independently. However, the parent Manifests still need
to list sub-Manifests (albeit without verification data) in order
to detect removal or replacement of subdirectories. This has
the following implications:

- Verifying any set of files in the repository requires using checksums
  and verifying signatures of the most relevant Manifest files.

- Altering any set of files requires updating the relevant Manifests
  and signing them again.

- Parent Manifests are updated only when Manifests are added or removed
  from subdirectories. As a result, they change infrequently.

While both models have their advantages, the hierarchical model was
selected because it reduces the number of OpenPGP operations
(which are comparatively costly) to the minimum.


Tree layout restrictions
------------------------

The algorithm is meant to work primarily with ebuild repositories which
normally contain only files and directories. Directories provide
no useful metadata for verification, and specifying special entries
for additional file types is purposeless. Therefore, the specification
is restricted to dealing with regular files.

The Gentoo repository does not use symbolic links. Some Gentoo
repositories do, however. To provide a simple solution for dealing with
symlinks without having to take care to implement special handling for
them, the common behavior of implicitly resolving them is used.
Therefore, symbolic links to files are stored as if they were regular
files, and symbolic links to directories are followed as if they were
regular directories.

Dotfiles are implicitly ignored as that is a common notion used
in software written for POSIX systems. All other filenames require
explicit ``IGNORE`` lines.

An ability to inject additional ignore entries is provided to account
for site configuration affecting the repository tree -- placing
additional files in it, skipping some of the categories from syncing.
This configuration can extend beyond the limits of this GLEP,
e.g. by allowing wildcards or regular expressions.

The algorithm is restricted to work on a single filesystem. This is
mostly relevant when scanning for top-level Manifest -- we do not want
to cross filesystem boundaries then. However, to ensure consistent
bidirectional behavior we need to also ban them when operating downwards
the tree.

The directories and files on different filesystems need to be ignored
explicitly as implicitly skipping them would cause confusion.
In particular, tools might then claim that a file does not exist when
it clearly does because it was skipped due to filesystem boundaries.


Filename character set restriction
----------------------------------

The valid set of filename characters for the Gentoo repository
is restricted by the devmanual 'File Naming Rules' section
[#FILE-NAMING-RULES]_, and enforced via a git hook. The valid distfile
names are not restricted explicitly -- however, the PMS dependency
specification syntax [#PMS-FETCH]_ implicitly makes it impossible to use
filenames containing whitespace.

This specification aims to avoid arbitrary restrictions. For this
reason, filename characters are only restricted by excluding three
technically problematic groups:

1. The NULL character (``U+0000``) is normally used to indicate the end
   of a null-terminated string. Its use could therefore break programs
   written using C. Furthermore, it is not allowed in any known
   filesystem.

2. The backwards slash character (``\``) is used as path separator
   on Windows systems, so it's extremely unlikely to be used in real
   filenames. For this reason it is used to implement character
   encoding with minimal risk of breaking backwards compatibility.

3. Whitespace characters are used to separate Manifest fields
   and entries. While technically it would be enough to restrict space
   (``U+0020``) character that is normally used as the separator
   and newline (``U+000A``) character that is used to separate lines,
   all whitespace characters are forbidden to avoid confusion
   and implementation errors.

Historically, Portage attempted to overcome the whitespace limitation
by attempting to locate the size field and take everything before it
as filename. This was terribly fragile and even if it worked, it would
solve the problem only partially.

The character encoding method provides means to overcome the character
restrictions to extend the tool usability beyond immediate Gentoo uses.
The backslash escape form based on Python unicode strings is used
since it can encode all characters within the Unicode range, the syntax
is familiar to many programmers and the backwards slash character
is extremely unlikely to appear in real filenames.

Syntax is limited to the minimum necessary to implement the encoding.
Shorthand forms (e.g. ``\t`` or ``\\``) are omitted to avoid unnecessary
complexity, and to reduce the risk of shell users using backslash
to escape space directly. The ``\x`` form is limited to ``\x00..\x7F``
range to avoid ambiguity of higher values which might be interpreted
either as UCS-2 code points or part of a UTF-8 encoded character.

Encoding stores UCS-2/UCS-4 characters directly rather than hex-encoded
UTF-8 string to simplify the implementation. In particular, it makes it
possible to process the Manifest file as UTF-8 encoded text without
having to perform additional UTF-8 decoding (and verification)
of the escaped data.

URL-encoding was considered as an alternative. However, it could collide
with ``DIST`` entries that are implicitly named after the URL filename
part where URL-encoding is pretty common.


File verification model
-----------------------

The verification model aims to provide full coverage against different
forms of attack. In particular, three different kinds of manipulation
are considered:

1. Alteration of the file content.

2. Removal of a file.

3. Addition of a new file.

In order to prevent against all three, the system requires that all
files in the repository are listed in Manifests and verified against
them.

As a special case, ignores are allowed to account for directories
that are not part of the repository but were traditionally placed inside
it. Those directories were ``distfiles``, ``local`` and ``packages``. It
could be also used to ignore VCS directories such as ``CVS``.


Non-strict Manifest verification
--------------------------------

Originally the Manifest2 format provided a special ``MISC`` tag that
was used for ``metadata.xml`` and ``ChangeLog`` files. This tag
indicated that the Manifest verification failures could be ignored for
those files unless the package manager was working in strict mode.

The first versions of this specification continued the use of this tag.
However, after a long debate it was decided to deprecate it along with
the non-strict behavior, and require all files to strictly match.

Two arguments were mentioned for the usefulness of a ``MISC`` type:

1. being able to reduce the checkout size by stripping unnecessary
   files out, and

2. being able to update automatically generated files locally
   without causing unnecessary verification failures.

However, the usefulness of ``MISC`` in both cases is doubtful.

The cases for stripping unnecessary files mostly focused around space
savings. For this purpose, stripping ``metadata.xml`` and similar files
has little value. It is much more common for users to strip whole
packages or categories. The ``MISC`` type is not suitable for that,
and so a dedicated package manager mechanism needs to be developed
instead. The same mechanism can also handle files that historically used
the ``MISC`` type. As an example, the package manager may choose
to generate both the rsync exclusion list and Manifest ignore list
using a single source list.

The cases for autogenerated files involve such cache files
as ``use.local.desc``. However, we can not include ``md5-cache`` there
due to security concerns which results in inconsistent cache handling.
Furthermore, the tools were historically modified to provide stable
output which means that their content can not change without
a non-``MISC`` content being changed first. This practically defeats
the purpose of using ``MISC``.

Finally, the non-strict mode could be used as means to an attack.
The allowance of missing or modified documentation file could be used
to spread misinformation, resulting in bad decisions made by the user.
A modified file could also be used, e.g. to exploit vulnerabilities
of an XML parser.


Timestamp field
---------------

The top-level Manifest optionally allows using a ``TIMESTAMP`` tag
to include a generation timestamp in the Manifest. A similar feature
was originally proposed in GLEP 58 [#GLEP58]_.

A malicious third-party may use the principles of exclusion or replay
[#C08]_ to deny an update to clients, while at the same time recording
the identity of clients to attack. The timestamp field can be used to
detect that.

In order to provide more complete protection, the Gentoo Infrastructure
should provide an ability to obtain the timestamps of all Manifests
from a recent timeframe over a secure channel from a trusted source
for comparison.

Strictly speaking, this information is provided by the various
``metadata/timestamp*`` files that are already present. However,
including the value in the Manifest itself has a little cost
and provides the ability to perform the verification stand-alone.

Furthermore, some of the timestamp files are added very late
in the distribution process, past the Manifest generation phase. Those
files will most likely receive ``IGNORE`` entries and therefore
be unsafe to use.

The specification permits additional timestamps in sub-Manifest files
for local use. A generic testing tool should ignore them.


New vs deprecated tags
----------------------

Out of the four types defined by Manifest2, only one is reused
and the remaining three are replaced by a single, universal ``DATA``
type.

The ``DIST`` tag is reused since the specification does not change
anything with regard to distfile handling.

The ``EBUILD`` tag could potentially be reused for generic file
verification data. However, it would be confusing if all the different
data files were marked as ``EBUILD``. Therefore, an equivalent ``DATA``
type was introduced as a replacement.

The ``MISC`` tag and the relevant non-strict mode has been removed
as being of little value, as detailed in the `Non-strict Manifest
verification`_ section.

The ``AUX`` tag is deprecated as it is redundant to ``DATA``, and has
the limiting property of implicit ``files/`` path prefix.


Finding top-level Manifest
--------------------------

The development of a reference implementation for this GLEP has brought
the following problem: how to find all the relevant Manifests when
the Manifest tool is run inside a subdirectory of the repository?

One of the options would be to provide a bi-directional linking
of Manifests via a ``PARENT`` tag. However, that would not solve
the problem when a new Manifest file is being created.

Instead, an algorithm for iterating over parent directories is proposed.
Since there is no obligatory explicit indicator for the top-level
Manifest, the algorithm assumes that the top-level Manifest
is the highest ``Manifest`` in the directory hierarchy that can cover
the current directory. This generally makes sense since the Manifest
files are required to provide coverage for all subdirectories, so all
Manifests starting from that one need to be updated.

If independent Manifest trees are nested in the directory structure,
then an ``IGNORE`` entry needs to be used to separate them.

Since sub-Manifests can use any filenames, the Manifest finding
algorithm must not short-cut the procedure by storing all ``Manifest``
files along the parent directories. Instead, it needs to retrace
the relevant sub-Manifest files along ``MANIFEST`` entries
in the top-level Manifest.


Injecting ChangeLogs into the checkout
--------------------------------------

One of the problems considered in the new Manifest format was injecting
historical and autogenerated ChangeLog into the repository. We normally
don't include those files, to reduce the checkout size. However, some
users have shown interest in them and Infra is working on providing them
via an additional rsync module.

If such files were injected into the repository, they would cause
verification failures of Manifests. To account for this, Infra could
provide ``IGNORE`` entries to allow them to exist.


Splitting distfile checksums from file checksums
------------------------------------------------

Another problem with the current Manifest format is that the checksums
for fetched files are combined with checksums for local files
in a single file inside the package directory. It has been specifically
pointed out that:

- since distfiles are sometimes reused across different packages,
  the repeating checksums are redundant [#DIST]_.
  
- mirror admins were interested in the possibility of verifying all
  the distfiles with a single tool.

This specification does not provide a clean solution to this problem.
It technically permits moving ``DIST`` entries to higher-level Manifests
but the usefulness of such a solution is doubtful.

However, for the second problem we will probably deliver a dedicated
tool working with this Manifest format.


Hash algorithms
---------------

While maintaining a consistent supported hash set is important
for interoperability, it is not a good fit for the generic layout
of this GLEP. Furthermore, it would require updating the GLEP
in the future every time the used algorithms change.

Instead, the specification focuses on listing the currently used
algorithm names for interoperability, and sets a recommendation
for consistent naming of algorithms in the future. The Python
``hashlib`` module is used as a reference since it is used
as the provider of hash functions for most of the Python software,
including Portage and PkgCore.

The basic rules for changing hash algorithms are defined in GLEP 59
[#GLEP59]_. The implementations can focus only on those algorithms
that are actually used or planned on being used. It may be feasible
to devise a new GLEP that specifies the currently used hashes (or update
GLEP 59 accordingly).


Manifest compression
--------------------

The support for Manifest compression is introduced with minimal changes
to the file format. The ``MANIFEST`` entries are required to provide
the real (compressed) file path for compatibility with other file
entries and to avoid confusion.

The compression of top-level Manifest file has been prohibited
as the specification currently does not provide any means of verifying
the file prior to decompression. If the top-level Manifest is
compressed, tooling will have to unpack the file before being able
to verify the contents. This makes it possible for a malicious third
party to attack the system by providing a compressed Manifest that
exposes decompressor vulnerabilities, or a zip bomb.

The OpenPGP cleartext signature covers the contents of the Manifest,
and is therefore compressed along with them. The possibility of using
a detached signature has been considered but it was rejected as
unnecessary complexity for minor gain.

Technically, a similar result could be effected via moving all the data
into a compressed sub-Manifest in the top directory (e.g.
``Manifest.sub.gz``), and including a ``MANIFEST`` entry for this file
in a signed, uncompressed top-level Manifest.

The existence of additional entries for uncompressed Manifest checksums
was debated. However, plain entries for the uncompressed file would
be confusing if only the compressed file existed, and conflicting
if both uncompressed and compressed variants existed. Furthermore,
it has been pointed out that ``DIST`` entries do not have
an uncompressed variant either.


Performance considerations
--------------------------

Performing a full-tree verification on every sync raises some
performance concerns for end-user systems. The initial testing has shown
that a cold-cache verification on a btrfs file system can take up around
4 minutes, with the process being mostly I/O bound. On the other hand,
it can be expected that the verification will be performed directly
after syncing, taking advantage of a warm filesystem cache.

To improve speed on I/O and/or CPU-restrained systems even further,
the algorithms can be easily extended to perform incremental
verification. Given that rsync does not preserve mtimes by default,
the tool can take advantage of mtime and Manifest comparisons to recheck
only the parts of the repository that have changed.

Furthermore, the package manager implementations can restrict checking
only to the parts of the repository that are actually being used.


Backwards Compatibility
=======================

This GLEP provides optional means of preserving backwards compatibility.
To preserve the backwards compatibility, the following needs to hold
for the ``Manifest`` file in every package directory:

- all files must be covered by the single ``Manifest`` file,

- all distfiles used by the package must be included,

- all files inside the ``files/`` subdirectory need to use
  the ``AUX`` tag (rather than ``DATA``),

- all ``.ebuild`` files need to use the ``EBUILD`` tag,

- the ``metadata.xml`` and ``ChangeLog`` files need to use
  the ``MISC`` tag,

- the Manifest can be signed to provide authenticity verification,

- an uncompressed Manifest must always exist, and a compressed Manifest
  of identical content may be present.

Once the backwards compatibility is no longer a concern, the above
no longer needs to hold and the deprecated tags can be removed.


Reference Implementation
========================

The reference implementation for this GLEP is being developed
as the gemato project [#GEMATO]_.


Credits
=======

Thanks to all the people whose contributions were invaluable
to the creation of this GLEP. This includes but is not limited to:

- Robin Hugh Johnson,
- Ulrich Müller.

Additionally, thanks to Robin Hugh Johnson for the original
MetaManifest GLEP series which served both as inspiration and source
of many concepts used in this GLEP. Recursively, also thanks to all
the people who contributed to the original GLEPs.


References
==========

.. [#GLEP44] GLEP 44: Manifest2 format
   (https://www.gentoo.org/glep/glep-0044.html)

.. [#GLEP57] GLEP 57: Security of distribution of Gentoo software
   - Overview
   (https://www.gentoo.org/glep/glep-0057.html)

.. [#GLEP58] GLEP 58: Security of distribution of Gentoo software
   - Infrastructure to User distribution - MetaManifest
   (https://www.gentoo.org/glep/glep-0058.html)

.. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
   (https://www.gentoo.org/glep/glep-0059.html)

.. [#GLEP60] GLEP 60: Manifest2 filetypes
   (https://www.gentoo.org/glep/glep-0060.html)

.. [#GLEP61] GLEP 61: Manifest2 compression
   (https://www.gentoo.org/glep/glep-0061.html)

.. [#UNICODE] The Unicode standard
   (https://unicode.org/versions/latest/)

.. [#PMS-FETCH] Package Manager Specification: Dependency Specification
   Format - SRC_URI
   (https://projects.gentoo.org/pms/6/pms.html#x1-940008.2.10)

.. [#FILE-NAMING-RULES] Ebuild File Format -- Gentoo Development Guide
   (https://devmanual.gentoo.org/ebuild-writing/file-format/#file-naming-rules)

.. [#MD5] RFC1321: The MD5 Message-Digest Algorithm
   (https://www.ietf.org/rfc/rfc1321.txt)

.. [#RIPEMD160] The hash function RIPEMD-160
   (https://homes.esat.kuleuven.be/~bosselae/ripemd160.html)

.. [#SHS] FIPS PUB 180-4: Secure Hash Standard (SHS)
   (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf)

.. [#WHIRLPOOL] The WHIRLPOOL Hash Function
   (http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html)

.. [#BLAKE2] BLAKE2 -- fast secure hashing
   (https://blake2.net/)

.. [#SHA3] FIPS PUB 202: SHA-3 Standard: Permutation-Based Hash
   and Extendable-Output Functions
   (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf)

.. [#STREEBOG] GOST R 34.11-2012: Streebog Hash Function
   (https://www.streebog.net/)

.. [#C08] Cappos, J et al. (2008). "Attacks on Package Managers"
   (https://www2.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html)

.. [#DIST] According to Robin H. Johnson, 8.4% of all DIST entries
   at the time of writing are duplicate, representing 2 MiB
   out of 25 MiB of DIST entries altogether.

.. [#GEMATO] gemato: Gentoo Manifest Tool
   (https://github.com/mgorny/gemato/)


Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v4]
  2017-11-22 16:54 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v4] Michał Górny
@ 2017-11-22 20:41   ` Ulrich Mueller
  0 siblings, 0 replies; 23+ messages in thread
From: Ulrich Mueller @ 2017-11-22 20:41 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2026 bytes --]

>>>>> On Wed, 22 Nov 2017, Michał Górny wrote:

> Path and filename encoding
> --------------------------

> The path fields in the Manifest file must consist of characters
> corresponding to valid UTF-8 code points excluding the NULL character
> (``U+0000``), the backwards slash (``\``) and characters classified
> as whitespace in the current version of the Unicode standard
> [#UNICODE]_.

As I said before, all C0 and C1 control characters and DEL should be
excluded as well, i.e. 0x00 to 0x1f, 0x7f, and 0x80 to 0x9f. Allowing
such characters in what is basically a text file is only asking for
trouble.

> Any of the excluded characters that are present in path must be encoded
> using one of the following escape sequences:

> - characters in the ``U+0000`` to ``U+007F`` range can be encoded
>   as ``\xHH`` where ``HH`` specifies the zero-padded, hexadecimal
>   character code,

> - characters in the ``U+0000`` to ``U+FFFF`` range can be encoded
>   as ``\uHHHH`` where ``HHHH`` specifies the zero-padded, hexadecimal
>   character code,

> - characters in the UCS-4 range can be encoded as ``\UHHHHHHHH``
>   where ``HHHHHHHH`` specifies the zero-padded, hexadecimal character
>   code.

> It is invalid for backwards slash to be used in any other context,
> and a backwards slash present in filename must be encoded. Backwards
> slash used as path component separator should be replaced by forward
> slash instead.

This entire section about the escape mechanism should be clearly
labelled as being purely optional, as it is not relevant for Gentoo
(and would break backwards compatibility with existing package
manager implementations). Maybe add a reference to GLEP 31 too?

> The encoding can be used for other characters as well. In particular,
> escaping control characters is recommended to ensure that the file
> works correctly in text editors.

See above, this should not be "recommended", but literal control chars
should be strictly forbidden.

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v5]
  2017-11-16 10:19 [gentoo-dev] [RFC] GLEP 74 post-Council review update Michał Górny
                   ` (3 preceding siblings ...)
  2017-11-22 16:54 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v4] Michał Górny
@ 2017-11-23 20:53 ` Michał Górny
  2017-12-01 11:30   ` Fabian Groffen
  4 siblings, 1 reply; 23+ messages in thread
From: Michał Górny @ 2017-11-23 20:53 UTC (permalink / raw
  To: gentoo-dev

W dniu czw, 16.11.2017 o godzinie 11∶19 +0100, użytkownik Michał Górny
napisał:
> Hi, everyone.
> 
> Here's the updated version of GLEP 74 taking into consideration
> the points made during the Council pre-review.
> 
> ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
> HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html
> 
> Changes:

27c2a9e glep-0074: Grammar corrections from Ulrich Müller
d39f865 glep-0074: Make extended filename encoding optional
ed111f8 glep-0074: Always exclude control characters

---
GLEP: 74
Title: Full-tree verification using Manifest files
Author: Michał Górny <mgorny@gentoo.org>,
        Robin Hugh Johnson <robbat2@gentoo.org>,
        Ulrich Müller <ulm@gentoo.org>
Type: Standards Track
Status: Draft
Version: 1
Created: 2017-10-21
Last-Modified: 2017-11-23
Post-History: 2017-10-26, 2017-11-16
Content-Type: text/x-rst
Requires: 59, 61
Replaces: 44, 58, 60
---

Abstract
========

This GLEP extends the Manifest file format to cover full-tree file
integrity and authenticity checks. The format aims to be future-proof,
efficient and provide means of backwards compatibility.


Motivation
==========

The Manifest files as defined by GLEP 44 [#GLEP44]_ provide the current
means of verifying the integrity of distfiles and package files
in Gentoo. Combined with OpenPGP signatures, they provide means to
ensure the authenticity of the covered files. However, as noted
in GLEP 57 [#GLEP57]_ they lack the ability to provide full-tree
authenticity verification as they do not cover any files outside
the package directory. In particular, they provide multiple ways
for a third party to inject malicious code into the ebuild environment.

Historically, the topic of providing authenticity coverage for the whole
repository has been mentioned multiple times. The most noteworthy effort
are GLEPs 58 [#GLEP58]_ and 60 [#GLEP60]_ by Robin H. Johnson from 2008.
They were accepted by the Council in 2010 but have never been
implemented. When potential implementation work started in 2017, a new
discussion about the specification arose. It prompted the creation
of a competing GLEP that would provide a redesigned alternative to
the old GLEPs.

This specification is designed with the following goals in mind:

1. It should provide means to ensure the authenticity of the complete
   repository, including preventing the injection of additional files.

2. The format should be universal enough to work both for the Gentoo
   repository and third-party repositories of different characteristics.

3. The Manifest files should be verifiable stand-alone, that is without
   knowing any details about the underlying repository format.


Specification
=============

Manifest file format
--------------------

This specification reuses and extends the Manifest file format defined
in GLEP 44 [#GLEP44]_. For the purpose of it, the *file type* field is
repurposed as a generic *tag* that could also indicate additional
(non-checksum) metadata. Appropriately, those tags can be followed by
other space-separated values.

Unless specified otherwise, the paths used in the Manifest files
are relative to the directory containing the Manifest file. The paths
must not reference the parent directory (``..``). Forward slash (``/``)
is used as path component separator.

The Manifest files use UTF-8 encoding.


Manifest file locations and nesting
-----------------------------------

The ``Manifest`` file located in the root directory of the repository
is called top-level Manifest, and it is used to perform the full-tree
verification. In order to verify the authenticity, it must be signed
using OpenPGP, using the armored cleartext format.

The top-level Manifest may reference sub-Manifests contained
in subdirectories of the repository. The sub-Manifests are traditionally
named ``Manifest``; however, the implementation must support arbitrary
names, including the possibility of multiple (split) Manifests
for a single directory. The sub-Manifest can only cover the files inside
the directory tree where it resides.

The sub-Manifest can also be signed using OpenPGP armored cleartext
format. However, the signature verification can be omitted since it
already is covered by the signed top-level Manifest.


Directory tree coverage
-----------------------

The specification provides three ways of skipping Manifest verification
of specific files and directories (recursively):

1. explicit ``IGNORE`` entries in Manifest files,

2. injected ignore paths via package manager configuration,

3. using names starting with a dot (``.``) which are always skipped.

All files that are not ignored must be covered by at least one
of the Manifests.

A single file may be matched by multiple identical or equivalent
Manifest entries, if and only if the entries have the same semantics,
specify the same size and the checksums common to both entries match.
It is an error for a single file to be matched by multiple entries
of different semantics, file size or checksum values. It is an error
to specify another entry for a file that matches ``IGNORE``, or that
is located inside an ignored directory.

The file entries (except for ``IGNORE``) can be specified for regular
files only. Symbolic links are followed when opening files
and traversing directories. It is an error to specify an entry for
a different file type. If the tree contain files of other types
that are not otherwise ignored, they need to be covered by an explicit
``IGNORE``.

All the local (non-``DIST``) files covered by a Manifest tree must
reside on the same filesystem. It is an error to specify entries
applying to files on another filesystem. If files or directories that
are not otherwise ignored reside on a different filesystem, or symbolic
links point to targets on a different filesystem, they must
be explicitly excluded via ``IGNORE``.


Path and filename encoding
--------------------------

The path fields in the Manifest file must consist of characters
corresponding to valid UTF-8 code points excluding the backwards slash
(``\``) and characters classified as control characters or as whitespace
in the current version of the Unicode standard [#UNICODE]_.

The implementation can optionally support extended filename encoding
to support those paths. If encoding is not supported, the implementation
must reject directories containing any files using non-compliant names,
as well as Manifest files whose filename field contains such filenames.

If encoding is supported, then all of the excluded characters that
are present in paths must be encoded using one of the following escape
sequences:

- characters in the ``U+0000`` to ``U+007F`` range can be encoded
  as ``\xHH`` where ``HH`` specifies the zero-padded, hexadecimal
  character code,

- characters in the ``U+0000`` to ``U+FFFF`` range can be encoded
  as ``\uHHHH`` where ``HHHH`` specifies the zero-padded, hexadecimal
  character code,

- characters in the UCS-4 range can be encoded as ``\UHHHHHHHH``
  where ``HHHHHHHH`` specifies the zero-padded, hexadecimal character
  code.

It is invalid for the backwards slash to be used in any other context,
and a backwards slash present in filename must be encoded. A backwards
slash used as a path component separator should be replaced by a forward
slash instead.

The encoding can be used for other characters as well. In particular,
escaping non-printable characters might be desirable.


File verification
-----------------

When verifying a file against the Manifest, the following rules are
used:

1. If the file is covered directly or indirectly by an entry
   of the ``IGNORE`` type, the verification always succeeds.

2. If the file is covered by an entry of the ``MANIFEST``, ``DATA``,
   ``MISC``, ``EBUILD`` or ``AUX`` type:

   a. if the file is not present, then the verification fails,

   b. if the file is present but has a different size or one
      of the checksums does not match, the verification fails,

   c. otherwise, the verification succeeds.

3. If the file is present but not listed in Manifest, the verification
   fails.

Unless specified otherwise, the package manager must not allow using
any files for which the verification failed. The package manager may
reject any package or even the whole repository if it may refer to files
for which the verification failed.


Timestamp verification
----------------------

The top-level Manifest file can contain a ``TIMESTAMP`` entry to account
for attacks against tree update distribution. If such an entry
is present, it should be updated every time at least one
of the Manifests changes. Every unique timestamp value must correspond
to a single tree state.

During the verification process, the client should compare the timestamp
against the update time obtained from a local clock or a trusted time
source. If the comparison result indicates that the Manifest at the time
of receiving was already significantly outdated, the client should
either fail the verification or require manual confirmation from
the user.

Furthermore, the Manifest provider may employ additional methods
of distributing the timestamps of recently generated Manifests
using a secure channel from a trusted source for exact comparison.
The exact details of such a solution are outside the scope of this
specification.

``TIMESTAMP`` entries may also be present in sub-Manifests. Those
timestamps must not be newer than the timestamp of the top-level
Manifest (if present). This specification does not define any specific
use for them.


Modern Manifest tags
--------------------

The Manifest files can specify the following tags:

``TIMESTAMP <iso8601>``
  Specifies a timestamp of when the Manifest file was last updated.
  The timestamp must be a valid second-precision ISO 8601 extended
  format combined date and time in UTC timezone, i.e. using
  the following ``strftime()`` format string: ``%Y-%m-%dT%H:%M:%SZ``.
  Optional. The package manager can use it to detect an outdated
  repository checkout as described in `Timestamp verification`_.

``MANIFEST <path> <size> <checksums>...``
  Specifies a sub-Manifest. The sub-Manifest must be verified like
  a regular file. If the verification succeeds, the entries from
  the sub-Manifest are included for verification as described
  in `Manifest file locations and nesting`_.

``IGNORE <path>``
  Ignores a subdirectory or file from Manifest checks. If the specified
  path is present, it and its contents are omitted from the Manifest
  verification (always pass). *Path* must be a plain file or directory
  path without a trailing slash. Wildcards are not supported
  and wildcard characters are interpreted literally.

``DATA <path> <size> <checksums>...``
  Specifies a regular file subject to Manifest verification. The file
  is required to pass verification. Used for all files that do not match
  any other type.

``DIST <filename> <size> <checksums>...``
  Specifies a distfile entry used to verify files fetched as part
  of ``SRC_URI``. The filename must match the filename used to store
  the fetched file as specified in the PMS [#PMS-FETCH]_. The package
  manager must reject the fetched file if it fails verification.
  ``DIST`` entries apply to all packages below the Manifest file
  specifying them.


Deprecated Manifest tags
------------------------

For backwards compatibility, the following tags are additionally
allowed at the package directory level:

``EBUILD <filename> <size> <checksums>...``
  Equivalent to the ``DATA`` type.

``MISC <path> <size> <checksums>...``
  Equivalent to the ``DATA`` type. Historically indicated that
  the package manager may ignore a verification failure if operating
  in non-strict mode. However, that behavior is deprecated.

``AUX <filename> <size> <checksums>...``
  Equivalent to the ``DATA`` type, except that the filename is relative
  to the ``files/`` subdirectory.


Algorithm for full-tree verification
------------------------------------

In order to perform full-tree verification, the following algorithm
can be used:

1. Collect all files present in the repository into *present* set.

2. Start at the top-level Manifest file. Verify its OpenPGP signature.
   Optionally verify the ``TIMESTAMP`` entry if present as specified
   in `timestamp verification`. Remove the top-level Manifest
   from the *present* set.

3. Process all ``MANIFEST`` entries, recursively. Verify the Manifest
   files according to the `file verification`_ section, and include
   their entries in the current Manifest entry list (using paths
   relative to directories containing the Manifests).

4. Process all ``IGNORE`` entries. Remove any paths matching them
   from the *present* set.

5. Collect all files covered by ``DATA``, ``MISC``, ``EBUILD``
   and ``AUX`` entries into the *covered* set.

6. Verify the entries in the *covered* set for incompatible duplicates
   and collisions with ignored files as explained in `Manifest file
   locations and nesting`_.

7. Verify all the files in the union of the *present* and *covered*
   sets, according to the `file verification`_ section.


Algorithm for finding parent Manifests
--------------------------------------

In order to find the top-level Manifest from the current directory
the following algorithm can be used:

1. Store the current directory as *original* and the device ID
   of the containing filesystem (``st_dev``) as *startdev*,

2. If the device ID of the containing filesystem (``st_dev``)
   of the current directory is different than *startdev*, stop.

3. If the current directory contains a ``Manifest`` file:

   a. If an ``IGNORE`` entry in the ``Manifest`` file covers
      the *original* directory (or one of the parent directories), stop.

   b. Otherwise, store the current directory as *last_found*.

4. If the current directory is the root system directory (``/``), stop.

5. Otherwise, enter the parent directory and jump to step 2.

Once the algorithm stops, *last_found* will contain the relevant
top-level Manifest. If *last_found* is null, then the directory tree
does not contain any valid top-level Manifest candidates and one should
be created in the *original* directory.

Once the top-level Manifest is found, its ``MANIFEST`` entries should
be used to find any sub-Manifests below the top-level Manifest,
up to and including the *original* directory. Note that those
sub-Manifests can use different filenames than ``Manifest``.


Checksum algorithms
-------------------

This section is informational only. Specifying the exact set
of supported algorithms is outside the scope of this specification.

The algorithm names reserved at the time of writing are:

- ``MD5`` [#MD5]_,
- ``RMD160`` -- RIPEMD-160 [#RIPEMD160]_,
- ``SHA1`` [#SHS]_,
- ``SHA256`` and ``SHA512`` -- SHA-2 family of hashes [#SHS]_,
- ``WHIRLPOOL`` [#WHIRLPOOL]_,
- ``BLAKE2B`` and ``BLAKE2S`` -- BLAKE2 family of hashes [#BLAKE2]_,
- ``SHA3_256`` and ``SHA3_512`` -- SHA-3 family of hashes [#SHA3]_,
- ``STREEBOG256`` and ``STREEBOG512`` -- Streebog family of hashes
  [#STREEBOG]_.

The method of introducing new hashes is defined by GLEP 59 [#GLEP59]_.
It is recommended that any new hashes are named after the Python
``hashlib`` module algorithm names, transformed into uppercase.


Manifest compression
--------------------

The topic of Manifest file compression is covered by GLEP 61 [#GLEP61]_.
This section merely addresses interoperability issues between Manifest
compression and this specification.

The compressed Manifest files are required to be suffixed for their
compression algorithm. This suffix should be used to recognize
the compression and decompress Manifests transparently. The exact list
of algorithms and their corresponding suffixes are outside the scope
of this specification.

The top-level Manifest file must not be compressed. Since the OpenPGP
signature covers the uncompressed text and is compressed itself,
the data would have to be decompressed without any prior verification.
This could expose users e.g. to zip bombs or exploits on decompressor
vulnerabilities.

Whenever this specification refers to sub-Manifests, they can use any
names but are also required to use a specific compression suffix.
The ``MANIFEST`` entries are required to specify the full name including
compression suffix, and the verification is performed on the compressed
file.

The specification permits uncompressed Manifests to exist alongside
their compressed counterparts, and multiple compressed formats
to coexist. If that is the case, the files must have the same
uncompressed content and the specification is free to choose either
of the files using the same base name.


Combining multiple Manifest trees (informational)
-------------------------------------------------

This specification permits nesting multiple hierarchical Manifest trees.
In this layout, the specific directories of the Manifest tree can
be verified both as a part of another top-level Manifest,
and as an independent Manifest tree (when obtained without the parent
directory).

For this to work, the sub-Manifest file in the directory must also
satisfy the requirements for the top-level Manifest file. That is:

- it must be named ``Manifest`` and not compressed,

- it must cover all the files in this directory and its subdirectories
  (i.e. no files from the directory tree can be covered by parent
  Manifest),

- if authenticity verification is desired, it must be OpenPGP-signed.

It should be noted that if such a directory is a subdirectory of a valid
Manifest tree, the sub-Manifest needs to be valid according
to the top-level Manifest and the OpenPGP signature is disregarded
as detailed in `Manifest file locations and nesting`_. The top-level
behavior is exhibited only when the directory is obtained without parent
directories.


An example Manifest file (informational)
----------------------------------------

An example top-level Manifest file for the Gentoo repository would have
the following content::

    TIMESTAMP 2017-10-30T10:11:12Z
    IGNORE distfiles
    IGNORE local
    IGNORE lost+found
    IGNORE packages
    MANIFEST app-accessibility/Manifest 14821 SHA256 1b5f.. SHA512 f7eb..
    ...
    MANIFEST eclass/Manifest.gz 50812 SHA256 8c55.. SHA512 2915..
    ...

An example modern Manifest (disregarding backwards compatibility)
for a package directory would have the following content::

    DATA SphinxTrain-0.9.1-r1.ebuild 932 SHA256 3d3b.. SHA512 be4d..
    DATA SphinxTrain-1.0.8.ebuild 912 SHA256 f681.. SHA512 0749..
    DATA metadata.xml 664 SHA256 97c6.. SHA512 1175..
    DATA files/gcc.patch 816 SHA256 b56e.. SHA512 2468..
    DATA files/gcc34.patch 333 SHA256 c107.. SHA512 9919..
    DIST SphinxTrain-0.9.1-beta.tar.gz 469617 SHA256 c1a4.. SHA512 1b33..
    DIST sphinxtrain-1.0.8.tar.gz 8925803 SHA256 548e.. SHA512 465d..


Rationale
=========

Stand-alone format
------------------

The first question that needed to be asked before proceeding with
the design was whether the Manifest file format was supposed to be
stand-alone, or tightly bound to the repository format.

The stand-alone format has been selected because of its three
advantages:

1. It is more future-proof. If an incompatible change to the repository
   format is introduced, only developers need to upgrade the tools
   they use to generate the Manifests. The tools used to verify
   the updated Manifests will continue to work.

2. It is more flexible and universal. With a dedicated tool,
   the Manifest files can be used to sign and verify arbitrary file
   sets.

3. It keeps the verification tool simpler. In particular, we can easily
   write an independent verification tool that could work on any
   distribution without needing to depend on a package manager
   implementation or rewrite parts of it.

Designing a stand-alone format requires that the Manifest carries enough
information to perform the verification following all the rules specific
to the Gentoo repository.


Tree design
-----------

The second important point of the design was determining whether
the Manifest files should be structured hierarchically, or independent.
Both options have their advantages.

In the hierarchical model, each sub-Manifest file is covered by a higher
level Manifest. As a result, only the top-level Manifest has to be
OpenPGP-signed, and subsequent Manifests need to be only verified by
checksum stored in the parent Manifest. This has the following
implications:

- Verifying any set of files in the repository requires using checksums
  from the most relevant Manifests and the parent Manifests.

- The OpenPGP signature of the top-level Manifest needs to be verified
  only once per process.

- Altering any set of files requires updating the relevant Manifests,
  and their parent Manifests up to the top-level Manifest, and signing
  the last one.

- As a result, the top-level Manifest changes on every commit,
  and various middle-level Manifests change (and need to be transferred)
  frequently.

In the independent model, each sub-Manifest file is independent
of the parent Manifests. As a result, each of them needs to be signed
and verified independently. However, the parent Manifests still need
to list sub-Manifests (albeit without verification data) in order
to detect removal or replacement of subdirectories. This has
the following implications:

- Verifying any set of files in the repository requires using checksums
  and verifying signatures of the most relevant Manifest files.

- Altering any set of files requires updating the relevant Manifests
  and signing them again.

- Parent Manifests are updated only when Manifests are added or removed
  from subdirectories. As a result, they change infrequently.

While both models have their advantages, the hierarchical model was
selected because it reduces the number of OpenPGP operations
(which are comparatively costly) to the minimum.


Tree layout restrictions
------------------------

The algorithm is meant to work primarily with ebuild repositories which
normally contain only files and directories. Directories provide
no useful metadata for verification, and specifying special entries
for additional file types is purposeless. Therefore, the specification
is restricted to dealing with regular files.

The Gentoo repository does not use symbolic links. Some Gentoo
repositories do, however. To provide a simple solution for dealing with
symlinks without having to take care to implement special handling for
them, the common behavior of implicitly resolving them is used.
Therefore, symbolic links to files are stored as if they were regular
files, and symbolic links to directories are followed as if they were
regular directories.

Dotfiles are implicitly ignored as that is a common notion used
in software written for POSIX systems. All other filenames require
explicit ``IGNORE`` lines.

An ability to inject additional ignore entries is provided to account
for site configuration affecting the repository tree -- placing
additional files in it, skipping some of the categories from syncing.
This configuration can extend beyond the limits of this GLEP,
e.g. by allowing wildcards or regular expressions.

The algorithm is restricted to work on a single filesystem. This is
mostly relevant when scanning for top-level Manifest -- we do not want
to cross filesystem boundaries then. However, to ensure consistent
bidirectional behavior we need to also ban them when operating downwards
the tree.

The directories and files on different filesystems need to be ignored
explicitly as implicitly skipping them would cause confusion.
In particular, tools might then claim that a file does not exist when
it clearly does because it was skipped due to filesystem boundaries.


Filename character set restriction
----------------------------------

The valid set of filename characters for the Gentoo repository
is restricted by the devmanual 'File Naming Rules' section
[#FILE-NAMING-RULES]_, and enforced via a git hook. The valid distfile
names are not restricted explicitly -- however, the PMS dependency
specification syntax [#PMS-FETCH]_ implicitly makes it impossible to use
filenames containing whitespace.

This specification aims to avoid arbitrary restrictions. For this
reason, filename characters are only restricted by excluding three
technically problematic groups:

1. The backwards slash character (``\``) is used as path separator
   on Windows systems, so it's extremely unlikely to be used in real
   filenames. For this reason it is used to implement character
   encoding with minimal risk of breaking backwards compatibility.

2. The control characters can trigger special behavior in various
   programs and confuse them from recognizing text files. In particular,
   the NULL character (``U+0000``) is normally used to indicate the end
   of a null-terminated string. Its use could therefore break
   implementations written in the C language. Other control characters
   could trigger various formatting routines, garbling text output.

3. Whitespace characters are used to separate Manifest fields
   and entries. While technically it would be enough to restrict space
   (``U+0020``) character that is normally used as the separator
   and newline (``U+000A``) character that is used to separate lines,
   all whitespace characters are forbidden to avoid confusion
   and implementation errors.

Historically, Portage attempted to overcome the whitespace limitation
by attempting to locate the size field and take everything before it
as filename. This was terribly fragile and even if it worked, it would
solve the problem only partially.

To preserve compatibility with the current implementations and given
that all of the listed characters are not allowed for the foreseeable
Gentoo uses, extended encoding support is optional. If such support
is not provided, the implementation must unconditionally reject any
such files. Ignoring them implicitly would be confusing, and it is
not possible to use them in explicit ``IGNORE`` entries.

The character encoding method provides means to overcome the character
restrictions to extend the tool usability beyond immediate Gentoo uses.
The backslash escape form based on Python unicode strings is used
since it can encode all characters within the Unicode range, the syntax
is familiar to many programmers and the backwards slash character
is extremely unlikely to appear in real filenames.

Syntax is limited to the minimum necessary to implement the encoding.
Shorthand forms (e.g. ``\t`` or ``\\``) are omitted to avoid unnecessary
complexity, and to reduce the risk of shell users using backslash
to escape space directly. The ``\x`` form is limited to ``\x00..\x7F``
range to avoid ambiguity of higher values which might be interpreted
either as UCS-2 code points or part of a UTF-8 encoded character.

Encoding stores UCS-2/UCS-4 characters directly rather than hex-encoded
UTF-8 string to simplify the implementation. In particular, it makes it
possible to process the Manifest file as UTF-8 encoded text without
having to perform additional UTF-8 decoding (and verification)
of the escaped data.

URL-encoding was considered as an alternative. However, it could collide
with ``DIST`` entries that are implicitly named after the URL filename
part where URL-encoding is pretty common.


File verification model
-----------------------

The verification model aims to provide full coverage against different
forms of attack. In particular, three different kinds of manipulation
are considered:

1. Alteration of the file content.

2. Removal of a file.

3. Addition of a new file.

In order to prevent against all three, the system requires that all
files in the repository are listed in Manifests and verified against
them.

As a special case, ignores are allowed to account for directories
that are not part of the repository but were traditionally placed inside
it. Those directories were ``distfiles``, ``local`` and ``packages``. It
could be also used to ignore VCS directories such as ``CVS``.


Non-strict Manifest verification
--------------------------------

Originally the Manifest2 format provided a special ``MISC`` tag that
was used for ``metadata.xml`` and ``ChangeLog`` files. This tag
indicated that the Manifest verification failures could be ignored for
those files unless the package manager was working in strict mode.

The first versions of this specification continued the use of this tag.
However, after a long debate it was decided to deprecate it along with
the non-strict behavior, and require all files to strictly match.

Two arguments were mentioned for the usefulness of a ``MISC`` type:

1. being able to reduce the checkout size by stripping unnecessary
   files out, and

2. being able to update automatically generated files locally
   without causing unnecessary verification failures.

However, the usefulness of ``MISC`` in both cases is doubtful.

The cases for stripping unnecessary files mostly focused around space
savings. For this purpose, stripping ``metadata.xml`` and similar files
has little value. It is much more common for users to strip whole
packages or categories. The ``MISC`` type is not suitable for that,
and so a dedicated package manager mechanism needs to be developed
instead. The same mechanism can also handle files that historically used
the ``MISC`` type. As an example, the package manager may choose
to generate both the rsync exclusion list and Manifest ignore list
using a single source list.

The cases for autogenerated files involve such cache files
as ``use.local.desc``. However, we can not include ``md5-cache`` there
due to security concerns which results in inconsistent cache handling.
Furthermore, the tools were historically modified to provide stable
output which means that their content can not change without
a non-``MISC`` content being changed first. This practically defeats
the purpose of using ``MISC``.

Finally, the non-strict mode could be used as means to an attack.
The allowance of missing or modified documentation file could be used
to spread misinformation, resulting in bad decisions made by the user.
A modified file could also be used, e.g. to exploit vulnerabilities
of an XML parser.


Timestamp field
---------------

The top-level Manifest optionally allows using a ``TIMESTAMP`` tag
to include a generation timestamp in the Manifest. A similar feature
was originally proposed in GLEP 58 [#GLEP58]_.

A malicious third-party may use the principles of exclusion or replay
[#C08]_ to deny an update to clients, while at the same time recording
the identity of clients to attack. The timestamp field can be used to
detect that.

In order to provide more complete protection, the Gentoo Infrastructure
should provide an ability to obtain the timestamps of all Manifests
from a recent timeframe over a secure channel from a trusted source
for comparison.

Strictly speaking, this information is provided by the various
``metadata/timestamp*`` files that are already present. However,
including the value in the Manifest itself has a little cost
and provides the ability to perform the verification stand-alone.

Furthermore, some of the timestamp files are added very late
in the distribution process, past the Manifest generation phase. Those
files will most likely receive ``IGNORE`` entries and therefore
be unsafe to use.

The specification permits additional timestamps in sub-Manifest files
for local use. A generic testing tool should ignore them.


New vs deprecated tags
----------------------

Out of the four types defined by Manifest2, only one is reused
and the remaining three are replaced by a single, universal ``DATA``
type.

The ``DIST`` tag is reused since the specification does not change
anything with regard to distfile handling.

The ``EBUILD`` tag could potentially be reused for generic file
verification data. However, it would be confusing if all the different
data files were marked as ``EBUILD``. Therefore, an equivalent ``DATA``
type was introduced as a replacement.

The ``MISC`` tag and the relevant non-strict mode has been removed
as being of little value, as detailed in the `Non-strict Manifest
verification`_ section.

The ``AUX`` tag is deprecated as it is redundant to ``DATA``, and has
the limiting property of implicit ``files/`` path prefix.


Finding top-level Manifest
--------------------------

The development of a reference implementation for this GLEP has brought
the following problem: how to find all the relevant Manifests when
the Manifest tool is run inside a subdirectory of the repository?

One of the options would be to provide a bi-directional linking
of Manifests via a ``PARENT`` tag. However, that would not solve
the problem when a new Manifest file is being created.

Instead, an algorithm for iterating over parent directories is proposed.
Since there is no obligatory explicit indicator for the top-level
Manifest, the algorithm assumes that the top-level Manifest
is the highest ``Manifest`` in the directory hierarchy that can cover
the current directory. This generally makes sense since the Manifest
files are required to provide coverage for all subdirectories, so all
Manifests starting from that one need to be updated.

If independent Manifest trees are nested in the directory structure,
then an ``IGNORE`` entry needs to be used to separate them.

Since sub-Manifests can use any filenames, the Manifest finding
algorithm must not short-cut the procedure by storing all ``Manifest``
files along the parent directories. Instead, it needs to retrace
the relevant sub-Manifest files along ``MANIFEST`` entries
in the top-level Manifest.


Injecting ChangeLogs into the checkout
--------------------------------------

One of the problems considered in the new Manifest format was injecting
historical and autogenerated ChangeLog into the repository. We normally
don't include those files, to reduce the checkout size. However, some
users have shown interest in them and Infra is working on providing them
via an additional rsync module.

If such files were injected into the repository, they would cause
verification failures of Manifests. To account for this, Infra could
provide ``IGNORE`` entries to allow them to exist.


Splitting distfile checksums from file checksums
------------------------------------------------

Another problem with the current Manifest format is that the checksums
for fetched files are combined with checksums for local files
in a single file inside the package directory. It has been specifically
pointed out that:

- since distfiles are sometimes reused across different packages,
  the repeating checksums are redundant [#DIST]_.
  
- mirror admins were interested in the possibility of verifying all
  the distfiles with a single tool.

This specification does not provide a clean solution to this problem.
It technically permits moving ``DIST`` entries to higher-level Manifests
but the usefulness of such a solution is doubtful.

However, for the second problem we will probably deliver a dedicated
tool working with this Manifest format.


Hash algorithms
---------------

While maintaining a consistent supported hash set is important
for interoperability, it is not a good fit for the generic layout
of this GLEP. Furthermore, it would require updating the GLEP
in the future every time the used algorithms change.

Instead, the specification focuses on listing the currently used
algorithm names for interoperability, and sets a recommendation
for consistent naming of algorithms in the future. The Python
``hashlib`` module is used as a reference since it is used
as the provider of hash functions for most of the Python software,
including Portage and PkgCore.

The basic rules for changing hash algorithms are defined in GLEP 59
[#GLEP59]_. The implementations can focus only on those algorithms
that are actually used or planned on being used. It may be feasible
to devise a new GLEP that specifies the currently used hashes (or update
GLEP 59 accordingly).


Manifest compression
--------------------

The support for Manifest compression is introduced with minimal changes
to the file format. The ``MANIFEST`` entries are required to provide
the real (compressed) file path for compatibility with other file
entries and to avoid confusion.

The compression of top-level Manifest file has been prohibited
as the specification currently does not provide any means of verifying
the file prior to decompression. If the top-level Manifest is
compressed, tooling will have to unpack the file before being able
to verify the contents. This makes it possible for a malicious third
party to attack the system by providing a compressed Manifest that
exposes decompressor vulnerabilities, or a zip bomb.

The OpenPGP cleartext signature covers the contents of the Manifest,
and is therefore compressed along with them. The possibility of using
a detached signature has been considered but it was rejected as
unnecessary complexity for minor gain.

Technically, a similar result could be effected via moving all the data
into a compressed sub-Manifest in the top directory (e.g.
``Manifest.sub.gz``), and including a ``MANIFEST`` entry for this file
in a signed, uncompressed top-level Manifest.

The existence of additional entries for uncompressed Manifest checksums
was debated. However, plain entries for the uncompressed file would
be confusing if only the compressed file existed, and conflicting
if both uncompressed and compressed variants existed. Furthermore,
it has been pointed out that ``DIST`` entries do not have
an uncompressed variant either.


Performance considerations
--------------------------

Performing a full-tree verification on every sync raises some
performance concerns for end-user systems. The initial testing has shown
that a cold-cache verification on a btrfs file system can take up around
4 minutes, with the process being mostly I/O bound. On the other hand,
it can be expected that the verification will be performed directly
after syncing, taking advantage of a warm filesystem cache.

To improve speed on I/O and/or CPU-restrained systems even further,
the algorithms can be easily extended to perform incremental
verification. Given that rsync does not preserve mtimes by default,
the tool can take advantage of mtime and Manifest comparisons to recheck
only the parts of the repository that have changed.

Furthermore, the package manager implementations can restrict checking
only to the parts of the repository that are actually being used.


Backwards Compatibility
=======================

This GLEP provides optional means of preserving backwards compatibility.
To preserve the backwards compatibility, the following needs to hold
for the ``Manifest`` file in every package directory:

- all files must be covered by the single ``Manifest`` file,

- all distfiles used by the package must be included,

- all files inside the ``files/`` subdirectory need to use
  the ``AUX`` tag (rather than ``DATA``),

- all ``.ebuild`` files need to use the ``EBUILD`` tag,

- the ``metadata.xml`` and ``ChangeLog`` files need to use
  the ``MISC`` tag,

- the Manifest can be signed to provide authenticity verification,

- an uncompressed Manifest must always exist, and a compressed Manifest
  of identical content may be present.

Once the backwards compatibility is no longer a concern, the above
no longer needs to hold and the deprecated tags can be removed.


Reference Implementation
========================

The reference implementation for this GLEP is being developed
as the gemato project [#GEMATO]_.


Credits
=======

Thanks to all the people whose contributions were invaluable
to the creation of this GLEP. This includes but is not limited to:

- Robin Hugh Johnson,
- Ulrich Müller.

Additionally, thanks to Robin Hugh Johnson for the original
MetaManifest GLEP series which served both as inspiration and source
of many concepts used in this GLEP. Recursively, also thanks to all
the people who contributed to the original GLEPs.


References
==========

.. [#GLEP44] GLEP 44: Manifest2 format
   (https://www.gentoo.org/glep/glep-0044.html)

.. [#GLEP57] GLEP 57: Security of distribution of Gentoo software
   - Overview
   (https://www.gentoo.org/glep/glep-0057.html)

.. [#GLEP58] GLEP 58: Security of distribution of Gentoo software
   - Infrastructure to User distribution - MetaManifest
   (https://www.gentoo.org/glep/glep-0058.html)

.. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
   (https://www.gentoo.org/glep/glep-0059.html)

.. [#GLEP60] GLEP 60: Manifest2 filetypes
   (https://www.gentoo.org/glep/glep-0060.html)

.. [#GLEP61] GLEP 61: Manifest2 compression
   (https://www.gentoo.org/glep/glep-0061.html)

.. [#UNICODE] The Unicode standard
   (https://unicode.org/versions/latest/)

.. [#PMS-FETCH] Package Manager Specification: Dependency Specification
   Format - SRC_URI
   (https://projects.gentoo.org/pms/6/pms.html#x1-940008.2.10)

.. [#FILE-NAMING-RULES] Ebuild File Format -- Gentoo Development Guide
   (https://devmanual.gentoo.org/ebuild-writing/file-format/#file-naming-rules)

.. [#MD5] RFC1321: The MD5 Message-Digest Algorithm
   (https://www.ietf.org/rfc/rfc1321.txt)

.. [#RIPEMD160] The hash function RIPEMD-160
   (https://homes.esat.kuleuven.be/~bosselae/ripemd160.html)

.. [#SHS] FIPS PUB 180-4: Secure Hash Standard (SHS)
   (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf)

.. [#WHIRLPOOL] The WHIRLPOOL Hash Function
   (http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html)

.. [#BLAKE2] BLAKE2 -- fast secure hashing
   (https://blake2.net/)

.. [#SHA3] FIPS PUB 202: SHA-3 Standard: Permutation-Based Hash
   and Extendable-Output Functions
   (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf)

.. [#STREEBOG] GOST R 34.11-2012: Streebog Hash Function
   (https://www.streebog.net/)

.. [#C08] Cappos, J et al. (2008). "Attacks on Package Managers"
   (https://www2.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html)

.. [#DIST] According to Robin H. Johnson, 8.4% of all DIST entries
   at the time of writing are duplicate, representing 2 MiB
   out of 25 MiB of DIST entries altogether.

.. [#GEMATO] gemato: Gentoo Manifest Tool
   (https://github.com/mgorny/gemato/)


Copyright
=========
This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
Unported License. To view a copy of this license, visit
http://creativecommons.org/licenses/by-sa/3.0/.

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v5]
  2017-11-23 20:53 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v5] Michał Górny
@ 2017-12-01 11:30   ` Fabian Groffen
  2017-12-01 12:32     ` Michał Górny
  0 siblings, 1 reply; 23+ messages in thread
From: Fabian Groffen @ 2017-12-01 11:30 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 47218 bytes --]

Hi,

While trying to implement full tree Manifests for the Prefix tree, I ran
into the following:

Would it be possible to add a section to define what directories receive
what kind of Manifest?

I mean in particular what is encoded in gemato/profile.py, the metadata
directory is an interesting mix and match of subdirectories that have a
Manifest of their own, and subdirectories whose content is included in
the Manifest at the metadata level.

More specifically, it seems like in the current GLEP it doesn't mention
what directories should have their own Manifest or not.  It would be
good to know if for instance adding Manifest(.gz) to
metadata/install-qa-check.d is ok as per GLEP or not (and if so, the
consumer of that directory should be fixed to ignore the Manifest*
files, instead of barking it can't source the gz file or doesn't get
it).  Also, what if someone would want to include all entries in the
top-level Manifest, would that be OK (albeit stupid I guess)?

I think it would be a good addition to specify (for a Gentoo tree) what
directories receive a Manifest file and what their content is.

In addition to this, because it is related, it would be nice to also
document the IGNORE entries that seem present at the top-level and
metadata-level, or specify where they would come from for the Gentoo
case.

Thanks!
Fabian

On 23-11-2017 21:53:57 +0100, Michał Górny wrote:
> W dniu czw, 16.11.2017 o godzinie 11∶19 +0100, użytkownik Michał Górny
> napisał:
> > Hi, everyone.
> > 
> > Here's the updated version of GLEP 74 taking into consideration
> > the points made during the Council pre-review.
> > 
> > ReST: https://dev.gentoo.org/~mgorny/tmp/glep-0074.rst
> > HTML: https://dev.gentoo.org/~mgorny/tmp/glep-0074.html
> > 
> > Changes:
> 
> 27c2a9e glep-0074: Grammar corrections from Ulrich Müller
> d39f865 glep-0074: Make extended filename encoding optional
> ed111f8 glep-0074: Always exclude control characters
> 
> ---
> GLEP: 74
> Title: Full-tree verification using Manifest files
> Author: Michał Górny <mgorny@gentoo.org>,
>         Robin Hugh Johnson <robbat2@gentoo.org>,
>         Ulrich Müller <ulm@gentoo.org>
> Type: Standards Track
> Status: Draft
> Version: 1
> Created: 2017-10-21
> Last-Modified: 2017-11-23
> Post-History: 2017-10-26, 2017-11-16
> Content-Type: text/x-rst
> Requires: 59, 61
> Replaces: 44, 58, 60
> ---
> 
> Abstract
> ========
> 
> This GLEP extends the Manifest file format to cover full-tree file
> integrity and authenticity checks. The format aims to be future-proof,
> efficient and provide means of backwards compatibility.
> 
> 
> Motivation
> ==========
> 
> The Manifest files as defined by GLEP 44 [#GLEP44]_ provide the current
> means of verifying the integrity of distfiles and package files
> in Gentoo. Combined with OpenPGP signatures, they provide means to
> ensure the authenticity of the covered files. However, as noted
> in GLEP 57 [#GLEP57]_ they lack the ability to provide full-tree
> authenticity verification as they do not cover any files outside
> the package directory. In particular, they provide multiple ways
> for a third party to inject malicious code into the ebuild environment.
> 
> Historically, the topic of providing authenticity coverage for the whole
> repository has been mentioned multiple times. The most noteworthy effort
> are GLEPs 58 [#GLEP58]_ and 60 [#GLEP60]_ by Robin H. Johnson from 2008.
> They were accepted by the Council in 2010 but have never been
> implemented. When potential implementation work started in 2017, a new
> discussion about the specification arose. It prompted the creation
> of a competing GLEP that would provide a redesigned alternative to
> the old GLEPs.
> 
> This specification is designed with the following goals in mind:
> 
> 1. It should provide means to ensure the authenticity of the complete
>    repository, including preventing the injection of additional files.
> 
> 2. The format should be universal enough to work both for the Gentoo
>    repository and third-party repositories of different characteristics.
> 
> 3. The Manifest files should be verifiable stand-alone, that is without
>    knowing any details about the underlying repository format.
> 
> 
> Specification
> =============
> 
> Manifest file format
> --------------------
> 
> This specification reuses and extends the Manifest file format defined
> in GLEP 44 [#GLEP44]_. For the purpose of it, the *file type* field is
> repurposed as a generic *tag* that could also indicate additional
> (non-checksum) metadata. Appropriately, those tags can be followed by
> other space-separated values.
> 
> Unless specified otherwise, the paths used in the Manifest files
> are relative to the directory containing the Manifest file. The paths
> must not reference the parent directory (``..``). Forward slash (``/``)
> is used as path component separator.
> 
> The Manifest files use UTF-8 encoding.
> 
> 
> Manifest file locations and nesting
> -----------------------------------
> 
> The ``Manifest`` file located in the root directory of the repository
> is called top-level Manifest, and it is used to perform the full-tree
> verification. In order to verify the authenticity, it must be signed
> using OpenPGP, using the armored cleartext format.
> 
> The top-level Manifest may reference sub-Manifests contained
> in subdirectories of the repository. The sub-Manifests are traditionally
> named ``Manifest``; however, the implementation must support arbitrary
> names, including the possibility of multiple (split) Manifests
> for a single directory. The sub-Manifest can only cover the files inside
> the directory tree where it resides.
> 
> The sub-Manifest can also be signed using OpenPGP armored cleartext
> format. However, the signature verification can be omitted since it
> already is covered by the signed top-level Manifest.
> 
> 
> Directory tree coverage
> -----------------------
> 
> The specification provides three ways of skipping Manifest verification
> of specific files and directories (recursively):
> 
> 1. explicit ``IGNORE`` entries in Manifest files,
> 
> 2. injected ignore paths via package manager configuration,
> 
> 3. using names starting with a dot (``.``) which are always skipped.
> 
> All files that are not ignored must be covered by at least one
> of the Manifests.
> 
> A single file may be matched by multiple identical or equivalent
> Manifest entries, if and only if the entries have the same semantics,
> specify the same size and the checksums common to both entries match.
> It is an error for a single file to be matched by multiple entries
> of different semantics, file size or checksum values. It is an error
> to specify another entry for a file that matches ``IGNORE``, or that
> is located inside an ignored directory.
> 
> The file entries (except for ``IGNORE``) can be specified for regular
> files only. Symbolic links are followed when opening files
> and traversing directories. It is an error to specify an entry for
> a different file type. If the tree contain files of other types
> that are not otherwise ignored, they need to be covered by an explicit
> ``IGNORE``.
> 
> All the local (non-``DIST``) files covered by a Manifest tree must
> reside on the same filesystem. It is an error to specify entries
> applying to files on another filesystem. If files or directories that
> are not otherwise ignored reside on a different filesystem, or symbolic
> links point to targets on a different filesystem, they must
> be explicitly excluded via ``IGNORE``.
> 
> 
> Path and filename encoding
> --------------------------
> 
> The path fields in the Manifest file must consist of characters
> corresponding to valid UTF-8 code points excluding the backwards slash
> (``\``) and characters classified as control characters or as whitespace
> in the current version of the Unicode standard [#UNICODE]_.
> 
> The implementation can optionally support extended filename encoding
> to support those paths. If encoding is not supported, the implementation
> must reject directories containing any files using non-compliant names,
> as well as Manifest files whose filename field contains such filenames.
> 
> If encoding is supported, then all of the excluded characters that
> are present in paths must be encoded using one of the following escape
> sequences:
> 
> - characters in the ``U+0000`` to ``U+007F`` range can be encoded
>   as ``\xHH`` where ``HH`` specifies the zero-padded, hexadecimal
>   character code,
> 
> - characters in the ``U+0000`` to ``U+FFFF`` range can be encoded
>   as ``\uHHHH`` where ``HHHH`` specifies the zero-padded, hexadecimal
>   character code,
> 
> - characters in the UCS-4 range can be encoded as ``\UHHHHHHHH``
>   where ``HHHHHHHH`` specifies the zero-padded, hexadecimal character
>   code.
> 
> It is invalid for the backwards slash to be used in any other context,
> and a backwards slash present in filename must be encoded. A backwards
> slash used as a path component separator should be replaced by a forward
> slash instead.
> 
> The encoding can be used for other characters as well. In particular,
> escaping non-printable characters might be desirable.
> 
> 
> File verification
> -----------------
> 
> When verifying a file against the Manifest, the following rules are
> used:
> 
> 1. If the file is covered directly or indirectly by an entry
>    of the ``IGNORE`` type, the verification always succeeds.
> 
> 2. If the file is covered by an entry of the ``MANIFEST``, ``DATA``,
>    ``MISC``, ``EBUILD`` or ``AUX`` type:
> 
>    a. if the file is not present, then the verification fails,
> 
>    b. if the file is present but has a different size or one
>       of the checksums does not match, the verification fails,
> 
>    c. otherwise, the verification succeeds.
> 
> 3. If the file is present but not listed in Manifest, the verification
>    fails.
> 
> Unless specified otherwise, the package manager must not allow using
> any files for which the verification failed. The package manager may
> reject any package or even the whole repository if it may refer to files
> for which the verification failed.
> 
> 
> Timestamp verification
> ----------------------
> 
> The top-level Manifest file can contain a ``TIMESTAMP`` entry to account
> for attacks against tree update distribution. If such an entry
> is present, it should be updated every time at least one
> of the Manifests changes. Every unique timestamp value must correspond
> to a single tree state.
> 
> During the verification process, the client should compare the timestamp
> against the update time obtained from a local clock or a trusted time
> source. If the comparison result indicates that the Manifest at the time
> of receiving was already significantly outdated, the client should
> either fail the verification or require manual confirmation from
> the user.
> 
> Furthermore, the Manifest provider may employ additional methods
> of distributing the timestamps of recently generated Manifests
> using a secure channel from a trusted source for exact comparison.
> The exact details of such a solution are outside the scope of this
> specification.
> 
> ``TIMESTAMP`` entries may also be present in sub-Manifests. Those
> timestamps must not be newer than the timestamp of the top-level
> Manifest (if present). This specification does not define any specific
> use for them.
> 
> 
> Modern Manifest tags
> --------------------
> 
> The Manifest files can specify the following tags:
> 
> ``TIMESTAMP <iso8601>``
>   Specifies a timestamp of when the Manifest file was last updated.
>   The timestamp must be a valid second-precision ISO 8601 extended
>   format combined date and time in UTC timezone, i.e. using
>   the following ``strftime()`` format string: ``%Y-%m-%dT%H:%M:%SZ``.
>   Optional. The package manager can use it to detect an outdated
>   repository checkout as described in `Timestamp verification`_.
> 
> ``MANIFEST <path> <size> <checksums>...``
>   Specifies a sub-Manifest. The sub-Manifest must be verified like
>   a regular file. If the verification succeeds, the entries from
>   the sub-Manifest are included for verification as described
>   in `Manifest file locations and nesting`_.
> 
> ``IGNORE <path>``
>   Ignores a subdirectory or file from Manifest checks. If the specified
>   path is present, it and its contents are omitted from the Manifest
>   verification (always pass). *Path* must be a plain file or directory
>   path without a trailing slash. Wildcards are not supported
>   and wildcard characters are interpreted literally.
> 
> ``DATA <path> <size> <checksums>...``
>   Specifies a regular file subject to Manifest verification. The file
>   is required to pass verification. Used for all files that do not match
>   any other type.
> 
> ``DIST <filename> <size> <checksums>...``
>   Specifies a distfile entry used to verify files fetched as part
>   of ``SRC_URI``. The filename must match the filename used to store
>   the fetched file as specified in the PMS [#PMS-FETCH]_. The package
>   manager must reject the fetched file if it fails verification.
>   ``DIST`` entries apply to all packages below the Manifest file
>   specifying them.
> 
> 
> Deprecated Manifest tags
> ------------------------
> 
> For backwards compatibility, the following tags are additionally
> allowed at the package directory level:
> 
> ``EBUILD <filename> <size> <checksums>...``
>   Equivalent to the ``DATA`` type.
> 
> ``MISC <path> <size> <checksums>...``
>   Equivalent to the ``DATA`` type. Historically indicated that
>   the package manager may ignore a verification failure if operating
>   in non-strict mode. However, that behavior is deprecated.
> 
> ``AUX <filename> <size> <checksums>...``
>   Equivalent to the ``DATA`` type, except that the filename is relative
>   to the ``files/`` subdirectory.
> 
> 
> Algorithm for full-tree verification
> ------------------------------------
> 
> In order to perform full-tree verification, the following algorithm
> can be used:
> 
> 1. Collect all files present in the repository into *present* set.
> 
> 2. Start at the top-level Manifest file. Verify its OpenPGP signature.
>    Optionally verify the ``TIMESTAMP`` entry if present as specified
>    in `timestamp verification`. Remove the top-level Manifest
>    from the *present* set.
> 
> 3. Process all ``MANIFEST`` entries, recursively. Verify the Manifest
>    files according to the `file verification`_ section, and include
>    their entries in the current Manifest entry list (using paths
>    relative to directories containing the Manifests).
> 
> 4. Process all ``IGNORE`` entries. Remove any paths matching them
>    from the *present* set.
> 
> 5. Collect all files covered by ``DATA``, ``MISC``, ``EBUILD``
>    and ``AUX`` entries into the *covered* set.
> 
> 6. Verify the entries in the *covered* set for incompatible duplicates
>    and collisions with ignored files as explained in `Manifest file
>    locations and nesting`_.
> 
> 7. Verify all the files in the union of the *present* and *covered*
>    sets, according to the `file verification`_ section.
> 
> 
> Algorithm for finding parent Manifests
> --------------------------------------
> 
> In order to find the top-level Manifest from the current directory
> the following algorithm can be used:
> 
> 1. Store the current directory as *original* and the device ID
>    of the containing filesystem (``st_dev``) as *startdev*,
> 
> 2. If the device ID of the containing filesystem (``st_dev``)
>    of the current directory is different than *startdev*, stop.
> 
> 3. If the current directory contains a ``Manifest`` file:
> 
>    a. If an ``IGNORE`` entry in the ``Manifest`` file covers
>       the *original* directory (or one of the parent directories), stop.
> 
>    b. Otherwise, store the current directory as *last_found*.
> 
> 4. If the current directory is the root system directory (``/``), stop.
> 
> 5. Otherwise, enter the parent directory and jump to step 2.
> 
> Once the algorithm stops, *last_found* will contain the relevant
> top-level Manifest. If *last_found* is null, then the directory tree
> does not contain any valid top-level Manifest candidates and one should
> be created in the *original* directory.
> 
> Once the top-level Manifest is found, its ``MANIFEST`` entries should
> be used to find any sub-Manifests below the top-level Manifest,
> up to and including the *original* directory. Note that those
> sub-Manifests can use different filenames than ``Manifest``.
> 
> 
> Checksum algorithms
> -------------------
> 
> This section is informational only. Specifying the exact set
> of supported algorithms is outside the scope of this specification.
> 
> The algorithm names reserved at the time of writing are:
> 
> - ``MD5`` [#MD5]_,
> - ``RMD160`` -- RIPEMD-160 [#RIPEMD160]_,
> - ``SHA1`` [#SHS]_,
> - ``SHA256`` and ``SHA512`` -- SHA-2 family of hashes [#SHS]_,
> - ``WHIRLPOOL`` [#WHIRLPOOL]_,
> - ``BLAKE2B`` and ``BLAKE2S`` -- BLAKE2 family of hashes [#BLAKE2]_,
> - ``SHA3_256`` and ``SHA3_512`` -- SHA-3 family of hashes [#SHA3]_,
> - ``STREEBOG256`` and ``STREEBOG512`` -- Streebog family of hashes
>   [#STREEBOG]_.
> 
> The method of introducing new hashes is defined by GLEP 59 [#GLEP59]_.
> It is recommended that any new hashes are named after the Python
> ``hashlib`` module algorithm names, transformed into uppercase.
> 
> 
> Manifest compression
> --------------------
> 
> The topic of Manifest file compression is covered by GLEP 61 [#GLEP61]_.
> This section merely addresses interoperability issues between Manifest
> compression and this specification.
> 
> The compressed Manifest files are required to be suffixed for their
> compression algorithm. This suffix should be used to recognize
> the compression and decompress Manifests transparently. The exact list
> of algorithms and their corresponding suffixes are outside the scope
> of this specification.
> 
> The top-level Manifest file must not be compressed. Since the OpenPGP
> signature covers the uncompressed text and is compressed itself,
> the data would have to be decompressed without any prior verification.
> This could expose users e.g. to zip bombs or exploits on decompressor
> vulnerabilities.
> 
> Whenever this specification refers to sub-Manifests, they can use any
> names but are also required to use a specific compression suffix.
> The ``MANIFEST`` entries are required to specify the full name including
> compression suffix, and the verification is performed on the compressed
> file.
> 
> The specification permits uncompressed Manifests to exist alongside
> their compressed counterparts, and multiple compressed formats
> to coexist. If that is the case, the files must have the same
> uncompressed content and the specification is free to choose either
> of the files using the same base name.
> 
> 
> Combining multiple Manifest trees (informational)
> -------------------------------------------------
> 
> This specification permits nesting multiple hierarchical Manifest trees.
> In this layout, the specific directories of the Manifest tree can
> be verified both as a part of another top-level Manifest,
> and as an independent Manifest tree (when obtained without the parent
> directory).
> 
> For this to work, the sub-Manifest file in the directory must also
> satisfy the requirements for the top-level Manifest file. That is:
> 
> - it must be named ``Manifest`` and not compressed,
> 
> - it must cover all the files in this directory and its subdirectories
>   (i.e. no files from the directory tree can be covered by parent
>   Manifest),
> 
> - if authenticity verification is desired, it must be OpenPGP-signed.
> 
> It should be noted that if such a directory is a subdirectory of a valid
> Manifest tree, the sub-Manifest needs to be valid according
> to the top-level Manifest and the OpenPGP signature is disregarded
> as detailed in `Manifest file locations and nesting`_. The top-level
> behavior is exhibited only when the directory is obtained without parent
> directories.
> 
> 
> An example Manifest file (informational)
> ----------------------------------------
> 
> An example top-level Manifest file for the Gentoo repository would have
> the following content::
> 
>     TIMESTAMP 2017-10-30T10:11:12Z
>     IGNORE distfiles
>     IGNORE local
>     IGNORE lost+found
>     IGNORE packages
>     MANIFEST app-accessibility/Manifest 14821 SHA256 1b5f.. SHA512 f7eb..
>     ...
>     MANIFEST eclass/Manifest.gz 50812 SHA256 8c55.. SHA512 2915..
>     ...
> 
> An example modern Manifest (disregarding backwards compatibility)
> for a package directory would have the following content::
> 
>     DATA SphinxTrain-0.9.1-r1.ebuild 932 SHA256 3d3b.. SHA512 be4d..
>     DATA SphinxTrain-1.0.8.ebuild 912 SHA256 f681.. SHA512 0749..
>     DATA metadata.xml 664 SHA256 97c6.. SHA512 1175..
>     DATA files/gcc.patch 816 SHA256 b56e.. SHA512 2468..
>     DATA files/gcc34.patch 333 SHA256 c107.. SHA512 9919..
>     DIST SphinxTrain-0.9.1-beta.tar.gz 469617 SHA256 c1a4.. SHA512 1b33..
>     DIST sphinxtrain-1.0.8.tar.gz 8925803 SHA256 548e.. SHA512 465d..
> 
> 
> Rationale
> =========
> 
> Stand-alone format
> ------------------
> 
> The first question that needed to be asked before proceeding with
> the design was whether the Manifest file format was supposed to be
> stand-alone, or tightly bound to the repository format.
> 
> The stand-alone format has been selected because of its three
> advantages:
> 
> 1. It is more future-proof. If an incompatible change to the repository
>    format is introduced, only developers need to upgrade the tools
>    they use to generate the Manifests. The tools used to verify
>    the updated Manifests will continue to work.
> 
> 2. It is more flexible and universal. With a dedicated tool,
>    the Manifest files can be used to sign and verify arbitrary file
>    sets.
> 
> 3. It keeps the verification tool simpler. In particular, we can easily
>    write an independent verification tool that could work on any
>    distribution without needing to depend on a package manager
>    implementation or rewrite parts of it.
> 
> Designing a stand-alone format requires that the Manifest carries enough
> information to perform the verification following all the rules specific
> to the Gentoo repository.
> 
> 
> Tree design
> -----------
> 
> The second important point of the design was determining whether
> the Manifest files should be structured hierarchically, or independent.
> Both options have their advantages.
> 
> In the hierarchical model, each sub-Manifest file is covered by a higher
> level Manifest. As a result, only the top-level Manifest has to be
> OpenPGP-signed, and subsequent Manifests need to be only verified by
> checksum stored in the parent Manifest. This has the following
> implications:
> 
> - Verifying any set of files in the repository requires using checksums
>   from the most relevant Manifests and the parent Manifests.
> 
> - The OpenPGP signature of the top-level Manifest needs to be verified
>   only once per process.
> 
> - Altering any set of files requires updating the relevant Manifests,
>   and their parent Manifests up to the top-level Manifest, and signing
>   the last one.
> 
> - As a result, the top-level Manifest changes on every commit,
>   and various middle-level Manifests change (and need to be transferred)
>   frequently.
> 
> In the independent model, each sub-Manifest file is independent
> of the parent Manifests. As a result, each of them needs to be signed
> and verified independently. However, the parent Manifests still need
> to list sub-Manifests (albeit without verification data) in order
> to detect removal or replacement of subdirectories. This has
> the following implications:
> 
> - Verifying any set of files in the repository requires using checksums
>   and verifying signatures of the most relevant Manifest files.
> 
> - Altering any set of files requires updating the relevant Manifests
>   and signing them again.
> 
> - Parent Manifests are updated only when Manifests are added or removed
>   from subdirectories. As a result, they change infrequently.
> 
> While both models have their advantages, the hierarchical model was
> selected because it reduces the number of OpenPGP operations
> (which are comparatively costly) to the minimum.
> 
> 
> Tree layout restrictions
> ------------------------
> 
> The algorithm is meant to work primarily with ebuild repositories which
> normally contain only files and directories. Directories provide
> no useful metadata for verification, and specifying special entries
> for additional file types is purposeless. Therefore, the specification
> is restricted to dealing with regular files.
> 
> The Gentoo repository does not use symbolic links. Some Gentoo
> repositories do, however. To provide a simple solution for dealing with
> symlinks without having to take care to implement special handling for
> them, the common behavior of implicitly resolving them is used.
> Therefore, symbolic links to files are stored as if they were regular
> files, and symbolic links to directories are followed as if they were
> regular directories.
> 
> Dotfiles are implicitly ignored as that is a common notion used
> in software written for POSIX systems. All other filenames require
> explicit ``IGNORE`` lines.
> 
> An ability to inject additional ignore entries is provided to account
> for site configuration affecting the repository tree -- placing
> additional files in it, skipping some of the categories from syncing.
> This configuration can extend beyond the limits of this GLEP,
> e.g. by allowing wildcards or regular expressions.
> 
> The algorithm is restricted to work on a single filesystem. This is
> mostly relevant when scanning for top-level Manifest -- we do not want
> to cross filesystem boundaries then. However, to ensure consistent
> bidirectional behavior we need to also ban them when operating downwards
> the tree.
> 
> The directories and files on different filesystems need to be ignored
> explicitly as implicitly skipping them would cause confusion.
> In particular, tools might then claim that a file does not exist when
> it clearly does because it was skipped due to filesystem boundaries.
> 
> 
> Filename character set restriction
> ----------------------------------
> 
> The valid set of filename characters for the Gentoo repository
> is restricted by the devmanual 'File Naming Rules' section
> [#FILE-NAMING-RULES]_, and enforced via a git hook. The valid distfile
> names are not restricted explicitly -- however, the PMS dependency
> specification syntax [#PMS-FETCH]_ implicitly makes it impossible to use
> filenames containing whitespace.
> 
> This specification aims to avoid arbitrary restrictions. For this
> reason, filename characters are only restricted by excluding three
> technically problematic groups:
> 
> 1. The backwards slash character (``\``) is used as path separator
>    on Windows systems, so it's extremely unlikely to be used in real
>    filenames. For this reason it is used to implement character
>    encoding with minimal risk of breaking backwards compatibility.
> 
> 2. The control characters can trigger special behavior in various
>    programs and confuse them from recognizing text files. In particular,
>    the NULL character (``U+0000``) is normally used to indicate the end
>    of a null-terminated string. Its use could therefore break
>    implementations written in the C language. Other control characters
>    could trigger various formatting routines, garbling text output.
> 
> 3. Whitespace characters are used to separate Manifest fields
>    and entries. While technically it would be enough to restrict space
>    (``U+0020``) character that is normally used as the separator
>    and newline (``U+000A``) character that is used to separate lines,
>    all whitespace characters are forbidden to avoid confusion
>    and implementation errors.
> 
> Historically, Portage attempted to overcome the whitespace limitation
> by attempting to locate the size field and take everything before it
> as filename. This was terribly fragile and even if it worked, it would
> solve the problem only partially.
> 
> To preserve compatibility with the current implementations and given
> that all of the listed characters are not allowed for the foreseeable
> Gentoo uses, extended encoding support is optional. If such support
> is not provided, the implementation must unconditionally reject any
> such files. Ignoring them implicitly would be confusing, and it is
> not possible to use them in explicit ``IGNORE`` entries.
> 
> The character encoding method provides means to overcome the character
> restrictions to extend the tool usability beyond immediate Gentoo uses.
> The backslash escape form based on Python unicode strings is used
> since it can encode all characters within the Unicode range, the syntax
> is familiar to many programmers and the backwards slash character
> is extremely unlikely to appear in real filenames.
> 
> Syntax is limited to the minimum necessary to implement the encoding.
> Shorthand forms (e.g. ``\t`` or ``\\``) are omitted to avoid unnecessary
> complexity, and to reduce the risk of shell users using backslash
> to escape space directly. The ``\x`` form is limited to ``\x00..\x7F``
> range to avoid ambiguity of higher values which might be interpreted
> either as UCS-2 code points or part of a UTF-8 encoded character.
> 
> Encoding stores UCS-2/UCS-4 characters directly rather than hex-encoded
> UTF-8 string to simplify the implementation. In particular, it makes it
> possible to process the Manifest file as UTF-8 encoded text without
> having to perform additional UTF-8 decoding (and verification)
> of the escaped data.
> 
> URL-encoding was considered as an alternative. However, it could collide
> with ``DIST`` entries that are implicitly named after the URL filename
> part where URL-encoding is pretty common.
> 
> 
> File verification model
> -----------------------
> 
> The verification model aims to provide full coverage against different
> forms of attack. In particular, three different kinds of manipulation
> are considered:
> 
> 1. Alteration of the file content.
> 
> 2. Removal of a file.
> 
> 3. Addition of a new file.
> 
> In order to prevent against all three, the system requires that all
> files in the repository are listed in Manifests and verified against
> them.
> 
> As a special case, ignores are allowed to account for directories
> that are not part of the repository but were traditionally placed inside
> it. Those directories were ``distfiles``, ``local`` and ``packages``. It
> could be also used to ignore VCS directories such as ``CVS``.
> 
> 
> Non-strict Manifest verification
> --------------------------------
> 
> Originally the Manifest2 format provided a special ``MISC`` tag that
> was used for ``metadata.xml`` and ``ChangeLog`` files. This tag
> indicated that the Manifest verification failures could be ignored for
> those files unless the package manager was working in strict mode.
> 
> The first versions of this specification continued the use of this tag.
> However, after a long debate it was decided to deprecate it along with
> the non-strict behavior, and require all files to strictly match.
> 
> Two arguments were mentioned for the usefulness of a ``MISC`` type:
> 
> 1. being able to reduce the checkout size by stripping unnecessary
>    files out, and
> 
> 2. being able to update automatically generated files locally
>    without causing unnecessary verification failures.
> 
> However, the usefulness of ``MISC`` in both cases is doubtful.
> 
> The cases for stripping unnecessary files mostly focused around space
> savings. For this purpose, stripping ``metadata.xml`` and similar files
> has little value. It is much more common for users to strip whole
> packages or categories. The ``MISC`` type is not suitable for that,
> and so a dedicated package manager mechanism needs to be developed
> instead. The same mechanism can also handle files that historically used
> the ``MISC`` type. As an example, the package manager may choose
> to generate both the rsync exclusion list and Manifest ignore list
> using a single source list.
> 
> The cases for autogenerated files involve such cache files
> as ``use.local.desc``. However, we can not include ``md5-cache`` there
> due to security concerns which results in inconsistent cache handling.
> Furthermore, the tools were historically modified to provide stable
> output which means that their content can not change without
> a non-``MISC`` content being changed first. This practically defeats
> the purpose of using ``MISC``.
> 
> Finally, the non-strict mode could be used as means to an attack.
> The allowance of missing or modified documentation file could be used
> to spread misinformation, resulting in bad decisions made by the user.
> A modified file could also be used, e.g. to exploit vulnerabilities
> of an XML parser.
> 
> 
> Timestamp field
> ---------------
> 
> The top-level Manifest optionally allows using a ``TIMESTAMP`` tag
> to include a generation timestamp in the Manifest. A similar feature
> was originally proposed in GLEP 58 [#GLEP58]_.
> 
> A malicious third-party may use the principles of exclusion or replay
> [#C08]_ to deny an update to clients, while at the same time recording
> the identity of clients to attack. The timestamp field can be used to
> detect that.
> 
> In order to provide more complete protection, the Gentoo Infrastructure
> should provide an ability to obtain the timestamps of all Manifests
> from a recent timeframe over a secure channel from a trusted source
> for comparison.
> 
> Strictly speaking, this information is provided by the various
> ``metadata/timestamp*`` files that are already present. However,
> including the value in the Manifest itself has a little cost
> and provides the ability to perform the verification stand-alone.
> 
> Furthermore, some of the timestamp files are added very late
> in the distribution process, past the Manifest generation phase. Those
> files will most likely receive ``IGNORE`` entries and therefore
> be unsafe to use.
> 
> The specification permits additional timestamps in sub-Manifest files
> for local use. A generic testing tool should ignore them.
> 
> 
> New vs deprecated tags
> ----------------------
> 
> Out of the four types defined by Manifest2, only one is reused
> and the remaining three are replaced by a single, universal ``DATA``
> type.
> 
> The ``DIST`` tag is reused since the specification does not change
> anything with regard to distfile handling.
> 
> The ``EBUILD`` tag could potentially be reused for generic file
> verification data. However, it would be confusing if all the different
> data files were marked as ``EBUILD``. Therefore, an equivalent ``DATA``
> type was introduced as a replacement.
> 
> The ``MISC`` tag and the relevant non-strict mode has been removed
> as being of little value, as detailed in the `Non-strict Manifest
> verification`_ section.
> 
> The ``AUX`` tag is deprecated as it is redundant to ``DATA``, and has
> the limiting property of implicit ``files/`` path prefix.
> 
> 
> Finding top-level Manifest
> --------------------------
> 
> The development of a reference implementation for this GLEP has brought
> the following problem: how to find all the relevant Manifests when
> the Manifest tool is run inside a subdirectory of the repository?
> 
> One of the options would be to provide a bi-directional linking
> of Manifests via a ``PARENT`` tag. However, that would not solve
> the problem when a new Manifest file is being created.
> 
> Instead, an algorithm for iterating over parent directories is proposed.
> Since there is no obligatory explicit indicator for the top-level
> Manifest, the algorithm assumes that the top-level Manifest
> is the highest ``Manifest`` in the directory hierarchy that can cover
> the current directory. This generally makes sense since the Manifest
> files are required to provide coverage for all subdirectories, so all
> Manifests starting from that one need to be updated.
> 
> If independent Manifest trees are nested in the directory structure,
> then an ``IGNORE`` entry needs to be used to separate them.
> 
> Since sub-Manifests can use any filenames, the Manifest finding
> algorithm must not short-cut the procedure by storing all ``Manifest``
> files along the parent directories. Instead, it needs to retrace
> the relevant sub-Manifest files along ``MANIFEST`` entries
> in the top-level Manifest.
> 
> 
> Injecting ChangeLogs into the checkout
> --------------------------------------
> 
> One of the problems considered in the new Manifest format was injecting
> historical and autogenerated ChangeLog into the repository. We normally
> don't include those files, to reduce the checkout size. However, some
> users have shown interest in them and Infra is working on providing them
> via an additional rsync module.
> 
> If such files were injected into the repository, they would cause
> verification failures of Manifests. To account for this, Infra could
> provide ``IGNORE`` entries to allow them to exist.
> 
> 
> Splitting distfile checksums from file checksums
> ------------------------------------------------
> 
> Another problem with the current Manifest format is that the checksums
> for fetched files are combined with checksums for local files
> in a single file inside the package directory. It has been specifically
> pointed out that:
> 
> - since distfiles are sometimes reused across different packages,
>   the repeating checksums are redundant [#DIST]_.
>   
> - mirror admins were interested in the possibility of verifying all
>   the distfiles with a single tool.
> 
> This specification does not provide a clean solution to this problem.
> It technically permits moving ``DIST`` entries to higher-level Manifests
> but the usefulness of such a solution is doubtful.
> 
> However, for the second problem we will probably deliver a dedicated
> tool working with this Manifest format.
> 
> 
> Hash algorithms
> ---------------
> 
> While maintaining a consistent supported hash set is important
> for interoperability, it is not a good fit for the generic layout
> of this GLEP. Furthermore, it would require updating the GLEP
> in the future every time the used algorithms change.
> 
> Instead, the specification focuses on listing the currently used
> algorithm names for interoperability, and sets a recommendation
> for consistent naming of algorithms in the future. The Python
> ``hashlib`` module is used as a reference since it is used
> as the provider of hash functions for most of the Python software,
> including Portage and PkgCore.
> 
> The basic rules for changing hash algorithms are defined in GLEP 59
> [#GLEP59]_. The implementations can focus only on those algorithms
> that are actually used or planned on being used. It may be feasible
> to devise a new GLEP that specifies the currently used hashes (or update
> GLEP 59 accordingly).
> 
> 
> Manifest compression
> --------------------
> 
> The support for Manifest compression is introduced with minimal changes
> to the file format. The ``MANIFEST`` entries are required to provide
> the real (compressed) file path for compatibility with other file
> entries and to avoid confusion.
> 
> The compression of top-level Manifest file has been prohibited
> as the specification currently does not provide any means of verifying
> the file prior to decompression. If the top-level Manifest is
> compressed, tooling will have to unpack the file before being able
> to verify the contents. This makes it possible for a malicious third
> party to attack the system by providing a compressed Manifest that
> exposes decompressor vulnerabilities, or a zip bomb.
> 
> The OpenPGP cleartext signature covers the contents of the Manifest,
> and is therefore compressed along with them. The possibility of using
> a detached signature has been considered but it was rejected as
> unnecessary complexity for minor gain.
> 
> Technically, a similar result could be effected via moving all the data
> into a compressed sub-Manifest in the top directory (e.g.
> ``Manifest.sub.gz``), and including a ``MANIFEST`` entry for this file
> in a signed, uncompressed top-level Manifest.
> 
> The existence of additional entries for uncompressed Manifest checksums
> was debated. However, plain entries for the uncompressed file would
> be confusing if only the compressed file existed, and conflicting
> if both uncompressed and compressed variants existed. Furthermore,
> it has been pointed out that ``DIST`` entries do not have
> an uncompressed variant either.
> 
> 
> Performance considerations
> --------------------------
> 
> Performing a full-tree verification on every sync raises some
> performance concerns for end-user systems. The initial testing has shown
> that a cold-cache verification on a btrfs file system can take up around
> 4 minutes, with the process being mostly I/O bound. On the other hand,
> it can be expected that the verification will be performed directly
> after syncing, taking advantage of a warm filesystem cache.
> 
> To improve speed on I/O and/or CPU-restrained systems even further,
> the algorithms can be easily extended to perform incremental
> verification. Given that rsync does not preserve mtimes by default,
> the tool can take advantage of mtime and Manifest comparisons to recheck
> only the parts of the repository that have changed.
> 
> Furthermore, the package manager implementations can restrict checking
> only to the parts of the repository that are actually being used.
> 
> 
> Backwards Compatibility
> =======================
> 
> This GLEP provides optional means of preserving backwards compatibility.
> To preserve the backwards compatibility, the following needs to hold
> for the ``Manifest`` file in every package directory:
> 
> - all files must be covered by the single ``Manifest`` file,
> 
> - all distfiles used by the package must be included,
> 
> - all files inside the ``files/`` subdirectory need to use
>   the ``AUX`` tag (rather than ``DATA``),
> 
> - all ``.ebuild`` files need to use the ``EBUILD`` tag,
> 
> - the ``metadata.xml`` and ``ChangeLog`` files need to use
>   the ``MISC`` tag,
> 
> - the Manifest can be signed to provide authenticity verification,
> 
> - an uncompressed Manifest must always exist, and a compressed Manifest
>   of identical content may be present.
> 
> Once the backwards compatibility is no longer a concern, the above
> no longer needs to hold and the deprecated tags can be removed.
> 
> 
> Reference Implementation
> ========================
> 
> The reference implementation for this GLEP is being developed
> as the gemato project [#GEMATO]_.
> 
> 
> Credits
> =======
> 
> Thanks to all the people whose contributions were invaluable
> to the creation of this GLEP. This includes but is not limited to:
> 
> - Robin Hugh Johnson,
> - Ulrich Müller.
> 
> Additionally, thanks to Robin Hugh Johnson for the original
> MetaManifest GLEP series which served both as inspiration and source
> of many concepts used in this GLEP. Recursively, also thanks to all
> the people who contributed to the original GLEPs.
> 
> 
> References
> ==========
> 
> .. [#GLEP44] GLEP 44: Manifest2 format
>    (https://www.gentoo.org/glep/glep-0044.html)
> 
> .. [#GLEP57] GLEP 57: Security of distribution of Gentoo software
>    - Overview
>    (https://www.gentoo.org/glep/glep-0057.html)
> 
> .. [#GLEP58] GLEP 58: Security of distribution of Gentoo software
>    - Infrastructure to User distribution - MetaManifest
>    (https://www.gentoo.org/glep/glep-0058.html)
> 
> .. [#GLEP59] GLEP 59: Manifest2 hash policies and security implications
>    (https://www.gentoo.org/glep/glep-0059.html)
> 
> .. [#GLEP60] GLEP 60: Manifest2 filetypes
>    (https://www.gentoo.org/glep/glep-0060.html)
> 
> .. [#GLEP61] GLEP 61: Manifest2 compression
>    (https://www.gentoo.org/glep/glep-0061.html)
> 
> .. [#UNICODE] The Unicode standard
>    (https://unicode.org/versions/latest/)
> 
> .. [#PMS-FETCH] Package Manager Specification: Dependency Specification
>    Format - SRC_URI
>    (https://projects.gentoo.org/pms/6/pms.html#x1-940008.2.10)
> 
> .. [#FILE-NAMING-RULES] Ebuild File Format -- Gentoo Development Guide
>    (https://devmanual.gentoo.org/ebuild-writing/file-format/#file-naming-rules)
> 
> .. [#MD5] RFC1321: The MD5 Message-Digest Algorithm
>    (https://www.ietf.org/rfc/rfc1321.txt)
> 
> .. [#RIPEMD160] The hash function RIPEMD-160
>    (https://homes.esat.kuleuven.be/~bosselae/ripemd160.html)
> 
> .. [#SHS] FIPS PUB 180-4: Secure Hash Standard (SHS)
>    (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf)
> 
> .. [#WHIRLPOOL] The WHIRLPOOL Hash Function
>    (http://www.larc.usp.br/~pbarreto/WhirlpoolPage.html)
> 
> .. [#BLAKE2] BLAKE2 -- fast secure hashing
>    (https://blake2.net/)
> 
> .. [#SHA3] FIPS PUB 202: SHA-3 Standard: Permutation-Based Hash
>    and Extendable-Output Functions
>    (http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.202.pdf)
> 
> .. [#STREEBOG] GOST R 34.11-2012: Streebog Hash Function
>    (https://www.streebog.net/)
> 
> .. [#C08] Cappos, J et al. (2008). "Attacks on Package Managers"
>    (https://www2.cs.arizona.edu/stork/packagemanagersecurity/attacks-on-package-managers.html)
> 
> .. [#DIST] According to Robin H. Johnson, 8.4% of all DIST entries
>    at the time of writing are duplicate, representing 2 MiB
>    out of 25 MiB of DIST entries altogether.
> 
> .. [#GEMATO] gemato: Gentoo Manifest Tool
>    (https://github.com/mgorny/gemato/)
> 
> 
> Copyright
> =========
> This work is licensed under the Creative Commons Attribution-ShareAlike 3.0
> Unported License. To view a copy of this license, visit
> http://creativecommons.org/licenses/by-sa/3.0/.
> 
> -- 
> Best regards,
> Michał Górny
> 
> 

-- 
Fabian Groffen
Gentoo on a different level

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [gentoo-dev] [RFC] GLEP 74 post-Council review update [v5]
  2017-12-01 11:30   ` Fabian Groffen
@ 2017-12-01 12:32     ` Michał Górny
  0 siblings, 0 replies; 23+ messages in thread
From: Michał Górny @ 2017-12-01 12:32 UTC (permalink / raw
  To: gentoo-dev

W dniu pią, 01.12.2017 o godzinie 12∶30 +0100, użytkownik Fabian Groffen
napisał:
> Hi,
> 
> While trying to implement full tree Manifests for the Prefix tree, I ran
> into the following:
> 
> Would it be possible to add a section to define what directories receive
> what kind of Manifest?
> 
> I mean in particular what is encoded in gemato/profile.py, the metadata
> directory is an interesting mix and match of subdirectories that have a
> Manifest of their own, and subdirectories whose content is included in
> the Manifest at the metadata level.
> 
> More specifically, it seems like in the current GLEP it doesn't mention
> what directories should have their own Manifest or not.  It would be
> good to know if for instance adding Manifest(.gz) to
> metadata/install-qa-check.d is ok as per GLEP or not (and if so, the
> consumer of that directory should be fixed to ignore the Manifest*
> files, instead of barking it can't source the gz file or doesn't get
> it).

It's on purpose, to allow us to create Manifests as we see a need for
them. The GLEP permits every directory to have its own Manifest file.
If some directory can't receive one, it's a limitation enforced
by something else and tracking all of those does not really fit
the purpose of this GLEP.

>   Also, what if someone would want to include all entries in the
> top-level Manifest, would that be OK (albeit stupid I guess)?

It is ok, albeit it will probably be quite slow and memory consuming
when doing partial validation only.

> I think it would be a good addition to specify (for a Gentoo tree) what
> directories receive a Manifest file and what their content is.

No, it wouldn't. It would add a lot of complexity and prevent us from
doing minor modifications without having to update the GLEP. It is
flexible by design, and it should stay that way.

Furthermore, if we specified that then I'm pretty sure some people will
decide it's written in stone and start writing stupid implementations
that rely on presence of Manifest files in some directory and their
absence in other.

> In addition to this, because it is related, it would be nice to also
> document the IGNORE entries that seem present at the top-level and
> metadata-level, or specify where they would come from for the Gentoo
> case.

-- 
Best regards,
Michał Górny



^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-12-01 12:32 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-11-16 10:19 [gentoo-dev] [RFC] GLEP 74 post-Council review update Michał Górny
2017-11-17 20:37 ` Daniel Campbell
2017-11-20 17:24   ` Michał Górny
2017-11-20 18:42 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v2] Michał Górny
2017-11-20 21:37   ` Ulrich Mueller
2017-11-21  6:30     ` Ulrich Mueller
2017-11-21 17:14     ` Michał Górny
2017-11-21 20:28       ` Ulrich Mueller
2017-11-21 21:13         ` Michał Górny
2017-11-21 21:48           ` Ulrich Mueller
2017-11-21 23:51             ` Michał Górny
2017-11-22  5:43               ` Ulrich Mueller
2017-11-22  2:59   ` R0b0t1
2017-11-22  8:02     ` Michał Górny
2017-11-22 16:38       ` R0b0t1
2017-11-21 17:26 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v3] Michał Górny
2017-11-21 18:20   ` Ulrich Mueller
2017-11-21 18:22     ` Michał Górny
2017-11-22 16:54 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v4] Michał Górny
2017-11-22 20:41   ` Ulrich Mueller
2017-11-23 20:53 ` [gentoo-dev] [RFC] GLEP 74 post-Council review update [v5] Michał Górny
2017-12-01 11:30   ` Fabian Groffen
2017-12-01 12:32     ` Michał Górny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox