From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 3BD3E139084 for ; Sat, 25 Nov 2017 20:49:51 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id BEAE3E0E92; Sat, 25 Nov 2017 20:49:40 +0000 (UTC) Received: from smtp.gentoo.org (dev.gentoo.org [IPv6:2001:470:ea4a:1:5054:ff:fec7:86e4]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 8E47FE0E92 for ; Sat, 25 Nov 2017 20:49:40 +0000 (UTC) Received: from oystercatcher.gentoo.org (unknown [IPv6:2a01:4f8:202:4333:225:90ff:fed9:fc84]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.gentoo.org (Postfix) with ESMTPS id 8000B33FE60 for ; Sat, 25 Nov 2017 20:49:39 +0000 (UTC) Received: from localhost.localdomain (localhost [IPv6:::1]) by oystercatcher.gentoo.org (Postfix) with ESMTP id 0B07AA789 for ; Sat, 25 Nov 2017 20:49:36 +0000 (UTC) From: "Michał Górny" To: gentoo-commits@lists.gentoo.org Content-Transfer-Encoding: 8bit Content-type: text/plain; charset=UTF-8 Reply-To: gentoo-dev@lists.gentoo.org, "Michał Górny" Message-ID: <1511642957.b00ade7b6467a3ae066d66f6e4ce71fb10309710.mgorny@gentoo> Subject: [gentoo-commits] data/glep:master commit in: / X-VCS-Repository: data/glep X-VCS-Files: glep-0074.rst X-VCS-Directories: / X-VCS-Committer: mgorny X-VCS-Committer-Name: Michał Górny X-VCS-Revision: b00ade7b6467a3ae066d66f6e4ce71fb10309710 X-VCS-Branch: master Date: Sat, 25 Nov 2017 20:49:36 +0000 (UTC) Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-commits@lists.gentoo.org X-Archives-Salt: 5ff8be24-bc52-4e34-895f-f365ae67751b X-Archives-Hash: 2b532e63c34f27b4fd619e5649afd9d3 commit: b00ade7b6467a3ae066d66f6e4ce71fb10309710 Author: Michał Górny gentoo org> AuthorDate: Wed Nov 22 11:40:34 2017 +0000 Commit: Michał Górny gentoo org> CommitDate: Sat Nov 25 20:49:17 2017 +0000 URL: https://gitweb.gentoo.org/data/glep.git/commit/?id=b00ade7b glep-0074: Provide encoding for disallowed characters glep-0074.rst | 75 ++++++++++++++++++++++++++++++++++++++++++++--------------- 1 file changed, 56 insertions(+), 19 deletions(-) diff --git a/glep-0074.rst b/glep-0074.rst index b0daa05..3dc6730 100644 --- a/glep-0074.rst +++ b/glep-0074.rst @@ -70,7 +70,8 @@ other space-separated values. Unless specified otherwise, the paths used in the Manifest files are relative to the directory containing the Manifest file. The paths -must not reference the parent directory (``..``). +must not reference the parent directory (``..``). Forward slash (``/``) +is used as path component separator. The Manifest files use UTF-8 encoding. @@ -132,13 +133,35 @@ are not otherwise ignored reside on a different filesystem, or symbolic links point to targets on a different filesystem, they must be explicitly excluded via ``IGNORE``. -All paths specified in the Manifest file must consist of characters + +Path and filename encoding +-------------------------- + +The path fields in the Manifest file must consist of characters corresponding to valid UTF-8 code points excluding the NULL character (``U+0000``), the backwards slash (``\``) and characters classified as whitespace in the current version of the Unicode standard -[#UNICODE]_. It is an error to use Manifest files in directories -containing files whose names contain the disallowed characters. -The forward slash (``/``) must be used as path separator. +[#UNICODE]_. + +Any of the excluded characters that are present in path must be encoded +using one of the following escape sequences: + +- characters in the ``U+0000`` to ``U+007F`` range can be encoded + as ``\xHH`` where ``HH`` specifies the zero-padded, hexadecimal + character code, + +- characters in the ``U+0000`` to ``U+FFFF`` range can be encoded + as ``\uHHHH`` where ``HHHH`` specifies the zero-padded, hexadecimal + character code, + +- characters in the UCS-4 range can be encoded as ``\UHHHHHHHH`` + where ``HHHHHHHH`` specifies the zero-padded, hexadecimal character + code. + +It is invalid for backwards slash to be used in any other context, +and a backwards slash present in filename must be encoded. Backwards +slash used as path component separator should be replaced by forward +slash instead. File verification @@ -563,7 +586,7 @@ specification syntax [#PMS-FETCH]_ implicitly makes it impossible to use filenames containing whitespace. This specification aims to avoid arbitrary restrictions. For this -reason, filename characters are only restricted by excluding two +reason, filename characters are only restricted by excluding three technically problematic groups: 1. The NULL character (``U+0000``) is normally used to indicate the end @@ -571,12 +594,10 @@ technically problematic groups: written using C. Furthermore, it is not allowed in any known filesystem. -2. The backwards slash character (``\``) is frequently used as an escape - character, in particular in the languages derived from C and in shell - script. Furthermore, it is used as path separator on Windows systems. - It is forbidden to avoid implementation mistakes (in particular, - attempting to use it to escape whitespace or as path separator - on Windows) but also reserved for possible future extension. +2. The backwards slash character (``\``) is used as path separator + on Windows systems, so it's extremely unlikely to be used in real + filenames. For this reason it is used to implement character + encoding with minimal risk of breaking backwards compatibility. 3. Whitespace characters are used to separate Manifest fields and entries. While technically it would be enough to restrict space @@ -585,18 +606,34 @@ technically problematic groups: all whitespace characters are forbidden to avoid confusion and implementation errors. -While the specification could be extended to allow such filenames -by using some form of escaping, there is currently no apparent need -for such a feature. - Historically, Portage attempted to overcome the whitespace limitation by attempting to locate the size field and take everything before it as filename. This was terribly fragile and even if it worked, it would solve the problem only partially. -Since the same restrictions apply to ``IGNORE`` rules, it is currently -not possible to either list or ignore the file using whitespace -characters. Therefore, the presence of such files is forbidden entirely. +The character encoding method provides means to overcome the character +restrictions to extend the tool usability beyond immediate Gentoo uses. +The backslash escape form based on Python unicode strings is used +since it can encode all characters within the Unicode range, the syntax +is familiar to many programmers and the backwards slash character +is extremely unlikely to appear in real filenames. + +Syntax is limited to the minimum necessary to implement the encoding. +Shorthand forms (e.g. ``\t`` or ``\\``) are omitted to avoid unnecessary +complexity, and to reduce the risk of shell users using backslash +to escape space directly. The ``\x`` form is limited to ``\x00..\x7F`` +range to avoid ambiguity of higher values which might be interpreted +either as UCS-2 code points or part of a UTF-8 encoded character. + +Encoding stores UCS-2/UCS-4 characters directly rather than hex-encoded +UTF-8 string to simplify the implementation. In particular, it makes it +possible to process the Manifest file as UTF-8 encoded text without +having to perform additional UTF-8 decoding (and verification) +of the escaped data. + +URL-encoding was considered as an alternative. However, it could collide +with ``DIST`` entries that are implicitly named after the URL filename +part where URL-encoding is pretty common. File verification model