public inbox for gentoo-dev@lists.gentoo.org
 help / color / mirror / Atom feed
* [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
@ 2016-06-06  0:22 Mart Raudsepp
  2016-06-06  6:18 ` Michał Górny
                   ` (3 more replies)
  0 siblings, 4 replies; 16+ messages in thread
From: Mart Raudsepp @ 2016-06-06  0:22 UTC (permalink / raw
  To: gentoo-dev; +Cc: pr, mgorny

First draft of news item for proceeding with LINGUAS USE_EXPAND rename
to L10N independently of the INSTALL_MASK feature additions.

I hope English natives will improve the sentence flow and grammar here
:)
Perhaps there's also a better title than with the technical USE_EXPAND
mention.


Title: LINGUAS USE_EXPAND renamed to L10N
Author: Mart Raudsepp <leio@gentoo.org>
Content-Type: text/plain
Posted: 2016-06-06
Revision: 1
News-Item-Format: 1.0

The LINGUAS USE_EXPAND has been renamed to L10N, to avoid a conceptual
clash with the standard gettext LINGUAS behaviour.
L10N controls which extra localization support will be installed.
This is usually used in case of extra downloads of language packs.

If you have set LINGUAS in your make.conf, you should either copy or
rename it to L10N, depending on if you want to filter the supported
languages at build time or not via the gettext LINGUAS environment
variable behaviour as described below. Note that this filtering does not
affect only installed gettext catalog files (*.mo), but also lines of
translations in an always shipped file (e.g *.desktop).

LINGUAS maintains the standard gettext behaviour and will now work as
expected with all package managers. It controls which language
translations are built and installed. An unset value means all
available, an empty value means none, and a value can be an unordered
list of gettext language codes, with or without country codes.
Usually only two letter language codes suffice, but can be limited with
country codes with a 'll_CC' formatting, where 'll' is the language code
and 'CC' is the country code, e.g en_GB. Some rare languages also have
three letter language codes.
If you want English with a set LINGUAS, it is suggested to list it with
the desired country code, in case the default is not the usual en_US.
It is also common to list "en" then, in case a package is natively
written in a different language, but does provide an English translation
for whichever country.
A list of LINGUAS language codes is available at
http://www.gnu.org/software/gettext/manual/gettext.html#Language-Codes

Note that LINGUAS affects build time, and thus filters what ends up
in binary packages. If you are building generic binary packages that
should support all available language, you should not set LINGUAS.

If you have per-package customizations of LINGUAS USE_EXPAND, you
should also rename those from LINGUAS to L10N. This typically means
renaming linguas_* to l10n_*.

https://wiki.gentoo.org/wiki/Localization/Guide has also been updated
to reflect this change.


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-06  0:22 [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N Mart Raudsepp
@ 2016-06-06  6:18 ` Michał Górny
  2016-06-06  6:47 ` Ulrich Mueller
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 16+ messages in thread
From: Michał Górny @ 2016-06-06  6:18 UTC (permalink / raw
  To: Mart Raudsepp; +Cc: gentoo-dev, pr

[-- Attachment #1: Type: text/plain, Size: 3333 bytes --]

On Mon, 06 Jun 2016 03:22:34 +0300
Mart Raudsepp <leio@gentoo.org> wrote:

> First draft of news item for proceeding with LINGUAS USE_EXPAND rename
> to L10N independently of the INSTALL_MASK feature additions.
> 
> I hope English natives will improve the sentence flow and grammar here
> :)
> Perhaps there's also a better title than with the technical USE_EXPAND
> mention.
> 
> 
> Title: LINGUAS USE_EXPAND renamed to L10N
> Author: Mart Raudsepp <leio@gentoo.org>
> Content-Type: text/plain
> Posted: 2016-06-06
> Revision: 1
> News-Item-Format: 1.0
> 
> The LINGUAS USE_EXPAND has been renamed to L10N, to avoid a conceptual
> clash with the standard gettext LINGUAS behaviour.
> L10N controls which extra localization support will be installed.
> This is usually used in case of extra downloads of language packs.
> 
> If you have set LINGUAS in your make.conf, you should either copy or
> rename it to L10N, depending on if you want to filter the supported
> languages at build time or not via the gettext LINGUAS environment
> variable behaviour as described below. Note that this filtering does not
> affect only installed gettext catalog files (*.mo), but also lines of
> translations in an always shipped file (e.g *.desktop).
> 
> LINGUAS maintains the standard gettext behaviour and will now work as
> expected with all package managers. It controls which language
> translations are built and installed. An unset value means all
> available, an empty value means none, and a value can be an unordered
> list of gettext language codes, with or without country codes.
> Usually only two letter language codes suffice, but can be limited with
> country codes with a 'll_CC' formatting, where 'll' is the language code
> and 'CC' is the country code, e.g en_GB. Some rare languages also have
> three letter language codes.
> If you want English with a set LINGUAS, it is suggested to list it with
> the desired country code, in case the default is not the usual en_US.
> It is also common to list "en" then, in case a package is natively
> written in a different language, but does provide an English translation
> for whichever country.
> A list of LINGUAS language codes is available at
> http://www.gnu.org/software/gettext/manual/gettext.html#Language-Codes
> 
> Note that LINGUAS affects build time, and thus filters what ends up
> in binary packages. If you are building generic binary packages that
> should support all available language, you should not set LINGUAS.

After such a long explanation of how LINGUAS works, you almost
naturally except explanation of what goes into L10N and how it works.

And while at it, you might also give a little suggestion that with new
enough Portage you can do exclusive INSTALL_MASK and how it does not
affect binary packages.

> If you have per-package customizations of LINGUAS USE_EXPAND, you
> should also rename those from LINGUAS to L10N. This typically means
> renaming linguas_* to l10n_*.
> 
> https://wiki.gentoo.org/wiki/Localization/Guide has also been updated
> to reflect this change.

...or alternatively, reduce the news item to a paragraph on each,
and direct to wiki (and info gettext) for more detailed explanations.

-- 
Best regards,
Michał Górny
<http://dev.gentoo.org/~mgorny/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 949 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-06  0:22 [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N Mart Raudsepp
  2016-06-06  6:18 ` Michał Górny
@ 2016-06-06  6:47 ` Ulrich Mueller
  2016-06-06  8:43   ` Chí-Thanh Christopher Nguyễn
  2016-06-07 18:28 ` [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N Michał Górny
  2016-06-19 22:18 ` Ulrich Mueller
  3 siblings, 1 reply; 16+ messages in thread
From: Ulrich Mueller @ 2016-06-06  6:47 UTC (permalink / raw
  To: gentoo-dev; +Cc: pr, mgorny

[-- Attachment #1: Type: text/plain, Size: 873 bytes --]

>>>>> On Mon, 06 Jun 2016, Mart Raudsepp wrote:

> Usually only two letter language codes suffice, but can be limited with
> country codes with a 'll_CC' formatting, where 'll' is the language code
> and 'CC' is the country code, e.g en_GB. Some rare languages also have
> three letter language codes.

s/country code/territory code/g

Question related to this, do we take the opportunity to standardise
the values? Looks like the vast majority follows
language[_territory][@modifier] specified by POSIX [1] but some don't.

Also there are a few duplicates, like sr@Latn / sr@latin and uz@Cyrl /
uz@cyrillic. I suggest that we adhere to the BCP 47 [2] names if
possible (which would be Latn and Cyrl for the examples mentioned).

Ulrich


[1] http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02
[2] http://www.rfc-editor.org/rfc/bcp/bcp47.txt

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-06  6:47 ` Ulrich Mueller
@ 2016-06-06  8:43   ` Chí-Thanh Christopher Nguyễn
  2016-06-06  9:17     ` Ulrich Mueller
  0 siblings, 1 reply; 16+ messages in thread
From: Chí-Thanh Christopher Nguyễn @ 2016-06-06  8:43 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1022 bytes --]

Ulrich Mueller schrieb:
>>>>>> On Mon, 06 Jun 2016, Mart Raudsepp wrote:
>
>> Usually only two letter language codes suffice, but can be limited with
>> country codes with a 'll_CC' formatting, where 'll' is the language code
>> and 'CC' is the country code, e.g en_GB. Some rare languages also have
>> three letter language codes.
>
> s/country code/territory code/g
>
> Question related to this, do we take the opportunity to standardise
> the values? Looks like the vast majority follows
> language[_territory][@modifier] specified by POSIX [1] but some don't.

What do we do with locales that don't fit into this scheme? Catalan Valencian 
is one such locale.
Packages currently use modifiers (ca@valencia) or ISO 3166-1 reserved area 
(ca_XV) or something entirely different (ca_valencia).
ISO 3166-1:ES defines ES-VC as region code, so maybe ca_ES-VC would be best. 
Though a quick Google search didn't find any major usage of that either.


Best regards,
Chí-Thanh Christopher Nguyễn



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-06  8:43   ` Chí-Thanh Christopher Nguyễn
@ 2016-06-06  9:17     ` Ulrich Mueller
  2016-06-06 15:18       ` Chí-Thanh Christopher Nguyễn
  0 siblings, 1 reply; 16+ messages in thread
From: Ulrich Mueller @ 2016-06-06  9:17 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 998 bytes --]

>>>>> On Mon, 6 Jun 2016, Chí-Thanh Christopher Nguyễn wrote:

> Ulrich Mueller schrieb:
>> Question related to this, do we take the opportunity to standardise
>> the values? Looks like the vast majority follows
>> language[_territory][@modifier] specified by POSIX [1] but some
>> don't.

> What do we do with locales that don't fit into this scheme? Catalan
> Valencian is one such locale.
> Packages currently use modifiers (ca@valencia) or ISO 3166-1
> reserved area (ca_XV) or something entirely different (ca_valencia).

According to [1], "valencia" is a valid variant subtag, therefore
ca@valencia should be fine.

> ISO 3166-1:ES defines ES-VC as region code, so maybe ca_ES-VC would
> be best. Though a quick Google search didn't find any major usage of
> that either.

Neither XV nor ES-VC are registered as a subtag though, so presumably
these should be avoided.

Ulrich

[1] http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-06  9:17     ` Ulrich Mueller
@ 2016-06-06 15:18       ` Chí-Thanh Christopher Nguyễn
  2016-06-06 18:07         ` Ulrich Mueller
  0 siblings, 1 reply; 16+ messages in thread
From: Chí-Thanh Christopher Nguyễn @ 2016-06-06 15:18 UTC (permalink / raw
  To: gentoo-dev

Ulrich Mueller schrieb:
>>>>>> On Mon, 6 Jun 2016, Chí-Thanh Christopher Nguyễn wrote:
>> Ulrich Mueller schrieb:
>>> Question related to this, do we take the opportunity to standardise
>>> the values? Looks like the vast majority follows
>>> language[_territory][@modifier] specified by POSIX [1] but some
>>> don't.
>> What do we do with locales that don't fit into this scheme? Catalan
>> Valencian is one such locale.
>> Packages currently use modifiers (ca@valencia) or ISO 3166-1
>> reserved area (ca_XV) or something entirely different (ca_valencia).
> According to [1], "valencia" is a valid variant subtag, therefore
> ca@valencia should be fine.
>
>> ISO 3166-1:ES defines ES-VC as region code, so maybe ca_ES-VC would
>> be best. Though a quick Google search didn't find any major usage of
>> that either.
> Neither XV nor ES-VC are registered as a subtag though, so presumably
> these should be avoided.

I'm not totally convinced yet.
Following the BCP-47 spec the format is

Language-Tag  = langtag             ; normal language tags
langtag       = language
                  ["-" script]
                  ["-" region]
                  *("-" variant)
                  *("-" extension)
                  ["-" privateuse]

So using the language ca, region es, and variant valencia, the BCP-47 
language tag is ca-es-valencia (or ca-valencia if you omit the region).

POSIX.1-2008[2] as you mentioned defines a slightly different format for 
locales

language[_territory][.codeset]

Only LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and 
LC_TIME additionally accept specification of a modifier.

[language[_territory][.codeset][@modifier]]

Where territory is implementation defined and the modifier "select[s] a 
specific instance of localization data within a single category". Which 
I think does not match what we want with "valencia" variant of the "ca" 
language.

Hence I think POSIX locale cannot handle Catalan Valencian, unless 
territory is made accept ISO3166-2 region subdivisions.


Best regards,
Chí-Thanh Christopher Nguyễn

[1] https://tools.ietf.org/rfc/bcp/bcp47.txt
[2] 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-06 15:18       ` Chí-Thanh Christopher Nguyễn
@ 2016-06-06 18:07         ` Ulrich Mueller
  2016-06-07 11:12           ` Ulrich Mueller
  2016-06-07 20:19           ` Chí-Thanh Christopher Nguyễn
  0 siblings, 2 replies; 16+ messages in thread
From: Ulrich Mueller @ 2016-06-06 18:07 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2231 bytes --]

>>>>> On Mon, 6 Jun 2016, Chí-Thanh Christopher Nguyễn wrote:

> I'm not totally convinced yet.
> Following the BCP-47 spec the format is

> Language-Tag  = langtag             ; normal language tags
> langtag       = language
>                   ["-" script]
>                   ["-" region]
>                   *("-" variant)
>                   *("-" extension)
>                   ["-" privateuse]

> So using the language ca, region es, and variant valencia, the
> BCP-47 language tag is ca-es-valencia (or ca-valencia if you omit
> the region).

Right. Or rather, ca-ES-valencia for the former, because all caps are
preferred for the region tag.

> POSIX.1-2008[2] as you mentioned defines a slightly different format
> for locales

> language[_territory][.codeset]

> Only LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and 
> LC_TIME additionally accept specification of a modifier.

> [language[_territory][.codeset][@modifier]]

> Where territory is implementation defined and the modifier
> "select[s] a specific instance of localization data within a single
> category". Which I think does not match what we want with "valencia"
> variant of the "ca" language.

As I understand it:

1. Gettext documentation says that locale names can be LL_CC or
LL_CC@VARIANT. The natural mapping to the (implementation defined)
format mentioned by POSIX seems to be that LL, CC, and VARIANT
correspond to language, territory, and modifier, respectively.

2. Language codes are taken from ISO 639, namely the two-letter code
if one exists, otherwise the three-letter code.

3. Territory codes are taken from ISO 3166-1, usually the two-letter
country codes.

4. According to Gettext documentation, "'@VARIANT' can denote any kind
of characteristics that is not already implied by the language LL and
the country CC." (So IIUC the BCP-47 variant "valencia" would become
"@valencia".)

> Hence I think POSIX locale cannot handle Catalan Valencian, unless
> territory is made accept ISO3166-2 region subdivisions.

I haven't found any mention or usage of ISO 3166-2 region subdivisions
in the context of locale. Can you provide any references for this?

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-06 18:07         ` Ulrich Mueller
@ 2016-06-07 11:12           ` Ulrich Mueller
  2016-06-07 20:19           ` Chí-Thanh Christopher Nguyễn
  1 sibling, 0 replies; 16+ messages in thread
From: Ulrich Mueller @ 2016-06-07 11:12 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 1912 bytes --]

>>>>> On Mon, 6 Jun 2016, Ulrich Mueller wrote:

>>>>> On Mon, 6 Jun 2016, Chí-Thanh Christopher Nguyễn wrote:
>> I'm not totally convinced yet.
>> Following the BCP-47 spec the format is

>> Language-Tag  = langtag             ; normal language tags
>> langtag       = language
>> ["-" script]
>> ["-" region]
>> *("-" variant)
>> *("-" extension)
>> ["-" privateuse]

>> [...]

> As I understand it:

> 1. Gettext documentation says that locale names can be LL_CC or
> LL_CC@VARIANT. The natural mapping to the (implementation defined)
> format mentioned by POSIX seems to be that LL, CC, and VARIANT
> correspond to language, territory, and modifier, respectively.

> 2. Language codes are taken from ISO 639, namely the two-letter code
> if one exists, otherwise the three-letter code.

> 3. Territory codes are taken from ISO 3166-1, usually the two-letter
> country codes.

> 4. According to Gettext documentation, "'@VARIANT' can denote any
> kind of characteristics that is not already implied by the language
> LL and the country CC." (So IIUC the BCP-47 variant "valencia" would
> become "@valencia".)

Of course, we could also say that Gettext/POSIX syntax (especially its
variant/modifier part) is ill-defined, and use BCP-47 syntax for the
L10N USE_EXPAND instead (except that the separator would be an
underscore instead of a hyphen).

AFAICS, there would be no change at all for any of the LL or LL_CC
entries. The only ones that would change would be the (about 10) ones
containing an @ sign. For example, ca@valencia would become
ca_valencia, and sr@ijekavianlatin would become sr_Latn_ijekavsk.

Not sure how much additional code for remapping would be required.
However, my impression is that upstream usage of @VARIANT is not at
all standardised, so some remapping would be required in any case if
we want unique entries for L10N.

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-06  0:22 [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N Mart Raudsepp
  2016-06-06  6:18 ` Michał Górny
  2016-06-06  6:47 ` Ulrich Mueller
@ 2016-06-07 18:28 ` Michał Górny
  2016-06-13 18:25   ` Mart Raudsepp
  2016-06-19 22:18 ` Ulrich Mueller
  3 siblings, 1 reply; 16+ messages in thread
From: Michał Górny @ 2016-06-07 18:28 UTC (permalink / raw
  To: Mart Raudsepp; +Cc: gentoo-dev, pr

[-- Attachment #1: Type: text/plain, Size: 4890 bytes --]

On Mon, 06 Jun 2016 03:22:34 +0300
Mart Raudsepp <leio@gentoo.org> wrote:

> First draft of news item for proceeding with LINGUAS USE_EXPAND rename
> to L10N independently of the INSTALL_MASK feature additions.
> 
> I hope English natives will improve the sentence flow and grammar here
> :)
> Perhaps there's also a better title than with the technical USE_EXPAND
> mention.
> 
> 
> Title: LINGUAS USE_EXPAND renamed to L10N
> Author: Mart Raudsepp <leio@gentoo.org>
> Content-Type: text/plain
> Posted: 2016-06-06
> Revision: 1
> News-Item-Format: 1.0
> 
> The LINGUAS USE_EXPAND has been renamed to L10N, to avoid a conceptual
> clash with the standard gettext LINGUAS behaviour.
> L10N controls which extra localization support will be installed.
> This is usually used in case of extra downloads of language packs.
> 
> If you have set LINGUAS in your make.conf, you should either copy or
> rename it to L10N, depending on if you want to filter the supported
> languages at build time or not via the gettext LINGUAS environment
> variable behaviour as described below. Note that this filtering does not
> affect only installed gettext catalog files (*.mo), but also lines of
> translations in an always shipped file (e.g *.desktop).
> 
> LINGUAS maintains the standard gettext behaviour and will now work as
> expected with all package managers. It controls which language
> translations are built and installed. An unset value means all
> available, an empty value means none, and a value can be an unordered
> list of gettext language codes, with or without country codes.
> Usually only two letter language codes suffice, but can be limited with
> country codes with a 'll_CC' formatting, where 'll' is the language code
> and 'CC' is the country code, e.g en_GB. Some rare languages also have
> three letter language codes.
> If you want English with a set LINGUAS, it is suggested to list it with
> the desired country code, in case the default is not the usual en_US.
> It is also common to list "en" then, in case a package is natively
> written in a different language, but does provide an English translation
> for whichever country.
> A list of LINGUAS language codes is available at
> http://www.gnu.org/software/gettext/manual/gettext.html#Language-Codes
> 
> Note that LINGUAS affects build time, and thus filters what ends up
> in binary packages. If you are building generic binary packages that
> should support all available language, you should not set LINGUAS.
> 
> If you have per-package customizations of LINGUAS USE_EXPAND, you
> should also rename those from LINGUAS to L10N. This typically means
> renaming linguas_* to l10n_*.
> 
> https://wiki.gentoo.org/wiki/Localization/Guide has also been updated
> to reflect this change.

So here's how I would word it. I think if we combine a few different
texts, we may end up with something good ;-).

---
The LINGUAS USE flag group has been renamed to L10N, in order to avoid
a conceptual clash between the Gentoo use of the name, and a standard
environment variable used by multiple gettext-based packages. Therefore,
from now on filtering localizations is supported on three independent
levels: L10N, LINGUAS and INSTALL_MASK.

The L10N flags affect built and installed localizations of the packages
listing those flags explicitly. They are fully controlled by
the package manager, and their values are defined globally. They do not
affect the packages not listing them explicitly.

The LINGUAS variable is now verbosely passed through to the build
system. It controls the localizations built and installed by packages
that use it, and that do not override it using L10N flags. Note that
due to the design, the localization stripping is done implicitly
and the package manager can not determine which localizations were
actually provided.

Additionally, the INSTALL_MASK improvements available in Portage 2.3.0
make it possible to filter localizations at package merge stage. In this
case, the filtering is done on installed directories transparently,
and the build process and binary packages are not affected.

If you were using LINGUAS before, you most likely want to replace it
with L10N. If you need to strip localizations more (e.g. for embedded
systems), you may also want to set LINGUAS and/or INSTALL_MASK.
However, if you intend to provide or use binary package, you will most
likely want to leave L10N and LINGUAS unset in order to build most
portable binary packages, and use INSTALL_MASK to transparently strip
installed localizations on the hosts using them.

For more information, please see:
https://wiki.gentoo.org/wiki/Localization/Guide
---

Of course, we'd need to update the guide to explain all three layers
in detail.

-- 
Best regards,
Michał Górny
<http://dev.gentoo.org/~mgorny/>

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 949 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-06 18:07         ` Ulrich Mueller
  2016-06-07 11:12           ` Ulrich Mueller
@ 2016-06-07 20:19           ` Chí-Thanh Christopher Nguyễn
  2016-06-10  9:29             ` [gentoo-dev] RFC: BCP 47 for L10N? (was: News item: LINGUAS USE_EXPAND renamed to L10N) Ulrich Mueller
  1 sibling, 1 reply; 16+ messages in thread
From: Chí-Thanh Christopher Nguyễn @ 2016-06-07 20:19 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2545 bytes --]

Ulrich Mueller schrieb:
>> POSIX.1-2008[2] as you mentioned defines a slightly different format
>> for locales
>
>> language[_territory][.codeset]
>
>> Only LC_COLLATE, LC_CTYPE, LC_MESSAGES, LC_MONETARY, LC_NUMERIC, and
>> LC_TIME additionally accept specification of a modifier.
>
>> [language[_territory][.codeset][@modifier]]
>
>> Where territory is implementation defined and the modifier
>> "select[s] a specific instance of localization data within a single
>> category". Which I think does not match what we want with "valencia"
>> variant of the "ca" language.
>
> As I understand it:
>
> 1. Gettext documentation says that locale names can be LL_CC or
> LL_CC@VARIANT. The natural mapping to the (implementation defined)
> format mentioned by POSIX seems to be that LL, CC, and VARIANT
> correspond to language, territory, and modifier, respectively.
>
> 2. Language codes are taken from ISO 639, namely the two-letter code
> if one exists, otherwise the three-letter code.

Yes.

> 3. Territory codes are taken from ISO 3166-1, usually the two-letter
> country codes.

Yes.

> 4. According to Gettext documentation, "'@VARIANT' can denote any kind
> of characteristics that is not already implied by the language LL and
> the country CC." (So IIUC the BCP-47 variant "valencia" would become
> "@valencia".)

This I think is wrong and collides with POSIX.
POSIX modifiers are not allowed for LANG or LC_ALL in POSIX.1-2008[1]
Section 8.2 says you can have at most one modifier field to "select a 
specific instance of localization data within a single category", which I 
don't think applies because it is its own locale, not an instance of an 
existing one. Furthermore (but that doesn't apply in our use case), POSIX 
spec lists the example
LC_COLLATE=De_DE@dict
So what if you want Catalan Valencian with dictionary order? Or if someone 
hypothetically came up with a different script?

>> Hence I think POSIX locale cannot handle Catalan Valencian, unless
>> territory is made accept ISO3166-2 region subdivisions.
>
> I haven't found any mention or usage of ISO 3166-2 region subdivisions
> in the context of locale. Can you provide any references for this?

As I wrote before, it is not used. But I think it is the only spec-compliant 
way to marry POSIX locales with Catalan Valencian. BCP-47 does it in a more 
natural way.


Best regards,
Chí-Thanh Christopher Nguyễn

[1] 
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_02


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [gentoo-dev] RFC: BCP 47 for L10N? (was: News item: LINGUAS USE_EXPAND renamed to L10N)
  2016-06-07 20:19           ` Chí-Thanh Christopher Nguyễn
@ 2016-06-10  9:29             ` Ulrich Mueller
  2016-06-10 10:26               ` Michał Górny
  0 siblings, 1 reply; 16+ messages in thread
From: Ulrich Mueller @ 2016-06-10  9:29 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 2954 bytes --]

>>>>> On Tue, 7 Jun 2016, Chí-Thanh Christopher Nguyễn wrote:

>> 4. According to Gettext documentation, "'@VARIANT' can denote any
>> kind of characteristics that is not already implied by the language
>> LL and the country CC." (So IIUC the BCP-47 variant "valencia"
>> would become "@valencia".)

> This I think is wrong and collides with POSIX.
> POSIX modifiers are not allowed for LANG or LC_ALL in
> POSIX.1-2008[1] Section 8.2 says you can have at most one modifier
> field to "select a specific instance of localization data within a
> single category", which I don't think applies because it is its own
> locale, not an instance of an existing one. Furthermore (but that
> doesn't apply in our use case), POSIX spec lists the example
> LC_COLLATE=De_DE@dict
> So what if you want Catalan Valencian with dictionary order? Or if
> someone hypothetically came up with a different script?

>> I haven't found any mention or usage of ISO 3166-2 region
>> subdivisions in the context of locale. Can you provide any
>> references for this?

> As I wrote before, it is not used. But I think it is the only
> spec-compliant way to marry POSIX locales with Catalan Valencian.
> BCP-47 does it in a more natural way.

So, trying to summarise: We cannot follow strict POSIX syntax, so our
two choices are either to stick to Gettext LL_CC@VARIANT syntax or
to change to BCP 47.

Using BCP 47 would have some advantages:
- It is a well defined standard [1] and tools for validation of
  language tags exist, e.g. [2].
- The L10N USE_EXPAND could follow usual USE flag syntax, as BCP 47
  tags contain neither underscores (which are supposed to be reserved
  as USE_EXPAND separators) nor @ signs (which PMS explicitly
  mentions as an exception for LINGUAS).
- Gettext's @VARIANT is ill-defined and conflates different
  characteristics like script and variant. There is no further
  subdivision within @VARIANT, which leads to locale names like
  sr@ijekavianlatin. Also different upstreams use different
  conventions, like @latin and @Latn for the latin script.
- For the vast majority of languages, identifiers are either identical
  ("de" -> "de") or they can be converted by simple shell substitution
  ("pt-BR" -> "pt_BR").
- IIUC, L10N is primarily intended to control things like additional
  language bundles of packages. Some upstreams like libreoffice
  already use BCP 47 for these.

On the other hand, there will be some cost:
- If BCP 47 tags containing a script or a variant should be used to
  generate LINGUAS, they will require explicit mapping. (OTOH, such
  mapping will also be needed if we stick to Gettext syntax but unify
  variants like "sr@latin" and "sr@Latn".)
- Different syntax for LINGUAS and L10N might be confusing to users,
  so additional documentation will be needed.

Comments?

Ulrich

[1] https://tools.ietf.org/html/bcp47
[2] http://schneegans.de/lv/

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] RFC: BCP 47 for L10N? (was: News item: LINGUAS USE_EXPAND renamed to L10N)
  2016-06-10  9:29             ` [gentoo-dev] RFC: BCP 47 for L10N? (was: News item: LINGUAS USE_EXPAND renamed to L10N) Ulrich Mueller
@ 2016-06-10 10:26               ` Michał Górny
  2016-06-10 12:09                 ` [gentoo-dev] RFC: BCP 47 for L10N? Chí-Thanh Christopher Nguyễn
  0 siblings, 1 reply; 16+ messages in thread
From: Michał Górny @ 2016-06-10 10:26 UTC (permalink / raw
  To: gentoo-dev, Ulrich Mueller

Dnia 10 czerwca 2016 11:29:41 CEST, Ulrich Mueller <ulm@gentoo.org> napisał(a):
>>>>>> On Tue, 7 Jun 2016, Chí-Thanh Christopher Nguyễn wrote:
>
>>> 4. According to Gettext documentation, "'@VARIANT' can denote any
>>> kind of characteristics that is not already implied by the language
>>> LL and the country CC." (So IIUC the BCP-47 variant "valencia"
>>> would become "@valencia".)
>
>> This I think is wrong and collides with POSIX.
>> POSIX modifiers are not allowed for LANG or LC_ALL in
>> POSIX.1-2008[1] Section 8.2 says you can have at most one modifier
>> field to "select a specific instance of localization data within a
>> single category", which I don't think applies because it is its own
>> locale, not an instance of an existing one. Furthermore (but that
>> doesn't apply in our use case), POSIX spec lists the example
>> LC_COLLATE=De_DE@dict
>> So what if you want Catalan Valencian with dictionary order? Or if
>> someone hypothetically came up with a different script?
>
>>> I haven't found any mention or usage of ISO 3166-2 region
>>> subdivisions in the context of locale. Can you provide any
>>> references for this?
>
>> As I wrote before, it is not used. But I think it is the only
>> spec-compliant way to marry POSIX locales with Catalan Valencian.
>> BCP-47 does it in a more natural way.
>
>So, trying to summarise: We cannot follow strict POSIX syntax, so our
>two choices are either to stick to Gettext LL_CC@VARIANT syntax or
>to change to BCP 47.
>
>Using BCP 47 would have some advantages:
>- It is a well defined standard [1] and tools for validation of
>  language tags exist, e.g. [2].
>- The L10N USE_EXPAND could follow usual USE flag syntax, as BCP 47
>  tags contain neither underscores (which are supposed to be reserved
>  as USE_EXPAND separators) nor @ signs (which PMS explicitly
>  mentions as an exception for LINGUAS).
>- Gettext's @VARIANT is ill-defined and conflates different
>  characteristics like script and variant. There is no further
>  subdivision within @VARIANT, which leads to locale names like
>  sr@ijekavianlatin. Also different upstreams use different
>  conventions, like @latin and @Latn for the latin script.
>- For the vast majority of languages, identifiers are either identical
>  ("de" -> "de") or they can be converted by simple shell substitution
>  ("pt-BR" -> "pt_BR").
>- IIUC, L10N is primarily intended to control things like additional
>  language bundles of packages. Some upstreams like libreoffice
>  already use BCP 47 for these.
>
>On the other hand, there will be some cost:
>- If BCP 47 tags containing a script or a variant should be used to
>  generate LINGUAS, they will require explicit mapping. (OTOH, such
>  mapping will also be needed if we stick to Gettext syntax but unify
>  variants like "sr@latin" and "sr@Latn".)
>- Different syntax for LINGUAS and L10N might be confusing to users,
>  so additional documentation will be needed.
>
>Comments?

I'd say BCP-47. The gettext tags aren't 100% defined anyway, so we'd end up having to choose between one upstream and another eventually, and map to the other.

Also, when it makes mapping L10N to LINGUAS harder, it will discourage people from abusing the latter.

>
>Ulrich
>
>[1] https://tools.ietf.org/html/bcp47
>[2] http://schneegans.de/lv/


-- 
Best regards,
Michał Górny (by phone)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] RFC: BCP 47 for L10N?
  2016-06-10 10:26               ` Michał Górny
@ 2016-06-10 12:09                 ` Chí-Thanh Christopher Nguyễn
  0 siblings, 0 replies; 16+ messages in thread
From: Chí-Thanh Christopher Nguyễn @ 2016-06-10 12:09 UTC (permalink / raw
  To: gentoo-dev

Michał Górny schrieb:
>> On the other hand, there will be some cost:
>> - If BCP 47 tags containing a script or a variant should be used to
>>   generate LINGUAS, they will require explicit mapping. (OTOH, such
>>   mapping will also be needed if we stick to Gettext syntax but unify
>>   variants like "sr@latin" and "sr@Latn".)
>> - Different syntax for LINGUAS and L10N might be confusing to users,
>>   so additional documentation will be needed.

As pointed out below, users better not mess with LINGUAS anyway. But one 
thing which might still cause confusion is that LANG and L10N use 
different syntax if we decide for BCP 47.

>>
>> Comments?
> I'd say BCP-47.

+1 for BCP-47

> The gettext tags aren't 100% defined anyway, so we'd end up having to choose between one upstream and another eventually, and map to the other.

Worse, gettext locales, while apparently designed to resemble POSIX 
locales, can change at any time without notice and may be different 
between glibc versions.

> Also, when it makes mapping L10N to LINGUAS harder, it will discourage people from abusing the latter.


Best regards,
Chí-Thanh Christopher Nguyễn



^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-07 18:28 ` [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N Michał Górny
@ 2016-06-13 18:25   ` Mart Raudsepp
  2016-06-14 21:02     ` Ulrich Mueller
  0 siblings, 1 reply; 16+ messages in thread
From: Mart Raudsepp @ 2016-06-13 18:25 UTC (permalink / raw
  To: gentoo-dev

Ühel kenal päeval, T, 07.06.2016 kell 20:28, kirjutas Michał Górny:
> 
> So here's how I would word it. I think if we combine a few different
> texts, we may end up with something good ;-).
> 
> ---
> The LINGUAS USE flag group has been renamed to L10N, in order to
> avoid
> a conceptual clash between the Gentoo use of the name, and a standard
> environment variable used by multiple gettext-based packages. 

Where "multiple" is pretty much the whole tree of autoconf based
packages (and probably cmake too?). Pretty much anything that has a
translation :)

> Therefore,
> from now on filtering localizations is supported on three independent
> levels: L10N, LINGUAS and INSTALL_MASK.
> 
> The L10N flags affect built and installed localizations of the
> packages
> listing those flags explicitly. They are fully controlled by
> the package manager, and their values are defined globally. They do
> not
> affect the packages not listing them explicitly.
> 
> The LINGUAS variable is now verbosely passed through to the build
> system.

If we have a transition phase for the USE_EXPAND, where we'll have both
for a while, then this is not strictly true immediately. It also means
we can't roll out your portage changes to stop the special casing
before we are finished with the transition and LINGUAS is removed from
the USE_EXPAND set.


> It controls the localizations built and installed by packages
> that use it, and that do not override it using L10N flags.

Packages should not be exporting some LINGUAS values based on L10N
USE_EXPAND anyways in my opinion; I'd make such approach a QA violation
maybe even, though we have some odd cases in a very limited set of
packages, iirc.

> Note that
> due to the design, the localization stripping is done implicitly
> and the package manager can not determine which localizations were
> actually provided.


> Additionally, the INSTALL_MASK improvements available in Portage
> 2.3.0
> make it possible to filter localizations at package merge stage. In
> this
> case, the filtering is done on installed directories transparently,
> and the build process and binary packages are not affected.

So I take it that these install_mask groups are in for upcoming 2.3.0.
Before that, you can still do it though, just need to list paths
manually yourself.
Info on what groups are pre-shipped or whatnot would have to be on the
wiki page then, I suppose.

> If you were using LINGUAS before, you most likely want to replace it
> with L10N. If you need to strip localizations more (e.g. for embedded
> systems), you may also want to set LINGUAS and/or INSTALL_MASK.
> However, if you intend to provide or use binary package, you will
> most

I don't like this shunning of LINGUAS feature and shunning to some sort
of embedded systems use case still.
Most of us build systems used by only ourselves, I believe, and there
is nothing wrong in getting a gettext feature applied for free, which
reduces translation lines in .desktop, .schema and other files, and
reduces the runtime mmap caches of those with it for free.
It being clear in the appropriate place that this is a build time thing
and whatnot is of course quite fine.

If we go with BCP47, then users will want to revise their values based
on the available options in the new l10n.desc I suppose.

> likely want to leave L10N and LINGUAS unset in order to build most
> portable binary packages, and use INSTALL_MASK to transparently strip
> installed localizations on the hosts using them.

L10N unset would mean no language packs at all, unless we have a wide
default set in base profile. So unless we set a default set of all of
them in profile, this would mean opposite behavior for L10N and LINGUAS
when unset.

> For more information, please see:
> https://wiki.gentoo.org/wiki/Localization/Guide
> ---
> 
> Of course, we'd need to update the guide to explain all three layers
> in detail.


This was just my random set of thoughts, so Ulrich knows them while
writing a new version of the news item tomorrow ;)


Mart


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-13 18:25   ` Mart Raudsepp
@ 2016-06-14 21:02     ` Ulrich Mueller
  0 siblings, 0 replies; 16+ messages in thread
From: Ulrich Mueller @ 2016-06-14 21:02 UTC (permalink / raw
  To: gentoo-dev

[-- Attachment #1: Type: text/plain, Size: 533 bytes --]

>>>>> On Mon, 13 Jun 2016, Mart Raudsepp wrote:

> This was just my random set of thoughts, so Ulrich knows them while
> writing a new version of the news item tomorrow ;)

Yeah, sorry, but I haven't found the time yet. I was wondering though,
maybe we should have two news items? One now, announcing L10N and
outlining that the plan is to retire LINGUAS (as a USE_EXPAND) after a
transition period. The second item could be send when the transition
is complete, and could explain behaviour of LINGUAS as a gettext
variable.

Ulrich

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N
  2016-06-06  0:22 [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N Mart Raudsepp
                   ` (2 preceding siblings ...)
  2016-06-07 18:28 ` [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N Michał Górny
@ 2016-06-19 22:18 ` Ulrich Mueller
  3 siblings, 0 replies; 16+ messages in thread
From: Ulrich Mueller @ 2016-06-19 22:18 UTC (permalink / raw
  To: gentoo-dev; +Cc: pr

[-- Attachment #1: Type: text/plain, Size: 3126 bytes --]

>>>>> On Mon, 06 Jun 2016, Mart Raudsepp wrote:

> First draft of news item for proceeding with LINGUAS USE_EXPAND rename
> to L10N independently of the INSTALL_MASK feature additions.

> I hope English natives will improve the sentence flow and grammar here
> :)
> Perhaps there's also a better title than with the technical USE_EXPAND
> mention.

Since I've seen no objections against using BCP 47 (aka IETF language
tags) for L10N, find my attempt of an updated wording below.

Ulrich


Title: L10N USE_EXPAND variable replacing LINGUAS
Author: Mart Raudsepp <leio@gentoo.org>
Author: Ulrich Müller <ulm@gentoo.org>
Content-Type: text/plain
Posted: 2016-06-19
Revision: 1
News-Item-Format: 1.0

The L10N variable is replacing LINGUAS as an USE_EXPAND, to avoid a
conceptual clash with the standard gettext LINGUAS behaviour.

L10N controls which extra localization support will be installed.
This is commonly used for downloads of additional language packs.

If you have set LINGUAS in your make.conf, you most likely want to add
its entries also to L10N. Note that while the common two letter language
codes (like "de" or "fr") are identical, more complex entries have a
different syntax because L10N now uses IETF language tags. (For example,
"pt_BR" becomes "pt-BR" and "sr@latin" becomes "sr-Latn".) You can look
up the available codes in profiles/desc/l10n.desc in the gentoo tree.
A detailed description of language tags (aka BCP 47) can be found at: 
https://www.w3.org/International/articles/language-tags/

After a transition time for packages to be converted, the LINGUAS
environment variable will maintain the standard gettext behaviour and
will work as expected with all package managers. It controls which
language translations are built and installed. An unset value means all
available, an empty value means none, and a value can be an unordered
list of gettext language codes, with or without territory codes. Usually
two letter language codes suffice, but can be narrowed down by territory
codes with a "ll_CC" formatting, where "ll" is the language code and
"CC" is the territory code, e.g., "en_GB". Some rare languages also have
three letter language codes. Note that LINGUAS does not only affect
installed gettext catalog files (*.mo), but also lines of translations
in an always shipped file (e.g., *.desktop).

If you want English with a set LINGUAS, it is suggested to list it with
the desired country code, in case the default is not the usual "en_US".
It is also common to list "en" then, in case a package is natively
written in a different language, but does provide an English translation
for whichever country. A list of LINGUAS language codes is available at:
http://www.gnu.org/software/gettext/manual/gettext.html#Language-Codes

If you have per-package customizations of the LINGUAS USE_EXPAND, you
should also rename those. This typically means changing linguas_* to
l10n_*, and possibly updating the syntax as described above.

https://wiki.gentoo.org/wiki/Localization/Guide has also been updated to
reflect this change.

[-- Attachment #2: Type: application/pgp-signature, Size: 490 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2016-06-19 22:18 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-06-06  0:22 [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N Mart Raudsepp
2016-06-06  6:18 ` Michał Górny
2016-06-06  6:47 ` Ulrich Mueller
2016-06-06  8:43   ` Chí-Thanh Christopher Nguyễn
2016-06-06  9:17     ` Ulrich Mueller
2016-06-06 15:18       ` Chí-Thanh Christopher Nguyễn
2016-06-06 18:07         ` Ulrich Mueller
2016-06-07 11:12           ` Ulrich Mueller
2016-06-07 20:19           ` Chí-Thanh Christopher Nguyễn
2016-06-10  9:29             ` [gentoo-dev] RFC: BCP 47 for L10N? (was: News item: LINGUAS USE_EXPAND renamed to L10N) Ulrich Mueller
2016-06-10 10:26               ` Michał Górny
2016-06-10 12:09                 ` [gentoo-dev] RFC: BCP 47 for L10N? Chí-Thanh Christopher Nguyễn
2016-06-07 18:28 ` [gentoo-dev] News item: LINGUAS USE_EXPAND renamed to L10N Michał Górny
2016-06-13 18:25   ` Mart Raudsepp
2016-06-14 21:02     ` Ulrich Mueller
2016-06-19 22:18 ` Ulrich Mueller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox