From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from lists.gentoo.org ([140.105.134.102] helo=robin.gentoo.org) by nuthatch.gentoo.org with esmtp (Exim 4.54) id 1FHReB-0003CZ-Hw for garchives@archives.gentoo.org; Thu, 09 Mar 2006 20:22:07 +0000 Received: from robin.gentoo.org (localhost [127.0.0.1]) by robin.gentoo.org (8.13.5/8.13.5) with SMTP id k29KKTdG025475; Thu, 9 Mar 2006 20:20:29 GMT Received: from mail-relay-2.tiscali.it (mail-relay-2.tiscali.it [213.205.33.42]) by robin.gentoo.org (8.13.5/8.13.5) with ESMTP id k29KHBVq021272 for <gentoo-dev@lists.gentoo.org>; Thu, 9 Mar 2006 20:17:11 GMT Received: from c1358217.kevquinn.com (84.222.87.21) by mail-relay-2.tiscali.it (7.2.069.1) id 4382FB7B00DC8FA8 for gentoo-dev@lists.gentoo.org; Thu, 9 Mar 2006 21:17:10 +0100 Date: Thu, 9 Mar 2006 21:25:11 +0100 From: "Kevin F. Quinn (Gentoo)" <kevquinn@gentoo.org> To: gentoo-dev@lists.gentoo.org Subject: Re: [gentoo-dev] enable UTF8 per default? Message-ID: <20060309212511.2b92a73d@c1358217.kevquinn.com> In-Reply-To: <1141124283.7962.74.camel@localhost> References: <1141124283.7962.74.camel@localhost> X-Mailer: Sylpheed-Claws 2.0.0 (GTK+ 2.8.11; i686-pc-linux-gnu) Precedence: bulk List-Post: <mailto:gentoo-dev@lists.gentoo.org> List-Help: <mailto:gentoo-dev+help@gentoo.org> List-Unsubscribe: <mailto:gentoo-dev+unsubscribe@gentoo.org> List-Subscribe: <mailto:gentoo-dev+subscribe@gentoo.org> List-Id: Gentoo Linux mail <gentoo-dev.gentoo.org> X-BeenThere: gentoo-dev@gentoo.org Reply-to: gentoo-dev@lists.gentoo.org Mime-Version: 1.0 Content-Type: multipart/signed; boundary=Sig_QxlK_0zNZAIbpHdBojH9GcU; protocol="application/pgp-signature"; micalg=PGP-SHA1 X-Archives-Salt: 99057894-09d9-439a-968f-7ec1f80fb3a5 X-Archives-Hash: 27f1a3aab8f03643e3b23bb2c75f73d2 --Sig_QxlK_0zNZAIbpHdBojH9GcU Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On Tue, 28 Feb 2006 11:58:03 +0100 Patrick Lauer <patrick@gentoo.org> wrote: > During that discussion we realized that having utf-8 not enabled by > default and no utf8 fonts available by default causes lots of > recompilation and reconfiguration.=20 >=20 > Enabling the unicode useflag in the profiles should help our > international users and should not cause any problems. Are there any > known bugs / problems this would trigger? Any reasons against that? Enabling support for utf-8 should be fine, but I'd like to sound a note of caution about using a utf-8 locale as a system-wide setting. Since UTF-8 contains "holes" in the representation (i.e. some sequences of 8-bit values are invalid), when something is asked to parse such invalid data unexpected results can ensue. For an example, see bug #125375 - it turns out that invalid sequences do not match '.' in sed regular expressions (sed-4.1.4). The other gnu tools probably behave similarly. Up to a point this is in line with the UTF-8 spec, which says, "When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequences as an error condition, and shall not interpret such sequences as characters." (chapter 3 para 2 rule C12a). This clearly means that the invalid bytes cannot match "." (or anything else for that matter). However sed should either generate an error, filter the illegal bytes out of its input, or replace them with a marker (replacement character) - instead it leaves the non-conformant bytes alone. --=20 Kevin F. Quinn --Sig_QxlK_0zNZAIbpHdBojH9GcU Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.1 (GNU/Linux) iD8DBQFEEI8q9G2S8dekcG0RAo7sAKDB7Q21kWiqHnanK8LaFVlqrf86rQCfaCS5 BWNXH+/N9B1Td8UZGDnTN0M= =8dhR -----END PGP SIGNATURE----- --Sig_QxlK_0zNZAIbpHdBojH9GcU-- -- gentoo-dev@gentoo.org mailing list