From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from lists.gentoo.org (pigeon.gentoo.org [208.92.234.80]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by finch.gentoo.org (Postfix) with ESMTPS id 8903A1382C5 for ; Wed, 30 Dec 2020 17:42:59 +0000 (UTC) Received: from pigeon.gentoo.org (localhost [127.0.0.1]) by pigeon.gentoo.org (Postfix) with SMTP id D1D58E0ABD; Wed, 30 Dec 2020 17:42:54 +0000 (UTC) Received: from smtp.hosts.co.uk (smtp.hosts.co.uk [85.233.160.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by pigeon.gentoo.org (Postfix) with ESMTPS id 8EC2FE0A9A for ; Wed, 30 Dec 2020 17:42:54 +0000 (UTC) Received: from host86-158-105-41.range86-158.btcentralplus.com ([86.158.105.41] helo=[192.168.1.65]) by smtp.hosts.co.uk with esmtpa (Exim) (envelope-from ) id 1kufUq-000BAp-9k for gentoo-user@lists.gentoo.org; Wed, 30 Dec 2020 17:42:52 +0000 Subject: Re: [gentoo-user] ncurses; I think I wrecked my fresh install To: gentoo-user@lists.gentoo.org References: <9095281.eNJFYEL58v@noumea> <4587102.OV4Wx5bFTl@noumea> From: antlists Message-ID: Date: Wed, 30 Dec 2020 17:42:52 +0000 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 Precedence: bulk List-Post: List-Help: List-Unsubscribe: List-Subscribe: List-Id: Gentoo Linux mail X-BeenThere: gentoo-user@lists.gentoo.org Reply-to: gentoo-user@lists.gentoo.org X-Auto-Response-Suppress: DR, RN, NRN, OOF, AutoReply MIME-Version: 1.0 In-Reply-To: <4587102.OV4Wx5bFTl@noumea> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Language: en-GB Content-Transfer-Encoding: 7bit X-Archives-Salt: 489423b4-c2c3-4852-9285-8b13f62dac19 X-Archives-Hash: 9e2f90102242220571be15d8c79b8f98 On 30/12/2020 16:35, Andreas K. Huettel wrote: >> I don't know if this has improved over the years, but my initial >> experience with unicode was rather negative. The fact that text >> files were twice as large wasn't a major problem in itself. The >> real showstopper was that importing text files into spreadsheets >> and text-editors and word processors failed miseraby. >> >> I looked at a unicode text file with a binary viewer. It turns out >> that a simple text string like "1234" was actually... >> "1" binary-zero "2" binary-zero "3" binary-zero "4" binary zero, etc. > > That's (as someone has already pointed out) UTF-16, which is the default for > some Windows tools (but understood in Linux too). (Even UTF-32 exists where > all characters are 4 byte wide, but I've never seen it in the wild.) > > UTF-8 is normally used on Linux (and ASCII chars look exactly the same there); > even for "long characters" outside the ASCII range spreadsheets and word > processors should not be a problem anymore. > Following up on my previous answer, you need to separate in your mind UTF the character set, and UTF-x the representation. When UTF was introduced MS - in accordance with the thoughts of the time - thought the future was a 16-bit char, which can store 32 thousand characters. (Note that, BY DEFINITION, the high bit of a UTF character *must* be zero. Just like standard ASCII.) So MS and Windows uses UTF-16 as its encoding. Unix LATER went down the route of UTF-8 which - I think - can only encode 16 thousand characters in two bytes, but because most (western) text does encode successfully in one byte is actually a major saving in network operations such as email, web etc which is where Unix has traditionally been very strong. But UTF-16 works very well for MS, because they are primarily desktop, and UTF-16 means that there are very few multi-char characters. That reduces pressure on CPU, which is a desktop-limited resource. And lastly, very importantly, given that AT PRESENT all characters can be encoded in 31 bits, UTF-32 the representation is equivalent to UTF the character set. But should we need more than 2 billion characters, there is nothing stopping us rolling out characters encoded in two 32-bit chars, and UTF-64. Cheers, Wol