public inbox for gentoo-user@lists.gentoo.org
 help / color / mirror / Atom feed
From: antlists <antlists@youngman.org.uk>
To: gentoo-user@lists.gentoo.org
Subject: Re: [gentoo-user] ncurses; I think I wrecked my fresh install
Date: Wed, 30 Dec 2020 17:42:52 +0000	[thread overview]
Message-ID: <da2b63e1-2c9c-33ee-6601-f4568e0eb901@youngman.org.uk> (raw)
In-Reply-To: <4587102.OV4Wx5bFTl@noumea>

On 30/12/2020 16:35, Andreas K. Huettel wrote:
>>    I don't know if this has improved over the years, but my initial
>> experience with unicode was rather negative.  The fact that text
>> files were twice as large wasn't a major problem in itself.  The
>> real showstopper was that importing text files into spreadsheets
>> and text-editors and word processors failed miseraby.
>>
>>    I looked at a unicode text file with a binary viewer.  It turns out
>> that a simple text string like "1234" was actually...
>> "1" binary-zero "2" binary-zero "3" binary-zero "4" binary zero, etc.
> 
> That's (as someone has already pointed out) UTF-16, which is the default for
> some Windows tools (but understood in Linux too). (Even UTF-32 exists where
> all characters are 4 byte wide, but I've never seen it in the wild.)
> 
> UTF-8 is normally used on Linux (and ASCII chars look exactly the same there);
> even for "long characters" outside the ASCII range spreadsheets and word
> processors should not be a problem anymore.
> 
Following up on my previous answer, you need to separate in your mind 
UTF the character set, and UTF-x the representation. When UTF was 
introduced MS - in accordance with the thoughts of the time - thought 
the future was a 16-bit char, which can store 32 thousand characters. 
(Note that, BY DEFINITION, the high bit of a UTF character *must* be 
zero. Just like standard ASCII.)

So MS and Windows uses UTF-16 as its encoding. Unix LATER went down the 
route of UTF-8 which - I think - can only encode 16 thousand characters 
in two bytes, but because most (western) text does encode successfully 
in one byte is actually a major saving in network operations such as 
email, web etc which is where Unix has traditionally been very strong.

But UTF-16 works very well for MS, because they are primarily desktop, 
and UTF-16 means that there are very few multi-char characters. That 
reduces pressure on CPU, which is a desktop-limited resource.

And lastly, very importantly, given that AT PRESENT all characters can 
be encoded in 31 bits, UTF-32 the representation is equivalent to UTF 
the character set. But should we need more than 2 billion characters, 
there is nothing stopping us rolling out characters encoded in two 
32-bit chars, and UTF-64.

Cheers,
Wol


  reply	other threads:[~2020-12-30 17:42 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-12-28 21:36 [gentoo-user] ncurses; I think I wrecked my fresh install Walter Dnes
2020-12-28 21:48 ` Arve Barsnes
2020-12-28 21:55 ` Dale
2020-12-28 22:52 ` tastytea
2020-12-29  2:54   ` Walter Dnes
2021-01-19  2:21     ` Walter Dnes
2021-01-19 13:15       ` Andreas K. Hüttel
2020-12-29 15:11 ` Andreas K. Huettel
2020-12-29 21:17   ` [gentoo-user] " Grant Edwards
2020-12-29 23:01   ` [gentoo-user] " Walter Dnes
2020-12-30  1:04     ` [gentoo-user] " Grant Edwards
2020-12-30  9:23       ` Wols Lists
2020-12-30 16:35     ` [gentoo-user] " Andreas K. Huettel
2020-12-30 17:42       ` antlists [this message]
2020-12-30 17:30     ` Andreas K. Huettel
2020-12-30 18:01       ` antlists
2020-12-30 18:14         ` Andreas K. Huettel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=da2b63e1-2c9c-33ee-6601-f4568e0eb901@youngman.org.uk \
    --to=antlists@youngman.org.uk \
    --cc=gentoo-user@lists.gentoo.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox