On Thu, 2003-11-13 at 09:05, Paul de Vrieze wrote:

> Isn't it true that it is possible to to encode the few 4 byte characters 
> into a number of 2byte sequences. I think that is more than enough for 
> most cases (who needs to read/write cjk anyway ;-) )

According to my understanding of UCS, it doesn't seem to be the case. I
believe that UCS is the internal representation of a unicode character,
whereas UTF is the encoding of the character into octets for
representation on a computer. 

As an example, a UCS4 character on one of the higher planes (the one
where extra CJK characters are placed), UTF-8 would require 6
characters. UTF-8 (or -16 -32) are all able to represent the whole UCS4
space. UCS2 does not do any "chaining" so it can only have, at most, 16
bit characters (eg. 65535). UCS2 is a subset of UCS4[1].

Of course, I've left out alot of details like UCS2 doesn't actually have
64K chars, etc. 

With that said, most Linux machines that have wchar support, has wchar
defined as 4 bytes (int). So anything with wchar support probably
already uses 4 bytes. Maybe someone who has used wchar support can
comment on this.

Cheers,

[1]
http://www.gnuenterprise.org/doc/console-tools-libs/html/lct-4.html#sec-unicode

-- 
Alastair 'liquidx' Tse
 >> Gentoo Developer
 >> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/