On Thu, 2003-11-13 at 09:05, Paul de Vrieze wrote: > Isn't it true that it is possible to to encode the few 4 byte characters > into a number of 2byte sequences. I think that is more than enough for > most cases (who needs to read/write cjk anyway ;-) ) According to my understanding of UCS, it doesn't seem to be the case. I believe that UCS is the internal representation of a unicode character, whereas UTF is the encoding of the character into octets for representation on a computer. As an example, a UCS4 character on one of the higher planes (the one where extra CJK characters are placed), UTF-8 would require 6 characters. UTF-8 (or -16 -32) are all able to represent the whole UCS4 space. UCS2 does not do any "chaining" so it can only have, at most, 16 bit characters (eg. 65535). UCS2 is a subset of UCS4[1]. Of course, I've left out alot of details like UCS2 doesn't actually have 64K chars, etc. With that said, most Linux machines that have wchar support, has wchar defined as 4 bytes (int). So anything with wchar support probably already uses 4 bytes. Maybe someone who has used wchar support can comment on this. Cheers, [1] http://www.gnuenterprise.org/doc/console-tools-libs/html/lct-4.html#sec-unicode -- Alastair 'liquidx' Tse >> Gentoo Developer >> http://www.liquidx.net/ | http://dev.gentoo.org/~liquidx/