[LUG] "default encoding for the user"

Zlatko Trajceski zlat at bagra.net.mk
Wed Nov 27 15:41:09 UTC 2002


On Tue, 26 Nov 2002 22:26:17 +0100, you wrote:

>On Monday 25 November 2002 11:17, Дамјан Георгеивски wrote:
>> Освен за обични текст фајлови компатибилноста не е битна.
>> Windows користи UTF-16, и за тој избор можам да кажам само ЗОШТО?!?
>
>Не разбирам. За што точно користи utf-16?
>
>(негде прочитав порано дека utf-8 е disadvantage за азијските јазици во однос 
>на utf-16)

Еве што вика http://www.pango.org/design.shtml за тоа:

(btw добари линкови се и 
http://www.pango.org/resources.shtml
http://www.pango.org/input-resources.shtml
http://www.pango.org/font-resources.shtml)



 Text Handling

Text in Pango is represented, in most cases, as UTF-8 encoded strings.
This representation, has a number of advantages as opposed to a
fixed-width representation such as UCS-16:

    * Compatibility with existing Unix API's is maximized. The
standard C library functions widely availabe now (such as sprintf())
continue to work.
    * Two sets of API entry points are not needed, since functions
continue to take a char *.
    * The character set is extensible to the full range of ISO10646
without requiring escape mechanisms such as surrogate pairs.
    * UTF-8 requires no extra space for storing ASCII text, and has
only a 50% penalty as opposed to UCS-2 for double-byte character sets.
    * UTF-8 is independent of byte-order

There are some disadvantages as well:

    * Other popular systems such as Microsoft products and Java have
adopted UCS-2 as their encoding, necessitating conversions. (But UTF-8
appears to be the emerging standard for open-source applications.)
    * . UTF-8 has a 50% penalty in space as opposed to UCS-2 for
storing double-byte character sets.

Where individual characters are represented, they are reprented as 32
bit wide characters. This again provides forwards compatibility with
the full range of ISO10646, and should incur minimal cost for local
variables and parameter passing. This agrees with the type of wchar_t
in the GNU libc library.

Offsets into a utf-8 string are represented as byte offsets, not
character offsets. This is more convenient for processing, and
although there is the problem of having invalid offsets into the data,
note that given a string of Unicode text with combining characters,
character positions may already be invalid, and break-iteration is
needed to determine valid positions.

Conversion between character sets will be handled via via the iconv. A
lot of new systems provide a decent implementation of iconv -
noteably, GNU libc-2.1, and these systems can be used as "reference
platforms"; for other platforms, it shouldn't be hard to write a
simple table-driven iconv implementation that can handle the small
amounts of data in a typical GUI reasonably efficiently. (various
implementations of this are availale - e.g., Tom Tromey's libunicode,
Bruno Haible's libiconv.) 


-- 
m o reandmore and morea nd moreandmo r eandm ore andm ore
_______________________________________________
LUG mailing list
LUG at lists.linux.net.mk
http://lists.linux.net.mk/mailman/listinfo/lug



More information about the Ossm-members mailing list