|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Use of UTF-8 and UTF-16
On Wed, 2 Nov 2005, Philippe Poulard wrote: > Elliotte Harold wrote: > > Rick Jelliffe wrote: > > > >> For CJK (Chinese, Japanese, Korean) XML documents, where three (or six) > >> bytes may be used by UTF-8 instead of UCS-16's two (or four), UTF-16 > >> files > >> will usually be smaller. > > > > > > First a correction: UTF-8 never uses six bytes for anything. The largest > > UTF-8 character you'll ever see is 4 bytes wide. > > > > hi, > > I read somewhere that : > > UTF-8 uses 6 bytes for ISO/IEC 10646 > UTF-8 uses 4 bytes for Unicode > > Unicode is a subset of ISO/IEC 10646 (in terms of addressing) > ISO/IEC 10646 is a subset of Unicode (in terms of semantic) > > XML uses Unicode 10646 reserves the codes U+D800..U+DFFF for use in pairs to address characters with codes up to 20-bits long (U-00010000..U-0010FFFF). These reserved values (U+D800..U+DFFF) get encoded at 3 bytes each in UTF-8 so it takes 6 bytes to address the values 17 to 20 bits long via the 10646 scheme. However, UTF-8 can encode the UNICODE values U-00010000..U-0010FFFF as 4 bytes. <http://czyborra.com/utf/> explains some of the details. Chris Gray University of Waterloo Library
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








