[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: MSXML and Encoding
> UTF-8 characters are between 1 and 3 bytes long, mapping > approximately as follows (it's a while since I did this, and this is > from memory, so apologies if I've not got it exactly right, but it's > similar ). > > UCS-2 char UTF-8 mapping > ------------- ------------- > 0x0000-0x007f 0x0nnnnnnn > 0x0080-0x03ff 0x110nnnnn 0x10nnnnnn > 0x0400-0xffff 0x1110nnnn 0x10nnnnnn 0x10nnnnnn > > where nnnnn... are the bits which build up the UCS-2 value. > > Note: > You can tell what type of byte you have from the first 1-4 bits > 0 - single-byte > 10 - continuation > 110 - 2-byte > 1110 - 3-byte Which means you can easily find the nearest character boundary - search for the next byte starting with 0, 110 or 1110. > This means that (eg) e (0xe9 => 0x11101001) is interpreted as > the start of a 3-byte character in the range 0x9000-0x9fff. The UTF-8 encoding for e is 0080-03ff -> 110nnnnn 10nnnnnn where nnnnn nnnnnn are 000 11101001 ie 11000011 10101001 or 0xc3 0xa9 or A? HTH, Ian -- Ian Brockbank, Indigo Active Vision Systems, The Edinburgh Technopole, Bush Loan, Edinburgh EH26 0PJ Tel: 0131-475-7234 Fax: 0131-475-7201 work: ian@xxxxxxxxxxxxxx personal: Ian.Brockbank@xxxxxxxxxxx web: ScottishDance@xxxxxxxxxxx http://www.scottishdance.net/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|