Re: UTF-8 vs UTF-16...? (Was: Feeling good about SML)
At 17 Nov 1999 14:29 GMT, Steve Schafer wrote: > On 17 Nov 1999 13:24:27 +0100, you wrote: > > >Not sure if I understand the UTF-16 bit above, but I'm reading this: > > <URL:http://www.unicode.org/unicode/faq/#UTF-16 and UCS-4> > >to UTF-16 being able to represent the full UCS-4, which is what you > >say UTF-8 can do, if I interpret you correctly...? > > Section C.3 of the Unicode 2.0 spec, paragraph 4: > > "UTF-16 does not support the representation of all the UCS-4 code > space but is limited to the BMP and the next 16 planes...." True, but that's more code values than anybody expects to ever standardise (although that's the opinion of the same people that thought that they'd never need more than the BMP). All of the currently defined Unicode and ISO/IEC 10646 characters (both people define the same characters) are in the BMP. It won't be long until characters are defined in Plane 1 and Plane 2 (with possible spill-over into Plane 3), plus planes 15 and 16 are reserved for private use. Currently the only thing defined for the characters beyond Plane 16 of Group 00 (i.e. beyond the characters addressable with UTF-16) are more areas available for private use. The fuss over UTF-8 or UTF-16 is over the number of bytes used to represent the characters in the BMP, i.e. the currently defined characters. UTF-16 uses two bytes per character, and UTF-8 uses one byte per character for the ASCII characters, two bytes per character for not that many more characters, and three bytes per character for most of the characters in the BMP. Both UTF-8 and UTF-16 use four bytes per character to represent the characters in planes 1 to 16. (There's also UTF-32, which is four bytes per character for all the characters that you can represent with UTF-16.) UTF-8 is efficient if you use a lot of ASCII, e.g. if you're an English speaker and all you use is ASCII, but it's more bytes per character than UTF-16 for a whole lot of other scripts (plus it's more bytes per character than an lot of current script-specific encodings). So the issue isn't how many characters the different encodings can represent, but how efficiently (or how uniformly) they represent the currently defined characters. Regards, Tony Graham ====================================================================== Tony Graham mailto:tgraham@m... Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9632 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ====================================================================== xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To unsubscribe, mailto:majordomo@i... the following message; unsubscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format