|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Unicode surrogate block in XML?
At 17 Sep 1999 22:16 -0400, Paul W. Abrahams wrote: > Tony Graham (tgraham@m...) > Fri, 17 Sep 1999 01:15:51 -0400 (EST) > > >> In any XML document, you can make numeric references to any Unicode > > character in the range #x10000 to #x10FFFF (as well as to any other > legal character number). These references are independent of the > encoding used in the XML document. << > > Is it really correct to refer to #x10FFFF, say, as a Unicode > character, since Unicode characters are limited to 16 bits? I'd think > it's necessary here to refer to that as a UCS-4 character. The Unicode Standard started out with the design principle that all characters have a uniform width of 16 bits. The expectation was that the 65,000 or so characters that you can address with 16 bits would far exceed the requirements. However, reality intruded, and the practicalities (and possibly the political realities) of defining a universal character set has meant that there are more characters to be defined than can fit in a 16-bit address space. Unicode 2.0, published in 1996, defines the Surrogate block and a mechanism for using two code values from the surrogate block to address over one million extra characters. The Unicode Standard, Version 2.0, supports surrogates, but doesn't quite know what to do about them. Section 3.7 of the Unicode Standard, Version 2.0, defines surrogates, and they are mentioned again in section C.3, but you're left with the impression that they and UTF-16 are really an ISO/IEC 10646 thing. UTF-16 was initially defined in Amendment 1 of ISO/IEC 10646-1:1993, so it wasn't far off the mark. Planes 15 and 16 are reserved for private use, so there's been a legitimate use for surrogates, or, more broadly, for using characters outside Plane 0, since 1996. Since 1996, however, there have been numerous proposals for scripts to be included in the Unicode Standard and ISO/IEC 10646, and many of these are slated for definition in Plane 1, i.e. they'll need more than 16 bits to address the characters. As far as I know, none have been assigned code values yet, but it won't be too long after the release of the Unicode Standard, Version 3.0, and ISO/IEC 10646-1:2000. Furthermore, Plane 2 is reserved as the CJK Unified Ideographs Supplementary Plane, and it already has 41,000 characters lined up for inclusion. > >> The sequence of #xD800 #xDC00 is the two Surrogate code values that > > address #x10000. That four-byte sequence may occur in a UTF-16 > encoded file to represent #x10000. In contrast, "��" in > > an XML document is two illegal character references in a row. << > > I've been trying to fathom the distinction between Unicode and UTF-16, > if there is one, and how these in turn relate to the UCS-2 encoding of There isn't one anymore. The Unicode Standard used to say that it corresponded to UCS-2, but now it has embraced UTF-16 (and given us UTF-16BE and UTF-16LE for big-endian and little-endian representations without the BOM, respectively). The Unicode Consortium now also defines UTF-32, which is a 32-bit representation of the characters that you can address with UTF-16. There is no difference between the UTF-32 representation of a character and the UCS-4 representation of a character over the range of characters that you can address with UTF-32. The only difference is that when you say that your document is UTF-32, you're saying that it comes with the Unicode character semantics and conformance requirements rather than the different requirements of UCS-4. UTF-8 has also come into the fold since 1996. In the Unicode Standard, Version 2.0, UTF-8 was relegated to section A.2, but now it's an accepted alternative for UTF-16. > ISO 10646. There's also the question of whether an XML document can > be stored directly in Unicode, or whether instead it must be stored in > either UTF-8 or UTF-16, as Section 2.2 seems to imply when it says > ``all XML processors must accept the UTF-8 and UTF-16 encodings of > 10646''. The latter appears to be the case; but if it isn't, then > how would an XML document be stored directly in Unicode? I've UTF-8 and UTF-16 can encode the characters of the Unicode Standard. The Unicode Standard used to miss an aspect compared to how some people, e.g. some ISO standards, define a character set. Roughly speaking, the base aspect is the character repertoire, which is a collection of abstract characters. The next aspect is a mapping of the character repertoire onto a set of numbers. The third aspect is mapping the character numbers onto some representation as bits or bytes. The Unicode Standard used to conflate the second and third aspects since the character numbers are identical to the value of the 16-bit quantities that you can use to represent the characters. Hence it seems like a Unicode character is its 16-bit character number. This simplification falls down when you have character numbers that you can't express with 16-bits and you allow other bit representations for the characters. You'll find that the Unicode Consortium now speaks about UTF-8, UTF-16, UTF-32, and UTF-EBCDIC. The favourite is probably still UTF-16, but even UTF-16 isn't one 16-bit quantity to one character. Also, the Unicode character encoding model (http://www.unicode.org/unicode/reports/tr17/) now has five levels. > pondered both Appendix C of the Unicode Standard and the relevant part > of the FAQ on the Unicode website, and I'm still unclear about all of > this. (By the way, the FAQ erroneously refers to UTF as the Unicode > Transformation Format rather than the UCS transformation format.) There are two definitions for UTF. ISO/IEC 10646 always defines it as "UCS transformation format", and the Unicode Consortium mostly defines it as "Unicode transformation format" (see section C.3 of the Unicode Standard, Version 2.0, for an exception). They mean the same thing. > In any event, thanks, Tony, for your very enlightening response to my > original query. I hope this remains enlightening, and not overwhelming. Regards, Tony Graham ====================================================================== Tony Graham mailto:tgraham@m... Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9632 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ====================================================================== xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








