Re: Unicode surrogate block in XML?
At 16 Sep 1999 18:12 -0400, Paul W. Abrahams wrote: > The XML 1.0 spec explicitly excludes the Unicode surrogate characters > from XML documents (production 2). It now seems, from information > I've picked up on the Unicode web site, that surrogate characters are > likely to play a more important role in the future, since the > available 16-bit characters are almost all used up. (Unicode 2.0 has > 18,134 spares but Unicode 3.0 has only 7827 spares. The trend is > clear.) > > Is any thought being given in W3C to allowing surrogate characters in > XML documents? The code values from the Surrogate block (soon to be the High Surrogates, High Private Use Surrogates, and Low Surrogates) are not allowed in XML documents, but the characters that you reference with the two parts of a Surrogate Pair are definitely allowed. The characters that you can address with a Surrogate Pair are in the range #x10000 to #x10FFFF. In Unicode terminology, this is the Unicode Scalar Value of the Surrogate Pair. Production 2 from the XML Recommendation shows that these are legal characters:  Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] In a UTF-16 encoded document, you can use the code values from the Surrogate block to refer to these characters. It would be an error if, for example, you used an unpaired Surrogate code value, but any UTF-16 application is going to complain about or ignore an unpaired surrogate. In a UTF-8 encoded document, you can refer to the characters in the range #x10000 to #x10FFFF using a four-byte sequence that has no relationship to the code values in the Surrogate block. In UCS-4 (or the new UTF-32) you can directly represent characters in the range #x10000 to #x10FFFF. In any XML document, you can make numeric references to any Unicode character in the range #x10000 to #x10FFFF (as well as to any other legal character number). These references are independent of the encoding used in the XML document. #x10000 is the first code value outside the Basic Multilingual Plane (the ISO/IEC 10646 term for the characters in the range #x0 to #xFFFF). "𐀀" is the hexadecimal numeric reference for this code value. The sequence of #xD800 #xDC00 is the two Surrogate code values that address #x10000. That four-byte sequence may occur in a UTF-16 encoded file to represent #x10000. In contrast, "��" in an XML document is two illegal character references in a row. Regards, Tony Graham ====================================================================== Tony Graham mailto:tgraham@m... Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9632 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ====================================================================== xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format