[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: fuzzy end of this lolly-pop OR Why Latin Rocks
Tedd wrote > Now, to draw this thread back to on-topic, I know how code-points are > used in url's, html, and such, but I would like to see how xml > incorporates/uses Unicode code-points. Anyone? Please enlighten me. XML character handling is best thought of as having two parts: 1) A parsing function: it has Unicode characters in text entities as its input, and a document containing data and markup as its output. This is XML proper. So: * A numeric character reference (e.g., ꯍ ) in XML is always in terms of Unicode characters (not UTF-8, ISO8859-1, UTF-16, etc.) * XML does not constrain the actual coded character set that an implementation uses internally. * Being defined in terms of Unicode, XML is text, not binary. In other words, the character € means exactly what ISO 10646 or the Unicode Conosortium says it means. If I provide an XML document in CP1252, as used in the US and here in Australia, and I encode a byte P as binary data, that will be mapped before the XML parser to U+20AC, the Euro character (assuming that the XML processor accepts CP1252.) In that case, I will need to know which encoding was actually used for my data in order for my application to map the data back to the original code. Consequently, attempting to overload XML characters as binary code points is probably unworkable except for tightly coupled processes or where UTF-16 is used. Even using UTF-16 for binary overloaded transmission is probably not reliable for Japanese systems (see Japanese XML Profile at W3C technical report site). And XML 1.0 does not allow every UTF-code point, notably U+0000. (The lack of U+0000 is often portrayed, typically by MicroSoft users, as an antique carbuncle that should be removed, however I see that protecting \00 is alive in the Java JNI interface, see the 2nd last para of http://www.dil.univ-mrs.fr/docs/j2sdk/1.5/guide/jni/spec/types.html) 2) An algorithm that will typically be used to select the transcoding function: it has bytes from a Web resource as its input, and Unicode characters in an entity as its output. This is the auto-detection algorithm of Appendix F. So: * The auto-detection algorithm is never required when your XML implementation has the entity available as Unicode: for example, when an XML document in a single Java String does not require (indeed, should ignore) any encoding information in the XML Header. Or when a text resource is accessed over the web, and all the different protocols encoding-defaults match, and the server and intermediate caches etc. are configured correctly, and your client converts the bytes correctly into a form your system can trear as Unicode. (The unworkability of this parallel chain of metadata is what makes sending XML as application/xml using the XML headers and auto-detection more prudent than sending the XML as text/xml and relying on the Web infrastructure and protocols to get it right.) * A common reaction that people have when discoving that the MIME infrastructure is rather broken (for reliable transmission of non-ASCII text across different locales, or when using a character encoding different from the locale-default, notably UTF-8) the typical reaction is not to pull together and make sure everything is configured correctly, but to hack together something that seems to work. Invocations of the incompetency of people who make standards are never particularly convincing when made by people who deliberately break them. * The nail in the coffin for external transmission of character encoding (rather than using auto-detection) is that standard APIs for writing strings to files do not provide a built-in mechanism for transmitting the encoding. (This would require something like my XText format, which is basically generalizing Appendix F for use in almost any textual data format.) * XML does not constrain the actual coded character set that an implementation accepts externally; except that all implementation should accept UTF-8 and UTF16. As with many standards, XML describes its input in concrete terms, but its output only in abstract terms. Indeed, the output of an XML parser function was defined in such vague terms that an ancilliary standard, XML Infoset, was written to provide help for subsequent standards. I hope this is useful, even for aspiring trolls, Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|