[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Where does a parser get the replacement text for a characterreferenc
> | I assume that it would depend on what encoding the xml that you are > | parsing has. > > Actually, no. More like: "sort of yes". Java developers tend to assume Unicode is the universal way to represent character data, but folk working in other languages may not be so fortunate. Parser APIs aren't required to transcode into a UTF (UTF-8, UTF-16, UTF-32); they may deliver characters in other encodings, including the input encoding. Using the original U+E311 private-use character as an example, it could be natural to have some component transcode it to the local character set. That may be preferred for Klingon, or for other characters that don't have code points in Unicode. (A while back, I think Taiwan needed to use that approach; dunno if that's less of an issue in 3.1 Unicode.) > Character references always refer to Unicode characters. Or surrogate pairs -- they refer to ISO-10646 characters, which can be represented in Unicode as one or two 16-byte units. It's explicitly illegal to have references to surrogate pairs, but characters in the "Astral Planes" expand to two UTF-16 characters (or one UTF-32). - Dave
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|