|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Where does a parser get the replacement text for a characterreferenc
* Ben Ryan | | I assume that it would depend on what encoding the xml that you are | parsing has. * Lars Marius Garshol | | Actually, no. * David Brownell | | More like: "sort of yes". Java developers tend to assume Unicode is | the universal way to represent character data, but folk working in other | languages may not be so fortunate. That is true. I must admit that I've worked enough with Unicode to have brainwashed myself into thinking that Unicode is the one true way to represent text. | Parser APIs aren't required to transcode into a UTF (UTF-8, UTF-16, | UTF-32); they may deliver characters in other encodings, including | the input encoding. They may. The interpretation of the character reference is determined by Unicode, however, and is completely independent of the input encoding of the document. So in that sense my statement stands. You are of course right that this does not necessarily mean that your application will receive this character encoded as a Unicode character. | Using the original U+E311 private-use character as an example, it | could be natural to have some component transcode it to the local | character set. That may be preferred for Klingon, or for other | characters that don't have code points in Unicode. That is true, though one would assume that this would not necessarily be possible. If the character could be expressed in the local character encoding, why was it encoded with a character reference in the first place? | (A while back, I think Taiwan needed to use that approach; dunno if | that's less of an issue in 3.1 Unicode.) One would assume so, given the addition of more than 40,000 new chinese characters in Unicode 3.1. :-) This issue is not likely to ever go away completely for living ideographic scripts, however, since new characters keep being created all the time, although at a slow pace. * Lars Marius Garshol | | Character references always refer to Unicode characters. * David Brownell | | Or surrogate pairs No. Surrogate pairs are an artifact of the UTF-16 character encoding and conceptually they do not exist outside it. In other words 𐐖 does not refer to a surrogate pair; it refers to U+10416, DESERET CAPITAL LETTER JEE. | -- they refer to ISO-10646 characters, which can be represented in | Unicode as one or two 16-byte units. They can be represented in UTF-16 as one or two 16-byte units, but UTF-16 and Unicode are not the same. Unicode is the character set, UTF-16 is one of its (too) many encodings. | It's explicitly illegal to have references to surrogate pairs, I guess that by this you mean that "it's explicitly illegal to refer to characters as a pair of character references each referring to a surrogate". That is so because it does not make sense to import the UTF-16 kluge that surrogate pair are into XML when one can refer directly to the code point instead. | but characters in the "Astral Planes" expand to two UTF-16 | characters No, they are single characters, in UTF-16 represented by a pair of 16-bit code units. | (or one UTF-32). They are represented as a single 32-bit code unit in UTF-32, yes. --Lars M.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








