[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Where does a parser get the replacement text for a characterreferenc
* David Brownell | | I think Lars and I are agreeing ... It does sounds suspiciously like it, yes. No reason to be disappointed, though. I'm sure we can find something we do disagree on. :-) * Lars Marius Garshol | | That is true, though one would assume that this would not necessarily | be possible. If the character could be expressed in the local | character encoding, why was it encoded with a character reference in | the first place? * David Brownell | | If the text were encoded in UTF-8 for interchange purposes, then any | given local system might use different encodings ... It might, and indeed I've written code to decode UTF-8 into local encodings several times. When doing this, however, one always runs the risk that there will be characters in the input that cannot be represented in the output. | there must be some convention to establish agreement on what a given | private-use character means. Presumably folk who work with systems | using those characters could describe how they work. A few years | back, I heard questions about how such conventions ought to be | structured. The Unicode standard does subdivide the the privat use area into different parts for different uses but I don't know enough about this to say much more. * Lars Marius Garshol | | No. Surrogate pairs are an artifact of the UTF-16 character encoding | and conceptually they do not exist outside it. * David Brownell | | More or less; the Unicode spec defines surrogates, and what pairing them | means. The definition of the UTF-16 encoding does, yes. Surrogates are not Unicode characters, however, and encoding a pair of them using UTF-8 or UTF-32 is not (AFAIR) legal, much less meaningful. The recent UTF-8S proposal requires using surrogates instead of encoding code points directly, but this is controversial for several reasons, one of which is that this is simply importing the problems with UTF-16 into UTF-8, which previously did not have them. * David Brownell | | But equating Unicode with UTF-16, to match common usage (and clearly | not wearing my pedantic hat :) that point is not going to be | understood very widely, because ... * Lars Marius Garshol | | In other words 𐐖 does not refer to a surrogate pair; it | refers to U+10416, DESERET CAPITAL LETTER JEE. * David Brownell | | ... that is _represented_ as a "surrogate pair" in Java and many other | programming environments: two Java "char" values are needed to | represent a single (up one level) "character". I agree that most people thoroughly confuse UTF-16, UCS-2 and Unicode, and I think that dates from the time when the Unicode people themselves did not distinguish between the encodings and the character set. Probably the lack of a need for such a distinction when working with western encodings has contributed to the problem. This is the very reason I responded to your message, though, since I think that confusion needs to be corrected. * Lars Marius Garshol | | They can be represented in UTF-16 as one or two 16-byte units, but | UTF-16 and Unicode are not the same. Unicode is the character set, | UTF-16 is one of its (too) many encodings. * David Brownell | | But a "char"acter in Java (or wchar_t on Win32) is a 16-bit (not | byte :) unit, hence the semantic confusion when you talk about a | "character". It is a source of confusion, I agree, and all the more reason to clear it up. :-) --Lars M.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|