|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Java/Unicode brain damage
At 08:24 PM 26/07/01 -0700, David Brownell wrote: >> A Java 'char' is a 16 bit data type, so it simply isn't possible for >> it to directly represent a Unicode character. > >Could you elaborate? There's a section in my Unicode book >(in another city :) that talks about surrogates. There's a sense >in which "if it's listed there, it's a kind of character". > >The word "character" is heavily overloaded, but I think it's >clear that in at least one sense a Java "char" _is_ what folk >call a "character". That's just how the word is used, even >if it's arguably sloppy usage for other contexts. > >It would likely be instructive to have someone explain >the senses in which "char" is, and isn't, a character. It is clear that a Java "jchar" (hereinafter jchar) cannot represent an XML character (xchar), simply because a jchar can be in the surrogate range and an XML character can't; also because a jchar can't represent a value outside of the BMP, but such values are legal xchars. As for combiners and so on, XML and Java agree that COMBINING ACUTE ACCENT and so on are characters - yes, there's a problem in that there are multiple ways to represent things that will render identically, that's why the W3C published a canonical character composition model. I think it's clear that a jchar can represent a UTF-16 encoding unit, but java currently doesn't know about the semantics associated with surrogates, i.e. they have to appear in pairs which represent non-BMP chars. I think I still believe that a jchar is really trying to represent UCS-2. >ISO-10646 code points >are (as I understand) not necessarily going to be able >to represent a "character" either (32 bits v. 16). Well, an xchar is by definition a Unicode/ISO10646 code point (hereinafter uchar). Yes, there are things that a typographer would consider a "character" that can't be represented in a single xchar or uchar. But damn few actually, there are uchars for pretty well anything you're apt to encounter outside the domain of bleeding- edge math research. The worrying thing is that for 99.9999999999% of all real-world XML processing, if you pretend that a jchar represents an xchar, you won't get in any trouble. So I bet there's a huge amount of java code out there right now that makes this assumption. I don't think we have much understanding now as to what flavor of breakage is apt to occur when (if) non-BMP data starts flowing through such code. -Tim
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








