[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Java/Unicode brain damage

  • From: Tim Bray <tbray@t...>
  • To: xml-dev@l...
  • Date: Thu, 26 Jul 2001 21:11:08 -0700

java char
At 08:24 PM 26/07/01 -0700, David Brownell wrote:
>> A Java 'char' is a 16 bit data type, so it simply isn't possible for
>> it to directly represent a Unicode character. 
>
>Could you elaborate?  There's a section in my Unicode book
>(in another city :) that talks about surrogates.  There's a sense
>in which "if it's listed there, it's a kind of character".
>
>The word "character" is heavily overloaded, but I think it's
>clear that in at least one sense a Java "char" _is_ what folk
>call a "character".  That's just how the word is used, even
>if it's arguably sloppy usage for other contexts.
>
>It would likely be instructive to have someone explain
>the senses in which "char" is, and isn't, a character.

It is clear that a Java "jchar" (hereinafter jchar) cannot
represent an XML character (xchar), simply because a jchar
can be in the surrogate range and an XML character can't; 
also because a jchar can't represent a value outside of
the BMP, but such values are legal xchars.

As for combiners and so on, XML and Java agree that 
COMBINING ACUTE ACCENT and so on are characters - yes,
there's a problem in that there are multiple ways to
represent things that will render identically, that's
why the W3C published a canonical character composition
model.

I think it's clear that a jchar can represent a UTF-16
encoding unit, but java currently doesn't know about
the semantics associated with surrogates, i.e. they
have to appear in pairs which represent non-BMP chars.
I think I still believe that a jchar is really trying
to represent UCS-2.

>ISO-10646 code points
>are (as I understand) not necessarily going to be able
>to represent a "character" either (32 bits v. 16).

Well, an xchar is by definition a Unicode/ISO10646 code
point (hereinafter uchar).  Yes, there are things that 
a typographer would consider a "character" that can't be 
represented in a single xchar or uchar.  But damn few
actually, there are uchars for pretty well anything 
you're apt to encounter outside the domain of bleeding-
edge math research.

The worrying thing is that for 99.9999999999% of all
real-world XML processing, if you pretend that a jchar
represents an xchar, you won't get in any trouble.  So
I bet there's a huge amount of java code out there right
now that makes this assumption.  I don't think we have
much understanding now as to what flavor of breakage is
apt to occur when (if) non-BMP data starts flowing 
through such code.  -Tim


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.