[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Java/Unicode brain damage

  • From: Elliotte Rusty Harold <elharo@m...>
  • To: xml-dev@l...
  • Date: Wed, 25 Jul 2001 10:22:40 -0400

java unicode
At 11:13 PM -0700 7/24/01, Tim Bray wrote:

>Which means in effect that Dave's right, basically you just
>totally can't use a java's String or char in dealing with
>Blueberry docs.  Or am I missing something... please?  Or
>re-open the door to the UTF-16 hack by putting the 
>surrogate blocks back into [2] as part of the Blueberry
>update.
>
>Er, is anyone in the Java language team on top of what
>Unicode's up to?  This is a real problem.
>
>Somebody ship some Prozac over to Elliote before he goes
>critical... -Tim

I'm afraid the Java mess sent me over the edge a long time ago. :-) I've actually given quite a lot of thought to that problem in other forums, and for a while I was even arguing that JDOM needed to replace the String class in order to be XML compliant. However, on further reflection I decided maybe the problem wasn't quite that bad. It's still pretty bad, but it's not insurmountable.

The Java way to handle this is to stop thinking of a Java char as representing a Unicode character. It doesn't. A Java char represents a UTF-16 code point, which may be a surrogate. The public API to java.lang.String is essentially a UTF-16 API. For example, the length() method of a string does not return the number of Unicode characters in the string. Rather it returns the number of UTF-16 code points. A string containing a single Plane-1 character has length 2 in Java. 

This is inconvenient as all get out, but as long as you realize what's going on and code carefully, it's not necessarily wrong. Java is just providing a less than ideal representation of strings. For example, when a parser or other method is checking a string to see if it's a legal XML Blueberry name, it cannot simply pass each char in the String to an isBlueberryNameCharacter() method. Instead, it has to look at the whole string in toto and do its own decoding of surrogate pairs into Unicode characters before checking. The logic is much more complex, but it is doable, and it does work with existing Java APIs for processing XML.

FYI, I deliberately didn't bring this up previously, because even though Blueberry makes the problem worse, the problem still exists for element content and attribute values in XML 1.0. Furthermore, I think Java is broken enough here that Java needs to change. I don't think XML should be limited by this brain damage in Java. One silver lining to the Blueberry cloud might be that it could convince Sun to use a four-byte char like they should have back in 1995. 

Although Java's the only language I'm intimately familiar with these days, I do think it would be informative to see how other languages handle these issues. Would anyone care to address the handling of non-BMP text in Python, Perl, C, C++, Fortran, AppleScript, Rexx, Delphi, Visual Basic, etc?
-- 

+-----------------------+------------------------+-------------------+
| Elliotte Rusty Harold | elharo@m... | Writer/Programmer |
+-----------------------+------------------------+-------------------+ 
|          The XML Bible, 2nd Edition (Hungry Minds, 2001)           |
|              http://www.ibiblio.org/xml/books/bible2/              |
|   http://www.amazon.com/exec/obidos/ISBN=0764547607/cafeaulaitA/   |
+----------------------------------+---------------------------------+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/      | 
|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/     |
+----------------------------------+---------------------------------+

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.