[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Java/Unicode brain damage

  • From: Joel Rees <rees@s...>
  • To: xml-dev@l...
  • Date: Thu, 26 Jul 2001 15:17:42 +0900

java unicode
Inquiring minds and Elliotte Rusty Harold want to know:

[snipped]

> The Java way to handle this is to stop thinking of a Java char as
> representing a Unicode character. It doesn't. A Java char represents
> a UTF-16 code point, which may be a surrogate.

[snipped]

> Although Java's the only language I'm intimately familiar with these
> days, I do think it would be informative to see how other languages
> handle these issues. Would anyone care to address the handling of
> non-BMP text in Python, Perl, C, C++, Fortran, AppleScript, Rexx,
> Delphi, Visual Basic, etc?

Here's what I know about it:

(Hint -- somebody who really knows, correct me!)

CoBOL, as I understand it, hides (from) the problem by not telling anyone
what is happening underneath. Where you need anything beyond 7 bit, you use
"NATIONAL" characters, and you get whatever functionality the system gives
you. String functions are predefined, you just use them.

I once knew something about ForTran.

Delphi has wide characters that are (presently) 16 bits. If we try to deal
with anything beyond BMP, we usually use surrogate pairs. For some
intermediate operations, we do convert to UTF-32 (With 3.1, it's official
now!). For file I/O, we usually convert between UTF-16 Unicode and
(shift)-JIS.

The char in C is a byte, and most C libraries assume strings are built of
bytes, so C tends to use variable width characters.(Read that as UTF-8 for
Unicode.) You can't back up safely with shift-JIS, so you sometimes dump
things temporarily to fixed-width buffers when you need random access.
Although you can back up safely with UTF-8, it's still sometimes convenient
to temporarily dump a UTF-8 string to a constant width buffer. Since these
buffers are rather local in nature (can't be worked on by most of the
standard libraries at this time), widening them to 32 bits when 16 bits had
been used does not usually cause any ripples. Note that UTF-32 was not
official Unicode until 3.1, so the typing and other machinery for the 32-bit
temporary buffers is somewhat ad-hoc, not that it matters much.

I think I have seen 16 bit character string classes in C++. But these are
classes.

The manuals for Objective-C say that NSString conversion to UTF-8 is just a
copy. Apparently the general assumption is UTF-8.

Perl and Ruby (the language) both declare that you don't really know what a
string looks like inside, but they both are built on C and make heavy use of
pre-existing RE code, so we can assume they are presently handling things at
the byte string level. What I have heard in the Perl forums indicates they
are going with UTF-8 internally in most of the UNICODE support, but I may be
wrong. One thing about Perl, you can get just about anything you want.

BTW, variable width byte strings fit naturally with building character
classification tables in small chunks, which is useful for eliminating
redundant subtables.

HTH

Joel Rees
programmer -- rees@m...
----------------------------------------------------
To be a tree supporting all information,
  giving root to the chaos
    and branches to the trivia,
      information breathing anew --
        This is the aim of Yggdrasill.
============================XML as Best Solution===
Media Fusion Co. ,Ltd.  株式?社????????????
Amagasaki  TEL 81-6-6415-2560    FAX 81-6-6415-2556
    Tokyo??TEL 81-3-3516-2566  ??FAX 81-3-3516-2567
                       http://www.mediafusion.co.jp
===================================================



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.