[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Non-Unicode Character Sets

  • From: roddey@u...
  • To: xml-dev@i...
  • Date: Mon, 31 Jan 2000 15:52:07 -0700

yen backslash 2005

>I am told that conversion of some character sets through Unicode is
>lossy and cannot be round-tripped. But it occurs ot me that as long as
>one has the private use area, "unknown" characters can always be
>preserved. If a particular mapping loses information, isn't that more a
>weakness in the mapping then in Unicode itself? Are there some
>standardized national character sets with so many non-Unicode characters
>that they cannot fit into the PUA? Even with planes 15 and 16?

Don't know the answer to that, but just as a related aside...

In some cases the problem isn't round tripping, its 'half-tripping', due to
wierd design of the encoding. For instance, we have had some problems with
some Japanese and Korean encodings because of ambiguity between the
backslash and Yen sign. When you transcode that code point to Unicode, you
have to know the context of the text being transcoded in order to know
which translation is the correct one. If you transcode it to Yen, then if
you turn around and pass that text to say a 'file open' Unicode API in a
system that is inherently Unicode enabled, then it breaks because the
Unicode Yen sign probably isn't a legal path separator on that platform. If
you transcode it to backslash, and the text was a monetary value, then it
will be incorrect in its Unicode incarnation as well.

If you round trip it, its ok probably because both Unicode points can get
translated back to the single, ambiguous point, but then the software is
processed by an API that knows its dealing with this situation and can use
its context sensitivity to do the right thing (i.e. the file open knows
what that ambiguous code point means in that situation.)

Its all due to a psycho encoding design I guess, which could be mostly
dealt with when the code dealing with it was specific to that locale and
was dealing with it in the original encoding. But, once you move to a
Unicode world, and you have to make a choice between the two Unicode code
points to transcode to, it gets wierd and I don't see how it could really
be made to work consistently, since no one is going to write entire
software systems that carry around context information with the text
wherever it goes.

If some of you folks who deal with these encodings think I'm just confused,
please say so. But this is the best we can figure out with these types of

Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1
Please note: New list subscriptions and unsubscriptions
are  now ***CLOSED*** in preparation for list transfer to OASIS.


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.