|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Non-Unicode Character Sets
>I am told that conversion of some character sets through Unicode is >lossy and cannot be round-tripped. But it occurs ot me that as long as >one has the private use area, "unknown" characters can always be >preserved. If a particular mapping loses information, isn't that more a >weakness in the mapping then in Unicode itself? Are there some >standardized national character sets with so many non-Unicode characters >that they cannot fit into the PUA? Even with planes 15 and 16? Don't know the answer to that, but just as a related aside... In some cases the problem isn't round tripping, its 'half-tripping', due to wierd design of the encoding. For instance, we have had some problems with some Japanese and Korean encodings because of ambiguity between the backslash and Yen sign. When you transcode that code point to Unicode, you have to know the context of the text being transcoded in order to know which translation is the correct one. If you transcode it to Yen, then if you turn around and pass that text to say a 'file open' Unicode API in a system that is inherently Unicode enabled, then it breaks because the Unicode Yen sign probably isn't a legal path separator on that platform. If you transcode it to backslash, and the text was a monetary value, then it will be incorrect in its Unicode incarnation as well. If you round trip it, its ok probably because both Unicode points can get translated back to the single, ambiguous point, but then the software is processed by an API that knows its dealing with this situation and can use its context sensitivity to do the right thing (i.e. the file open knows what that ambiguous code point means in that situation.) Its all due to a psycho encoding design I guess, which could be mostly dealt with when the code dealing with it was specific to that locale and was dealing with it in the original encoding. But, once you move to a Unicode world, and you have to make a choice between the two Unicode code points to transcode to, it gets wierd and I don't see how it could really be made to work consistently, since no one is going to write entire software systems that carry around context information with the text wherever it goes. If some of you folks who deal with these encodings think I'm just confused, please say so. But this is the best we can figure out with these types of encodings. ---------------------------------------- Dean Roddey Software Weenie IBM Center for Java Technology - Silicon Valley roddey@u... xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ or CD-ROM/ISBN 981-02-3594-1 Please note: New list subscriptions and unsubscriptions are now ***CLOSED*** in preparation for list transfer to OASIS.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








