[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Unicode confusion

  • From: roddey@u...
  • To: xml-dev@i...
  • Date: Mon, 10 Jan 2000 12:44:25 -0700

xerces wchar_t

>> If anything, it should go the other way. Unicode should be the core
>> API, and there should be helper API to allow the use of local code
>> page chars where necessary. Everything should be set up to optimize
>> use of the Unicode API, with local code page use paying the price,
>> since Unicode is the more desireable format.
>No one's disagreeing with the use of Unicode; we're talking about
>which character encoding we'll use to represent it.  You can represent
>Unicode in variable-width 8-bit or 16-bit encodings or in fixed-width
>32-bit encodings.
>Note that Java uses UTF-16, which isn't quite fixed-width, though no
>one really notices.

Our parser already adopts to whether the native wchar_t is 16 or 32 bits,
though it still uses surrogates and stores 16 bit data points in the 32 bit
values when its 32 bits. However, it could also pretty reasonably also
adopt to not using surrogates if the local wchar_t is 32 bits. I guess it
comes down to whatever the local system's wide character APIs expect. If it
expects 32 bit values without surrogates, then it would be kind of
necessary to give them that. If it expects 16 bit code points with
surrogates, irregardless of the fact that the wchar_t is 32 bits perhaps,
then it would best to give them that.

Going this far would require some support in parsers that might not be
common, but I think that we could do that reasonably in the Xerces/XML4C
stuff without too much pulling out of hair or added complexity. The
internalization of text into the local format is pretty constrained. The
big iss though is that you are kind of dependent upon what transcoding
package you use. For those incodings that we handle intrinsically, we could
do this well enough. But we allow each platform to use its own transcoding
mechanism if they choose to, and they probably are going to support one
scheme or the other. Hopefully they would support the local scheme, but you
could also choose to use some portable package such as ICU which is going
to do one thing.

So, perhaps the question is: Are there any systems out there which use 32
bit wchar_t *and* expect that surrogates will not be used?

Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo@i... the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.