[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Where does a parser get the replacement text for a characterreferenc

  • From: David Brownell <david-b@p...>
  • To: xml-dev <xml-dev@l...>
  • Date: Wed, 04 Jul 2001 19:39:14 -0700

parser character replacement
I think Lars and I are agreeing ... at which point this thread can become
a digression about "private use" characters and other ways that Unicode
wants to extend itself, or perhaps a discussion about how confusing the
word "character" can be.


> | Using the original U+E311 private-use character as an example, it
> | could be natural to have some component transcode it to the local
> | character set.  That may be preferred for Klingon, or for other
> | characters that don't have code points in Unicode.
> 
> That is true, though one would assume that this would not necessarily
> be possible. If the character could be expressed in the local
> character encoding, why was it encoded with a character reference in
> the first place?

If the text were encoded in UTF-8 for interchange purposes, then any
given local system might use different encodings ... there must be some
convention to establish agreement on what a given private-use character
means.  Presumably folk who work with systems using those characters
could describe how they work.  A few years back, I heard questions
about how such conventions ought to be structured.



> * Lars Marius Garshol
> |
> | Character references always refer to Unicode characters.
>  
> * David Brownell
> |
> | Or surrogate pairs
> 
> No. Surrogate pairs are an artifact of the UTF-16 character encoding
> and conceptually they do not exist outside it.

More or less; the Unicode spec defines surrogates, and what pairing them
means.  But equating Unicode with UTF-16, to match common usage
(and clearly not wearing my pedantic hat :) that point is not going to
be understood very widely, because ...

>     In other words
> &#x10416; does not refer to a surrogate pair; it refers to U+10416,
> DESERET CAPITAL LETTER JEE.

... that is _represented_ as a "surrogate pair" in Java and many other
programming environments:  two Java "char" values are needed to
represent a single (up one level) "character".


> | -- they refer to ISO-10646 characters, which can be represented in
> | Unicode as one or two 16-byte units.  

("they" being expanded character refs ... there are 10646 code points
that can't be represented in UTF-16, such as those using 5 and 6 byte
UTF-8 encodings ...)


> They can be represented in UTF-16 as one or two 16-byte units, but
> UTF-16 and Unicode are not the same. Unicode is the character set,
> UTF-16 is one of its (too) many encodings.

But a "char"acter in Java (or wchar_t on Win32) is a 16-bit (not byte :)
unit, hence the semantic confusion when you talk about a "character".
And it doesn't stop there ... :)

- Dave


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.