Re: Where does a parser get the replacement text for a characterreferenc

From: Lars Marius Garshol <larsga@g...>
To: xml-dev <xml-dev@l...>
Date: Thu, 05 Jul 2001 11:00:24 +0200

Play the video


* David Brownell
|
| I think Lars and I are agreeing ... 

It does sounds suspiciously like it, yes. No reason to be
disappointed, though. I'm sure we can find something we do disagree
on. :-)

* Lars Marius Garshol
|
| That is true, though one would assume that this would not necessarily
| be possible. If the character could be expressed in the local
| character encoding, why was it encoded with a character reference in
| the first place?
 
* David Brownell
|
| If the text were encoded in UTF-8 for interchange purposes, then any
| given local system might use different encodings ... 

It might, and indeed I've written code to decode UTF-8 into local
encodings several times. When doing this, however, one always runs the
risk that there will be characters in the input that cannot be
represented in the output.

| there must be some convention to establish agreement on what a given
| private-use character means.  Presumably folk who work with systems
| using those characters could describe how they work.  A few years
| back, I heard questions about how such conventions ought to be
| structured.
 
The Unicode standard does subdivide the the privat use area into
different parts for different uses but I don't know enough about this
to say much more.

* Lars Marius Garshol
|
| No. Surrogate pairs are an artifact of the UTF-16 character encoding
| and conceptually they do not exist outside it.
 
* David Brownell
|
| More or less; the Unicode spec defines surrogates, and what pairing them
| means.  

The definition of the UTF-16 encoding does, yes. Surrogates are not
Unicode characters, however, and encoding a pair of them using UTF-8
or UTF-32 is not (AFAIR) legal, much less meaningful.

The recent UTF-8S proposal requires using surrogates instead of
encoding code points directly, but this is controversial for several
reasons, one of which is that this is simply importing the problems
with UTF-16 into UTF-8, which previously did not have them.

* David Brownell
|
| But equating Unicode with UTF-16, to match common usage (and clearly
| not wearing my pedantic hat :) that point is not going to be
| understood very widely, because ...

* Lars Marius Garshol
|
| In other words &#x10416; does not refer to a surrogate pair; it
| refers to U+10416, DESERET CAPITAL LETTER JEE.
 
* David Brownell
|
| ... that is _represented_ as a "surrogate pair" in Java and many other
| programming environments:  two Java "char" values are needed to
| represent a single (up one level) "character".

I agree that most people thoroughly confuse UTF-16, UCS-2 and Unicode,
and I think that dates from the time when the Unicode people
themselves did not distinguish between the encodings and the character
set. Probably the lack of a need for such a distinction when working
with western encodings has contributed to the problem.

This is the very reason I responded to your message, though, since I
think that confusion needs to be corrected.
 
* Lars Marius Garshol
|
| They can be represented in UTF-16 as one or two 16-byte units, but
| UTF-16 and Unicode are not the same. Unicode is the character set,
| UTF-16 is one of its (too) many encodings.
 
* David Brownell
|
| But a "char"acter in Java (or wchar_t on Win32) is a 16-bit (not
| byte :) unit, hence the semantic confusion when you talk about a
| "character".

It is a source of confusion, I agree, and all the more reason to clear
it up. :-)

--Lars M.

Follow-Ups:
- Re: Where does a parser get the replacement text for a characterreference?
  - From: John Cowan <cowan@m...>

References:
- Where does a parser get the replacement text for a character reference?
  - From: Ben Ryan <b_ryan@c...>
- Re: Where does a parser get the replacement text for a characterreference?
  - From: Lars Marius Garshol <larsga@g...>
- Re: Where does a parser get the replacement text for a characterreference?
  - From: David Brownell <david-b@p...>
- Re: Where does a parser get the replacement text for a characterreference?
  - From: Lars Marius Garshol <larsga@g...>
- Re: Where does a parser get the replacement text for a characterreference?
  - From: David Brownell <david-b@p...>

Prev by Date: RE: XML and unit testing
Next by Date: xlink and non-addressable URIs
Previous by thread: Re: Where does a parser get the replacement text for a characterreference?
Next by thread: Re: Where does a parser get the replacement text for a characterreference?
Index(es):
- Date
- Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >