RE: UTF-8+names

To: "'Tim Bray'" <tbray@t...>
Subject: RE: UTF-8+names
From: "Alessandro Triglia" <sandro@m...>
Date: Sun, 19 Oct 2003 04:26:05 -0400
Cc: <xml-dev@l...>
Importance: Normal
In-reply-to: <3F91FAD6.8090601@t...>

Play the video

> -----Original Message-----
> From: Tim Bray [mailto:tbray@t...] 
> Sent: Saturday, October 18, 2003 22:46
> To: Simon St.Laurent
> Cc: xml-dev@l...
> Subject: Re:  UTF-8+names
> 
> 
> Simon St.Laurent wrote:
> 
> >>Of course it's cunningly designed to look like an architectural 
> >>change, that allows such syntax as: <&eacute;/>
> 
> Yow.  I hadn't thought of that.  (Hmm, somehow I missed 
> David's message; 
> xml-dev acting up again?)
> 
> > That is therefore an enormous processing model change.  This is way 
> > beyond surrogates. The potential for further disruption on this 
> > precedent seems downright boundless.
> 
> Hmm, it's just an idiotically simple filter that replaces a bunch of 
> hardwired patterns with hardwired Unicode code points.  Hardly feels 
> like a processing model change.

I have another problem with it.

In UTF-8 and UTF-16, there is a single bit pattern for each Unicode
character.  UTF-8+names introduces a kind of non-canonicality in the
encoding itself, which concerns me a little.

There are two cases:

1) a character such as  NON-BREAK SPACE  can be encoded in two different
ways, either as in UTF-8, or as the replacement   0x26 0x6E 0x62 0x73 0x70
0x3B

2) AMPERSAND can be encoded in two different ways, either as in UTF-8, or as
the replacement   0x26 0x26 0x3B

While (1) is always true for all characters that have a replacement defined
for them (except AMPERSAND), (2) is true if and only if the AMPERSAND is NOT
followed by certain characters and then by a SEMICOLON, the entire sequence
being the same as one of the defined replacements.

This lack of canonicality in the encoding implies that a conversion from
UTF-8 (or UTF-16) to UTF-8+names does not always produce the same result for
the same input.

Also, I wonder about current XML tools.  If a program uses an internal
representation of Unicode characters, how should it generate a UTF-8+names
encoding?  Unlike the characters that make up entity references and numeric
character references (which are individual Unicode characters), the *bytes*
that make up the replacement names of UTF-8+names are not individual Unicode
characters and so don't have a representation as such.

If you view an XML document as a string of Unicode characters, the entity
references and numeric character references are there, but the UTF-8+names
replacements are not there (they are resolved on decoding and are generated
on encoding).

I think the introduction of UTF-8+names would be a *big* change indeed, with
a serious impact on existing XML tools and (some) applications.

Alessandro

> 
> > I wrote a piece on XML as a disruptive technology a few 
> years ago [1], 
> > but I can't say I expected XML to drill into the Unicode layer and 
> > modify the very notion of a character encoding.
> 
> UTF-8+names doesn't depend on XML, I can think of other 
> applications for
> it.  Anyhow Unicode character encodings in widespread use have been 
> cooked up by ANSI, ISO, JIS, and even Bell Labs (that's where 
> UTF-8 came 
> from).  The notion of inventing a new encoding to better serve 
> application needs is hardly radical.  The bar to entry is 
> that you have 
> to have a clear and transparent mapping to Unicode code points, which 
> UTF-8+names does.
> 
> -- 
> Cheers, Tim Bray (http://www.tbray.org/ongoing/)
> 
> 
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org 
> <http://www.xml.org>, an initiative of OASIS 
<http://www.oasis-open.org>

The list archives are at http://lists.xml.org/archives/xml-dev/

To subscribe or unsubscribe from this list use the subscription
manager: <http://lists.xml.org/ob/adm.pl>

References:
- Re: UTF-8+names
  - From: Tim Bray <tbray@t...>

Prev by Date: Re: UTF-8+names
Next by Date: Re: UTF-8+names
Previous by thread: Re: UTF-8+names
Next by thread: Re: UTF-8+names
Index(es):
- Date
- Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >