|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: UTF-8+names
> -----Original Message----- > From: Tim Bray [mailto:tbray@t...] > Sent: Saturday, October 18, 2003 22:46 > To: Simon St.Laurent > Cc: xml-dev@l... > Subject: Re: UTF-8+names > > > Simon St.Laurent wrote: > > >>Of course it's cunningly designed to look like an architectural > >>change, that allows such syntax as: <é/> > > Yow. I hadn't thought of that. (Hmm, somehow I missed > David's message; > xml-dev acting up again?) > > > That is therefore an enormous processing model change. This is way > > beyond surrogates. The potential for further disruption on this > > precedent seems downright boundless. > > Hmm, it's just an idiotically simple filter that replaces a bunch of > hardwired patterns with hardwired Unicode code points. Hardly feels > like a processing model change. I have another problem with it. In UTF-8 and UTF-16, there is a single bit pattern for each Unicode character. UTF-8+names introduces a kind of non-canonicality in the encoding itself, which concerns me a little. There are two cases: 1) a character such as NON-BREAK SPACE can be encoded in two different ways, either as in UTF-8, or as the replacement 0x26 0x6E 0x62 0x73 0x70 0x3B 2) AMPERSAND can be encoded in two different ways, either as in UTF-8, or as the replacement 0x26 0x26 0x3B While (1) is always true for all characters that have a replacement defined for them (except AMPERSAND), (2) is true if and only if the AMPERSAND is NOT followed by certain characters and then by a SEMICOLON, the entire sequence being the same as one of the defined replacements. This lack of canonicality in the encoding implies that a conversion from UTF-8 (or UTF-16) to UTF-8+names does not always produce the same result for the same input. Also, I wonder about current XML tools. If a program uses an internal representation of Unicode characters, how should it generate a UTF-8+names encoding? Unlike the characters that make up entity references and numeric character references (which are individual Unicode characters), the *bytes* that make up the replacement names of UTF-8+names are not individual Unicode characters and so don't have a representation as such. If you view an XML document as a string of Unicode characters, the entity references and numeric character references are there, but the UTF-8+names replacements are not there (they are resolved on decoding and are generated on encoding). I think the introduction of UTF-8+names would be a *big* change indeed, with a serious impact on existing XML tools and (some) applications. Alessandro > > > I wrote a piece on XML as a disruptive technology a few > years ago [1], > > but I can't say I expected XML to drill into the Unicode layer and > > modify the very notion of a character encoding. > > UTF-8+names doesn't depend on XML, I can think of other > applications for > it. Anyhow Unicode character encodings in widespread use have been > cooked up by ANSI, ISO, JIS, and even Bell Labs (that's where > UTF-8 came > from). The notion of inventing a new encoding to better serve > application needs is hardly radical. The bar to entry is > that you have > to have a clear and transparent mapping to Unicode code points, which > UTF-8+names does. > > -- > Cheers, Tim Bray (http://www.tbray.org/ongoing/) > > > > ----------------------------------------------------------------- > The xml-dev list is sponsored by XML.org > <http://www.xml.org>, an initiative of OASIS <http://www.oasis-open.org> The list archives are at http://lists.xml.org/archives/xml-dev/ To subscribe or unsubscribe from this list use the subscription manager: <http://lists.xml.org/ob/adm.pl>
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








