Politics, and UTF-8+names considered harmful for text

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

To: XML Dev <xml-dev@l...>
Subject: Politics, and UTF-8+names considered harmful for text
From: Rick Jelliffe <ricko@a...>
Date: Tue, 21 Oct 2003 13:25:05 +1000
In-reply-to: <3F9417CA.9040902@t...>
References: <3F917A29.1000708@t...> <3f96f447.23635425@s...> <3F9417CA.9040902@t...>
User-agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.3.1) Gecko/20030428

Tim Bray wrote:

> Only because such a revision is not politically viable.  The only 
> advantage of the +names approach is that it doesn't touch XML. 

But because this is a new encoding (and there have been no successful 
new encodings for years AFAIK),
it will take at best about 3-5 years minimum to have deployment as part 
of standard distributions
such as Java etc, depending on the attitude of the vendors, and vendors 
such as MS and Sun probably see
it as a waste of time not fitting in with their Unicode strategy and tools.

So the only likely implementation route is for parser writers to add it 
(or for implementers to
add it to entity management) on a product-by-product basis. But if you 
have a majority of
parser vendors supporting it as an XML add-on, you already have the 
quorum for getting
an XML revision.

So arguments for it on the basis of realistic pragmatism don't make any 
sense to me.

Adding together the  W3C HTML/XHTML people + the W3C Schema people
+ the MathML people + the XSLT people (all of whom have language that 
are being
held back by a named character references being tied to DTDs) + the I18n WG
gives a group hardly without any policital clout in the W3C.  This is a 
very different
issue to the Unicode upgrade issue of 1.1.

Furthermore, adopting XML's entity or NCR mechanism without also adopting
a header mechanism for non-XML uses is allow in-band signalling that 
that encoding
is currently in use is positively damaging, because it creates a dialect 
of UTF-8 that can
only be detected by some who knows that the data may be using this
convention checking to see whether it has things that look like delimiters
and judging that they are being used as delimiters.

At the moment, life is simple: you can look to see the byte patterns in a
file and know that it is UTF-8: there is very little chance of a 
misdiagnosis
because no other encoding really has the same modified Huffman signature.
I don't know why on earth we would want to put ourselves in the same 
kind of position
as the Japanese have with text: they have a couple of alternate mappings 
in some
vendors' versions of various encodings which adds complication.[1]  Why 
would we
want to get a similar situation?

Cheers
Rick Jelliffe

[1] http://www.w3.org/TR/2000/NOTE-japanese-xml-20000414/

References:
- UTF-8+names
  - From: Tim Bray <tbray@t...>
- Re: UTF-8+names
  - From: Bjoern Hoehrmann <derhoermi@g...>
- Re: UTF-8+names
  - From: Tim Bray <tbray@t...>

Prev by Date: Re: UTF-8+names
Next by Date: Re: UTF-8+names
Previous by thread: Re: UTF-8+names
Next by thread: Re: UTF-8+names
Index(es):
- Date
- Thread

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >