[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: relax UTF-8 default?

  • From: Jim DeLaHunt <from.xml-dev@jdlh.com>
  • To: Dave Pawson <davep@dpawson.co.uk>, xml-dev@l...
  • Date: Wed, 15 Dec 2010 19:06:19 -0800

Re:  relax UTF-8 default?
On Fri, 10 Dec 2010 10:43:32 +0000 Andrew Welch 
<andrew.j.welch@gmail.com> wrote:
>>  Yep - the "UTF-8/16 only" suggestion is to solve the problem of the
>>  potential mismatch between the encoding in the prolog and the actual
>>  encoding.. add to that the content-type when http is involved and you
>>  have 3 areas to look at to determine the encoding...

At 12:12 PM +0000 12/10/10, Dave Pawson wrote:
>What are the alternatives... if any?
>Some app to analyse/guess the encoding and propose changes/
>set the encoding? Is such a beast possible?

I think the experiment on analysing/guessing encodings has been 
conducted, in the form of HTML files on the public Web.  Similar to 
XML, the specification allows files to be stored in a wide range of 
encodings; similar ot XML, there are in-band (and also out-of-band) 
ways to state the file's encoding within the file.  Web browsers like 
Firefox, and web crawlers like Google's have code to analyse and 
guess the encoding of the pages they encounter.

Good implementations to examine are:
* "Character Set Detection" feature of the Internationalization 
Classes for Unicode (ICU). 
* "Mozilla Charset Detectors" 
* There are others, easily found by a web search, but none that I saw 
struck me as more authoritative than these two.

My impression from being in the internationalization arena is that 
the history of encoding declarations has been fraught with error, 
especially the case of documents labelled as ISO 8859/1 encoding 
which are actually Windows CP-1252 encoding.  Also, detection is 
difficult and also fraught with error, especially for short documents.

Consider this document
where "C2A0" stands for the two octets with values 0xC2 and 0xA0. 
Does C2A0 represent the UTF-8 sequence for U+00A0 "non-breaking space 
character" or the Windows CP-1252 characters "A with circonflex" 
"non-breaking space character"? The short document gives very little 
context for a detection algorithm to use.

The Unicode UTF's bypass all of these problems.  They can represent 
any character from the older code pages, it is now reasonable to 
expect that authoring tools can save in UTF-8 or UTF-16{BE|LE}, with 
UTFs as the only encoding option there is no ambiguity, and it is 
straightforward to distinguish between octet streams containing UTF-8 
and UTF-16{BE|LE}.

There's a reason why more than half of the public web is 
Unicode-encoded. My opinion is that it would be wise for NextXML to 
require either UTF-8 or UTF-16 encoding, and offer no other choices. 
The spec will be simpler, and interchange will be more reliable.
     --Jim DeLaHunt, jdlh@jdlh.com     http://blog.jdlh.com/ (http://jdlh.com/)
       multilingual websites consultant

       157-2906 West Broadway, Vancouver BC V6K 2G8, Canada
          Canada mobile +1-604-376-8953

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.