[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: [Summary] Why is Encoding Metadata (e.g. encoding="UTF

  • From: "Rick Jelliffe" <rjelliffe@a...>
  • To: "Philippe Poulard" <philippe.poulard@s...>
  • Date: Fri, 21 Sep 2007 11:30:08 +1000 (EST)

Re:  [Summary] Why is Encoding Metadata (e.g.     encoding="UTF
Philippe Poulard said:
>
> I guess some parsers have additional heuristics for reading successfully
> the sequence <?xml encoding="blah-blah"?> ; maybe some try-catch to
> apply with the set of charset they know ?

I hope they don't, unless they are specific tools for repairing broken
documents.

Guessing encoding is the *opposite* of the XML approach and should be
strongly resisted. The XML approach is based on explicit labeling as the
only approach that is reliable (which is not the same as not-stuff-up-able
of course).

There are many problems with guessing:

 * most platforms provide hundreds of character sets
 * most character sets belong to families which are ASCII or EBCDIC
superrsets so there is not enough redundant (in the engineering-theoretic
sense) information or orthogonality to know which specific sets are
actually being used
 * most transcoders don't actually generate exceptions when an unknown
byte sequence is found: older ones just ignored the sequence, others
replace it with "?" or some other character, some more recent transcoders
are a little better, so you cannot know
 * detecting encoding from statistical patterns in the text relies on the
document conforming to the corpuse, to a certain extent, and may even be
skewed by the use of native language markup.
 * guessing prevents error detection
 * guessing can corrupt the database

So the XML system is then based on solving the problem "How do we read
that label reliably?"  The UTF-8 default is just low hanging fruit,
because it also accepts ISO646-US (ASCII), but again it is not in any
sense guessed.

A system that guesses encoding is unsuitable for critical data. In a
hospital record, you don't want your name to be rejected because it has
some Hungarian character but you are in a German hospital, etc.

Cheers
Rick Jelliffe


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.