[Home] [By Thread] [By Date] [Recent Entries]

  • From: "Rick Jelliffe" <rjelliffe@a...>
  • To: "Philippe Poulard" <philippe.poulard@s...>
  • Date: Fri, 21 Sep 2007 11:30:08 +1000 (EST)

Philippe Poulard said:
>
> I guess some parsers have additional heuristics for reading successfully
> the sequence <?xml encoding="blah-blah"?> ; maybe some try-catch to
> apply with the set of charset they know ?

I hope they don't, unless they are specific tools for repairing broken
documents.

Guessing encoding is the *opposite* of the XML approach and should be
strongly resisted. The XML approach is based on explicit labeling as the
only approach that is reliable (which is not the same as not-stuff-up-able
of course).

There are many problems with guessing:

 * most platforms provide hundreds of character sets
 * most character sets belong to families which are ASCII or EBCDIC
superrsets so there is not enough redundant (in the engineering-theoretic
sense) information or orthogonality to know which specific sets are
actually being used
 * most transcoders don't actually generate exceptions when an unknown
byte sequence is found: older ones just ignored the sequence, others
replace it with "?" or some other character, some more recent transcoders
are a little better, so you cannot know
 * detecting encoding from statistical patterns in the text relies on the
document conforming to the corpuse, to a certain extent, and may even be
skewed by the use of native language markup.
 * guessing prevents error detection
 * guessing can corrupt the database

So the XML system is then based on solving the problem "How do we read
that label reliably?"  The UTF-8 default is just low hanging fruit,
because it also accepts ISO646-US (ASCII), but again it is not in any
sense guessed.

A system that guesses encoding is unsuitable for critical data. In a
hospital record, you don't want your name to be rejected because it has
some Hungarian character but you are in a German hospital, etc.

Cheers
Rick Jelliffe


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member