Re: Encoding detection again ...

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

From: David Brownell <db@E...>
To: Miles Sabin <msabin@c...>
Date: Tue, 02 Mar 1999 13:19:59 -0800

Miles Sabin wrote:
> 
> Appendix F of the spec say that given a document
> starting with the 4 octet sequence,
> 
>   00 3C 00 3F
> 
> I'm to infer BOM-less big-endian UTF-16, and
> given a document starting with,
> 
>   3C 00 3F 00
> 
> I'm to infer BOM-less little-endian UTF-16.

That is, the appendix _suggests_ (in a non-normative
fashion) that's the way to go.

> What I what to know is: why could these
> sequences not equally represent (respectively)
> big-endian UCS-2 or little-endian UCS-2?

They could ...

> 
> 1. Unicode == UTF-16
> 2. UCS-2 != UTF-16 (because UCS-2 lacks UTF-16's
>    support for characters outside the BMP).

Put it this way:  if you assume UTF-16, you're
safe either way because UTF-16 is a superset.

It'd be reasonable for an autodetecting algorithm
to support "downgrading" its guess from UTF-16 to
UCS-2, and should probably do so if it's reporting
encoding mismatches as fatal errors.

- Dave

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)

References:
- Encoding detection again ...
  - From: Miles Sabin <msabin@c...>

Prev by Date: re: I wonder ...
Next by Date: RE: I wonder ...
Previous by thread: Encoding detection again ...
Next by thread: RE: Encoding detection again ...
Index(es):
- Date
- Thread

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >