Re: Encoding detection again ...
Miles Sabin wrote: > > Appendix F of the spec say that given a document > starting with the 4 octet sequence, > > 00 3C 00 3F > > I'm to infer BOM-less big-endian UTF-16, and > given a document starting with, > > 3C 00 3F 00 > > I'm to infer BOM-less little-endian UTF-16. That is, the appendix _suggests_ (in a non-normative fashion) that's the way to go. > What I what to know is: why could these > sequences not equally represent (respectively) > big-endian UCS-2 or little-endian UCS-2? They could ... > > 1. Unicode == UTF-16 > 2. UCS-2 != UTF-16 (because UCS-2 lacks UTF-16's > support for characters outside the BMP). Put it this way: if you assume UTF-16, you're safe either way because UTF-16 is a superset. It'd be reasonable for an autodetecting algorithm to support "downgrading" its guess from UTF-16 to UCS-2, and should probably do so if it's reporting encoding mismatches as fatal errors. - Dave xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format