[Home] [By Thread] [By Date] [Recent Entries]
David Brownell wrote, > Put it this way: if you assume UTF-16, you're > safe either way because UTF-16 is a superset. Err ... is that true? Maybe I'm being a bit obsessive about my interpretation of the various standards docs, but as far as I can see UCS-2 isn't a subset of UTF-16. The BMP S-zone codes (D800-DFFF) are undefined but reserved in UCS-2, and so should not occur in a purportedly UCS-2 stream. I would expect a processor which encountered such codes to either, 1. Spit out an error and give up. or, 2. Quietly ignore them and continue processing with the next 2 octets. Obviously these codes are defined and legal in UTF-16, so an incorrect assumption of UTF-16 when the stream was in fact broken UCS-2 would produce unpredictably incorrect behaviour (ie. the processor might continue processing a broken doc in an indeterminate way). In any case, on a less finickety note, I'd quite like to be able to compute string lengths UCS-2 style where that's appropriate, because 2*byte- length is a bit simpler than the UTF-16 equivalent ;-) Anyway, here's a slightly updated version of a proposal I mailed to Tim Bray yesterday ... In the absence of an appropriate MIME header the octet sequences, 1. FE FF 2. FF FE 3. 00 3C 00 3F 4. 3C 00 3F 00 may be inferred to be, 1. big-endian indeterminately encoded 2 octet characters. 2. little-endian indeterminately encoded 2 octet characters. 3. BOM-less big-endian indeterminately encoded 2 octet characters. 4. BOM-less little-endian indeterminately encoded 2 octet characters. If either of the following PIs are found, <?xml version="1.0" ?> <?xml version="1.0" encoding="UTF-16"?> or, in cases (1) and (2), if *no* PI is found, then encoding is resolved to UTF-16. Otherwise if, <?xml version="1.0" encoding="ISO-10646-UCS-2"?> is found then encoding is resolved to UCS-2. This very complicated and isn't a zillion miles away from the current handling of UTF-8 vs. ISO 8859-x vs. US-ASCII. Cheers, Miles -- Miles Sabin Cromwell Media Internet Systems Architect 5/6 Glenthorne Mews +44 (0)181 410 2230 London, W6 0LJ msabin@c... England xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|

Cart



