[Home] [By Thread] [By Date] [Recent Entries]
Richard Tobin wrote: > I don't think it's wrong for you to accept "UTF8", but I think it's > wrong that the test uses it. It's not required that a parser > recognize it, and one that doesn't will reject the document at that > point. Yes, and the XML spec even hints that it is wrong to accept "UTF8" as being synonymous with "UTF-8". Section 4.3.3 of the XML Rec is pretty clear on this point, but uses "should" language instead of "must", unfortunately: All XML processors must be able to read entities in both the UTF-8 and UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification do not apply to character encodings with any other labels, even if the encodings or labels are very similar to UTF-8 or UTF-16. [...] In an encoding declaration, the values "UTF-8", "UTF-16", [...] should be used for the various encodings and transformations of Unicode / ISO/IEC 10646 [...] [...] It is recommended that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings should use names starting with an "x-" prefix. XML processors should match character encoding names in a case-insensitive way and should either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown [...] Given that only "UTF-8" -- not "UTF8" -- is listed in http://www.iana.org/assignments/character-sets, "UTF8" violates the first "should" recommendation here (it should be "x-UTF8"). Furthermore the processor that accepts it as if it were "UTF-8" is violating the third "should" recommendation that the non-IANA-registered encoding actually be treated as unknown, and thus produce a fatal error. My question is, must the XML parser developer honor these "shoulds" as if they were "musts" and produce a fatal error rather than accepting "UTF8"? The IANA registry is for character maps that may be used on the Internet. An XML parser is not necessarily "on the Internet", so I can see an argument, especially in light of the fact that the EncName production is not constrained to IANA-registered values, for the acceptance of unregistered charset names. Other opinions appreciated. - Mike _____________________________________________________________________________ mike j. brown, software engineer at | xml/xslt: http://skew.org/xml/ webb.net in denver, colorado, USA | personal: http://hyperreal.org/~mike/
|

Cart



