[Home] [By Thread] [By Date] [Recent Entries]
A potentially useful data point: the open source ICU project (International Components for Unicode) [1], which provides a large character encoding conversion API in C/C++, has the following policy for matching names of character encodings (from the distribution file icu/data/convrtrs.txt): Name matching is case-insensitive. Also, dashes '-', underscores '_' and spaces ' ' are ignored in names (thus cs-iso-latin-1 and csisolatin1 are the same). Under this regime, "UTF-8" = "utf-8" = "utf_8" = "UTF8" = ... It seems to me that it is exactly these variations that humans are likely to produce; given the human-legible/producible aspect of the design of XML, it's nice to see an algorithmically simple and unambiguous method to accept authors' expressed intent. Steve Rowe MNIS-TextWise Labs [1] http://oss.software.ibm.com/developerworks/opensource/icu/ Mike Brown wrote: > Richard Tobin wrote: > > I don't think it's wrong for you to accept "UTF8", but I > > think it's wrong that the test uses it. It's not required > > that a parser recognize it, and one that doesn't will > > reject the document at that point. > > Yes, and the XML spec even hints that it is wrong to accept > "UTF8" as being synonymous with "UTF-8". Section 4.3.3 of > the XML Rec is pretty clear on this point, but uses "should" > language instead of "must", unfortunately: > > All XML processors must be able to read entities in both > the UTF-8 and UTF-16 encodings. The terms "UTF-8" and > "UTF-16" in this specification do not apply to character > encodings with any other labels, even if the encodings or > labels are very similar to UTF-8 or UTF-16. > > [...]
|

Cart



