[Home] [By Thread] [By Date] [Recent Entries]

  • From: Steve Rowe <sarowe@t...>
  • To: xml-dev@l...
  • Date: Mon, 11 Jun 2001 17:17:50 -0400

A potentially useful data point: the open source ICU project
(International Components for Unicode) [1], which provides a large
character encoding conversion API in C/C++, has the following policy
for matching names of character encodings (from the distribution file
icu/data/convrtrs.txt):

   Name matching is case-insensitive. Also, dashes '-',
   underscores '_' and spaces ' ' are ignored in names
   (thus cs-iso-latin-1 and csisolatin1 are the same).

Under this regime, "UTF-8" = "utf-8" = "utf_8" = "UTF8" = ...

It seems to me that it is exactly these variations that humans are
likely to produce; given the human-legible/producible aspect of the
design of XML, it's nice to see an algorithmically simple and
unambiguous method to accept authors' expressed intent.

Steve Rowe
MNIS-TextWise Labs

[1] http://oss.software.ibm.com/developerworks/opensource/icu/

Mike Brown wrote:
> Richard Tobin wrote:
> > I don't think it's wrong for you to accept "UTF8", but I
> > think it's wrong that the test uses it.  It's not required
> > that a parser recognize it, and one that doesn't will
> > reject the document at that point.
>
> Yes, and the XML spec even hints that it is wrong to accept
> "UTF8" as being synonymous with "UTF-8". Section 4.3.3 of
> the XML Rec is pretty clear on this point, but uses "should"
> language instead of "must", unfortunately:
>
>    All XML processors must be able to read entities in both
>    the UTF-8 and UTF-16 encodings. The terms "UTF-8" and
>    "UTF-16" in this specification do not apply to character
>    encodings with any other labels, even if the encodings or
>    labels are very similar to UTF-8 or UTF-16.
>
>    [...]


Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member