[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Unrecognized encodings (was Re: XML 1.0 Conformance Test Results)

  • From: Mike Brown <mike@s...>
  • To: xml-dev@l...
  • Date: Mon, 11 Jun 2001 11:12:28 -0600 (MDT)

iana utf8
Richard Tobin wrote:
> I don't think it's wrong for you to accept "UTF8", but I think it's
> wrong that the test uses it.  It's not required that a parser
> recognize it, and one that doesn't will reject the document at that
> point.

Yes, and the XML spec even hints that it is wrong to accept "UTF8" as
being synonymous with "UTF-8". Section 4.3.3 of the XML Rec is pretty 
clear on this point, but uses "should" language instead of "must", 
unfortunately:

   All XML processors must be able to read entities in both the UTF-8 and 
   UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification 
   do not apply to character encodings with any other labels, even if the 
   encodings or labels are very similar to UTF-8 or UTF-16.

   [...]

   In an encoding declaration, the values "UTF-8", "UTF-16", [...]
   should be used for the various encodings and transformations of
   Unicode / ISO/IEC 10646 [...]

   [...]

   It is recommended that character encodings registered (as charsets) 
   with the Internet Assigned Numbers Authority [IANA-CHARSETS], other 
   than those just listed, be referred to using their registered names; 
   other encodings  should use names starting with an "x-" prefix. XML 
   processors should match character encoding names in a case-insensitive 
   way and should either interpret an IANA-registered name as the 
   encoding registered at IANA for that name or treat it as unknown [...]

Given that only "UTF-8" -- not "UTF8" -- is listed in
http://www.iana.org/assignments/character-sets, "UTF8" violates the first
"should" recommendation here (it should be "x-UTF8"). Furthermore the
processor that accepts it as if it were "UTF-8" is violating the third
"should" recommendation that the non-IANA-registered encoding actually be
treated as unknown, and thus produce a fatal error.

My question is, must the XML parser developer honor these "shoulds" as if
they were "musts" and produce a fatal error rather than accepting "UTF8"?

The IANA registry is for character maps that may be used on the Internet.  
An XML parser is not necessarily "on the Internet", so I can see an
argument, especially in light of the fact that the EncName production is
not constrained to IANA-registered values, for the acceptance of
unregistered charset names.

Other opinions appreciated.

   - Mike
_____________________________________________________________________________
mike j. brown, software engineer at  |  xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA    |  personal: http://hyperreal.org/~mike/

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.