|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Text/xml with omitted charset parameter
> From: Bjoern Hoehrmann [mailto:derhoermi@g...] > Sent: Thursday, October 25, 2001 6:07 PM > To: ietf-xml-mime@i... > Cc: xml-dev@l... > Subject: Text/xml with omitted charset parameter > > > Hi, > > Quoting RFC 3023, section 8.5: > > | 8.5 Text/xml with Omitted Charset > | > | Content-type: text/xml > | > | {BOM}<?xml version="1.0" encoding="utf-16"?> > | > | or > | > | {BOM}<?xml version="1.0"?> > | > | This example shows text/xml with the charset parameter > omitted. In > | this case, MIME and XML processors MUST assume the > charset is "us- > | ascii", > > ... and issue a fatal error, no BOM in US-ASCII. Mentioning UTF-16 in > this example is absurd, XML documents labeled as text/xml without > charset parameter can never ever be UTF-16 encoded. So, who tells me I > am wrong and text/xml documents without charset parameter may still be > UTF-8 encoded (and use non-ASCII characters)? Apache uses text/xml as > default type for .xml documents, are they asking for interoperability > problems or what? Mentioning UTF-16 in this example is not absurd at all. It describes a scenario that could easily arise in the real world -- a UTF-16 encoded XML document encapsulated in a MIME envelope in which the Content-Type header does not include a charset parameter. This RFC states that in such a scenario, compliant processors must treat the document as being US-ASCII, which as you correctly point out would lead to a processing error. The key point is that for the text/xml media type, the charset parameter is authoritative. Failure to provide that in any instance where the document uses any character encoding other than US-ASCII is an error, regardless of any BOM or encoding declaration in the XML document. And yes, just serving up "text/xml" with no charset parameter is asking for interoperability problems. But to be honest, this RFC is widely violated by many software packages and products on the market. Many products ignore the headers and just go by the encoding declaration in the XML (or assume UTF-8 if that is not present). So serving up XML documents using US-ASCII character encoding and omitting the charset parameter would also be asking for interoperability problems, even though it complies with this RFC. Unfortunately, there is a hell of a lot of software out there that just uses "text/xml" with no charset parameter. Apache certainly isn't the only offender. So when you want to write software that can accept XML via HTTP or within MIME envelopes, you are going to encounter interoperability headaches no matter what. Best way to ensure interoperability is: * Always use UTF-8 * Always include the appropriate charset parameter * If, for some reason, you must use another character encoding, include the appropriate charset parameter as well as a redundant encoding declaration in the XML Of course, building in a default media type with an appropriate charset parameter in a web server product poses an obvious challenge: how is the product supposed to know what character encoding is used for text documents on the server upon which it will be installed? I suppose the product could special case XML documents and parse them each time before serving them up to check for an encoding declaration or auto-detect the encoding. But it's probably better for the server administrator to ensure a consistent character encoding is used for the documents, and that the XML media type configured on the server includes the charset parameter.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








