[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Why is Encoding Metadata (e.g. encoding="UTF-8) putInside

  • From: Jonathan Robie <jonathan.robie@r...>
  • To: "Costello, Roger L." <costello@m...>
  • Date: Wed, 19 Sep 2007 08:56:56 -0400

Re:  Why is Encoding Metadata (e.g. encoding="UTF-8) putInside
Costello, Roger L. wrote:
> Typically XML and HTML documents are exchanged on the Internet using
> the HTTP protocol.  

When they are, software that sends an existing XML document can use the 
encoding to determine how to set the MIME type. But XML documents live 
in many other places, they may be stored in repositories or on hard 
disks, for instance, where they are not accompanied by a MIME type.

Also, XML parsers generally don't have access to the MIME type. They do 
have access to the document.

Of course, many parsers also manage to parse XML documents that don't 
declare their encoding just fine, at least for the expected character 
sets. The prolog is not required to have an XML declaration, and the XML 
declaration is not required to have an encoding declaration:

[1] document ::= prolog element Misc*
[22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)?
[23] XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'

> But that raises an intriguing question: in order to read the document
> you need to know what its encoding is, but to know what the encoding is
> you must read the document! 
>   

Autodetection of character encodings in XML documents is discussed in 
some detail here:

http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing

> These are all ASCII characters.  

The XML encoding declaration is restricted to characters taken from the 
ASCII repertoire specifically to make this kind of character encoding 
guessing easier, as discussed in the appendix referenced above.

> From then on, it interprets the rest of the document
> using the encoding it found in the XML declaration.
>   

Yes.

> Likewise, all HTML documents must begin with a header section:
>
> <html>
>     <head>
>         <meta http-equiv="Content-Type" content="text/html;
> Charset="UTF-8"  />
>   

Here's a useful excerpt from the XHTML spec:

C.9. Character Encoding

Historically, the character encoding of an HTML document is either 
specified by a
web server via the charset parameter of the HTTP Content-Type header, or 
via
a meta element in the document itself. In an XML document, the character 
encoding
of the document is specified on the XML declaration
(e.g., <?xml version="1.0" encoding="EUC-JP"?>). In order to portably 
present
documents with specific character encodings, the best approach is to 
ensure that
the web server provides the correct headers. If this is not possible, a 
document
that wants to set its character encoding explicitly must include both 
the XML
declaration an encoding declaration and a meta http-equiv statement
(e.g., <meta http-equiv="Content-type" content="text/html; 
charset=EUC-JP" />).
In XHTML-conforming user agents, the value of the encoding declaration 
of the XML
declaration takes precedence.

Note: be aware that if a document must include the character encoding 
declaration
in a meta http-equiv statement, that document may always be interpreted 
by HTTP
servers and/or user agents as being of the internet media type defined 
in that
statement. If a document is to be served as multiple media types, the 
HTTP server
must be used to set the encoding of the document.

Hope this is helpful!

Jonathan


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2007 All Rights Reserved.