[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Docum

  • From: "Costello, Roger L." <costello@m...>
  • To: <xml-dev@l...>
  • Date: Tue, 18 Sep 2007 19:53:35 -0400

Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Docum
Hi Folks,

Below I describe my understanding of:
1. Why the indication of how an XML document is encoded is placed
"within" the document, and
2. How an XML parser is able to parse an XML document before it even
knows its encoding.

I would appreciate any comments on where I err.  

------------------------------------------

It is considered best practice to embed within your document an
indication of the encoding used to create the document.

For example, in XML documents you put encoding information in the XML
declaration:

     <?xml version="1.0" encoding="UTF-8"?>

In HTML documents you put encoding information in the header section:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;
Charset="UTF-8"  /> 

Why? Shouldn't metadata be external to a document?

Typically XML and HTML documents are exchanged on the Internet using
the HTTP protocol.  The HTTP header has a property to indicate the
charset (encoding) of its payload (i.e. the XML document or the HTML
document), e.g.

    Content-Type: text/xml; Charset="UTF-8"

Isn't the HTTP header sufficient to specify a document's encoding?

Suppose you have a big web server with lots of sites and hundreds of
pages, contributed by lots of people in lots of different languages.
The web server wouldn't know the encoding of each document.  

So it is considered best practice to specify the encoding within the
document itself.

But that raises an intriguing question: in order to read the document
you need to know what its encoding is, but to know what the encoding is
you must read the document! 

Stated differently, for an XML parser to know how to interpret the bit
strings in a document it must know the encoding, but to know the
encoding it must read the document!

We seem to have a chicken-and-egg situation.  How is this handled?

Here's how: all XML documents must begin with this XML declaration:

    <?xml version="1.0" encoding="..."?>

These are all ASCII characters.  Thus, an XML parser opens the
document, interprets the bit strings as ASCII characters up to the
first ">" symbol.  From then on, it interprets the rest of the document
using the encoding it found in the XML declaration.

Likewise, all HTML documents must begin with a header section:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;
Charset="UTF-8"  />

These are all ASCII characters.  Thus, an HTML parser opens the
document, interprets the bit string as ASCII characters up to the end
of the header section.  From then on, it interprets the rest of the
document using the encoding it found in the meta tag.

---------------------

Do you agree?  /Roger


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.