Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Docum

From: "Costello, Roger L." <costello@m...>
To: <xml-dev@l...>
Date: Tue, 18 Sep 2007 19:53:35 -0400

Play the video

Hi Folks,

Below I describe my understanding of:
1. Why the indication of how an XML document is encoded is placed
"within" the document, and
2. How an XML parser is able to parse an XML document before it even
knows its encoding.

I would appreciate any comments on where I err.  

------------------------------------------

It is considered best practice to embed within your document an
indication of the encoding used to create the document.

For example, in XML documents you put encoding information in the XML
declaration:

     <?xml version="1.0" encoding="UTF-8"?>

In HTML documents you put encoding information in the header section:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;
Charset="UTF-8"  /> 

Why? Shouldn't metadata be external to a document?

Typically XML and HTML documents are exchanged on the Internet using
the HTTP protocol.  The HTTP header has a property to indicate the
charset (encoding) of its payload (i.e. the XML document or the HTML
document), e.g.

    Content-Type: text/xml; Charset="UTF-8"

Isn't the HTTP header sufficient to specify a document's encoding?

Suppose you have a big web server with lots of sites and hundreds of
pages, contributed by lots of people in lots of different languages.
The web server wouldn't know the encoding of each document.  

So it is considered best practice to specify the encoding within the
document itself.

But that raises an intriguing question: in order to read the document
you need to know what its encoding is, but to know what the encoding is
you must read the document! 

Stated differently, for an XML parser to know how to interpret the bit
strings in a document it must know the encoding, but to know the
encoding it must read the document!

We seem to have a chicken-and-egg situation.  How is this handled?

Here's how: all XML documents must begin with this XML declaration:

    <?xml version="1.0" encoding="..."?>

These are all ASCII characters.  Thus, an XML parser opens the
document, interprets the bit strings as ASCII characters up to the
first ">" symbol.  From then on, it interprets the rest of the document
using the encoding it found in the XML declaration.

Likewise, all HTML documents must begin with a header section:

<html>
    <head>
        <meta http-equiv="Content-Type" content="text/html;
Charset="UTF-8"  />

These are all ASCII characters.  Thus, an HTML parser opens the
document, interprets the bit string as ASCII characters up to the end
of the header section.  From then on, it interprets the rest of the
document using the encoding it found in the meta tag.

---------------------

Do you agree?  /Roger

Follow-Ups:
- Re: Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
  - From: Rick Jelliffe <rjelliffe@a...>
- Re: Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
  - From: Philippe Poulard <philippe.poulard@s...>
- Re: Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: David Carlisle <davidc@n...>
- Re: Why is Encoding Metadata (e.g. encoding="UTF-8) putInside the XML Document?
  - From: Jonathan Robie <jonathan.robie@r...>
- Re: Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Document?
  - From: richard@i... (Richard Tobin)

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >