[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Why is Encoding Metadata (e.g. encoding="UTF-8) put Inside the XML Docum
Hi Folks, Below I describe my understanding of: 1. Why the indication of how an XML document is encoded is placed "within" the document, and 2. How an XML parser is able to parse an XML document before it even knows its encoding. I would appreciate any comments on where I err. ------------------------------------------ It is considered best practice to embed within your document an indication of the encoding used to create the document. For example, in XML documents you put encoding information in the XML declaration: <?xml version="1.0" encoding="UTF-8"?> In HTML documents you put encoding information in the header section: <html> <head> <meta http-equiv="Content-Type" content="text/html; Charset="UTF-8" /> Why? Shouldn't metadata be external to a document? Typically XML and HTML documents are exchanged on the Internet using the HTTP protocol. The HTTP header has a property to indicate the charset (encoding) of its payload (i.e. the XML document or the HTML document), e.g. Content-Type: text/xml; Charset="UTF-8" Isn't the HTTP header sufficient to specify a document's encoding? Suppose you have a big web server with lots of sites and hundreds of pages, contributed by lots of people in lots of different languages. The web server wouldn't know the encoding of each document. So it is considered best practice to specify the encoding within the document itself. But that raises an intriguing question: in order to read the document you need to know what its encoding is, but to know what the encoding is you must read the document! Stated differently, for an XML parser to know how to interpret the bit strings in a document it must know the encoding, but to know the encoding it must read the document! We seem to have a chicken-and-egg situation. How is this handled? Here's how: all XML documents must begin with this XML declaration: <?xml version="1.0" encoding="..."?> These are all ASCII characters. Thus, an XML parser opens the document, interprets the bit strings as ASCII characters up to the first ">" symbol. From then on, it interprets the rest of the document using the encoding it found in the XML declaration. Likewise, all HTML documents must begin with a header section: <html> <head> <meta http-equiv="Content-Type" content="text/html; Charset="UTF-8" /> These are all ASCII characters. Thus, an HTML parser opens the document, interprets the bit string as ASCII characters up to the end of the header section. From then on, it interprets the rest of the document using the encoding it found in the meta tag. --------------------- Do you agree? /Roger
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|