[Home] [By Thread] [By Date] [Recent Entries]
Costello, Roger L. wrote: > Typically XML and HTML documents are exchanged on the Internet using > the HTTP protocol. When they are, software that sends an existing XML document can use the encoding to determine how to set the MIME type. But XML documents live in many other places, they may be stored in repositories or on hard disks, for instance, where they are not accompanied by a MIME type. Also, XML parsers generally don't have access to the MIME type. They do have access to the document. Of course, many parsers also manage to parse XML documents that don't declare their encoding just fine, at least for the expected character sets. The prolog is not required to have an XML declaration, and the XML declaration is not required to have an encoding declaration: [1] document ::= prolog element Misc* [22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)? [23] XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>' > But that raises an intriguing question: in order to read the document > you need to know what its encoding is, but to know what the encoding is > you must read the document! > Autodetection of character encodings in XML documents is discussed in some detail here: http://www.w3.org/TR/2006/REC-xml-20060816/#sec-guessing > These are all ASCII characters. The XML encoding declaration is restricted to characters taken from the ASCII repertoire specifically to make this kind of character encoding guessing easier, as discussed in the appendix referenced above. > From then on, it interprets the rest of the document > using the encoding it found in the XML declaration. > Yes. > Likewise, all HTML documents must begin with a header section: > > <html> > <head> > <meta http-equiv="Content-Type" content="text/html; > Charset="UTF-8" /> > Here's a useful excerpt from the XHTML spec: C.9. Character Encoding Historically, the character encoding of an HTML document is either specified by a web server via the charset parameter of the HTTP Content-Type header, or via a meta element in the document itself. In an XML document, the character encoding of the document is specified on the XML declaration (e.g., <?xml version="1.0" encoding="EUC-JP"?>). In order to portably present documents with specific character encodings, the best approach is to ensure that the web server provides the correct headers. If this is not possible, a document that wants to set its character encoding explicitly must include both the XML declaration an encoding declaration and a meta http-equiv statement (e.g., <meta http-equiv="Content-type" content="text/html; charset=EUC-JP" />). In XHTML-conforming user agents, the value of the encoding declaration of the XML declaration takes precedence. Note: be aware that if a document must include the character encoding declaration in a meta http-equiv statement, that document may always be interpreted by HTTP servers and/or user agents as being of the internet media type defined in that statement. If a document is to be served as multiple media types, the HTTP server must be used to set the encoding of the document. Hope this is helpful! Jonathan
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



