[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")
Very interesting! Here's the description I now have: Assuming external information did not decide the encoding ... An XML Parser will make an initial "guess" of the encoding based upon the presence or absence of a Byte Order Mark (BOM). The XML parser then interprets the bit strings using that guess up to the first ">" character (the end of the XML declaration). Now that it knows the "real" encoding it interprets the rest of the document using the encoding it found in the XML declaration. Do I have it correct? /Roger -----Original Message----- From: David Carlisle [mailto:davidc@n...] Sent: Thursday, September 20, 2007 9:08 AM To: Costello, Roger L. Cc: xml-dev@l... Subject: Re: [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document? > These are all ASCII characters. Thus, an XML parser opens the > document, interprets the bit strings as ASCII characters up to the > first ">" No as was said earlier, the first few bytes of the file do not need to be read as ascii. (And must not be for several popular encodings such as utf-16 for example) It's true that the characters that appear in an encoding declaration are characters that do have an ASCII encoding, but there is no requirement that the byte sequence that represents the encoding declaration uses the ASCII encoding. These are all ASCII characters. Thus, an XML parser opens the document, interprets the bit strings as ASCII characters up to the first ">" character. From then on, it interprets the rest of the document using the encoding it finds in the XML declaration. The entire document, including the encoding declaration, is read using the same encoding. > Algorithm for Detection of the Character Encoding when there is no > Internal Encoding Label That isn't the same as the algorithm given in XML. There, if there is no external metadata or xml declaration the file has to be in utf16 or utf8, and the BOM is optional for utf8, so if the file has no BOM, then the parser does not "give up" The file is treated as if utf8 is specified. Recommendation 3 HTTP Header: specifying the encoding in an HTTP header is unreliable. When exchanging XML or HTML documents using the HTTP protocol, don't specify the Content-Type in the HTTP header. This will force applications to look inside the document for encoding information. is explictly the opposite of the the RFC that defines the XML mime types, so while there are arguments on both sides I think its dangerous to state it as such a clear recommendation. In eth case of text/* mime types (at least) I believe that the default charset is latin-1 so effectively you _can't_ omit the charset: even if you don't specify it explictly the receiver is supposed to act as if iso8859-1 is specified (which will mean that if you don't specify a charset in the mime headers then any utf8 document that has a non ascii character in it will be parsed as iso8859-1 and generate a fatal encoding error.... David _______________________________________________________________________ _ The Numerical Algorithms Group Ltd is a company registered in England and Wales with company number 1249803. The registered office is: Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom. This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. _______________________________________________________________________ _
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|