[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")
Hello All, I have been following this thread regarding XML documents and their character encodings. I still don't quite understand how to tell what the encoding of an XML document is when there is no external information to go on. As discussed, you can either specify an encoding via HTTP headers (externally), or in the XML document instead (internally). If the HTTP headers do not indicate what the encoding of the document is, you must read the document (at least the first line) and figure out what the encoding is. However, how is this accomplished? If you don't know the encoding of the document to begin with, how can you read even the first line? After reading this http://www.w3.org/TR/REC-xml/#sec-guessing, it seems that instead of reading what <?xml encoding="utf-8"?> has to say, parsers simply look at the first few octets of the document and compare it to several known encodings of the text <?xml. Then, they just continue to read the rest of the document. If parsers never actually use the encoding attribute, is then any reason to have it other than for human-readability? Are there any encodings that have the same encoding of <?xml but completely different encodings for other characters? Does anyone have any further information on how exactly XML parsers auto-detect character encodings within XML documents? Thanks, -Tom -----Original Message----- From: David Carlisle [mailto:davidc@n...] Sent: Thursday, September 20, 2007 10:03 AM To: Costello, Roger L. Cc: xml-dev@l... Subject: Re: [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document? > > An XML Parser will make an initial "guess" of the encoding based upon > the presence or absence of a Byte Order Mark (BOM). The XML parser then > interprets the bit strings using that guess up to the first ">" > character (the end of the XML declaration). > If the encoding isn't known in advance then (in theory) you don't know where the first > is (as you don't know how > is encoded) > Now that it knows the "real" encoding it interprets the rest of the > document using the encoding it found in the XML declaration. That still makes it sound as if the encoding declaration is read using a different encoding from the rest of the document. Once an encoding has been determined then the encoding declaration line itself must be consistent with that encoding. You can't use one byte per character ascii <?xml version="1.0" encoding="utf-16"?> and then read the rest of the file using two (or four) bytes per character. Suppose I have an encoding "my-encoding" that's the same as as ascii except that > and < are swapped round. then the following is a well formed document >?xml version="1.0" encoding="my-encoding"< >foo<hello>/foo< The parser knows it's been handed an xml file, can tell that it's not going to parse as utf8 so there must be an xml declaration, so the first tfew bytes must encode "<?xml" it sees the bytes it sees and the only encoding it knows about in which that sequence encodes "<?xmlis the "my-encoding" encoding so proceeds on that basis, which means it successfullt finds encoding="my-encoding" and knows all is well... David _______________________________________________________________________ _ The Numerical Algorithms Group Ltd is a company registered in England and Wales with company number 1249803. The registered office is: Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom. This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. _______________________________________________________________________ _ _______________________________________________________________________ XML-DEV is a publicly archived, unmoderated list hosted by OASIS to support XML implementation and development. To minimize spam in the archives, you must subscribe before posting. [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ Or unsubscribe: xml-dev-unsubscribe@l... subscribe: xml-dev-subscribe@l... List archive: http://lists.xml.org/archives/xml-dev/ List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|