[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")
Not quite I'm afraid :-) You can also get little endian UCS-2 and little endian UCS-4, UTF-16 little endian and various permutations thereof. e.g. UCS-2 LE is 00111100 00000000 UCS-4 LE is 00111100 00000000 00000000 00000000 (although I think support for UCS-4 is optional.) In our implementation we basically take the tables contained in http://www.w3.org/TR/REC-xml/#sec-guessing and convert them into an if-else based decision tree so that we can read a byte at a time and makes successive deductions about the encoding in use. This is an implementation issue though, and grabbing the first 4 bytes is also likely to work (subject to there being 4 bytes available!). Note also, that the prolog (the bit that may contain the xml-decl - http://www.w3.org/TR/REC-xml/#NT-prolog) may just consist of white space, hence the opening character may be a whitespace character also. HTH, Pete. -- ============================================= Pete Cordell Codalogic for XML Schema to C++ data binding visit http://www.codalogic.com/lmx/ ============================================= ----- Original Message ----- From: "Rudick, Tom" <tmrudick@m...> To: <xml-dev@l...> Sent: Thursday, September 20, 2007 6:05 PM Subject: RE: [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document? So we know that the first character in an xml document must be <. Which has the ASCII value of 60. So a parser will keep reading in bytes until it gets up to 60. ASCII is 00111100 UCS-2 is 00000000 00111100 So with ASCII (or UTF-8), we encounter 60 which is in the first byte. After that characters will be considered to be one-byte long until we read in the correct encoding attribute. With UCS-2, read up to 60, see that it took two bytes, and now all characters are two-bytes long. Is this correct? Thanks again, -Tom -----Original Message----- From: Philippe Poulard [mailto:philippe.poulard@s...] Sent: Thursday, September 20, 2007 12:00 PM To: Rudick, Tom Cc: xml-dev@l... Subject: Re: [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8") put Inside the XML Document? Rudick, Tom a écrit : > If the HTTP headers do not indicate what the encoding of the document > is, you must read the document (at least the first line) and figure out > what the encoding is. However, how is this accomplished? If you don't > know the encoding of the document to begin with, how can you read even > the first line? > > After reading this http://www.w3.org/TR/REC-xml/#sec-guessing, it seems > that instead of reading what <?xml encoding="utf-8"?> has to say, > parsers simply look at the first few octets of the document and compare > it to several known encodings of the text <?xml. Then, they just > continue to read the rest of the document. Not exactly : the first few octets will indicate if <?xml encoding="blah-blah"?> is coded on 1, 2, or even 4 bytes (for UCS) ; the charset of the sequence <?xml encoding="blah-blah"?> is limited to ASCII-7 bits, which is fortunately compatible with UTF-8, ISO-8859-1 and some others, and easily decodable if coded on 2 or 4 bytes, because the same sequence is mapped to ASCII-7 bits, whatever the number of bytes (zero-extension) ; for example : Bits Encoding Hex Dec Char 7 US-ASCII 41 65 A 1000001 8 ASCII 8bits 41 65 A 01000001 16 UCS-2 41 65 A 00000000 01000001 32 UCS-4 41 65 A 00000000 00000000 00000000 01000001 So, the encoding can be read (if any) I guess some parsers have additional heuristics for reading successfully the sequence <?xml encoding="blah-blah"?> ; maybe some try-catch to apply with the set of charset they know ? -- Cordialement, /// (. .) --------ooO--(_)--Ooo-------- | Philippe Poulard | ----------------------------- http://reflex.gforge.inria.fr/ Have the RefleX ! _______________________________________________________________________ XML-DEV is a publicly archived, unmoderated list hosted by OASIS to support XML implementation and development. To minimize spam in the archives, you must subscribe before posting. [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ Or unsubscribe: xml-dev-unsubscribe@l... subscribe: xml-dev-subscribe@l... List archive: http://lists.xml.org/archives/xml-dev/ List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|