[Home] [By Thread] [By Date] [Recent Entries]
In article <B8415163A689094689542C617ECA036601E3A42B@I...> you write: >Typically XML and HTML documents are exchanged on the Internet using >the HTTP protocol. If you mean "of XML documents exchanged on the internet, most are exchanged using the HTTP protocol", this may well be true. But if you mean "most uses of XML documents involve exchange on the internet", I am more doubtful. Most of my XML processing is of local documents, and having the encoding data embedded (and maintained by XML tools) is a big advantage compared with plain text. >Here's how: all XML documents must begin with this XML declaration: > > <?xml version="1.0" encoding="..."?> It would be more accurate to say that all XML documents must be either encoded in UTF-8, or have that declaration. It's also allowed for the encoding to be provided by external means. For example, if a document is being served by HTTP then it need not have an encoding declaration because the HTTP header gives the encoding. Of course, if the HTTP server get it from a file on disk we're back in the situation you describe. >These are all ASCII characters. Thus, an XML parser opens the >document, interprets the bit strings as ASCII characters up to the >first ">" symbol. No! The characters are all ones present in the ASCII character set, but the declaration must be in the same encoding as the file. A UTF-16 file has its XML declaration in UTF-16, not ASCII. What you say is only true for ASCII supersets like UTF-8 and Latin-*. XML parsers typically examine the first few bytes to determine the encoding sufficiently to read the declaration; if there is a declaration the first two characters must be less-than, question-mark so this is fairly straightforward. This will be enough to decide whether it's an ASCII superset, UTF-16 or -32 (and determine the byte order), or even EBCDIC. It's a well-formedness error if the encoding specified by the declaration isn't compatible with the encoding of the declaration itself. -- Richard -- "Consideration shall be given to the need for as many as 32 characters in some alphabets" - X3.4, 1963.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



