[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: How to specify a Processing Instruction? (better: howtocontrolencodi
From: "Chris Bayes" <chris@b...> > P.s. your original UPS document is invalid. It is declared as > <?xml version="1.0"?> and yet contains "UPS ONLINER TOOLS ACCESS USER > TERMS". > R is invalid in a utf-8 document. I don't understand this comment. The 8bit code used for LATIN CAPITAL LETTER R in ASCII and ISO8859-1 is the same code point in UTF-8. But it is good to understand how things work. 1) An XML parseable text entity can be encoded in almost any encoding (that has an IANA registered charset.) The encoding declaration lets you say what encoding your entity is in. (It may be stripped by a parser: you certainly cannot rely that when the data is re-serialized from the DOM it will come out in the same encoding: that is matter of however the software has been design. ) 2) An XML parser operates in terms of Unicode characters, so it will convert from the external encoding into some kind of Unicode. This includes treating numeric character references as the corresponding Unicode character number. 3) Inside any software, the Unicode characters will be represented in some way. This is typically using 8-bit variable-length encodings (i.e. UTF-8) or 16-bit variable-length encodings (e.g. UTF-16, loosely a.k.a. "Unicode" proper or UCS-2, no flames from codeheads please). Almost all characters in the Unicode Character Set are < 2^16 at the moment, so to most intents and purposes you can take it that a Unicode character is 16 bits. (This will assumption will change, but not effect many people.) 4) DOM is defined in terms of UTF-16. Apparantly COM is too. The storage units of a character. 5) XPath, however, is defined in terms of full characters. For characters < 2^16 in Unicode, this is the same as the DOM's storage index. 6) If a DOM serialized an XML header which still has the original encoding parameter, but actually outputs the document in a different encoding (e.g. its default), then the document is likely to fail when any unexpected codes appear. 7) The encoding for XML is UTF-8 (or UTF-16, if there is a special Byte Order Mark at the beginning of the XML entity). The default encoding for HTML is ISO 8859-1. 8) The idea is that the only way systems that have multiple encodings and different defaults can work together is a) by making data carry around explicit labels so that there is no guesswork, and b) we all move to UTF-* sooner or later, since that is what modern systems use internally anyway (Java, Microsoft) Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|