[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: How to specify a Processing Instruction? (better: howtocontrolencodi

  • From: Rick Jelliffe <ricko@a...>
  • To: xml-dev@l...
  • Date: Thu, 30 Aug 2001 14:38:45 +1000

invalid processing instruction name
From: "Chris Bayes" <chris@b...>
 
> P.s. your original UPS document is invalid. It is declared as 
> <?xml version="1.0"?> and yet contains "UPS ONLINER TOOLS ACCESS USER
> TERMS".
> R is invalid in a utf-8 document.

I don't understand this comment. The 8bit code used for LATIN CAPITAL LETTER R in ASCII and ISO8859-1 is the same code point in UTF-8.

But it is good to understand how things work.

1) An XML parseable text entity can be encoded in almost any encoding
(that has an IANA registered charset.)  The encoding declaration lets you
say what encoding your entity is in. (It may be stripped by a parser: you certainly
cannot rely that when the data is re-serialized from the DOM it will come out in the same encoding: that is matter of however the software has been design. )

2) An XML parser operates in terms of Unicode characters, so it will convert
from the external encoding into some kind of Unicode. This includes treating
numeric character references as the corresponding Unicode character number.

3) Inside any software, the Unicode characters will be represented in some way.
This is typically using 8-bit variable-length encodings (i.e. UTF-8) or 16-bit
variable-length encodings (e.g. UTF-16, loosely a.k.a. "Unicode" proper or UCS-2, no flames from codeheads please).  Almost all characters in the Unicode Character Set are < 2^16 at the moment, so to most intents and purposes you can take it that a Unicode character is 16 bits. (This will assumption will change, but not effect many people.) 

4) DOM is defined in terms of UTF-16.  Apparantly COM is too. The storage units
of a character.

5) XPath, however, is defined in terms of full characters. For characters < 2^16 in Unicode, this is the same as the DOM's storage index.

6) If a DOM serialized an XML header which still has the original encoding parameter, but actually outputs the document in a different encoding (e.g. its default), then
the document is likely to fail when any unexpected codes appear.

7) The encoding for XML is UTF-8 (or UTF-16, if there is a special
Byte Order Mark at the beginning of the XML entity). The default encoding
for HTML is ISO 8859-1. 

8) The idea is that the only way systems that have multiple encodings and different
defaults can work together is
   a) by making data carry around explicit labels so that there is no guesswork, and
   b) we all move to UTF-* sooner or later, since that is what modern systems use internally anyway (Java, Microsoft)

Cheers
Rick Jelliffe 


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.