Re: How to specify a Processing Instruction? (better: howtocontrolencodi

From: Rick Jelliffe <ricko@a...>
To: xml-dev@l...
Date: Thu, 30 Aug 2001 14:38:45 +1000

Play the video

From: "Chris Bayes" <chris@b...>
 
> P.s. your original UPS document is invalid. It is declared as 
> <?xml version="1.0"?> and yet contains "UPS ONLINER TOOLS ACCESS USER
> TERMS".
> R is invalid in a utf-8 document.

I don't understand this comment. The 8bit code used for LATIN CAPITAL LETTER R in ASCII and ISO8859-1 is the same code point in UTF-8.

But it is good to understand how things work.

1) An XML parseable text entity can be encoded in almost any encoding
(that has an IANA registered charset.)  The encoding declaration lets you
say what encoding your entity is in. (It may be stripped by a parser: you certainly
cannot rely that when the data is re-serialized from the DOM it will come out in the same encoding: that is matter of however the software has been design. )

2) An XML parser operates in terms of Unicode characters, so it will convert
from the external encoding into some kind of Unicode. This includes treating
numeric character references as the corresponding Unicode character number.

3) Inside any software, the Unicode characters will be represented in some way.
This is typically using 8-bit variable-length encodings (i.e. UTF-8) or 16-bit
variable-length encodings (e.g. UTF-16, loosely a.k.a. "Unicode" proper or UCS-2, no flames from codeheads please).  Almost all characters in the Unicode Character Set are < 2^16 at the moment, so to most intents and purposes you can take it that a Unicode character is 16 bits. (This will assumption will change, but not effect many people.) 

4) DOM is defined in terms of UTF-16.  Apparantly COM is too. The storage units
of a character.

5) XPath, however, is defined in terms of full characters. For characters < 2^16 in Unicode, this is the same as the DOM's storage index.

6) If a DOM serialized an XML header which still has the original encoding parameter, but actually outputs the document in a different encoding (e.g. its default), then
the document is likely to fail when any unexpected codes appear.

7) The encoding for XML is UTF-8 (or UTF-16, if there is a special
Byte Order Mark at the beginning of the XML entity). The default encoding
for HTML is ISO 8859-1. 

8) The idea is that the only way systems that have multiple encodings and different
defaults can work together is
   a) by making data carry around explicit labels so that there is no guesswork, and
   b) we all move to UTF-* sooner or later, since that is what modern systems use internally anyway (Java, Microsoft)

Cheers
Rick Jelliffe

Follow-Ups:
- RE: How to specify a Processing Instruction? (better: howtocontrolencoding on saving)
  - From: Chris Bayes <chris@b...>

References:
- RE: How to specify a Processing Instruction? (better: howtocontrolencoding on saving)
  - From: Chris Bayes <chris@b...>

Prev by Date: wanting advice on finding out the root element in a w3c schema
Next by Date: Re: wanting advice on finding out the root element in a w3c schema
Previous by thread: RE: How to specify a Processing Instruction? (better: howtocontrolencoding on saving)
Next by thread: RE: How to specify a Processing Instruction? (better: howtocontrolencoding on saving)
Index(es):
- Date
- Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >