[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: A FAQ question about non-Latin characters in XT output

Subject: Re: A FAQ question about non-Latin characters in XT output
From: David Carlisle <davidc@xxxxxxxxx>
Date: Wed, 13 Oct 1999 10:07:19 +0100 (BST)
non latin characters
> You use the phrase 'which does not directly encode
> position #x0107 '

probably, I shouldn't have:-(

> Guessing: position hex 107 in the utf-8 list of ?characters?

> What do you mean by 'encode' please?

Stepping back a bit.
The xml/unicode character set consists of the numbered characters
in the range 1 through to hex 10FFFF (with some slots disallowed,
but ignore that for now).

That is the `Universal Character Set (UCS)' 

utf8 is a particular encoding of that range (actually it can encode
the full UCS4 range, up to hex FFFFFFFF, although `only' the first
17 planes of 2^16 characters are currently in Unicode (and only the
first 2^16 characters up to FFFF are in Unicode 2.x)

Note that utf8 is just an `encoding' of the 32bit character number into
1 or more sequences of 8bit bytes, it does not re-order or subset the
available characters.

Now `traditional' encodings like `latin1' or `latin2' or `windows ansi'
or `microsoft code page 850' or the 8bit cyrillic encodings
are subsets of the available characters in UCS (if they are not subsets
they can not be used in XML as the underlying character set in XML is
always unicode). 

> The charset in the xml declaration I believed
> to be one of inclusion/exclusion rather than
> 'encoding'.

No, it's encoding (that's why the syntax is encoding= -)

If you say

<?xml version="1.0" encoding="microsoft-weirdness" ?>

then the available characters and the way they are encoded as bytes
(ie effectively their order) is whatever Bill Gates says it is.
So the byte with value 255 may or may not be y-umlaut (which is what
position 255 is in latin1 and unicode) However the syntax &#255;
(and equivalently &#xFF;) _always_ refers to the unicode numbering
not the current encoding used to decode bytes of character data.


So....

If the encoding is the default utf8 encoding and an XML system wants
to output the character hex 107 (which is c-acute) then 
it can _always_ output it as either
&#x107; or &#253;
however since that is 6 or 7 bytes, if the xml declaration specifies
an encoding for character data that includes this slot then probably
the system will just do that. This is a latin-2 character so if
the encoding is specified as latin-2 then c acute can be encoded in the
single byte with value 230. If the encoding is utf8 then there will
be a two byte representation of character position 263, as shown
in the original posters question.

Since the request in this case was to force the system to use the
character reference form, the actual encoding for the character data
did not matter, as long as this character was _not_ part of the
encoding.

If you pick latin-1 (or ascii, or presumably a cyrillic encoding) then
in that encoding there is no encoding for c-acute ie no encoding for
unicide #x107, so with any of these encodings the only way to get a c
acute is to use &#107; (actually you could use c followed by a combining
acute character, but whether or not that is the same thing depends on
who you are, and what you are doing...)

David


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.