[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: [Summary] UTF-8 Question: e with acute accentshould requi

  • From: Rick Jelliffe <rjelliffe@a...>
  • To: xml-dev@l...
  • Date: Mon, 01 Oct 2007 14:13:14 +1000

Re:  [Summary]  UTF-8 Question: e with acute accentshould requi
Internationalization experts, who need precision in order to be clear
about their meaning when discussing things, tend to use the following
terms distinctly:

 * Character repertoire: unordered bag of characters. E.g. Latin 1

 * Coded character set (CCS): ordered set of characters: one or more
repertoire mapped to numbers (usually but not always distinct numbers.)
E.g. ISO 646-US 

 * Character encoding scheme (CES): a function that gives a sequence of
bytes for a string of characters from a character set (or from multiple
character sets in the case of escaped encodings.)  E.g. UTF-8

 * Higher order protocol: e.g. XML numeric character references.

So "character" is only used either to mean 
 * the thing that is the same between a repertoire, CCS and CES, or
 * character in a particular repertoire, CCS or CES. 

Two terms that are rarely used, or used condescendingly or
pedagogically, are ASCII and ANSI (the character repertoire/set/encoding
scheme) for several reasons. Obviously for a start because "ANSI" is not
from ANSI. And also because ASCII has regional variants, so very often
it is IS646 that is meant, and so ISO646-US is used to be clear which of
the ASCII-family is being meant. (In other words,
English-speaking-country people use ASCII to mean two different
concepts: 7-bit clean strings (which could be any IS646 variant) and
actual ASCII characters.)  But perhaps primarily ASCII and ANSI are
avoided because they come from a time before the three-fold distinction
above was widely accepted. Sometimes people use US-ASCII rather then ISO
646-US or IS646-US (http://en.wikipedia.org/wiki/Character_encoding is

Another term that is rarly used is plain "Character set", because no-one
knows whether you mean repertoire, CCS or CES. And so most material on
the web and even in standards that is before 1990 (and perhaps even
1999) is terribly confused in terminology. Originally Unicode was a 16
bit CES (UCS-2) but now it is the CCS and UTF-* are the CES, for

People interested in studying this should look at Dan Connolly's
"Charset considered harmful"
The XML encoding declaration is "encoding" not "charset" on purpose.

It probably goes without saying on this forum, but there is also "ASCII"
considered as a set of glyphs (e.g. an "ASCII font"). People who want to
get up to speed on the character issue might well start with the ISO

So what is the point of this?  That any discussions on characters other
than trivial ones do well to explicitly state whether character is being
used as a member of a repertoire, a code point in a CCS, or a byte
sequence from a CES, or whatever. Roger's question was clearly about CES
and responses in terms of repertoire and CES, though interesting, are
surely tangential. 

So ISO 646-US (e.g. ASCII) as a repertoire is a subset of the ISO 10646
repertoire. And as a CCS it is a subset of the Unicode CCS. And as a CES
it is a subset of the UTF-8 CES.

Rick Jelliffe 

P.S. Even the three-fold repertoire/CCS/CES distinction is not really
good enough for every case. However, to get more complicated drowns us
in the sea of details rather than rescuing us. 

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.