[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: [Summary] UTF-8 Question: e with acute accentshould requi
Internationalization experts, who need precision in order to be clear about their meaning when discussing things, tend to use the following terms distinctly: * Character repertoire: unordered bag of characters. E.g. Latin 1 repertoire. * Coded character set (CCS): ordered set of characters: one or more repertoire mapped to numbers (usually but not always distinct numbers.) E.g. ISO 646-US * Character encoding scheme (CES): a function that gives a sequence of bytes for a string of characters from a character set (or from multiple character sets in the case of escaped encodings.) E.g. UTF-8 * Higher order protocol: e.g. XML numeric character references. So "character" is only used either to mean * the thing that is the same between a repertoire, CCS and CES, or * character in a particular repertoire, CCS or CES. Two terms that are rarely used, or used condescendingly or pedagogically, are ASCII and ANSI (the character repertoire/set/encoding scheme) for several reasons. Obviously for a start because "ANSI" is not from ANSI. And also because ASCII has regional variants, so very often it is IS646 that is meant, and so ISO646-US is used to be clear which of the ASCII-family is being meant. (In other words, English-speaking-country people use ASCII to mean two different concepts: 7-bit clean strings (which could be any IS646 variant) and actual ASCII characters.) But perhaps primarily ASCII and ANSI are avoided because they come from a time before the three-fold distinction above was widely accepted. Sometimes people use US-ASCII rather then ISO 646-US or IS646-US (http://en.wikipedia.org/wiki/Character_encoding is good.) Another term that is rarly used is plain "Character set", because no-one knows whether you mean repertoire, CCS or CES. And so most material on the web and even in standards that is before 1990 (and perhaps even 1999) is terribly confused in terminology. Originally Unicode was a 16 bit CES (UCS-2) but now it is the CCS and UTF-* are the CES, for example. People interested in studying this should look at Dan Connolly's "Charset considered harmful" http://www.w3.org/MarkUp/html-spec/charset-harmful.html The XML encoding declaration is "encoding" not "charset" on purpose. It probably goes without saying on this forum, but there is also "ASCII" considered as a set of glyphs (e.g. an "ASCII font"). People who want to get up to speed on the character issue might well start with the ISO document http://standards.iso.org/ittf/PubliclyAvailableStandards/c027163_ISO_IEC_TR_15285_1998(E).zip So what is the point of this? That any discussions on characters other than trivial ones do well to explicitly state whether character is being used as a member of a repertoire, a code point in a CCS, or a byte sequence from a CES, or whatever. Roger's question was clearly about CES and responses in terms of repertoire and CES, though interesting, are surely tangential. So ISO 646-US (e.g. ASCII) as a repertoire is a subset of the ISO 10646 repertoire. And as a CCS it is a subset of the Unicode CCS. And as a CES it is a subset of the UTF-8 CES. Cheers Rick Jelliffe P.S. Even the three-fold repertoire/CCS/CES distinction is not really good enough for every case. However, to get more complicated drowns us in the sea of details rather than rescuing us.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|