[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Historical I18n Note
Bullard, Claude L (Len) wrote at 16 Jul 2001 14:25:22 -0500: > While SDATA is interesting in its own right, the more applicable > part of the SGML Declaration is the document character set > clause that enables a document to contain characters > that are not defined in the document's concrete syntax. > This uses the reserved name > > CHARSET > > followed by one or more character set descriptions. Again > from Martin Bryan: > > "Each character set description consists of a base character > set statement followed by a described character set > portion identifying the roles of individual characters. > > More than one reference (base) character set can be used > to build up a character set description... > > When using the document character set clause to create > a translation table for an incoming document it is important > to remember that character references to reassigned codes > will also need to be changed during translation. For example, > if a document prepared ... is to be transferred to an > EBCIDIC-based system, an ISO 646 character reference such as > $#34; in an entity declaration will need to be changed to > }, the EBCIDIC code for a quotation mark." > > Ok, now, which parts of that are hard and expensive? Feel > free to fill in details I missed. Yes, the document character set is defined in terms of characters from one or more base character sets, but your SGML system works by mapping the characters in those base characters sets to characters in the (one or more) base character sets that are referenced in the "syntax reference character set" later in the SGML Declaration. Actually, in the syntax portion of the SGML Declaration, you assign roles to character numbers, and each character number is equated to a character in a base character set, then in the document character set portion you define the character numbers that can be used in your document and map them to characters in a base character set (I'm ignoring characters defined in term of minimum literals). The whole thing works because of the correspondence of characters in the two lots of base character sets. The interesting thing is that there was never great agreement on how to specify the base character sets. At least one SGML parser worked with only the character sets that it could recognise from (the decimal representation of) the charset's ISO 2022 escape sequence in the charset identifier, and while OmniMark and nsgmls let you map from the charset identifier to an external file describing the character set, they each used a different file format for the external file. So, aside from the fact that character set definitions in the SGML declaration are incantations to most people, a "novel" character set definition in a SGML declaration is not necessarily portable. Also, the definition of numeric character references such as " and } has been subject to reinterpretation in recent years: numeric character references are evaluated in terms of the syntax reference character set, not the document character set, which is why you can use & to represent '&' in any XML or HTML document no matter what encoding you are using. Regards, Tony Graham ------------------------------------------------------------------------ Tony Graham mailto:tony.graham@i... Sun Microsystems Ireland Ltd Phone: +353 1 8199708 Hamilton House, East Point Business Park, Dublin 3 x(70)19708
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|