[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Historical I18n Note
Thanks for responding, Tony. I have to open up the books to sort this out given that the SGML Declaration was for me and most SGML users, I suspect, a black box. When I've had to tweak one in the past, it usually wasn't character set descriptions. Latin liturgies sound like incantations if one isn't trained for Latin. It doesn't absolve us from trying our best to explore this aspect of the foundations of XML to inquire if we might improve it given what is emerging from member requirements in the context of asking what might be useful in the SGML Declaration for XML. It being obvious that XML has to be changed and now the debate is how this should be done, one might ask: 1. Should the XML SGML Declaration be real and be open to use by XML developers? Do we go forward only by building new Blueberry-capable parsers, or do we solve the problem once using SGML facilities more deliberately? It would prudent to go to a level of applying the standard that is deeper than the infamous rejoinder from an XML father to the SGML father, "I have my own ideas about how standards should be used." That isn't smart. 2. Should some portion of that remain closed? 3. Could some portion of it be used for requirements such as Blueberry presents? 4. Should information about how the declaration could be improved be fed back to ISO as part of the review of SGML to improve it such that it may better work with XML? In other words, pertaining to Leigh Dodd's question as to is this XML pulling away from SGML, it may be the case that XML as as subset now has lessons learned that ought to be folded back into SGML to converge the international standard and the consortium specification. Bryan states that the variant concrete syntax declarations are the way to respond when a system not based on the International Reference Version (IRV) character set defined in ISO 646 is used thus requiing alterations to the SYNTAX clause of the SGML Declaration. Three ways are provided: 1. in the SYNTAX clause of the SGML Declaration, a public concrete syntax is specified (itself, a variant concrete syntax) 2. Use the SWITCHES option to modify the reference concrete syntax (or another publicly declared syntax) 3. Completely redefine the SYNTAX clause. Bryan provides an example of an alternative syntax-reference character set description for EBCDIC that changes the reference concrete syntax. This makes use of public identifiers. I am curious if a URI based identifier might be used if a stable external file format were provided such as you mention if formal is set to NO in the features clause. Also, what about the SYSTEM declarations? Using a SYSTEM declaration we see something such as Martin Bryan's example: SCOPE Instance <!-- indicates system can handle more than one syntax at a time --> SYNTAX PUBLIC "ISO 8879-1986//SYNTAX Reference//EN" CHANGES SWITCHES SYNTAX PUBLIC "ISO-1986//SYNTAX MULTICODE Basic//EN" SYNTAX PUBLIC "+//IBM//SYNTAX EBCDIC//EN" CHANGES DELIMLEN 3 SEQUENCE YES SRCNT 100 SRLEN 10 I don't want to trivialize the difficulty. On the other hand, I don't want to see a Blueberry pop up every two years and find out "oops, we need yet more of SGML or we need to reinvent SGML" or "those HAN characters just aren't business requirements so...". Len http://www.mp3.com/LenBullard Ekam sat.h, Vipraah bahudhaa vadanti. Daamyata. Datta. Dayadhvam.h -----Original Message----- From: Tony Graham [mailto:Tony.Graham@i...] Sent: Tuesday, July 17, 2001 10:21 AM To: xml-dev@l... Subject: RE: Historical I18n Note Bullard, Claude L (Len) wrote at 16 Jul 2001 14:25:22 -0500: > While SDATA is interesting in its own right, the more applicable > part of the SGML Declaration is the document character set > clause that enables a document to contain characters > that are not defined in the document's concrete syntax. > This uses the reserved name > > CHARSET > > followed by one or more character set descriptions. Again > from Martin Bryan: > > "Each character set description consists of a base character > set statement followed by a described character set > portion identifying the roles of individual characters. > > More than one reference (base) character set can be used > to build up a character set description... > > When using the document character set clause to create > a translation table for an incoming document it is important > to remember that character references to reassigned codes > will also need to be changed during translation. For example, > if a document prepared ... is to be transferred to an > EBCIDIC-based system, an ISO 646 character reference such as > $#34; in an entity declaration will need to be changed to > }, the EBCIDIC code for a quotation mark." > > Ok, now, which parts of that are hard and expensive? Feel > free to fill in details I missed. Yes, the document character set is defined in terms of characters from one or more base character sets, but your SGML system works by mapping the characters in those base characters sets to characters in the (one or more) base character sets that are referenced in the "syntax reference character set" later in the SGML Declaration. Actually, in the syntax portion of the SGML Declaration, you assign roles to character numbers, and each character number is equated to a character in a base character set, then in the document character set portion you define the character numbers that can be used in your document and map them to characters in a base character set (I'm ignoring characters defined in term of minimum literals). The whole thing works because of the correspondence of characters in the two lots of base character sets. The interesting thing is that there was never great agreement on how to specify the base character sets. At least one SGML parser worked with only the character sets that it could recognise from (the decimal representation of) the charset's ISO 2022 escape sequence in the charset identifier, and while OmniMark and nsgmls let you map from the charset identifier to an external file describing the character set, they each used a different file format for the external file. So, aside from the fact that character set definitions in the SGML declaration are incantations to most people, a "novel" character set definition in a SGML declaration is not necessarily portable. Also, the definition of numeric character references such as " and } has been subject to reinterpretation in recent years: numeric character references are evaluated in terms of the syntax reference character set, not the document character set, which is why you can use & to represent '&' in any XML or HTML document no matter what encoding you are using. Regards, Tony Graham ------------------------------------------------------------------------ Tony Graham mailto:tony.graham@i... Sun Microsystems Ireland Ltd Phone: +353 1 8199708 Hamilton House, East Point Business Park, Dublin 3 x(70)19708 ------------------------------------------------------------------ The xml-dev list is sponsored by XML.org <http://www.xml.org>, an initiative of OASIS <http://www.oasis-open.org> The list archives are at http://lists.xml.org/archives/xml-dev/ To unsubscribe from this elist send a message with the single word "unsubscribe" in the body to: xml-dev-request@l...
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|