|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Historical I18n Note
Bullard, Claude L (Len) wrote at 17 Jul 2001 11:18:18 -0500: > Latin liturgies sound like incantations if one isn't trained > for Latin. It doesn't absolve us from trying our best to I guess I shouldn't have included that line about incantations, since my main point was that the level of support for arbitrary character sets among SGML parsers was mixed, to put it mildly. Of course, there was neither the emphasis on nor the knowledge of multiple character sets when SGML was designed or when most of the SGML parsers were written. The SGML Declaration's character set definitions really only became useful with large character sets when it adopted the ERCS proposed by Rick Jelliffe (although I and many other people processed a lot of Chinese, Japanese, and Korean SGML on 8-bit clean SGML parsers without the SGML Declaration being any the wiser). SGML was designed with ISO 2022 in mind -- see the definitions of MSOCHAR, MSICHAR, MSSCHAR, and FUNCHAR in the SGML Handbook -- which, in a way, would make SGML well suited for Internet protocols that use ISO 2022-based character sets, but the current interpretation of SGML normalises all of that character set switching before the characters are compared against the document character set, so the SGML Declaration deals with abstract character numbers (scalar values, in Unicode terms), not the numeric value of the bytes used to encode the characters. > explore this aspect of the foundations of XML to inquire > if we might improve it given what is emerging from member > requirements in the context of asking > what might be useful in the SGML Declaration for XML. It > being obvious that XML has to be changed and now the debate > is how this should be done, one might ask: > > 1. Should the XML SGML Declaration be real and be open > to use by XML developers? Do we go forward only by No. There's too much stuff that you would never change, because changing it would break XML interoperability. When I described the SGML Declaration for XML 1.0 in my book, I covered the character set stuff and omitted the rest as not relevant to SGML systems that support Unicode. > building new Blueberry-capable parsers, or do we > solve the problem once using SGML facilities more > deliberately? It would prudent to go to a level of > applying the standard that is deeper than the infamous > rejoinder from an XML father to the SGML father, > "I have my own ideas about how standards should be > used." That isn't smart. > > 2. Should some portion of that remain closed? You shouldn't use it. Since I've never understood why the SGML Declaration isn't written in SGML, I think a hypothetical SGML Declaration equivalent for XML should be written in XML. I don't think you can convince many people of the need for a new SGML Declaration for XML, and I don't think that you could convince many of those to use something that isn't itself XML. > 3. Could some portion of it be used for requirements > such as Blueberry presents? You might use some of the ideas that an SGML Declaration represents, but its syntax is appalling. > 4. Should information about how the declaration could be > improved be fed back to ISO as part of the review of SGML > to improve it such that it may better work with XML? An SGML Declaration is capable of expressing naming rules that Blueberry proposes, but it seems to me that you can't add … (NEXT LINE) as a line delimiter alongside 
 (LINE FEED) and 
 (CARRIAGE RETURN) because you can assign only one character to the RS (Record start character) role and one to the RE (Record end character) role, and those are currently assigned to 
 and 
, respectively. You could, however, declare … as a SEPCHAR (Separator character) alongside 	 (HORIZONTAL TABULATION) for much the same effect. > In other words, pertaining to Leigh Dodd's question as to > is this XML pulling away from SGML, it may be the case that > XML as as subset now has lessons learned that ought to be > folded back into SGML to converge the international standard > and the consortium specification. There's nothing particularly significant about having to change the set of characters that are allowed in names. Supporting three line separator characters when there's only two record separator character roles might be a problem, but it remains to be seen whether a majority of the people who can decide the question for XML think that having three line separator characters is necessary for XML. > Bryan states that the variant concrete syntax declarations > are the way to respond when a system not based on the International > Reference Version (IRV) character set defined in ISO 646 is used > thus requiing alterations to the SYNTAX clause of the SGML > Declaration. Three ways are provided: > > 1. in the SYNTAX clause of the SGML Declaration, a public > concrete syntax is specified (itself, a variant concrete syntax) That just saves space in the SGML Declaration, since what you would put in the SYNTAX clause is now in an external file (or built into the SGML parser). Only the SYNTAX clause that would differ between XML 1.0 and Blueberry, so you'd end up with separate SGML Declaration files that refer to separate syntax files. > 2. Use the SWITCHES option to modify the reference concrete > syntax (or another publicly declared syntax) No. SWITCHES changes the role of a specific character number. For both name characters and line delimiters, Blueberry proposes adding more characters, but you can't switch in a new name character, for example, without switching out an old one. > 3. Completely redefine the SYNTAX clause. Bryan provides > an example of an alternative syntax-reference character > set description for EBCDIC that changes the reference > concrete syntax. That's what you'd have to do. > This makes use of public identifiers. I am curious if a > URI based identifier might be used if a stable external > file format were provided such as you mention if formal > is set to NO in the features clause. The SGML Declaration has always identified things by name, not by location (where the ISO 2022 escape sequences in CHARSET identifiers are really just an alternative name, I suppose). Also, identifiers in the SGML declaration are currently limited to "minimum literals", which is a different set of characters to those allowed in URLs. > Also, what about the SYSTEM declarations? And you thought SGML Declarations weren't widely understood! > Using a SYSTEM declaration we see something such as > Martin Bryan's example: > > SCOPE Instance <!-- indicates system can handle more than one syntax at a > time --> > > SYNTAX PUBLIC "ISO 8879-1986//SYNTAX Reference//EN" > CHANGES SWITCHES > SYNTAX PUBLIC "ISO-1986//SYNTAX MULTICODE Basic//EN" > SYNTAX PUBLIC "+//IBM//SYNTAX EBCDIC//EN" > CHANGES DELIMLEN 3 > SEQUENCE YES > SRCNT 100 > SRLEN 10 If you wrote separate syntax clauses for XML 1.0 and Blueberry and gave them separate identifiers, then an XML processor that wanted to behave like a SGML parser could provide a System Declaration that stated which syntax clauses it supported. The System Declaration, even more so than the SGML Declaration, is meant for people to read, since if you don't read the System Declaration and give the SGML system something that the software can't support, the software will just choke and die. Over the years, people have proposed various schemes for documenting the capabilities of XML processors that have all reminded me of SGML's System Declaration, and indicating Blueberry support or lack of it is probably best left to such an XML mechanism because there's a lot of stuff in a System Declaration that will never change for XML and that is of absolutely no interest to someone checking on Blueberry support. > I don't want to trivialize the difficulty. On the other hand, > I don't want to see a Blueberry pop up every two years and > find out "oops, we need yet more of SGML or we need to > reinvent SGML" or "those HAN characters just aren't business > requirements so...". Yes, you can describe post-Blueberry XML using a SGML Declaration (although you might need to fudge on …), but since there's so much stuff in a SGML Declaration that will never change for XML, I question why you'd want to add parsing SGML Declarations to all XML processors. As John Cowan pointed out in a post a while ago, in SGML you can now refer to a SGML Declaration rather than having to include the SGML Declaration in the input stream the way that you used to. (I haven't actually seen that implemented by any SGML parser, but nor have I looked very hard.) If you really wanted to base post-Blueberry XML on a post-Blueberry SGML Declaration, then you could standardise the identifier for the post-Blueberry SGML Declaration and include the SGML Declaration reference in every post-Blueberry XML file (which would certainly be sufficient to stop XML 1.0 processors from using the file). The post-Blueberry SGML Declaration could be assumed to be built in to the XML processor (or obtainable by dereferencing the name, for systems that care to implement it that way). ... > -----Original Message----- > From: Tony Graham [mailto:Tony.Graham@i...] > Sent: Tuesday, July 17, 2001 10:21 AM > To: xml-dev@l... > Subject: RE: Historical I18n Note ... > Also, the definition of numeric character references such as " and > } has been subject to reinterpretation in recent years: numeric > character references are evaluated in terms of the syntax reference > character set, not the document character set, which is why you can > use & to represent '&' in any XML or HTML document no matter what > encoding you are using. Oops, wrong. What I should have said (prompted by a post by Lars Marius Garshol on the Unicode mailing list) is that the numeric character references are to characters in the document character set, but whatever "character encoding" or "storage representation of characters" that you use is able to be mapped to whatever character representation that the SGML system cares to use that can represent every character in your document character set. &, no matter what document is appears in, refers to character number 38 in the document character set. What bit or byte value the SGML system uses internally to represent character number 38 isn't your concern, just as you don't have to worry about what internal representation your XML processor uses for characters. Regards, Tony Graham ------------------------------------------------------------------------ Tony Graham mailto:tony.graham@i... Sun Microsystems Ireland Ltd Phone: +353 1 8199708 Hamilton House, East Point Business Park, Dublin 3 x(70)19708
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








