|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Historical I18n Note
Bullard, Claude L (Len) wrote at 18 Jul 2001 09:43:35 -0500: > From: Tony Graham [mailto:Tony.Graham@i...] ... > >Of course, there was neither the emphasis on nor the knowledge of > >multiple character sets when SGML was designed or when most of the > >SGML parsers were written. > > But the abstractions "in principle" reveal some foresight in design. That's true, but, for example, the original need to specify character numbers for both uppercase and lowercase forms of any characters that you add to names (and the lack of a mechanism to specify that any of A-Z and a-z are not allowed in names) shows that the foresight only saw so far. I think my previous statement still stands, but the fact that SGML did later change to include the ERCS that made it easier to make declarations for large, caseless character sets does show that the SGML designers followed through on the intent of the original design (even if maybe only one parser implemented ERCS). > Again, the Declaration is the ultimate escape hatch: use wisely > and with regard to costs. CALS systems usually had to specify > the Declaration in effect. No one said it was simple but no one assumed > a priori a single universal system. I think it is that assumption > by Berners-Lee et al that drives W3C design. I think it is an > optimistic assumption even if necessary. But assuming we don't > need that escape hatch is beyond optimistic and into foolhardy. I'm not doubting that the full generality of the SGML Declaration's character set definition mechanism is useful, I'm just doubting that you'll see it implemented in every XML processor. That's not to say that everything that an SGML Declaration lets you specify is wonderful. I used to really like being able to do concurrent markup (even if the one SGML parser that supported it didn't support it according to the standard), but anyone who uses DATATAG ("to some extent an accident of history" according to the SGML Handbook) or RANK ("a concession to application design practices in the early days of generic coding") needs to have a long, hard look at their requirements. ... > > 2. Should some portion of that remain closed? > > >You shouldn't use it. Since I've never understood why the SGML > >Declaration isn't written in SGML, I think a hypothetical SGML > >Declaration equivalent for XML should be written in XML. > > It requires the reference concrete syntax. I think that you're confusing syntax and syntax. The reference concrete syntax is the default rules about what's a name character, the maximum length of names, the maximum number of attributes on an element, the maximum length of an attribute value, etc. The syntax of the SGML Declaration as keywords and values separated by whitespace was a design decision by SGML's designers. The main argument that I used to hear against using SGML markup in the SGML declaration was that you would need an SGML parser to bootstrap an SGML parser. Right now the SGML Declaration is in a format the you have to parse with a hardwired parser. That hardwired parser has to recognise '<' and '>' in the SGML Declaration because the Declaration is delimited by them. I've never understood why the SGML Declaration isn't some really limited SGML markup format. That would require a different hardwired parser, but the SGML Declaration would have been in a form that the people who use SGML were familiar with. > >I don't think you can convince many people of the need for a new SGML > >Declaration for XML, and I don't think that you could convince many of > >those to use something that isn't itself XML. > > Times change and so do requirements. Today the alternative is > yetAnotherMagicName > inside the file or to turn the names into syntax puree (relax the > draconian parse). Again, one might really want to use the standard > as intended instead of how personally interpreted. That is the > Bad Thing About XML: privatization of public assets by consortia > with a follow on distortion of the perception of the need for > international standards. We aren't doing ourselves > or our heirs any favors with that policy or practice. We can > logically justify something based on current systems, > but that won't make it right. > > >> 3. Could some portion of it be used for requirements > >> such as Blueberry presents? > > >You might use some of the ideas that an SGML Declaration represents, > >but its syntax is appalling. > > Please clarify: the reference concrete syntax is appalling? Why? The reference concrete syntax is appalling because names are limited to eight characters, you can only use A-Z, a-z, '-', and '.' in names, and '_' isn't allowed in names. However, what I was saying was appalling is the keywords and whitespace nature of the SGML declaration itself. Quick Quiz (answers below): 1. What is the correct order of SPACE, RS, and RE in the FUNCTION portion: (a) It doesn't matter (b) RS, RE, SPACE (c) RE, RS, SPACE (d) SPACE, RE, RS (e) SPACE, RS, RE 2. What does "GENERAL YES" mean: (a) Names are case sensitive (b) Names are not case sensitive 3. What is the correct order of GENERAL and ENTITY in the NAMECASE portion: (a) It doesn't matter (b) GENERAL, ENTITY (c) ENTITY, GENERAL 4. What is the correct order of the General Delimiters in the GENERAL portion: (a) It doesn't matter (b) It does matter but the list is too long to go into here 5. What's the difference between the two uses of GENERAL in the SGML Declaration? 6. What's the difference between the two uses of CHARSET in the SGML Declaration? 7. The SYNTAX portion starts with the SYNTAX keyword. Where does it end? FWIW, I had to look up the answers to some of my own questions, and I used to give tutorials on this stuff. I contend that part of why the SGML Declaration is seen as so unapproachable is that its format is so unapproachable. Yes, the keywords are all eight characters or less because that's what allowed by the reference concrete syntax, but the meanings of some of the YES or NO options are hard to remember, as are the rules for when things have a required order and when they don't. For many people, the stuff in the SGML Declaration is yetAnotherMagicName inside the file. > >> Bryan states that the variant concrete syntax declarations > >> are the way to respond when a system not based on the International > >> Reference Version (IRV) character set defined in ISO 646 is used > >> thus requiing alterations to the SYNTAX clause of the SGML > >> Declaration. Three ways are provided: ... > >> 3. Completely redefine the SYNTAX clause. Bryan provides > >> an example of an alternative syntax-reference character > >> set description for EBCDIC that changes the reference > >> concrete syntax. > > >That's what you'd have to do. > > It seems useful at the very least as the normative way to document the > differences. Yes, but do you want every XML processor to have to parse and act on that document? > >> This makes use of public identifiers. I am curious if a > >> URI based identifier might be used if a stable external > >> file format were provided such as you mention if formal > >> is set to NO in the features clause. > > >The SGML Declaration has always identified things by name, not by > >location (where the ISO 2022 escape sequences in CHARSET identifiers > >are really just an alternative name, I suppose). Also, identifiers in > >the SGML declaration are currently limited to "minimum literals", > >which is a different set of characters to those allowed in URLs. > > That might be worth changing. The URN is a name, so enabling > it in the declaration should be viable. Whether or not it's worth changing is a separate discussion to whether or not XML processors should use SGML Declarations. ... > >Over the years, people have proposed various schemes for documenting > >the capabilities of XML processors that have all reminded me of SGML's > >System Declaration, and indicating Blueberry support or lack of it is > >probably best left to such an XML mechanism because there's a lot of > >stuff in a System Declaration that will never change for XML and that > >is of absolutely no interest to someone checking on Blueberry support. > > Again, it seems best to use the standard as intended rather than > building in system-specific flags. There will be no end of it. The intent of the standard is that it would be followed a year or two after its publication by a companion formatting standard, that SGML documents would be shipped around as ASN.1 data streams, that styles would be associated with elements using link sets, and that you would use a FSV classification code to rate your system's conformance and its support of the four concrete syntaxes that were all that you were ever going to need. We got DSSSL eventually, and there were a few people passionate about link sets, but some of the other intents never really influenced the way we work. Following the intent of the standard would be very lonely, I think. I contend that the pre-1986 intent of the standard was that passing documents from my SGML system to your SGML system is would be a major event and that you'd carefully examine my SGML Declaration -- especially its capacities, quantities, concrete syntax, and required features -- and carefully compare them against your System Declaration before you even thought about processing my document on your system. That changed in time, and I doubt that many of the thousands of people who've downloaded nsgmls ever analysed its System Declaration before parsing their first document. Nor would they have had to, since nsgmls had capacities greater than you were allowed to specify in an SGML Declaration. The System Declaration was only ever useful to a fraction of a percent of SGML users, and now you want to require it for the majority of XML users. I wish you luck. ... > Why not? Do we change XML or change the requirement for the Blueberry > support such that only Blueberry systems have to recognize Blueberry > documents? No comment. I joined this thread because you gave half the story on the SGML Declaration's character set definition and didn't mention how well or badly those character set definitions were handled by the available software. Regards, Tony Graham ------------------------------------------------------------------------ Tony Graham mailto:tony.graham@i... Sun Microsystems Ireland Ltd Phone: +353 1 8199708 Hamilton House, East Point Business Park, Dublin 3 x(70)19708 Answers: 1. (c) It does matter. The order of any FUNCHAR, SEPCHAR, MSOCHAR, MSICHAR, or MSSCHAR characters after RE, RS, and SPACE, however, doesn't matter and they all require a unique added function name in addition to their keyword. 2. (b) YES means replace lowercase letters with uppercase. 3. (c) 4. (a) Which sometimes seems a bit odd considering how many other things in the SGML Declaration can only appear in a prescribed order. Actually, you have to have the SGMLREF keyword, but if you're changing any from the default, they can follow the keyword in any order. 5. GENERAL in the NAMECASE portion controls case folding of names (other than entity names), name tokens, number tokens, and delimiter strings. GENERAL in the DELIM portion is where you specify which character numbers (in the syntax-reference character set) are assigned to which roles. For example, '&' is typically assigned the AND and ERO roles, and '&#' is assigned the CRO role. 6. One defines the document character set and the other defines the syntax-reference character set. 7. The SYNTAX portion ends after the declaration of the quantity set, but since that can drag on a bit with no real sign that it's ended, it's simpler to consider that the SYNTAX portion ends before the FEATURES keyword.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








