[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Weak DTDs
At 06:28 PM 10/13/97, Peter Murray-Rust wrote: [...] >I'd like *constructive* views on the value of DTDs in XML. [I know that the >community has strongly held ones, so please avoid too much passion :-). >There was a very interesting discussion a few weeks back on the aesthetics >of DTDs - a good DTD is a thing of beauty.] I can see the following reasons >for DTDs. [...] >In creating CML documents I find myself: > (a) wanting to introduce foreign names (e.g. <DC:author>, or <MathML:EQN>) >These could reasonably come at many places in the document > (b) forgetting my own 'rules', e.g. order of elements within a content >model. So I can't expect others to follow them :-) > (c) adding new components to content models - for good reasons. There is >no reason why an <MOLECULE> cannot contain a <FIGURE>, but I didn't think >of that earlier. I don't want to have to think of all combinations and ask >'is that reasonable?'. Peter has run head-on into one of the fundamental problems with DTDs as currently defined by SGML (and XML): we want them to describe *classes* of documents when they actually describe *individual* documents (and are incapable of defining classes of documents except in very weak ways). It was clearly the intent of the SGML designers that DTDs describe *classes* documents (thus the term 'document type'). Unfortunately, by making the DTD declarations a property of individual documents, they are prevented from being used in that way except in the most draconian fashion: all documents of a type must have *exactly* the same rules (because they all share exactly the same declaration set as part of their syntactic content). Valiant attempts at making configurable declaration sets, typified by the TEI and Docbook, simply emphasize the problem: there is no useful way with DTDs alone to define flexible document classes that can be easily specialized at the document level. Draconian rules are fine when your use scenario requires draconian policies, such as when creating military documents or documents that drive well-defined and specific processes. However, not all uses of SGML require draconian policies (i.e., the TEI). XML, in particular, is expressly designed for situations that *probably don't* require draconian policies (as evidenced by the potential lack of any DTD declarations). In other words, there is a continuum of possible constraint policies, from no variation allowed to anything is allowed. Unfortunately, SGML only really supports the 'no variation' end of that spectrum and XML only really supports the 'anything is allowed' or the 'no variation' ends, with no obviouis support for the middle ground, where you want some constraints but not necessarily full constraint. Thus the frustration that Peter describes is unavoidable with DTDs alone: he has clearly defined a general document type, the CML, that needs to allow a range of specialization options. However, if the CML is defined as a set of declarations to be used directly in documents as their DTD declarations, it cannot do that, as the declarations define the *complete set* of constraints on those documents. The CML must either impose arbitrary constraints that are necessarily appropriate for all CML documents or it must be so loose as to define no constraints beyond type names. In short: DTDs don't define document classes. The use of parameter entities to create configuratable declaration sets is a very weak way of expressing the allowed range of specialization, one that depends entirely on syntax tricks and conventions and one that cannot be reliably machine processed (it is impossible to impute meaning to the names and/or positions of parameter entities in the geneal case). And one that cannot be used at the document level with any of the commercial SGML editors I'm familiar with (because none allow element or attribute declarations in the internal subset). This is why something like architectures is required for the productive and large-scale use of SGML and XML: you must have a way to define true document classes with clear, machine-processible and validatable specialization constraints that dont', at the same time, impose unnecessary constraints on individual documents. SGML architectures, as defined by the AFDR (http://www.ornl.gov/sgml/wg8/docs/n1920/html/clause-A.3.html), provide such a mechanism. An architecture is defined by the *combination* of a set of DTD declarations and accompanying documentation that together define the rules for a class of documents (the documentation is vitally important because there will always be rules and constraints that cannot be expressed through syntax, regardless of what syntax you are using to formally express constraints). As part of these rules, the range of allowed variation among documents that confrom to the class can be defined, both formally in the syntax and completely in the documentation. The DTD declarations form a "meta-DTD", that is a DTD that defines the syntactic rules for the class, not for instances. Instances will have their individual DTDs (explicit or implicit) that define their individual syntax rules. Architectures can themselves be derived from other architectures, allowing you to form a hierarchy of document classes. By the same token, any architecture can be used as the base for a more specialized architecture. In addition, a single document or architecture can be derived from many different architectures (for example, the CML might be derived in part from some RDF architecture in order to standardize the way the CML structures metadata). Because architectures are defined using normal DTD syntax, any existing DTD declaration set can be used as an architecture without modification (although most existing DTDs can benefit from some redesign in order to make them better architectures). Thus, the CML, in the abstract, is clearly an architecture in the general sense: it defines the rules for a class of documents. It does (or needs to) define specialization constraints. The current definition of the CML includes a declaration set... ...Thus, the CML is an SGML architecture because the CML DTD can be used as an architectural meta-DTD (with the possible addition of a few small changes to better express its specialization constraints). To use this architecture with documents, you need to define a mapping between the elements, attributes, and data of the document with the elements and attributes in the architectural meta-DTD. The AFDR mechanism does this with attributes and provides a natural automatic mapping mechanism so that documents that are very similar to their meta-DTDs need provide mappings only for those things that differ from the meta-DTDs (that is, those things that are specialized beyond what the architectures define). [...] >These are powerful conditions, but if we try to express them in DTDs, >validation will fail. What I'd like to have is a wildcard #ANY (this has >already been suggested) which can be used for content models something like >the (currently illegal) XML: The idea of a "wildcard" for content models is expressed in the AFDR by the notion of "bridging" element forms, "bridging" in the sense that they bridge between the architecture and non-architectural stuff. In the meta-DTD, a bridging form simply says "anything can go here". Thus, rather than saying the following in the document's declarations: <!ELEMENT MOL (#ANY,ATOMS,BONDS)*> You would say this in the meta-DTD: <!ELEMENT ANY -- Bridging form that allows anything to occur -- - - (#PCDATA | ANY)* > This is essentially the same as what Rick suggested, except that we're doing it in the meta-DTD, rather than the document's DTD (the document may not have a DTD). To define the mapping from a document to a governing architecture, you declare the architecture and then define the mapping. In the AFDR as written the architecture is declared using a NOTATION declaration [several of people, including myself and Peter Newcomb, have suggested alternative PI-based mechanisms for doing these declarations as XML doesn't yet provide data attributes, which the AFDR mechanism relies on--what's important is making the connection, not the precise syntax by which it is made.]. A document that is derived from the CML and takes advantage of the above might look like this: <!DOCTYPE CML [ <!NOTATION CML PUBLIC "-//VSMS//DTD Chemical Markup Language Architecture//EN"> <!ATTLIST #NOTATION CML ArcDTD CDATA #FIXED "CML.meta-DTD" ArcBridge NAME #FIXED "ANY" > <!NOTATION SGML PUBLIC "ISO 8879:1986//NOTATION Standard Generalized Markup Language//EN"> <!ENTITY CML.meta-DTD SYSTEM "http://www.vsms.nottingham.ac.uk/vsms/cml.dtd" CDATA SGML > <!-- Map the local element type 'MyElement' to the CML bridging form 'ANY': --> <!ATTLIST MyElement CML NAME #FIXED "ANY" > ]> <CML> ...[normal CML stuff] <MOL> <MyElement>...</MyElement> <ATOMS>...</ATOMS> <BONDS>...</BONDS> </MOL> </CML> By the normal rules of automatic architectural mapping, I only had to explicitly map the 'MyElement' element to something in the CML meta-DTD because everything else used the same names as in the meta-DTD. This means that I didn't need any other DTD declarations in the document in order to be able to interpret it as a CML document (that is, as a document that conforms to the general rules defined by the CML architecture). To process it as a CML document, I can simply derive the "architectural instance" using an architecture processor like SGMLNORM: sgmlnorml -A CML mydoc.sgml > cmlai.sgm The code samples Peter shows in his note could easily be used for architecture-aware processing simply by looking at the result of the architectural mapping rather than directly at element types. Resolving architectural mapping in an ad-hoc way requires about 20 lines of code if you make some reasonable assumptions about the use of the architecture (assuming you aren't prepared to do fully-general architectural processing involving actually loading the meta-DTD, which you don't usually need to do for most purposes). To define the sort of attribute constraints Peter wants, you must still rely on either documentation that states the rules that must then be enforced by an architecture-aware processor or you have to use something like the lextype facility in ISO/IEC 10744 Annex A.2. However, if you're building a processor for a specific architecture (i.e., a CML-aware processor), building in rules for specific attributes isn't a big deal and is no different than the sorts of things people do in specialized SGML processors every day. The architecture does give you a central place to put the documentation of the constraint and lets you make your implementation as generalized as you want (or have time for). Thus we can use architectural meta-DTDs to really and truly define the syntactic rules for classes of documents and then create documents that are specialized from those classes. The specialization rules are (mostly) machine processible and enforceable (there will always be semantic rules that can't be enforced by syntax alone). Because of automatic mapping, documents derived from architectures need have no explicit declarations of their own except as needed to express specific specializations (as shown above). Note that if you have an existing SGML document with an explicit DTD, you can make that SGML document into a DTD-less XML document simply by using the existing DTD as an architectural meta-DTD. This removes the necessity of parsing the declarations with the document any time you want to parse it without removing the connection between the document and its syntactic and semantic constraints (thus allowing validation on demand). This is particularly useful when the DTD you use is large (e.g., Docbook, full TEI, etc.). This then continues to beg the question: why have DTDs for documents at all? In fact, most documents need never have a full set of explicit declarations if they are derived from an architecture if they are also well formed. The only time you'd need explicit declarations would be to define specializations or to drive non-architecture-aware authoring or validation. But wouldn't it be cool if XML editors *were* architecture aware such that you could say "I want to create documents that conform to architecture X" and the editor would determine and enforce the specialization rules, letting you define new element types (or modify existing ones) and either warn you when you were doing something outside the architecture or prevent you from doing something outside the architecture (depending on what your local specialization policies are)? I think so. In fact, I think this is the only way you can have a useful XML editor at all [I find it interesting that the ADEPT*Editor product has had for many years a non-SGML-conforming mechanism for creating specialized element types while editing, although ADEPT does it through the use of PIs and creates documents that are really only processible in that form by ADEPT tools. But clearly they recognized a stronge requirement to allow specialization of documents by authors--unfortunately, no architectural mechanism, certainly not a standardized one, existed at the time they built that facility. I wonder how difficult it would be to make ADEPT into an architecture-aware editor that provided the same specialization facilities it does now, but expressed using the AFDR syntax rather than the proprietary ADEPT syntax? Certainly the work that Paul Grosso has done to demonstrate XML editing and on-the-fly element declaration suggests it might be possible, even if it requires something of a hack in the short term.] If you don't have an editor like this, then you are requiring the author to know the architecture's rules, which as Peter points out, can be difficult, even when you are the creator of the rules to begin with. In other words, "DTD-less authoring" is not attractive for most people because most people create documents that need to have at least some minimal consistency with other documents. My personal feeling is that without architectures [in the general sense, not necessarily using the AFDR mechanism, although I think the AFDR is a very good mechanism] that neither SGML nor XML are really very useful--meaning that architectures are required to use SGML and XML at large scales and across wide domains. Almost all the problems people have with using SGML at large scales come not from technological limitations but from limitations in the ability of document types alone to define document classes and the inability of SGML processors to operate at the class level, rather than the document level. Having said that, let me stress that SGML and XML are still the best thing going for creating structured documents. Obviously, we need to add the architectural mechanism to SGML and XML, not discard them in favor of something else. I think publication of ISO/IEC 10744:1997 demonstrates the desire to do this addition and, in fact, accomplished it (at least within the constraints of 8879:1986--there's lots of room for improvement to this mechanism as the syntax of SGML is improved through the SGML revision). Cheers, Eliot -- <Address HyTime=bibloc> W. Eliot Kimber, Senior Consulting SGML Engineer Highland Consulting, a division of ISOGEN International Corp. 2200 N. Lamar St., Suite 230, Dallas, TX 95202. 214.953.0004 www.isogen.com </Address> xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|