[Home] [By Thread] [By Date] [Recent Entries]
Eliot, Like you, I'm not really wedded to the notion of parser-mediated transclusion. On the other hand, I'm not really convinced we can 100% jettison it, either, or preach that the very concept is somehow evil. It's a hack, that's all. (Frankly, hacks are what get us through the day.) What you've said is packed, as usual, with terrific insights. I guess I just have trouble with the rhetoric. It wouldn't bother me so much if I didn't think your words are (quite deservedly) influential. If I didn't already know you so well, I might gather that it is your opinion that either: (1) entities have no identity or (2) entities may have identity but it doesn't make any difference, because ...entities have no purpose other than content-level reuse in the context of parsing operations. Assuming I'm right, then I'm going to guess that it's your opinion that the *only* reason why we have the "lt" (less-than) general entity is to bypass the parser's natural inclination to recognize a STAGO (start tag open character). As a purely practical matter, I must admit that I don't think I've *ever* used the "lt" general entity name for any other purpose. And, truly, that purpose is a hack, pure and simple! BUT: there's a vital principle here, and I don't want it to be trampled and lost. The principle at work here, at least for me, is that when I'm invoking an entity by name, I'm using a defined name to refer to an abstract thing, namely that abstraction which is shared by all "less-than" characters in all character sets, fonts, encodings, and whatnot. The fact that I'm invoking the notion of "the less-than character" in the context of parsed character data is irrelevant. I might instead use the entity name "lt" as the value of an ENTITY attribute, for example, or in any of the many ways that HyTime, for example, exploited the notion of entity identity. In SGML, every aspect of the use of names to identify things is founded on the notions inherent in DTD syntax. And DTDs can be fully or partially shared among many documents that invoke those element-type names, attribute names, and entity names, so that they are all (presumably, *cough*) invoking the same things whenever they utter the same names. And the DTDs themselves can also have "universal" names by invoking the universes (somehow) identified in PUBLIC identifiers. So I would argue that you err in portraying entities and document types as different things. Instead, I think they are in fact best understood as different perspectives on one and the same organic whole, a single "grounding tree", if you like. Entity identity is the invisible root of the grounding tree. In my view, the names declared in document types and invoked in document instances are merely the visible, above-ground parts of it. Now, one may claim that we don't need entity identity for that purpose, just as we don't need gold to back up the U.S. dollar. Hmmmm. But there's still identity, even there, and in the case of U.S. dollars -- even the huge majority of them that don't have individual identity -- their root-existence and root-nature is arguably testable in the form of U.S. military power. Where's the power of URIs, if there no testable "there" there, and they don't even necessarily resolve? Where's the identity of a document type, if not in an entity of some kind that is testably somewhere and ideally has properties that are useful for testing instances that claim to be of the type? I don't see how your explanation of DITA's approach resolves the problem. When you say: ...you don't say how to resolve the problem, other than, implicitly, anyway, via entity identity: the identities of the DITA modules, wherever they are. Right? You just don't admit it up front in ENTITY declarations. It's just understood by everybody, more or less intuitively, I guess....stop caring about the grammar as an artifact and care only about the set of (abstract) vocabulary modules the document says it (may) use. That is, actually declare the abstract document type in an unambiguous way and worry about validation details separately. How is that better? Steve On 05/04/2016 01:12 PM, Eliot Kimber wrote: These are really two different subject domains: entities (content-level reuse) and document types (defining and determining correctness of instances against some understood set of rules). On general entities: General entities are absolute evil. They should never be used under any circumstances. Fortunately, the practical reality of XML is that they almost never are used. I only see them in XML applications that reflect recent migration from legacy SGML systems. The alternative is link-based reuse, that is, reuse at the application processing level, not at the serialization parser level. Or more precisely: reuse is an application concern, not a serialization concern. Entities in SGML and XML are string macros. To the degree that string macros are useful then they have value and in the context of DTD declarations parameter entities have obvious value and utility. Parameter entities are not evil. But in the context of content, that is, the domain of the elements themselves, string macros are a big problem, not because they aren't useful, but because people think they do something they don't, namely provide a way to do reliable reuse. The use cases where string macros are useful relative to the use cases where they are actively dangerous is so small as to make their value not at all worth the cost of their certain misuse. Even for apparently-simple use cases like string value parameterization in content (e.g., product names or whatever), string macros fail because they cannot be related to specific use contexts. When you push on the requirements for reuse you quickly realize that only application-level processing gives you the flexibility and opportunities required to properly implement re-use requirements, in particular, providing the correct resolution for a given use in a given use context. The solution was in HyTime, namely the content reference link type, which was a link with the base semantic of use by reference. Because it is a link it is handled in the application domain, not the parsing domain. This is transclusion as envisioned by Ted Nelson. You see this in DITA through DITA's content reference facility and the map-and-topic architecture, both of which use hyperlinks to establish reuse relationships. With DITA 1.3 the addressing mechanism is sufficiently complete to satisfy most of the requirements (the only missing feature is indirection for references to elements within topics, but I defined a potential solution that does not require any architectural changes to DITA, just additional processing applied to specific specializations). I'm not aware of any other documentation XML application that has the equivalent use-by-reference features, but DITA is somewhat unique in being driven primarily by re-use requirements, which is not the case for older specifications like DocBook, NLM/JATS, and TEI. Of course, there's no barrier to adding similar features to any application. However, there are complications and policy considerations that have to be carefully worked out, such as what are the rules for consistency between referencing and referenced elements? DITA has one policy, but it may not be the best policy for all use cases. On DTDs and grammars in general: I do not say that DTDs (or grammars in general) are evil. I only say that the way people applied them was (and is) misguided because they misunderstood (or willfully ignored in the face of no better alternative) their limitations as a way to associate documents with their abstract document types. Of course DTDs and grammars in general have great value as a way of imposing some order on data as it flows through its communication channels and goes through its life cycle. But grammars do not define document types. At the time namespaces were being defined I tried to suggest some standard way to identify abstract document types separate from any particular implementation of them: basically a formal document that says "This is what I mean by abstract document type 'X'". You give it a URI so it can be referred to unambiguously and you can connect whatever additional governing or augmenting artifacts to it you want. By such a mechanism you could have as complete a definition of a given abstract document type as you wanted, including prose definitions as well as any number of implementing artifacts (grammars, Schematrons, validation applications, phone numbers to call for usage advice, etc.). But of course that was too heavy for the time (or for now). Either people simply didn't need that level of definitional precision or they used the workaround of pointing in the other direction, that is, by having specifications that say "I define what abstract document type 'X'" is. This is was in the context of the problem that namespace names don't point to anything: people had the idea that namespace names told you something but we were always clear that they did not--they were simply magic strings that used the mechanics of URIs to ensure that you have a universally-unique name. But the namespace tells you nothing about the names in the space (that is, what is the set of allowed names, where are their semantics and rules defined, etc.). The namespace spec specifically says "You should not expect to find anything at the end of the namespace URI and you should not try to resolve it". So if the namespace name is not the name of the document type, what is? I wanted there to be one because I like definitional completeness. But in fact it's clear now that that level of completeness is either not practical or is not sufficiently desired to make it worth trying to implement it. So we're where we were 30 years ago: we have grammar definitions for documents but we don't have a general way to talk about abstract document types as distinct from their implementing artifacts (grammars, validation processors, output processors, prose definitions, etc.). But experience has shown that it's not that big of a deal in practice. In practice, having standards or standards-like documents is sufficient for those cases where it is important. As far as addressing the problem that the reference from a document instance a grammar in fact tells you nothing reliable, a solution is what DITA does: stop caring about the grammar as an artifact and care only about the set of (abstract) vocabulary modules the document says it (may) use. That is, actually declare the abstract document type in an unambiguous way and worry about validation details separately. DITA does this as follows: 1. Defines an architecture for layered vocabulary. The DITA standard defines an invariant and mandatory set of base element types and a mechanism for the definition of new element types in terms of the base types. All conforming DITA element types and attributes MUST be based on one of the base types (directly or indirectly) and must be at least as constrained as the base type (that is, you can't relax constraints). This is DITA specialization. It ensures that all DITA documents are minimally processable in terms of the base types (or any known intermediate types). It allows for reliable interoperation and interchange of all conforming DITA documents. Because the definitional mechanism uses attributes it is not dependent on any particular grammar feature in the way that HyTime is. Any normal XML processor (including CSS selectors) can get access to the definitional base of any element and thus do what it can with it. The definitional details of an element are specified on the required @class attribute, e.g. class="- topic/p mydomain/my-para ", which reflects a specialization of the base type "P" in the module "topic" by the module "mydomain" with the name "my-para". Any general DITA-aware processor can thus process "my-para" elements using the rules for "p" or, through extension, can have "mydomain/my-para" processing, which might be different. But in either case you'll get something reasonable as a result. 2. Defines a modular architecture for vocabulary such that each kind of vocabulary definition (map types, topic types, or mix-in "domains") follows a regular pattern. There is no sense of "a" DITA DTD, only collections of modules that can be combined into document types (both in the abstract sense of "DITA document type" and in the implementation sense of a "a working grammar file that governs document instances that use a given set of modules"). DITA requires that a given version in time of a module is invariant, meaning that every copy of the module should be identical to every other (basically, you never directly modify a vocabulary module's grammar implementation). Each module is given a name that should be globally unique, or at least unique within its expected scope of use. Experience has shown us that it's actually pretty easy to ensure practical uniqueness just by judicious use of name prefixes and general respect for people's namespaces. No need to step up to full GUID-style uniqueification ala XML namespaces. In addition to vocabulary modules, which define element types or attributes, you can have "constraint modules", which impose constraints on vocabulary defined in other modules. Constraint modules let you further constrain the vocabulary without the need to directly modify a given module's grammar definition. Again, the rule is that you can only constrain, you can't relax. 3. Defines a "DITA document type" as a unique set of modules, identified by module name. If two DITA documents declare the use of the same set of modules then by definition they have the same DITA document type. This works because of rule (2): all copies of a given module must be identical. So it is sufficient to simply identify the modules. In theory one could go from the module names to some set of implementations of the modules although I don't know of any tools that do that because in practice most DITA documents have associated DTDs that already integrate the grammars for the modules being used. But it is possible. The DITA document type is declared on the @domains attribute, which is required on DITA root elements (maps and topics). Note that you could have a conforming DITA vocabulary module that is only ever defined in prose. As long as documents reflected the types correctly in the @class attributes and reflected the module name in the @domains attribute the DITA definitional requirements are met. It would be up to tool implementors to do whatever was appropriate for your domain (which might be nothing if your vocabulary exists only to provide distinguishing names and doesn't require any processing different from the base). Nobody would do this *but they could*. Thus DITA completely divorces the notion of "document type" from any implementation details of grammar, validation, or processing, with the clear implication that there better be clear documentation of what a given vocabulary module is. Cheers, E. ---- Eliot Kimber, Owner Contrext, LLC http://contrext.com On 5/4/16, 11:06 AM, "Steve Newcomb" <srn@c...> wrote:Eliot, In order to avoid potential misunderstandings, I think it might be worth clarifying your position on the following points: (1) Resolved: the whole idea of entity identity was a mistake, is worthless, and is evil. (2) Resolved: the whole idea of document type identity was a mistake, is worthless, and is evil. I have deliberately made these statements extreme and obviously silly in order to dramatize the fact that, even though there are problems with SGML's and/or XML's operational approaches to them, we cannot discard these ideas altogether. The ideas themselves remain profound and necessary. They will always be needed. The usefulness of their various operational prostheses will always be limited to certain cultural contexts. Even within their specific contexts, those prostheses will always be imperfect. They will always require occasional repair and replacement, in order that they remain available for use even as that context's notions of "entity", "document", and "identity" continue to evolve and diversify. The operational prostheses with which these ideas were fitted at SGML's birth are things of their time. That was then, this is now, and "time makes ancient good uncouth". Their goodness in their earlier context is a matter of record; they were used, a lot, for a lot of reasons and in a lot of ways. At the time, it was not stupid or evil to make the notion of document type identity depend on the notion of entity identity, nor was it stupid or evil to make the notion of entity identity dependent on PUBLIC identifiers. And in many ways, it still isn't. What is your proposed alternative, and why is it better? Steve On 05/04/2016 11:23 AM, Eliot Kimber wrote:
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



