Re: The Power of Groves
Steve Schafer wrote: > I was rereading some old material on groves, and came across the > following in a post by Eliot Kimber to comp.text.sgml (it was at the > end of a paragraph discussing the definition of customized property > sets for various kinds of data; the full context is available at > http://www.oasis-open.org/cover/grovesKimber1.html): > > "However, there is no guarantee that the property set and grove > mechanism is capable of expressing all aspects of any notation other > than SGML." > > (Notes 440 and 442 in section A.4 of the HyTime spec say much the same > thing.) > > On the face of it, this is a perfectly sensible thing to say. At the > same time, however, it is rather disturbing, because it suggests that > there might exist data sets for which the grove paradigm is wholly > unsuited. I would certainly hate to expend a lot of effort building a > grove-based data model for a data set, only to discover part way > through that groves and property sets simply won't work for that data > set. The point of this statement is that we could not at the time *guarantee* that groves could express all aspects of a given notation. In fact I'm quite sure that, just like XML, there does not exist a form of data for which a usable grove representation could not be defined. We did not have the time or skills to mathematically prove that groves could be used for everything. I for one did not want to make an absolute claim I couldn't prove. It is likely that a grove-based representation would not be *optimal* for many kinds of data. But that doesn't really matter because the purpose of groves is to enable processing of data for specific purposes (addressing, linking, transforming) and therefore does not need to necessarily express all aspects of any particular notation, only those aspects that are needed by the processing for which the grove has been constructed. Different types of processing might even use different grove representations of the same notation to suit their own specific needs. It's important to remember that a grove is an abstraction of data (or the result of processing data), not the data itself. Also, whether or not a grove representation is useful or appropriate depends as much on the implementation as it does on the details of groves themselves. For example, it might not seem reasonable to represent a movie as a grove where every frame is a node, but in fact a clever grove implementation could make that representation about as effecient as some more optimized format. For example, you need not preconstruct all the nodes, doing so only when necessary. Also, as computers become faster, the cost of abstraction goes down for the same volume of data. Ten years ago streaming media had to be superoptimized just to be playable at all. Today we don't need that level of optimization (what we have been doing is putting more and more information into the same presentation time (MPEG movies) or doing more and more compression (MP3)). It's also important to remember that any form of data representation, standardized or not, will be optimized for some things and non-optimized for others. Groves were explicitly optimized for representing data that is like SGML and XML. It happens that SGML and XML data is more complicated and demanding that most other kinds of data, so it's likely that anything that satisfies those requirements will ably satisfy the requirements of most types of data, certainly most types of structured data. But it's no guarantee, at least not without some mathematical proof that I am not qualified or able to provide (not being a mathematician). > So the first question is this: > > 1) Does a Universal Data Abstraction exist? > Note that, like a Universal Turing Machine, such an abstraction need > not be particularly efficient or otherwise well suited to any specific > task. The only requirement is that it be universal in the sense of > being capable of representing any conceivable data set (or at least > any "reasonable" data set). (And no, I don't have a formal definition > of what "reasonable" would mean in this context; all I can say is that > the definition itself should be reasonable....) The real importance of > a Universal Data Abstraction is that it would provide a formal basis > for the construction of one or more Practical Data Abstractions. First, let me stress the importance of the last sentence: that is, I think, the key motivator for things like groves. I want things like the DOM, which are extremely practical, but I want them bound to a formal, testable, standardized abstraction. I know of two standardized universal, implementation-independent data abstractions: groves and the EXPRESS entities (ISO 10303 Part 11). Both of these standards provide a simple but complete data abstraction that is completely divorced from implementation details. For groves its nodes with properties. For EXPRESS its entities with attributes. Both can be used to represent any kind of data structure. These two representations have different characteristics and were designed to meet different purposes. There is currently an active preliminary work item within the ISO 10303 committee (ISO TC184/SC4) to define a formal mapping between groves and EXPRESS entities so that, for example, one can automatically provide a grove view of EXPRESS data or an EXPRESS view of groves. XML *appears* to be a universal data abstraction, but it's not quite, because it is already specialized from nodes with properties to elements, attributes, and data characters. This is why Len's recent comment about an XML representation of VRML not working well with the DOM is not at all surprising. Of course it doesn't. The DOM reflects the data model of XML (elements, attributes, and data characters) not the data model of VRML. This is always the case for XML. I have observed that the world desperately needs a universal data abstraction. I think that one of the reasons that XML has gotten so much attention is that it *looks like* such an abstraction (even though it's not). I also don't think it really matters what the abstraction looks like in detail--what's important is that we agree on what it is as a society. Once we have that we can stop worrying about stupid details like how to specify the abstract model for XML or RDF or Xlink or XSL or what have you: you'll just do it. It doesn't matter whether we use groves as is or EXPRESS entities as is or make something up that we can all agree on. What's important is that we do it and stick to it. I think that groves are a pretty good first cut, but we could certainly improve on them. The advantage that groves have at the moment is that they are standardized, they have been implemented in a number of tools, including James Clark's Jade, HyBrick from Fujistu, the Python grove stuff from STEP Infotek, my PHyLIS tool, TechnoTeacher's GroveMinder product, Alex Milowski's now-unavailable code he wrote before he got bought by CommerceOne, and others I'm sure. It satisfies immediate requirements well, it has at least two useful standards built around it, and it's a reasonably good base for future refinement (about to get under way with the DSSSL 2 project being led by Didier Martin). > Assuming that the answer is "yes" (and I have no real justification > other than optimism to believe that it is), the second question > follows immediately: > > 2) Does the grove paradigm, or something similar to the grove > paradigm, constitute a Universal Data Abstraction? Yes, obviously. > 3) Does there exist any "reasonable" data set for which the grove > paradigm inherently cannot provide an adequate representation? You'd have to define adequate, but I don't think so. Groves obviously do hierarchical stuff quite well. Relational tables are just shallow hierarchies. Streaming media is more of a problem, but even it can be decomposed into groups of frames or data units (e.g., movie goes to scenes, scene goes to frames, frames carry sound and image properties). > When attempting to answer this third question, it is important to > avoid getting caught up in unwarranted toplogical arguments. The > topology of groves may not map onto the topology of a particular data > set, but that does not mean that that data set is unrepresentable as a > grove. Consider XML: An XML document consists of a linear, ordered > list of Unicode characters, yet the XML format is quite capable of > representing any arbitrary directed acyclic graph. This is a very important point and it's well worth stressing again. Any "universal" data abstraction will be suboptimal for many types of data or data structures. That's what implementations are for, getting the optimization characteristics needed by specific applications or use environments. The main purpose, in my mind, for a universal abstraction like groves is to enable reliable addressing (because you have some common basis on which to define and predict the structures of things) and to enable the creation of data access APIs that may be individually optimized for different use scenarios but that are all provably consistent because they all reflect the same underlying data model. > ======== > > On a somewhat related note, I've noticed that in discussions regarding > the Power of Groves, the arguments by the proponents seem to fall into > two distinct groups. On the one hand, some people see groves as being > quite universal in their applicability. On the other, some people talk > about groves almost exclusively within the context of SGML, DSSSL > and/or HyTime. As an outsider and relative latecomer to the party, I > find it difficult to determine whether this dichotomy of viewpoints is > real, or merely reflects the differences in the contexts in which the > discussions have taken place. If the schism _is_ real, it would be > helpful if those sitting on either side of the fence could add their > thoughts regarding why the schism is there, and why the people on the > other side are wrong. :) I think it's largely a function of context. But it's important to remember that groves were defined as part of a larger standards framework of which SGML, DSSSL, and HyTime are the chief parts. There is a sense in which these three standards cover pretty much all of data representation and access at the abstract level (as opposed to the implementation level, where we rely on things like APIs, programming langauges, communications protocols, and other building blocks of working systems). But groves certainly have general application outside the use of the DSSSL and HyTime standards. It's just that the ability to implement those standards is what has motivated most of us who have implemented groves. Because groves can be applied to any kind of data (per the discussion above) it follows that the DSSSL and HyTime standards can be applied to any kind of data. That is, I can do generalized, consistent linking, addressing, styling, and transforming of anything I can put into a grove, which is anything. That covers almost all of what one needs to do to data in an application. This provides tremendous leverage once you have the layers of infrastructure built up. > An example of why I am concerned by this question is given by the > property set definition requirements in section A.4 of HyTime. The > definition of property sets is given explicitly in terms of SGML. That > is, a property set definition _is_ an SGML document. But it seems to > me that if property sets have any sort of widespread applicability > outside of SGML, then a property set definition in UML or IDL or some > other notation would serve just as well (assuming that those other > notations are sufficiently expressive; I'm fairly confident that UML > is, but I'm not so sure about IDL). I agree completely. That is one reason we're working on rationalizing EXPRESS and groves. As part of that effort, we have created EXPRESS models for the SGML and HyTime property sets, providing an example of using a more generalized formal modeling language to specify the data models the groves reflect. You could, of course, do the same thing with UML and define a generic algorithm for going from a UML model to a grove representation of the data objects conforming to that model. One key problem we ran into with EXPRESS (and would run into with UML) is that groves have the explicit and essential notion of name spaces (for addressing nodes by name, not disambiguating names). EXPRESS has no formal notion of grove-style name spaces, nor does UML. You can define the appropriate constraints using population constraints (OCL in UML), but it's not nearly as convenient as in a property set definition document. > Of course, it can be argued that _some_ notation had to be used, so > why not SGML? My response to that is that I believe that the > mathematical approach of starting with a few extremely basic axioms > and building on those as required to develop a relevant "language" for > expressing a model would be far superior, as it would allow people to > fully visualize the construction of the property set data model (or > "metamodel," if you prefer), without getting bogged down in arcane > SGML jargon. After all, SGML can hardly be described as minimalist. Again, I couldn't agree more. We have what we have largely because we were in a hurry and it was expedient (and because it's what James Clark did and, at the time, the rest of the editors didn't have anything better to offer). It's too bad that we didn't appreciate the existence or applicability of EXPRESS at the time, because if we had we very well might have used it. But in any case, it would be easy enough to revise the spec to provide a more complete and modern formalism. There's no particular magic to the property set definition document except that, being in SGML/XML form, it was easy for us to process and work with. > (An aside: I believe that a lot of the resistance to acceptance of > SGML and HyTime has its basis in the limitation of identifiers to > eight characters, leading to such incomprehensible abominations as > "rflocspn" and "nmndlist." Learning a completely new body of ideas is > hard enough without having to simultaneously learn a foreign--not to > mention utterly unpronounceable--language.) Almost certainly true. We felt that we had an obligation for backward compatibilty with legacy SGML, which meant that we had to have names that could be used with the reference concrete syntax. Not sure that we could have done otherwise. It's a historical legacy just like 512 scan lines for TV signals. In practice it probably wouldn't have caused anyone harm if we had required support for longer names. Cheers, Eliot
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format