[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Storing Lots of Fiddly Bits (was Re: What is XML for?)
At 02:17 PM 1/29/99 -0800, Tim Bray wrote: >At 02:49 PM 1/29/99 -0600, Paul Prescod wrote: >>The data structures observed in XML are "annotated tree with second-class >>links." This can be used to model "annotated directed graph" and even just >>"annotated graph" if you pretend that the links are first-class. >>"Annotated graphs" are the basic structures used by object databases. So >>you seem to be saying that it would be really nice if there were >>high-performance object databases. [...] >What was really worrying me was what I thought was an assertion that >a repository that directly models XML document structures on a large >scale wasn't interesting; I think it is. -T. But aren't XML document structures just one instance of a more general class of data that is composed of lots of little fiddly bits organized into complex hierarchies and graphs? Other instances include vector graphics, descriptions of power plants, models of human enterprises, etc. Doesn't this suggest that rather than trying to use XML's abstract data model as a base for then modeling other kinds of data, that we should develop a more general data model, develop supporting technologies around it, and then apply that to XML? If the result can handle XML at scale, it should be able to handle the result. Thus, I agree with Tim, but probably for a different reason (notwithstanding that my main job is explicitly to help people handle documents, and not powerplants). In other words, there seems to be a bit of poor reasoning at work in a lot of quarters that goes like this [not that I'm accusing Tim of this--he just provided a convenient seque for the following rant]: 1. I have data that doesn't fit into a relational database 2. XML lets me represent this data using an easy-to-see and easy-to-create-and-use data format. 3. I can use "XML tools" to manage this data and it will be cheap and effective. Unfortunately, the jump from 2 to 3 is not justified. That's because at point 2 you are working in the *syntactic domain*. XML works very well for serializing complex data structures because its hierarchy provides rich organizational facilities and its robust definition helps ensure transmission fidelity. But, the XML data model, that is, the abstract model for the *serialization* is not the same as the abstract model of the data being serialized. That is, the representation is not the thing. Thus, when you move from the syntactic domain back to the abstract domain, the abstration you get is not the abstaction you started with--it's the abstraction of an XML document serialization of the abstraction you started with. There's another step you have to perform before you get back your original abstraction, which is to translate the serialization back into the original abstraction. For example, I start with the following abstact model (using EXPRESS syntax just because I happen to know it and it doesn't get out much, so why not--also I've been working on the XML serialization grammar for EXPRESS and EXPRESS-driven data, so it's fresh in my mind): TYPE gender ENUM; male; female; unknown; END_TYPE: ENTITY person SUBTYPE OF being; name : STRING; sex : gender; employer : OPTIONAL enterprise; END_ENTITY; ENTITY enterprise; name : STRING; address : STRING; END_ENTITY; Now I create some instance data (using lisp syntax to represent the in-memory abstractions): (person (oid 1) (name "Eliot") (sex male) (employer (oid-ref 2))) (enterprise (oid 2) (name "ISOGEN International Corp") (address "Dallas, TX") (derived::employs (oid-ref 1))) Here is one possible (of an infinite number of possible) XML serialization: <?xml version="1.0"?> <data-serialization> <schema-ref>business objects schema</schema-ref> <data-instances> <entity-instance id="i0000"> <types> <type>person</type> <attributes> <attribute> <attname>name</attname> <attvalue>Eliot</attvalue> </attribute> <attribute> <attname>sex</attname> <attname>male</attname> </attribute> <attribute> <attname>employer</attname> <attvalue><entity-ref>i0001</entity-ref></attvalue> </attribute> </attributes> </type> </types> </entity-instance> <entity-instance id="i0001"> <types> <type>enterprise</type> <attributes> <attribute> <attname>name</attname> <attvalue>ISOGEN International Corp.</attvalue> </attribute> <attribute> <attname>address</attname> <attvalue>Dallas, TX</attvalue> </attribute> </attributes> </types> </entity-instance> </data-instances> </data-serialization> If you now parse this document into an abstraction conforming to the DOM, SGML Property Set, or similar rational abstract data model for XML documents, you'll get something like this: (xml-document (prolog (pi xml version="1.0") (doctype-decl)) (document-element (gi data-serialization) (content (element (gi schema-ref) (content (literal "business objects schema"))) (element (gi data-instances) (content ...))))) You get the idea--clearly the in-memory abstraction of the document bears no direct relationship to the in-memory abstraction of the original data. Even if you do an early-bound abstraction where you take the element types as node types, you still get something that is not the abstraction: (xml-document (prolog (pi xml version="1.0") (doctype-decl)) (data-serialization (schema-ref (literal "business objects schema")) (data-instances (entity-instance (types "person") (attributes (attribute (attname "name") (attval "Eliot")) ....)))) You get the idea. Even in this early-bound form, the abstraction is still reflecting the structure of the serialization, not the original abstraction. To get the original abstraction back, I have to apply the reverse of the original serialization algorithm. I might do this literally or I might do it by providing a set of query functions over my document that does it (e.g., translates the query "select person where name is 'Eliot'" to a more complex XML-specific query defined in terms of the semantics and structure of the serialization structure). Either way, the mapping has to be defined and implemented. Whether doing it literally (that is, importing the database back into some "non-XML" repository) or doing it virtually on top of an "XML repository" is an implementation/optimization choice. Thus, even saying "XML means the data abstraction you get from XML syntax" isn't very helpful. Because the resulting abstraction isn't really what you want. However, the characteristics of the XML in-memory abstraction *as a class of data* are very much similar to the characteristics of other abstractions. For example, the abstract data objects that describe a power plant are very much like the abstract data objects that describe a document: - There are a lot of them (every pipe, valve, pump, joint, etc., represents at least one node, with many relationships to other nodes) - Each node has lots of properties (position, identifier, operating characteristics, geometry, status, age, etc.) - The nodes exist in both a hierarchy reflecting their physical structure (plant-unit-assembly-subassembly-part) and a graph representing their connected nature to other parts (valve one must be closed before valve two can be opened) - They are equally static and dynamic, that is, a large part of the data never changes, a large part of it is constantly changing. - I want to ask a lot of questions about the data and I can't predict what sort of questions I might want to ask - If something is wrong in the data, bad things may happen This suggests that the technology that can handle documents at large scale can also handle powerplants at large scale (or ships or airplanes or buildings or electronic components or enterprises or governments or ...). This, I think, leads to an excitement about XML and its application to managing large data stores because it provides an easy-to-understand entry into the problem space and an easy-to-get-started place to start stressing and testing the technology. This is all good, but we have to be careful not to lose sight of the fact that the goal shouldn't be to shoe-horn all complex structures into XML's abstract data model, it should be to develop data management technologies that will handle documents well, because if we do, they will also handle powerplants and airplanes well. And the reverse is true as well--if I have a database that can handle a powerplant or an aircraft, chances are it will handle documents at scale too. Near as I can tell from my work in the STEP world and in the document world, the technology to manage data of this sort at the scales we need simply doesn't yet exist. I don't know if this is a hardware problem or a science problem, but I suspect it's a bit of both. I suspect that the solution requires an entirely new way of thinking about storing little fiddly bits of data that is neither relational nor object nor object-relational, but is entirely else (or at least significantly enough else to be something different). Cheers, E. -- <Address HyTime=bibloc> W. Eliot Kimber, Senior Consulting SGML Engineer ISOGEN International Corp. 2200 N. Lamar St., Suite 230, Dallas, TX 75202. 214.953.0004 www.isogen.com </Address> xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|