[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] ACM Queue Special Issue on Semi-Structured Data
The October 2005 issue of ACM Queue [1] is dedicated to the topic of semi-structured data, and has several excellent articles on XML. Here's an excerpt from one: "XML and Semi-Structured Data" By C. M. Sperberg-Mcqueen (World Wide Web Consortium) In: ACM Queue Volume 3, Number 8 (October 2005), pages 34-41 Special Issue on Semi-Structured Data Excerpts: XML makes several contributions to solving the problem of semi-structured data, the term database theorists use to denote data that exhibits any of the following characteristics: (1) Numerous repeating fields and structures in a naive hierarchical representation of the data, which lead to large numbers of tables in a second- or third-normal form representation; (2) Wide variation in structure; (3) Sparse tables. XML provides a natural representation for hierarchical structures and repeating fields or structures. Further, XML document type definitions (DTDs) and schemas allow fine-grained control over how much variation to allow in the data: Vocabulary designers can require XML data to be perfectly regular, or they can allow a little variation, or a lot. In the extreme case, an XML vocabulary can effectively say that there are no rules at all beyond those required of all well-formed XML. Because XML syntax records only what is present, not everything that might be present, sparse data does not make the XML representation awkward; XML storage systems are typically built to handle sparse data gracefully. The most important contribution XML makes to the problem of semi-structured data, however, is to call into question the nature and existence of the problem. As the description makes clear, semi-structured data is just data that does not fit neatly into the relational model. Referring to 'the problem of semi- structured data' suggests subliminally that the problem lies in the failure of the data to live up fully to the relational model, rather than in the model and its failure fully to support the natural structure of the data. XML invites us to model the structure of our information with elements that form a tree structure, attributes that decorate the nodes of the tree, and inter-nodal links that allow us to model arbitrary graphs, not just trees. For this tree structure, XML provides a straightforward linear representation in the form of a labeled bracketing, which can be used for serial transfer of information. Fundamentally, XML is simply a labeled bracketing in which every element is labeled both at its beginning, with a start-tag, and at its end, with an end-tag. XML invites us to model information as a tree, but it need not be processed in that form. XML can be understood, and processed, at several different levels of abstraction: * As a character stream (this is the layer actually defined by the XML spec itself) * As a sequence of data characters interspersed with markup (a regular language) * As a tree in the obvious way, with one node per element, and the attributes as decorations on the nodes * As a graph in which internodal links are defined by parent-child relations between XML elements, by ID/IDREF links, or by application-specific methods of linking between elements * As a tree or graph annotated with information about data types and validity (as the output of schema validation) * As an instance of an application data structure, with arbitrary structure, built on the basis of the XML input. [...] By offering tree structures, instead of just lines of characters or tabular structures, XML dramatically enriches the possibilities for representation of documents and other information. Many kinds of information, documents among them, have prominent hierarchical organization and their representation using XML is dramatically more natural and convenient than using competing notations. But the hard fact is that in many kinds of interesting data, hierarchical structures coexist with other, competing hierarchical structures, or with information that resists any kind of hierarchy. To take a simple example: A book typically has a hierarchical logical structure of front matter, body, and back matter, with the body being subdivided into chapters, sections, subsections, and so on; but books also have a physical structure of volume, gathering, opening, page, column, line. Whenever paragraphs flow across page boundaries -- that is, virtually always -- these two hierarchies come into conflict. This topic has been of interest to markup theorists for at least 20 years, and new proposals continue to appear: Concurrent markup hierarchies, colored XML, GODDAG (general ordered-descendant directed acyclic graph) structures, just-in-time trees, LMNL (Layered Markup Annotation Language), and range algebras are just a few of the more interesting recent proposals. XML has inherited from formal language theory as defined by Noam Chomsky in 1957 the notion that a language is a Boolean set of strings.3 Applied to documents and document grammars, this means that documents are either valid and members of the set or else invalid and not members. In reality, some errors are more severe than others, and our systems would be less rigid and brittle if our notion of validity allowed continuous ('fuzzy') values instead of forcing a black/white distinction. The rigidity of the distinction is one reason that some XML users prefer not to use document grammars. A more flexible notion of validity would make writing flexible applications possible without giving in to dirty data. Given the massive proliferation of schemaless XML vocabularies, the need for tools to support grammar induction is increasing: Given a body of XML data, what grammars can be written that describe the data? There are several more or less widely known efforts in this area, from the attempt to generate a grammar for the New Oxford English Dictionary in the late 1980s to the industrially oriented grammar induction of the Fred project at OCLC (Online Computer Library Center). Schemaless or not, the number of XML vocabularies is exploding and unlikely to shrink anytime soon. Both in the context of data integration projects that provide searching over a federation of data sources, and in the context of a single project working with an evolving document grammar, applications of the data- exchange problem to XML are important. Given two schemas S1 and S2, allow the convenient specification of a mapping from S1 to S2 or find such a mapping automatically. Given that mapping and a query against schema S2, translate the query into terms of schema S1 to allow the data to be filtered without first being materialized in schema S2. How does XML help solve the semi-structured data problem? XML provides a tool for representing and grappling with the data and recognizing the complexity of its inbuilt structure. [1] http://www.acmqueue.org/ [2] http://www.acmqueue.org/modules.php?name=Content&pa=list_pages_issues&issue_id=27 -- Robin Cover
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|