[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: xml taxonomy
Rick and Len: Disclaimer: The following is my own terminology that helps me sort out the world. I'm not trying to impose it on anyone, just sharing it for what its worth. I distinguish the following: 1. "atomic/electronic document" and "xml document" 2. "messages/protocols", "forms", and "documents" 3. "tight" versus "loose" schema 4. "data dictionary", "schema" and "schema framework" ================ 1. "atomic document/electronic document" and "xml document" ================ "XML document" is an xml document as defined by W3C's XML 1.x specification(s). By "atomic" or "electronic" document, I mean one electronic file that has both style and data information in it, whether xml or not. For example, a MS Word, Word Perfect, or PDF document is an atomic/electronic document. An XML document can also be an atomic/electronic document. However, if there is a stylesheet necessary to render the document, then I do not consider *two* separate files (xml and stylesheet) to be an atomic/electronic document. I would only consider an XML document to be atomic/electronic if the data and style is in the same file. IMPORTANT: This is not to say that I advocate mixing data and style, because I do not. However, it is possible to put both style information and data into the same electronic file and still separate it. Some "electronic/atomic" document formats separate data/style better than others. Separation is always better in my view, even if the style and data are in the same electronic file. ================ 2. "messages/protocols", "forms", and "documents" ================ Generally, a "message" is a machine-to-machine data transfer (e.g., from one database to another database). The order in which the data appears and a precise visual representation is not important. What is important is moving data from one system to another system. I find this is where people care the most about capturing "relational" structures in XML, because most often there is legacy data flowing from one relational database into another relational database. (XML provides hierarchical structure, so it is not always intuitive how to capture "relational" structures . . . but this is another discussion.) A "protocol" is a series of messages that follow one of many request/response patterns. For example, a filing xml might require a confirmation xml. The important point here is that a message generally does not require a stylesheet or, if it does, one or several stylesheets might show different views/subsets of the information, but the order and the format of the final output is not important. A "form" is similar to a "message" in that data is a form is highly structured and "fill-in-the-blank." Forms are different than messages because the visual representation of the form is important to a human user. Forms technologies, therefore, must pay good attention to the final output. In this area, users tend to want the "electronic" form to look exactly like the "paper" form. A "form" is a type of "document"; however, not all "documents" are "forms." A "form" is primarily "fill-in-the-blank" data. In contrast, a "document" includes "prose." "Prose" is free flowing text, structured as headings, paragraphs, and outlines/lists. A "document" can also contain "fill-in-the-blank" data, but, again, unlike a "form," it includes prose. In the area of legal forms and documents, in which I work, a form might be, for example, a "coversheet" on a pleading, whereas the "pleading" would be "document." A brief supporting the pleading would be a "document." Note (to add a bit of confusion): In the legal practice, lawyers use "form books" -- which are templates for making legal claims, such as fraud or medical malpractice. These "form books" contain what I define above as a "document". Also, in some states, especially California, state government has codified the format of certain traditional "documents" (form books) into what it calls "forms" (to be more precise - judicial council forms). Examples of legal documents include court/justice documents, transcripts, legislative documents, contracts, treaties and letters. There is a fine line between a "form" and a "document." To borrow an analogy from a Supreme Court Justice, distinguishing forms/documents is like pornography -- you know it when you see it. ================ 3. "tight" versus "loose" schema ================ A "tight" schema is a schema that precisely validates data in an xml document. Qualities of a "tight" schema, for example, are that it is neither underinclusive or overinclusive. Elements have strict/precise content models. For example, no mixed content; precise use of sequence/choice elements; precise, well-defined enumerations, mix/max occurs, data types, other facets. A "loose" schema is opposite of a "tight" schema. For example, there may be overinclusive elements, there may be mixed content, there may be many "string/text" nodes that do not define enumerations, mix/max occurs, data types, other facets. Different applications need schema that are "tighter" or "looser" than others. In practice, I find that there is a continuum from tight to loose when one moves from messages to forms to documents. That is, message formats tend to be very "tight" whereas document formats tend to be much more "loose". Forms are somewhere in the middle. ================ 4. "data dictionary", "schema" and "schema framework" ================ A "schema" is a DTD, XML Schema, Relax NG Schema, or the like. A "data dictionary" is a set of defined terms. In my view, it does not (or should not ) mandate, define, or require a data structure (such as a relational structure or a hierarchical structure). Most data dictionaries that I run across in my work are petrified in paper documents or MS Word/Word Perfect/HTML/PDF documents. This is unfortunate, because it greatly limits usability. Every once in a while, I'll get lucky and find a data dictionary in an electronic spreadsheet or in a database. A good dictionary will have a lot of terms in it. If it contain synonyms, then there will be some mechanical means to determine that two terms are synonyms. I would expect an XML data dictionary to be in a simple XML format that shows simple relationships among terms or in RDF or perhaps one of the emerging ontology formats. In my view, "schema" developers should use "data dictionaries" for element and attribute names, but *not* for content models. This is necessary because different applications need different types of schemas (e.g., tight versus loose) with different combinations/mixture of terms (e.g., elements/attributes). A "schema framework" is a set of best practices and conventions for creating (arbitrary) schema. We have found that the use of a "schema framework" *greatly reduces* the time it takes to create, manage, develop, store, and write code/applications around schema. We have, for example, a set of rules that apply to creating message schema (all schema), additional rules that apply to creating form schema, and yet additional rules that apply to creating document schema. In relation to Rick and Len's comments, we have found that the use of a "schema framework" allows us to automate and speed the development of "data dictionaries" (or taxonomies). I would disagree with Len that this is a purely academic exercise. We have implemented real, working techniques that greatly reduce the cost and time of using XML and XML Schema. For example, in our schema repository, we have perhaps 400-500 schema. Because each schema follows the rules of the schema framework, we are able to automatically generate one or more data dictionaries based on either all or a subset of the schema in the repository. (I am not contending that this could not be done with schema that are not in a "framework" -- I'm simply saying it is easier and has more benefits if a "framework" is used.) Automating the creation of data dictionaries has benefits that Rick touches on -- I call this "aggregation" -- that is, it is possible to aggregate and efficiently analyze terms (and potentially content models) in a group of schema. Aggregation (just as one would do with financial data) allows one to observe patterns in terminology use, harmonize terminology, and better use/reuse and define terms. I hope you find this useful. Thanks, Todd ----- Original Message ----- From: "Bullard, Claude L (Len)" <clbullar@i...> To: <rjm@z...>; <xml-dev@l...> Sent: Wednesday, August 27, 2003 9:47 AM Subject: RE: xml taxonomy > That is somewhat like saying take the infoset specification > apart and analysze how the individual information items in > combination enable different kinds of provable properties > given some set of axioms and operations. Sounds like fun > but I suspect a rigorous result will require some serious > resources and that is why I would expect this from the > academic community presenting papers at conferences, not > from the developer community on a mail list where as soon > as the frustration goes past a certain threshhold, someone > will derail into Godel and use strange loops to > admonish all about the fruitlessness of universal proofs. > > Proofs are nice to have, but all a real programmer needs is > to make it run then make it run faster. ;-) > > len > > -----Original Message----- > From: Rick Marshall [mailto:rjm@z...] > > following several discussions we've had lately, mostly on relational > models and document management i'm going to float the idea - which may > be covered elsewhere, please redirect me if appropriate - that having a > taxonomy of xml may help us to understand what forms, and when are good > for different problems. > > if we take numbers as an analogy (and that's all it is, there are plenty > of others) they can be divided into sets - integer, real, rational, > irrational, complex, etc and we increase our understanding and use of > numbers by developing theorems that cover the different sets. > > it seems to me that xml is as diverse as numbers or any similar grouping > and that by focusing on well defined sets of xml structures and their > properties we can get the theorems to improve our use and understanding. > > eg one set might be xml with tags only - no attributes; another might be > xml that is constrained to two levels; etc > > by understanding the properties and operators that are valid on these > sets we can then see the analogies to other technologies such as > relational models, markup, etc. > > ----------------------------------------------------------------- > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > initiative of OASIS <http://www.oasis-open.org> > > The list archives are at http://lists.xml.org/archives/xml-dev/ > > To subscribe or unsubscribe from this list use the subscription > manager: <http://lists.xml.org/ob/adm.pl> > >
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|