[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] The privilege of XML parsing - Data types, binary XML and XML pipelines
I've given a lot of thought recently to what it is about data typing in XML and Binary XML that makes me so nervous. What follows is my most concerted attempt at articulating what causes me to be so nervous an a suggestion for how we might proceed. Executive summary Welding data typing into the core of XML is a really bad idea and it is well on the way to splitting the XML world in two which is a very bad thing. Binary XML is a terrible idea that must be embraced before it does terrible damage. The best way to deal with these - and other thornies like namespace expansion, xlink, Xinclude etc. is to infuse the concept of "XML processing" from the lexical to the application, with a pipeline processing architecture before we all go completely ga ga or stop talkng to each other or both. Data Types: In order for any two systems to communicate they need to have a shared understanding of at least one, "bootstrap" data type for the bag of ones and zeros that ultimately goes across the wire. It is a self evident fact that decades of computing have failed to produce a set of universal datatypes - types that one can reasonable expect to be commodity datatypes on most architectures, most programming languages most databases etc. Originally the only universal datatype was 1 or 0. Then came the universality of ASCII. Now we are seeing ASCII evolve into Unicode. The nearest thing we have to a universal datatype is Unicode - or in programming language terms - the STRING datatype. The wonderful thing about strings - apart from their universality - is the fact that you can use a universal string based notation to represent pretty much any higher order datatype. Programming languages have used this fact to great effect over the years by storing programs as STRINGS. These days, there is a general purpose notation for sharing higher order datatypes in a universal way That notation is XML. XML did not - and should not now be allowed to - fall into the trap of declaring the existence of a universal set of datatypes for the following reasons: 1. No such set of datatypes exists. The world is full of systems that have only a thematic consensus on things like "int", "date" and so on. Datatypes for aggregates likes "person" or "business" have proven to be essentially impossible to canonicalize. 2. Applications come and go but data lives forever (or can do). The trick of making your data outlive your applications is to divorce application-level data models from the XML. Burying data model information into the XML binds the XML to the application in a way which will bite when the application is changed or retired. 3. Doing so significantly increases the semantic consensus required by communicating processes to share data. The beauty of *HAVING* to create your own data model[1] from a stream of Unicode with angle brackets is that you do not have to share any semantics or expectations other than Unicode with the originator of that XML. Far from being a burden, it is a *privilege* to be able to parse the XML and treat the data the way you want to, rather than have a data model imposed on you. 4. There is no need to infuse this right into the core of XML - it fits perfectly naturally into a post-parse, application-domain-specific pipeline, which is where it belongs. Mind you, those who think there interoperability problems will be solved by agreeing a set of basic datatypes are sorely mistaken. Binary XML: I use binary XML every day. My OpenOffice files are binary (zipped XML), my serialized pyxie trees (Python pickles) are binary. My RDBS that contain XML fragments are binary. I often send messages over MOMs that contain XML + Python pickle. Simply put, there is nothing wrong with Binary XML within the confines of an application. It is a very useful optimization which can and should be treated as a "compiler". You would never throw away your source code having passed it through a compiler. The same should be the case with your XML. It is the portable representation of your data just like the source files are the portable version of your machine code. A standarized, zipped XML notation is something the community needs to think about (perhaps in the context of packaging) because, many programmers see XML transmission size as a problem. If they end up using strongly typed "compiled" XML to get around this, they will have tightly bound their XML to their process which is a bad thing. Standardized, marshallings of XML (XML infoset compilers) for Java, .NET etc. need to be done so that the notion of binary XML is both catered for and COMPREHENSIVELY RELEGATED to the realm of "compiled" output. Something you just use for optimization reasons but NEVER use as primary storage for your data. Pipeline processing: I think we can keep peace amongst the data heads and the doc heads, the infoset heads and the lex heads etc. I think the way to do it is to infuse XML parsing with a layered, phased, time ordered processing model so that data typining, xincluding etc. can be incorporated into a single, flexible framework. Those who don't want infoset annotation should be able to leave it out of parsing by simply configuring the parser. This is where XPipe, DSDL, XVIF etc. are coming from. (Note that in the sense I am using the word "pipe" here, the W3C XML Pipeline Note is more of a dependency resolver than a pipe.) I have not had the time to devote to XPipe that I would have liked but I'm a big believer in XML pipelining. My company, Propylon, is about to announce a commercial J2EE implementation of XPipe which I'm hoping will fuel interest in the open source community in this approach to XML processing. (Anybody going to XML 2002 in Baltimore interested in seeing this can contact me.) Summary: I suggest we make one core twist to XML. Lets express the various layers to XML parsing in terms of a pipeline and see if it can help us accommodate the date typing folk, the binary XML folk etc. without throwing out the baby with the bathwater. The baby is that an XML document is *always* just a Unicode string to start with. It is the worlds' only, universally available, bootstrappable data type apart from 1's and 0's. [1] My use of data model here is, I have come to realize, at odds with Tim Bray's usage of the term which is, I think, the reason we have in the past disagreed on the data model issue. Given my predisposition to pipeline thinking, I am acutely interested in being able to build black box XML processing components whose only interface to the outside world is that universal datatype we all know and love - UnicodeWithAngleBrackets. Without a mechanism for specifying what parts of the infoset are preserved thorough such a black box, it is difficult to know what boxes can be hooked up to what other boxes. Interop suffers and complexity results. When I say I want a data model for XML, I mean that I want to be able to say rigorously what parts of the lexical structure my black box sees on its input side, can be faithfully replicated on the output side. In a pipelined word, this would equate to pre and post-conditions on pipeline components that express the infoset fidelity of the component. http://seanmcgrath.blogspot.com
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|