[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Parsing efficiency? - why not 'compile'????
On Tue, 25 Feb 2003 15:59:27 +0100 Robin Berjon wrote: > Karl Waclawek wrote: > > > Let's assume we would have had a binary XML specification from > > the beginning, everything basically the same, just binary streaming format, > > but same Infoset, same APIs for reporting XML content. > > What would be the difference? For the programmer? For the platforms? > > It would be horrible. Quite simply horrible. But then, it would never have taken > off so we wouldn't be discussing it. :-) Let me modify Karl's assumption a little: Let's assume we /now have/ a binary XML specification [snip], everything basically the same, just binary streaming format, but same Infoset, same APIs /as/ for reporting XML content. And again ask these questions: What would be the difference? For the programmer? For the platforms? > Binary XML is a contradiction in adjecto. That's why I'm anti-binxml: simply > because there is no such thing as "Binary XML". Binary Infosets however are > another story completely, and much more interesting :) True, there's no such thing as "Binary XML". Let's say, we're talking about "Binary XML-like Markup" condensed to "binary markup". One step to create a binary markup scheme is to replace the terminals in XML Grammar (which are essentially combinations of Unicode characters) with some other form. Binary infosets may not necessarily be binary markup. They're just serialization of some data structure. Extreme optimization based on the knowledge of Schema might be unattractive because: # Interpreting involved binary constructs could be more difficult: Consider the variable length symbols that I have used in Xqueeze[1] (as also Dennis Sosnoski in XMLS, IIRC). The symbols are easy to understand - unsigned integers serialized as octets in Big-endian order, with the least significant bit of each octet acting as a continuation flag. However, parsing them requires a loop that runs as many times as there are octets in the symbol to read one. Each iteration involves one comparison (check if LSb is 1), multiplication (promotion of the previous octet by 8 bits) and addition (value of the current octet). It's not difficult to see the computation involved in arriving at "Wed Jan 3rd 2003, 14:00 GMT" from a variable length integer that counts the number of seconds since the Epoch[2]. # Forced validation: The above situation would be even more ironic if the application didn't care about the actual value of the date and was only interested in some string that looked like a date. With XML validation of data types is an option that is being enforced as a requirement in the above scheme. Even where validation is required, how far can a parser validate? A value may be syntactically or semantically acceptable but contextually invalid (lame e.g. - a date of birth being in the future). My point: validation is and should remain an option. # Tight coupling between schema revisions: XML is quite resilient to changes in the schema as long as the changes are done smartly enough to allow old documents to pass validation through the new schema. This flexibility would be restricted the greater is the dependence of the binary encoding on the schema. (I still have to reach XML's level of compatibility in Xqueeze Associations (data dictionary). Fortunately, achieving that wouldn't require changes in the grammar of the encoding). # What is gained in the end? With schema-based compaction done in all the aggressiveness possible, how much would be gained against a simple markup binarization scheme? Perhaps a compaction factor of, say, 5 over XML. Would this be really significant when compared to a factor of, say, 4 compaction achieved by markup binarization? This is an optimization issue - the smaller the binary scheme, the more computation required to extract information out of it. I'm not totally against a type-aware encoding but for a standard binary encoding to evolve, it would have to be in a "sweet spot" on the size vs. computation vs. generality plane. [1] http://xqueeze.sourceforge.net [2] http://www.alaric-snell.com/xml-dev-threads.html#binxml PS: I've revised the xqML specifications to allow document parsing without the knowledge of schema and other goodies. I'll release a draft spec. on Monday (3rd March) when my vacation gets over. Random access would be addressed shortly thereafter. :-) -- Tahir Hashmi http://staff.ncst.ernet.in/tahir tahir AT ncst DOT ernet DOT in We, the rest of humanity, wish GNU luck and Godspeed
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|