[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Parsing efficiency? - why not 'compile'????


parsing vs compiling
Tahir Hashmi wrote:
> Robin Berjon wrote:
>>It would be horrible. Quite simply horrible. But then, it would never have taken 
>>off so we wouldn't be discussing it.
> 
> Let me modify Karl's assumption a little:
> 
>   Let's assume we /now have/ a binary XML specification [snip],
>   everything basically the same, just binary streaming format, but
>   same Infoset, same APIs /as/ for reporting XML content.
> 
> And again ask these questions:
> 
>   What would be the difference? For the programmer? For the platforms?

(note that your question is a bit flawed as we already have standard 
specifications for binary infosets.)

You basically have two groups of people:

   - those that don't need it. For them, it'll make no difference. They wouldn't 
use it. This is not the WXS type of technology that dribbles its way through 
many others.

   - those that do need it. These folks will be able to use XML where they 
couldn't before. And when I say XML, I mean AngleBracketedUnicode. Conversion to 
binary will only happen in the steps where it is needed so that most of what 
those people will see will be actual XML.

> Extreme optimization based on the knowledge of Schema might be
> unattractive because:
> 
> # Interpreting involved binary constructs could be more difficult:
> 
>   Consider the variable length symbols that I have used in Xqueeze[1]
>   (as also Dennis Sosnoski in XMLS, IIRC). The symbols are easy to
>   understand - unsigned integers serialized as octets in Big-endian
>   order, with the least significant bit of each octet acting as a
>   continuation flag. However, parsing them requires a loop that runs
>   as many times as there are octets in the symbol to read one. Each
>   iteration involves one comparison (check if LSb is 1),
>   multiplication (promotion of the previous octet by 8 bits) and
>   addition (value of the current octet). It's not difficult to see the
>   computation involved in arriving at "Wed Jan 3rd 2003, 14:00 GMT"
>   from a variable length integer that counts the number of seconds
>   since the Epoch[2].

Errr... I really am not sure what you mean, notably by "involved binary 
constructs". I think you can distinguish between two situations: a) the 
application wants a date, in which case seconds since the Epoch or a time_t 
struct might be exactly what it wants, it'll be cheaper than strptime(3) for 
sure; b) the application wants a string containing a date in which case you're 
free to store dates as strings in your binary format.


> # Forced validation:
> 
>   The above situation would be even more ironic if the application
>   didn't care about the actual value of the date and was only
>   interested in some string that looked like a date. With XML
>   validation of data types is an option that is being enforced as a
>   requirement in the above scheme. Even where validation is required,
>   how far can a parser validate? A value may be syntactically or
>   semantically acceptable but contextually invalid (lame e.g. - a date
>   of birth being in the future). My point: validation is and should
>   remain an option.

This is completely orthogonal to the subject.


> # Tight coupling between schema revisions:
>   
>   XML is quite resilient to changes in the schema as long as the
>   changes are done smartly enough to allow old documents to pass
>   validation through the new schema. This flexibility would be
>   restricted the greater is the dependence of the binary encoding on
>   the schema. (I still have to reach XML's level of compatibility in
>   Xqueeze Associations (data dictionary). Fortunately, achieving that
>   wouldn't require changes in the grammar of the encoding).

This is a solved problem in BinXML, multiple versions of the same schema can 
co-exist.


> # What is gained in the end?
> 
>   With schema-based compaction done in all the aggressiveness
>   possible, how much would be gained against a simple markup
>   binarization scheme? Perhaps a compaction factor of, say, 5 over
>   XML. Would this be really significant when compared to a factor of,
>   say, 4 compaction achieved by markup binarization? This is an
>   optimization issue - the smaller the binary scheme, the more
>   computation required to extract information out of it. I'm not
>   totally against a type-aware encoding but for a standard binary
>   encoding to evolve, it would have to be in a "sweet spot" on the
>   size vs. computation vs. generality plane.

I'm all for finding a sweet spot but pulling random numbers out of a hat and 
making broad assumptions about size vs computation won't contribute much in 
getting there. I am talking about empirically proven, tested, retested, put to 
work in a wide variety of situations, factors of 10, 20 or 50 (or more, but 
testing on SOAP is cheating ;).

As for your remark on the speed of decompaction, note that you may be right for 
a naive implementation of the same thing but there's compsci literature out 
there on making such tasks fast.

-- 
Robin Berjon <robin.berjon@e...>
Research Engineer, Expway        http://expway.fr/
7FC0 6F5F D864 EFB8 08CE  8E74 58E6 D5DB 4889 2488


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.