Re: Parsing efficiency? - why not 'compile'????
On Thu, 27 Feb 2003 11:02:41 +0100 Robin Berjon wrote: > Tahir Hashmi wrote: [snip] > > Let me modify Karl's assumption a little: > > > > Let's assume we /now have/ a binary XML specification [snip], > > everything basically the same, just binary streaming format, but > > same Infoset, same APIs /as/ for reporting XML content. > > > > And again ask these questions: > > > > What would be the difference? For the programmer? For the platforms? > > (note that your question is a bit flawed as we already have standard > specifications for binary infosets.) I didn't get it... I mean, isn't a binary substitute what we're trying to develop? I'm not talking about API, I'm talking about syntax or serialization or whatever - the thing that can be stored in files or passed down the wire. > You basically have two groups of people: > > - those that don't need it. For them, it'll make no difference. They wouldn't > use it. This is not the WXS type of technology that dribbles its way through > many others. > > - those that do need it. These folks will be able to use XML where they > couldn't before. And when I say XML, I mean AngleBracketedUnicode. Conversion to > binary will only happen in the steps where it is needed so that most of what > those people will see will be actual XML. In the first group, there could be a subgroup that doesn't need binary markup but may use it simply because it can, without affecting the way its applications work. That's the group that doesn't need human read/write-ability for its XML docs - the group of WYSIWYG Office suites, XML-based instant messaging protocols and so on. I hope not all the people in this group would be same as those described by Elliot ;-) > > # Interpreting involved binary constructs could be more difficult: [snip] > Errr... I really am not sure what you mean, notably by "involved binary > constructs". I think you can distinguish between two situations: a) the > application wants a date, in which case seconds since the Epoch or a time_t > struct might be exactly what it wants, it'll be cheaper than strptime(3) for > sure; b) the application wants a string containing a date in which case you're > free to store dates as strings in your binary format. Consider this: the application is only interested in strings for date but the schema designer specified a date type because it is the Right Thing(TM) for a date (so that the schema need not be changed if at some point of time the same application or another application does get interested in the value). In a binary representation, the processor will decode the variable length binary value to arrive at the number of seconds since epoch, then re-construct a string for the application. Note that the processor will be *synthesizing* a string that could be read straight off the document. This approach would be better only if the benefits of saved bandwidth are greater than the cost of synthesizing the date string. And we can't assume that limited bandwidth is *always* going to be the motivating factor for using binary markup. > > # Forced validation: > > > > The above situation would be even more ironic if the application > > didn't care about the actual value of the date and was only > > interested in some string that looked like a date. With XML > > validation of data types is an option that is being enforced as a > > requirement in the above scheme. Even where validation is required, > > how far can a parser validate? A value may be syntactically or > > semantically acceptable but contextually invalid (lame e.g. - a date > > of birth being in the future). My point: validation is and should > > remain an option. > > This is completely orthogonal to the subject. This may not be *completely* orthogonal. In the cited case, despite the date string being typed as date, the application is free to ignore the value by chosing to not validate it. In strongly typed encoding, the decoder does type-checking implicitly and takes the pains to compute a meaningful value whether or not the application required it. The particular example I gave is illustrative only and as stated earlier, I'm not against type-awareness. I'm simply being wary of how much flexibility might possibily be lost, and in some cases computation be wasted, in the quest of a super-optimized binary encoding. > As for your remark on the speed of decompaction, note that you may be right for > a naive implementation of the same thing but there's compsci literature out > there on making such tasks fast. Well yes, naivete may lead to bad design. The point is that more the logic that goes into decoding a format, the higher the bar for small devices is raised. While one can have small non-validating SAX parsers for XML, the size of a binary format parser may go up since it would have to know about synthesizing dates from integers, deducing document structure from the schema etc, besides the indispensible passing of strings around. The encoding scheme should require least possible context information and minimal parsing logic to be accessible there. Hope I'm able to explain myself better this time! -- Tahir Hashmi (VSE, NCST) http://staff.ncst.ernet.in/tahir tahir AT ncst DOT ernet DOT in We, the rest of humanity, wish GNU luck and Godspeed
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format