[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Binary XML == "spawn of the devil" ?

  • To: "xml-dev@l..." <xml-dev@l...>
  • Subject: Re: Binary XML == "spawn of the devil" ?
  • From: Elliotte Rusty Harold <elharo@m...>
  • Date: Sun, 10 Aug 2003 15:09:13 -0400
  • In-reply-to: <oprstl350fezizxn@localhost>
  • References: <oprstl350fezizxn@localhost>

binary xml java
Apologies for restarting this thread. I've just returned from my 
vacation, and I'm working my way through a lot of e-mail that built 
up. Having read this entire thread now. there's one issue I noticed 
that's been feinted at a couple of times, but nobody seems to have 
taken it head-on. So please allow me to do that now.

One of the goals of some of the developers pushing binary XML is to 
speed up parsing, to provide some sort of preparsed format that is 
quicker to parse than real XML. I am extremely skeptical that this 
can be achieved in a platform-independent fashion. Possibly some of 
the ideas for writing length codes into the data might help, though I 
doubt they help that much, or are robust in the face of data that 
violates the length codes.  Nonetheless this is at least plausible.

However, this is not the primary preparsing of XML I've seen in 
existing schemes. A much more common approach assigns types to the 
data and then writes the data into the file as a binary value that 
can be directly copied to memory. For example, an integer might be 
written as a four-byte big-endian int. A floating point number might 
be written as an eight-byte IEEE-754 double, and so forth. This might 
speed up things a little in a few cases. However, it's really only 
going to help on those platforms where the native types match the 
binary formats. On platforms with varying native binary types, it 
might well be slower than performing string conversions.

Unicode decoding is a related issue. It's been suggested that this is 
a bottleneck in existing parsers, and that directly encoding Unicode 
characters instead of UTF code points might help. However, since in a 
binary format you're shipping around bytes, not characters, it's not 
clear to me how this encoding would be any more efficient than 
existing encodings such as UTF-8 and UTF-16. If you just want 32-bit 
characters then use UTF-32. Possibly you could gain some speed by 
slamming bytes into the native string or wstring type (UTF-16 for 
Java, possibly other encodings for other languages.) However, as with 
numeric types this would be very closely tied to the specific 
language. What worked well for Java might not work well for C or Perl 
and vice versa.

Nonetheless it should be doable. A Java parser that worked directly 
on UTF-16 code points and did not directly decode characters should 
be able to be implemented. Verifying the well-formedness of surrogate 
pairs might be more expensive, but is rarely needed in practice. I 
think this could be fully implemented within the bounds of XML 1.0. I 
don't see why a new serialization format would be necessary to remove 
this bottleneck from the process.

In summary, I am very skeptical that any prepared format which 
accepts schema-invalid documents is going to offer significant 
speedups across different platforms and languages. I do not accept as 
an axiom that binary formats are naturally faster to parse than text 
formats. Possibly this can be proved by experiment, but I tend to 
doubt it.
-- 

   Elliotte Rusty Harold
   elharo@m...
   Processing XML with Java (Addison-Wesley, 2002)
   http://www.cafeconleche.org/books/xmljava
   http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.