[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: Textual transmission typically faster?

  • From: David Megginson <david@m...>
  • To: xml-dev@l...
  • Date: Thu, 18 Jan 2001 10:15:50 -0500 (EST)

transmission of textual data
Eugene Kuznetsov writes:

 > That shouldn't be surprising -- generally speaking a parse tree is
 > larger than the text input to the parser. (And java serialization 
 > doesn't win any awards for efficiency in either time or space).
 > 
 > However, a binary representation of the same data (specified in 
 > ASN.1 and encoded using BER, say) would be much smaller and more
 > efficient.

Consider

  <foobar>data</foobar>

If we are using UTF-8, this will require 21 bytes.  

In a non-navigable binary format (i.e. for one-pass reading, like XML
itself), we would have the "foobar" name in a name table and would end
up with something like this:

  1 byte: type code for start element
  4 bytes: pointer to name "foobar" in table
  2 bytes: number of attributes (0)
  1 byte: type code for text
  4 bytes: data length (4)
  4 bytes: data
  1 byte: type code for end element

Using UTF-8 (again) this gives us 17 bytes, a slight improvement.  If
the amount of text were increased to only say, 20 bytes, then the
ratio would be 37:33 rather than 21:17 (not that exciting at all), and
that's pre-compression.  If you knew for sure that none of your
documents would be very long, you could shorten some of the pointers
to 3 or even 2 bytes, but then your format wouldn't be generally
usable for XML processing.

In a navigable binary format, things become much worse, because the
binary format needs left/right sibling and parent/first-child
pointers, so we end up with something like this:

  1 byte: type code for start element
  4 bytes: pointer to name in table
  4 bytes: pointer to left sibling
  4 bytes: pointer to right sibling
  4 bytes: pointer to parent
  4 bytes: pointer to first child
  2 bytes: number of attributes (0)
  1 byte: type code for text
  4 bytes: data length (4)
  4 bytes: data
  1 byte: type code for end element

That gives us a total of 33 bytes (assuming no navigation pointers at
the end of the element) in binary format vs. 21 bytes in clear text,
and that's before compression.  There would be significant advantages
to such a binary format, though, since you could create a DOM
interface that could navigate large documents without loading them
completely into memory first.  Just don't count on file-system space
savings.

As for time efficiency, that depends on my much time people put into
writing the loaders.  James Clark's C-based Expat already parses XML
at pretty-much I/O speed, so you'd be hard pressed to beat it.  The
Java-based XML parsers aren't quite as fast, but they're still pretty
impressive (there was a bit of friendly rivalry a couple of years ago
over benchmarks, and several of the main XML parsers were optimized
almost beyond belief).

Even if you could write a one-pass binary-format parser that could
run, say, 20% faster than the best XML parsers in the same language,
that advantage would make almost no difference to total application
time, since, in my experience, actual XML parsing (as opposed to
building object trees, DOM or otherwise) accounts for well under 5% of
execution time for even the simplest real-world applications.  A 20%
improvement in parsing time would give you at most a 1% improvement in
application execution time.  If the document is coming over the net
rather than from a local drive, even that small advantage will
probably be lost in network latency.

In summary, then, there are some good arguments for writing a
navigable XML-like binary format to allow large documents to be
processed with a DOM without loading the whole document into memory.
I can see very little argument for using a binary format simply for
space or efficiency.


All the best,


David

-- 
David Megginson                 david@m...
           http://www.megginson.com/

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.