[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Compiled XML


compiled xml
On Tuesday 26 March 2002 20:02, Mike Champion wrote:

> The first issue is definitely one best handled by compression, but whether
> a generic compression such as gzip or a compression scheme that exploits
> the specific regularity of XML is still debated.  Well, at least for
> wireless devices one can make a credible case that an XML-specific
> compression scheme is more efficient of the various limited resources on a
> wireless device.  In general, though, you would be well-advised to not try
> to compress arbitrary text better than gzip can.  You'll fail.

Ahem! Arithmetic encoding and block sorting come to mind for a start - 
combine those two and you can shave off a good few tens of percent over gzip, 
IIRC... and even using gzip, you can do better by gzipping the element 
content seperately to the XML syntax.

Eg, 
<person><name>Alaric</name><email>alaric@a...</email></person> 
goes into "person_name_Alaric_alaric@a...", gzipped (the _ is 
U+001E, RECORD SEPERATOR - those control characters come in handy!) along 
with (preceded by?) a string of packed 3-bit codes, where the possible values 
are:

000 - text node; read a string from the data stream up to a U+001E
001 - open element; read the element name from the data stream up to a U+001E
010 - close element
011 - attribute; read the attribute name from the data stream up to a U+001E, 
then the attribute value up to a U+001E
100 - processing instruction, also used for <?xml version='1.0'?>; this is a  
      purely *syntactic* encoding. Content read from data stream.
101 - comment read from data stream
110 - <!DOCTYPE [content read from data stream]>
111 - End of document

As one potential optimisation (gzip has a limited window size, so needs some 
hand holding with repeated strings sometimes), you could define that a string 
in the data stream of the form 'U+001B' (ESCAPE) followed by a 16 bit network 
byte order unsigned integer is considered as a repeat of the string that many 
strings ago - this is useful for dealing with element and attribute names and 
even some repeated content.

Decoding consists of opening the command and data streams side by side (for 
streaming, ideally they would be in two intertwined gzipped streams) and 
converting the command stream into SAX events, pulling stuff from the data 
stream when required. Encoding consists of converting SAX events to command 
stream codes, merging adjacent character events and removing whitespace.

That was just off of the top of my head - there is potential for improvement, 
of course.

> The binary XML issue comes up every few months and generates a lot of
> dispute. The "mainstream" position seems to be that XML is really not all
> that hard to parse, the parsers are well-optimized, the overhead of doing
> the byte swapping and other binary format conversion to transfer parsed
> data from one platform to another outweighs any theoretical advantage of
> having a "compiled" form,

Endianness conversion is less of a hassle than converting to and from 
ASCII-coded decimal, I would like to note :-)

Endianness conversion is as little as a single instruction on most CPUs, 
while converting from base 10 involves... integer multiplication! Looping! 
Exception handling! Ew!

> and that the whole issue is a red herring.  I
> expect that the holders of the minority view (Hi, Al!)

Hi, Mike! How's the weather? :-)

> will let you know
> their response.

ABS

-- 
                               Alaric B. Snell
 http://www.alaric-snell.com/  http://RFC.net/  http://www.warhead.org.uk/
   Any sufficiently advanced technology can be emulated in software  

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.