Re: Compiled XML
On Tuesday 26 March 2002 20:02, Mike Champion wrote: > The first issue is definitely one best handled by compression, but whether > a generic compression such as gzip or a compression scheme that exploits > the specific regularity of XML is still debated. Well, at least for > wireless devices one can make a credible case that an XML-specific > compression scheme is more efficient of the various limited resources on a > wireless device. In general, though, you would be well-advised to not try > to compress arbitrary text better than gzip can. You'll fail. Ahem! Arithmetic encoding and block sorting come to mind for a start - combine those two and you can shave off a good few tens of percent over gzip, IIRC... and even using gzip, you can do better by gzipping the element content seperately to the XML syntax. Eg, <person><name>Alaric</name><email>alaric@a...</email></person> goes into "person_name_Alaric_alaric@a...", gzipped (the _ is U+001E, RECORD SEPERATOR - those control characters come in handy!) along with (preceded by?) a string of packed 3-bit codes, where the possible values are: 000 - text node; read a string from the data stream up to a U+001E 001 - open element; read the element name from the data stream up to a U+001E 010 - close element 011 - attribute; read the attribute name from the data stream up to a U+001E, then the attribute value up to a U+001E 100 - processing instruction, also used for <?xml version='1.0'?>; this is a purely *syntactic* encoding. Content read from data stream. 101 - comment read from data stream 110 - <!DOCTYPE [content read from data stream]> 111 - End of document As one potential optimisation (gzip has a limited window size, so needs some hand holding with repeated strings sometimes), you could define that a string in the data stream of the form 'U+001B' (ESCAPE) followed by a 16 bit network byte order unsigned integer is considered as a repeat of the string that many strings ago - this is useful for dealing with element and attribute names and even some repeated content. Decoding consists of opening the command and data streams side by side (for streaming, ideally they would be in two intertwined gzipped streams) and converting the command stream into SAX events, pulling stuff from the data stream when required. Encoding consists of converting SAX events to command stream codes, merging adjacent character events and removing whitespace. That was just off of the top of my head - there is potential for improvement, of course. > The binary XML issue comes up every few months and generates a lot of > dispute. The "mainstream" position seems to be that XML is really not all > that hard to parse, the parsers are well-optimized, the overhead of doing > the byte swapping and other binary format conversion to transfer parsed > data from one platform to another outweighs any theoretical advantage of > having a "compiled" form, Endianness conversion is less of a hassle than converting to and from ASCII-coded decimal, I would like to note :-) Endianness conversion is as little as a single instruction on most CPUs, while converting from base 10 involves... integer multiplication! Looping! Exception handling! Ew! > and that the whole issue is a red herring. I > expect that the holders of the minority view (Hi, Al!) Hi, Mike! How's the weather? :-) > will let you know > their response. ABS -- Alaric B. Snell http://www.alaric-snell.com/ http://RFC.net/ http://www.warhead.org.uk/ Any sufficiently advanced technology can be emulated in software
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format