[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: XML / HTML Transport size


castor xs any
[sorry if this gets through twice, I got weird messages from the mail 
sevrer]

John Cowan wrote:
 > Robin Berjon scripsit:
 >>>A variety of
 >>>small-scale studies have shown that general-purpose compression is 
generally
 >>>as good as, or better than, some scheme that knows it's compressing XML.
 >>
 >>Err, quite the opposite.  XMill beats gzip.
 >
 > This one is news to me, but I'm looking into it now.

You may also wish to take a look at Box (http://box.sf.net/). I don't
remember how well it compares to gzip in compression but it's fast to
decode (the website is down today with all other SF sites so I can't
look it up right now).

 >>BiM/BiX requires a schema,
 >
 > Yes: by "knows it's compressing XML" I meant to imply "and doesn't know
 > anything more than that".

I know, and that obviously makes things a little bit more complicated.
However in most non-pathological cases it is possible to apply
machine-learning techniques to deduce schema information (it also works
on pathological cases -- ie instances for which the only fathomable
pattern is the instance itself -- but it's rather useless there). That's
something we're seriously investigating in order to efficiently support
xs:any and xs:anyAttribute (for instance).

There is also a fair number of cases in which there is no schema per se,
but it can be usefully inferred from other metadata such as a WSDL
document, an XQuery...

 >>but there are many ways in which a schema can be deduced, even with just
 >>a raw document (and it can be done more intelligently than most tools
 >>that deduces schema information from instances I've seen out there do
 >>it).
 >
 > Pointer(s)?

The schema deducers I was referring to are the one included in Castor,
and the one on gotdotnet.com:

    http://www.castor.org/
    http://gotdotnet.com/team/xmltools/xsdinference/

Those tools are probably useful in cases where you just need a schema
but don't care that it is the simplest schema for the given instance or
set of instances. They tend to produce schemata that are pretty much
snapshots of the instance and more or less exactly mirror it.

The schema inferencer we're developing tries hard to get the simplest
schema. The reason for this is that we need it to produce a schema that
strikes the correct balance between generality and concision. Obviously
if you are to send a decoder update (using decoder bytecode) in the
stream, you want that extra information to decode more and better
encoded data than it costs to send the decoder itself. I should normally
have something to show in that area early next year.

-- 
Robin Berjon <robin.berjon@e...>
Research Engineer, Expway
7FC0 6F5F D864 EFB8 08CE  8E74 58E6 D5DB 4889 2488



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.