Re: XML and entropy, again
Michael Champion wrote: >We had a classically xml-devish thread back in October about the >implications of Shannon's information theory for XML. I must say I >didn't understand much of that thread, but Kurt Cagle has an >intriguing entry in his weblog >http://metaphoricalweb.blogspot.com/2004/12/xml-and-entropy.html that >puts forth some ideas that seem both interesting and somewhat >practical. > >"Entropy is important because it can better clarify the domain at >which it is best to work with a given document. XQuery I think >provides a good case in point here. XQuery supports XPath, and so it >has some of the advantages that XSLT has, but it's not really all that >useful for dealing with documents -- converting a DocBook document >into WordML or vice versa would be impossible in XQuery, but for many >business schemas with comparatively low entropies, XSLT is definitely >overkill." > > I for one like the idea of his interpretation of the entropy of an >XML document in terms of the number of discrete states that its >(implicit or explicit?) schema allows. I also like the idea that >certain tools are more or less appropriate depending on the entropy of >the documents being processed -- perhaps it's something like SAX and >DOM for low entropy, XQuery for medium entropy, and XSLT for high >entropy (very document-ish) documents. I wonder, however, about the >assertions made for the appropriateness of XQuery and XSLT, e.g. >"converting a DocBook document into WordML or vice versa would be >impossible in XQuery". It gets back into our XSLT vs XQuery >permathread -- do the two have radically different capabilities with >respect to handling recursive structures and/or recursive alorithms, >or are they more or less different syntaxes for the same capabilities? > >Thoughts, anyone? Sorry to reopen the permathread, but I think >Kurt's approach might lead to a more focused and possibly conclusive >discussion, Maybe wwe can all can trade ideas about this with our >relatives over the holidays :-) > > > i think cagle's interpretation is fundamentally wrong. (and i may be at odd slightly with shannon too, but i don't think so). the number of possible states is not the entropy of a particular message. and in cagle's example of the two bit integer he is talking about the limit of the number of possible states as entropy increases. arguments about the smallest program etc aside, a message is a subset of the possible states. and it's the message that has entropy, not the possible states. while the message can be described simply it has low entropy, if it can't be described simply it has high entropy - relative to the total number of possible states. (remember that's how we're trying to find the aliens - very low entropy must be simple physical things, very high is just background noise, but something in between we postulate must come from a living source). the entropy of a message is not a function of the complexity of a schema, but the information content (the CDATA section that cagle chose to ignore) within the schema. the schema adds structure and therefore a minimal amount of information. the cdata adds all the real information. so the total number of possible states is determined not by the schema but by the allowed content of the messages. if a schema allows any number of repitions of even one element, or the content of one element to be unbounded then the total number of states of the document is theoretically not limited. if everything is bounded then you can set an upper limit on the number of states. i suspect most schemas out there actually fall into the former category. but as these are open systems we can inject some energy into the system by way of improved schema design and reduce the number of total states and therefore the entropy of the messages (by getting to the completely bounded state). all of this seems to me irrelevent to the idea of recursive (xslt) vs non-recursive (xquery) processing. however it's not irrelevent to a discussion on of dom vs sax. dom processing implies the ability to represent the document dom before processing which in turn implies the total document size must be limited. sax on the other hand can keep processing the stream (forever if necessary). it also implies that an unbounded schema must therefore be impossible to validate in the general case. and finally for those who like playing with pipes you could establish network plumbing from input processes taking an infinite stream, converting to a schema, and passing to sax processors that can consume infinite messages, identifying bounded components and passing them off to either other sax or dom based processes for further processing. rick >----------------------------------------------------------------- >The xml-dev list is sponsored by XML.org <http://www.xml.org>, an >initiative of OASIS <http://www.oasis-open.org> > >The list archives are at http://lists.xml.org/archives/xml-dev/ > >To subscribe or unsubscribe from this list use the subscription >manager: <http://www.oasis-open.org/mlmanage/index.php> > > >
begin:vcard fn:Rick Marshall n:Marshall;Rick email;internet:rjm@z... tel;cell:+61 411 287 530 x-mozilla-html:TRUE version:2.1 end:vcard
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format