[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Data streams
thank you steven. that was the experiment i proposed a month or so ago. and you have just shown very neatly that the entropy of the message hasn't changed with the representation. the only piece missing is to put out a file of a million 32 bit integers (4MB by definition) and see how much it compresses - ie more than 50%? then we really do have a lower bound on the entropy. i'm choosing to ignore the compact formula/algorithmic representation at this stage because that's not a general solution. regards rick Steven J. DeRose wrote: > At 16:31 +0000 2004-12-04, Michael Kay wrote: > At 10:42 -0500 2004-12-04, tedd wrote: > >>> > In everything I have read, it appears that every chunk of content >>> >>>> must be encapsulated by tags, such as: >>>> >>> > <data>123.456</data> >> >> These are both legitimate XML documents. The question is, what are you >> trying to achieve? If you use XML markup around the data, then an XML >> parser >> (and tools such as XSLT) will understand it. If you use commas, then you >> have to parse it yourself. If you want to parse the data by hand, >> then why >> use XML in the first place? > > > Michael makes a good point -- how to do it depends on your goals. > > But to be extra clear, there is no real way to make XML *itself* aware > of fields that are delimited only by commas (or some other single > character delimiter). Such syntax was considered as a possibility but > rejected. SGML can do this via the SHORTREF feature, if you're > absolutely set on it. > > For cases like your example, where there is very little structure to > demarcate, it seems important: a million copies of "<data></data>" > versus "," adds up. However, consider: > > 1: if your data is "text files that are literally tens of thousands of > characters in length", that is small enough that the overhead won't > disturb most software running even on a cell phone. If we were talking > many millions or billions of *records*, then this would be more of an > issue (as it is for some users). > > 2: If you want the data formatted by CSS or XSL-FO, or transformed by > XSLT, or whatever, having all the data in one syntax that the > applications *already* know about is much easier than rewriting the > applications or working around them to add some syntax (like commas) > that they *don't* know about. You'll never have to debug the XML > parser you use to parse all those "<data>" tags, but you will spend a > lot of time if you try to introduce a new syntax in your process. > > 3: Any text file that contains zillions of instances of a certain > string, is necessarily very compressible. The first thing a > compression program will do is discover that "<data>" is real common, > and assign it a really short code. A comma-delimited file is > inherently less compressible. > > Here are some empirical results: > > I created a file with the numbers from one to a million, delimited in > different ways. zero.dat has just a linefeed between numbers; > comma.dat just has a comma and a linefeed; tag01 has a start and > end-tag with the one-character element type "d" (and the linefeed); > tag02 has element type "da", on up to tag20 which has a > 20-character-long element type. 5-line Ruby program available on request. > > Here are the original sizes: > > 6888888 4 Dec 13:35 zero.dat > 7888887 4 Dec 12:50 comma.dat > 13888881 4 Dec 12:51 tag01.dat > 15888879 4 Dec 12:52 tag02.dat > 17888877 4 Dec 12:53 tag03.dat > 19888875 4 Dec 12:56 tag04.dat > 21888873 4 Dec 13:06 tag05.dat > 31888863 4 Dec 13:07 tag10.dat > 51888843 4 Dec 13:08 tag20.dat > > Here are the sizes after gzipping: > > 2129148 4 Dec 13:35 zero.dat.gz > 2130082 4 Dec 12:50 comma.dat.gz > 2377733 4 Dec 12:51 tag01.dat.gz > 2376912 4 Dec 12:52 tag02.dat.gz > 2518197 4 Dec 12:53 tag03.dat.gz > 2638489 4 Dec 12:56 tag04.dat.gz > 2631120 4 Dec 13:06 tag05.dat.gz > 2661673 4 Dec 13:07 tag10.dat.gz > 2596261 4 Dec 13:08 tag20.dat.gz > > You can see that: > > the linefeed-only file reduces to 2130082 / 6888888 = 31% of its > original size > the 20-char tagged file reduces to 2596261 / 51888843 = 5% of its > original size > > And even though the 20-char tagged file was over 7.5 times bigger than > the linefeed-only file when uncompressed, once they're compressed it > is only about 1.2 times bigger -- a mere 22% increase despite every > number having 2 tags with 20-character tag names, instead of nothing > but a line break. > > I wouldn't worry about the extra bytes much. If you've got enough data > for it to matter, buy a disk-compression utility and you can forget > the issue. > > Steve DeRose > > ----------------------------------------------------------------- > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > initiative of OASIS <http://www.oasis-open.org> > > The list archives are at http://lists.xml.org/archives/xml-dev/ > > To subscribe or unsubscribe from this list use the subscription > manager: <http://www.oasis-open.org/mlmanage/index.php> > begin:vcard fn:Rick Marshall n:Marshall;Rick email;internet:rjm@z... tel;cell:+61 411 287 530 x-mozilla-html:TRUE version:2.1 end:vcard
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|