RE: Data streams

To: "Stephen E. Beller" <sbeller@n...>
Subject: RE: Data streams
From: david.lyon@c...
Date: Tue, 7 Dec 2004 10:19:25 +1100
Cc: xml-dev@l...
In-reply-to: <00cd01c4dbbe$8c3ca3b0$6501a8c0@dell8100>
References: <00cd01c4dbbe$8c3ca3b0$6501a8c0@dell8100>
User-agent: Internet Messaging Program (IMP) 3.2.6

Play the video


All,

Mind if I pull apart this report for some further analysis?

Quoting "Stephen E. Beller" <sbeller@n...>:

> I tried Steven's experiment from a different angle. I filled an Excel XP
> spreadsheet with a single-digit number, saved it in both XML and in a
> comma-delimited text file (CSV). I then compressed both with WinZip and then
> opened both with Excel. Here's what I found:
>
> The XML file was 840MB, the CSV 34MB -- a 2,500% difference
> Compressed, the XML file was 2.5MB, the CSV 0.00015MB (150KB) -- a 1,670%
> difference.

True. XML files are usually bigger.

> Equally dramatic is the time it took to uncompress and render the files as
> an Excel spreadsheet: It took about 20 minutes with the XML file; the CSV
> took 1 minute -- a 2,000% difference.

True. The old parts of Excel are written in assembly language
by true masters. They are efficient. The CSV era was at the
same time as the assembly language coding.

The new XML parts are written by programmers of the bloatware
era. They are not optimised to the same degree.

They are probably written in high level languages and I would
guess have never been "profiled". That's an old word... maybe
it's something that is never done with xml... wouldn't be surprised.

In perspective, Excel isn't a tool (imho) that a user would
use to deal with xml data in a commercial environment. As
rendering tags is absolutely no use to a business user. They
want the product data printed like a pricelist,or a purchase
order printed like a purchase order. xml tags are alienspeek
or geekspeek at best.

But some people do optimise and profile their XML. A "real"
xml trading app I would bet would fare better than excel.

> My conclusion is that delimited text files handle large
> arrays of data more efficiently.

Maybe, but providing only a single array is used.

Most business apps need to hold multiple sets of arrays
and thus the need for something like xml.

Finally...

> The XML file was 840MB, the CSV 34MB -- a 2,500% difference
> Compressed, the XML file was 2.5MB, the CSV 0.00015MB (150KB) -- a 1,670%
> difference.

Put another way, the compressed xml file was 2.5MB and the
CSV file was 34MB.

Therefore, sending compressed XML data is more efficient
than using CSV and requires less resources to transmit
and send.

David

>
> -----Original Message-----
> From: Rick Marshall [mailto:rjm@z...]
> Sent: Saturday, December 04, 2004 4:50 PM
> To: Steven J. DeRose
> Cc: xml-dev@l...
> Subject: Re:  Data streams
>
> thank you steven. that was the experiment i proposed a month or so ago.
> and you have just shown very neatly that the entropy of the message
> hasn't changed with the representation.
>
> the only piece missing is to put out a file of a million 32 bit integers
> (4MB by definition) and see how much it compresses - ie more than 50%?
> then we really do have a lower bound on the entropy. i'm choosing to
> ignore the compact formula/algorithmic representation at this stage
> because that's not a general solution.
>
> regards
>
> rick
>
> Steven J. DeRose wrote:
>
> > At 16:31 +0000 2004-12-04, Michael Kay wrote:
> > At 10:42 -0500 2004-12-04, tedd wrote:
> >
> >>>  > In everything I have read, it appears that every chunk of content
> >>>
> >>>>  must be encapsulated by tags, such as:
> >>>>
> >>>  > <data>123.456</data>
> >>
> >> These are both legitimate XML documents. The question is, what are you
> >> trying to achieve? If you use XML markup around the data, then an XML
> >> parser
> >> (and tools such as XSLT) will understand it. If you use commas, then you
> >> have to parse it yourself. If you want to parse the data by hand,
> >> then why
> >> use XML in the first place?
> >
> >
> > Michael makes a good point -- how to do it depends on your goals.
> >
> > But to be extra clear, there is no real way to make XML *itself* aware
> > of fields that are delimited only by commas (or some other single
> > character delimiter). Such syntax was considered as a possibility but
> > rejected. SGML can do this via the SHORTREF feature, if you're
> > absolutely set on it.
> >
> > For cases like your example, where there is very little structure to
> > demarcate, it seems important: a million copies of "<data></data>"
> > versus "," adds up. However, consider:
> >
> > 1: if your data is "text files that are literally tens of thousands of
> > characters in length", that is small enough that the overhead won't
> > disturb most software running even on a cell phone. If we were talking
> > many millions or billions of *records*, then this would be more of an
> > issue (as it is for some users).
> >
> > 2: If you want the data formatted by CSS or XSL-FO, or transformed by
> > XSLT, or whatever, having all the data in one syntax that the
> > applications *already* know about is much easier than rewriting the
> > applications or working around them to add some syntax (like commas)
> > that they *don't* know about. You'll never have to debug the XML
> > parser you use to parse all those "<data>" tags, but you will spend a
> > lot of time if you try to introduce a new syntax in your process.
> >
> > 3: Any text file that contains zillions of instances of a certain
> > string, is necessarily very compressible. The first thing a
> > compression program will do is discover that "<data>" is real common,
> > and assign it a really short code. A comma-delimited file is
> > inherently less compressible.
> >
> > Here are some empirical results:
> >
> > I created a file with the numbers from one to a million, delimited in
> > different ways. zero.dat has just a linefeed between numbers;
> > comma.dat just has a comma and a linefeed; tag01 has a start and
> > end-tag with the one-character element type "d" (and the linefeed);
> > tag02 has element type "da", on up to tag20 which has a
> > 20-character-long element type. 5-line Ruby program available on request.
> >
> > Here are the original sizes:
> >
> >  6888888  4 Dec 13:35 zero.dat
> >  7888887  4 Dec 12:50 comma.dat
> > 13888881  4 Dec 12:51 tag01.dat
> > 15888879  4 Dec 12:52 tag02.dat
> > 17888877  4 Dec 12:53 tag03.dat
> > 19888875  4 Dec 12:56 tag04.dat
> > 21888873  4 Dec 13:06 tag05.dat
> > 31888863  4 Dec 13:07 tag10.dat
> > 51888843  4 Dec 13:08 tag20.dat
> >
> > Here are the sizes after gzipping:
> >
> >  2129148  4 Dec 13:35 zero.dat.gz
> >  2130082  4 Dec 12:50 comma.dat.gz
> >  2377733  4 Dec 12:51 tag01.dat.gz
> >  2376912  4 Dec 12:52 tag02.dat.gz
> >  2518197  4 Dec 12:53 tag03.dat.gz
> >  2638489  4 Dec 12:56 tag04.dat.gz
> >  2631120  4 Dec 13:06 tag05.dat.gz
> >  2661673  4 Dec 13:07 tag10.dat.gz
> >  2596261  4 Dec 13:08 tag20.dat.gz
> >
> > You can see that:
> >
> > the linefeed-only file reduces to 2130082 / 6888888 = 31% of its
> > original size
> > the 20-char tagged file reduces to 2596261 / 51888843  = 5% of its
> > original size
> >
> > And even though the 20-char tagged file was over 7.5 times bigger than
> > the linefeed-only file when uncompressed, once they're compressed it
> > is only about 1.2 times bigger -- a mere 22% increase despite every
> > number having 2 tags with 20-character tag names, instead of nothing
> > but a line break.
> >
> > I wouldn't worry about the extra bytes much. If you've got enough data
> > for it to matter, buy a disk-compression utility and you can forget
> > the issue.
> >
> > Steve DeRose
>
>
>
>
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
>
> The list archives are at http://lists.xml.org/archives/xml-dev/
>
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://www.oasis-open.org/mlmanage/index.php>
>
>




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.

Follow-Ups:
- Re: Data streams
  - From: Bob Foster <bob@o...>
- RE: Data streams
  - From: "Bob Wyman" <bob@w...>
- RE: Data streams
  - From: "Stephen E. Beller" <sbeller@n...>

References:
- RE: Data streams
  - From: "Stephen E. Beller" <sbeller@n...>

Prev by Date: RE: Data streams
Next by Date: RE: Data streams
Previous by thread: RE: Data streams
Next by thread: RE: Data streams
Index(es):
- Date
- Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >