[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Data streams
At 16:31 +0000 2004-12-04, Michael Kay wrote: At 10:42 -0500 2004-12-04, tedd wrote: >> > In everything I have read, it appears that every chunk of content >>> must be encapsulated by tags, such as: >>> >> > <data>123.456</data> >These are both legitimate XML documents. The question is, what are you >trying to achieve? If you use XML markup around the data, then an XML parser >(and tools such as XSLT) will understand it. If you use commas, then you >have to parse it yourself. If you want to parse the data by hand, then why >use XML in the first place? Michael makes a good point -- how to do it depends on your goals. But to be extra clear, there is no real way to make XML *itself* aware of fields that are delimited only by commas (or some other single character delimiter). Such syntax was considered as a possibility but rejected. SGML can do this via the SHORTREF feature, if you're absolutely set on it. For cases like your example, where there is very little structure to demarcate, it seems important: a million copies of "<data></data>" versus "," adds up. However, consider: 1: if your data is "text files that are literally tens of thousands of characters in length", that is small enough that the overhead won't disturb most software running even on a cell phone. If we were talking many millions or billions of *records*, then this would be more of an issue (as it is for some users). 2: If you want the data formatted by CSS or XSL-FO, or transformed by XSLT, or whatever, having all the data in one syntax that the applications *already* know about is much easier than rewriting the applications or working around them to add some syntax (like commas) that they *don't* know about. You'll never have to debug the XML parser you use to parse all those "<data>" tags, but you will spend a lot of time if you try to introduce a new syntax in your process. 3: Any text file that contains zillions of instances of a certain string, is necessarily very compressible. The first thing a compression program will do is discover that "<data>" is real common, and assign it a really short code. A comma-delimited file is inherently less compressible. Here are some empirical results: I created a file with the numbers from one to a million, delimited in different ways. zero.dat has just a linefeed between numbers; comma.dat just has a comma and a linefeed; tag01 has a start and end-tag with the one-character element type "d" (and the linefeed); tag02 has element type "da", on up to tag20 which has a 20-character-long element type. 5-line Ruby program available on request. Here are the original sizes: 6888888 4 Dec 13:35 zero.dat 7888887 4 Dec 12:50 comma.dat 13888881 4 Dec 12:51 tag01.dat 15888879 4 Dec 12:52 tag02.dat 17888877 4 Dec 12:53 tag03.dat 19888875 4 Dec 12:56 tag04.dat 21888873 4 Dec 13:06 tag05.dat 31888863 4 Dec 13:07 tag10.dat 51888843 4 Dec 13:08 tag20.dat Here are the sizes after gzipping: 2129148 4 Dec 13:35 zero.dat.gz 2130082 4 Dec 12:50 comma.dat.gz 2377733 4 Dec 12:51 tag01.dat.gz 2376912 4 Dec 12:52 tag02.dat.gz 2518197 4 Dec 12:53 tag03.dat.gz 2638489 4 Dec 12:56 tag04.dat.gz 2631120 4 Dec 13:06 tag05.dat.gz 2661673 4 Dec 13:07 tag10.dat.gz 2596261 4 Dec 13:08 tag20.dat.gz You can see that: the linefeed-only file reduces to 2130082 / 6888888 = 31% of its original size the 20-char tagged file reduces to 2596261 / 51888843 = 5% of its original size And even though the 20-char tagged file was over 7.5 times bigger than the linefeed-only file when uncompressed, once they're compressed it is only about 1.2 times bigger -- a mere 22% increase despite every number having 2 tags with 20-character tag names, instead of nothing but a line break. I wouldn't worry about the extra bytes much. If you've got enough data for it to matter, buy a disk-compression utility and you can forget the issue. Steve DeRose
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|