[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Request: Techniques for reducing the size of XML instances
On Tue, 31 Jul 2001, HUGHES,MARK (Non-HP-FtCollins,ex1) wrote: > That's an excellent point - passing around a tokenized form of an XML > document to simplify parsing is a reasonable idea. Personally, I'd just > use the Pyxie format <http://www.pyxie.org/>, as it's *VERY* easy to > produce and to parse again, and has the tremendous advantage of still > being plain-text, so it's easy to debug and test. That's certainly in keeping with some of the binary XML approaches - the distinction between "binary" and "textual" is bogus, really, but a nomenclature we're stuck with for now. It's all binary anyway. "text" just uses a fairly standardish binary format (although the blueberry thread shows that this "text" format is a bit shifty anyway) PS: Just ran a quick test, timing gzip. gzipping 11449004 bytes on a K6/2 400 took 10.693 seconds of CPU time to compress to 2738792 bytes. If this machine were serving compressed XML, it wouldn't be able to max out a 10Mbit link, even assuming that whatever processing it was doing to create these data took zero time... This was a coredump file I compressed rather than a large amount of XML, which will skew the results a bit, but it looks like three to four times the CPU power of my laptop would be required to even handle the communications overhead of generating a 10Mbit gzipped XML stream. I recently had to help implement a system that read a small amount of data from disk and performed some computation, sending the data over a 100Mbit link to the next stage of servers. It had to pretty much fill that 100Mbit link to meet spec[1] and it was lower power than my laptop. gzipping XML would not have been an option; the system could only just about fit the raw data down a 100Mbit link with the required TCP/IP protocol overhead, let alone if it had XML markup all over it. Non-gzipped XML would have probably been OK in this situation since, luckily, this data happens to be a series of strings of about 20k in length, so the overhead of <?xml version='1.0' ?><message>...</message> wouldn't be an issue, but if it were highly structured or numerical data, the overhead of <number>123456789</number> over a single 32 bit word (a factor of 4) would have meant we'd need 4 100Base/T links coming from this machine to fit the required just-under-100Mbit/sec of raw data - or gigabit Ethernet. Raw data processing took just under 50% of the machine's CPU. If we'd had to emit XML, we'd have had to gzip it all to fit it down the 100Mbit/sec Ethernet, and there just wouldn't be enough CPU to do that. ABS [1] The spec mandated something along the lines of 1,000 80Kb data packets a second, IIRC - add TCP/IP overhead to that and you're pushing a 100Mbit/sec Ethernet, which was what the machine had connected to it. -- Alaric B. Snell http://www.alaric-snell.com/ http://RFC.net/ http://www.warhead.org.uk/ Any sufficiently advanced technology can be emulated in software
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|