|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] XML processing experiments
One nice feature of XML is that it is easily processable by the Desperate C/C++/Java/Perl hacker: the syntax is simple enough that you can do useful things with XML without a full XML parser. I've been exploring this sort processing. If all you want to do is be able to correctly parse well-formed XML, and you don't care about detecting whether or not it is well-formed, how much code does it take and is it significantly faster than using an XML parser that does proper well-formedness or validity checking? I used Jon's Old Testament XML file as test data (after removing the doctype line), which is about 3.7Mb. I ran the tests on a Toshiba Tecra 720CDT (133MHz Pentium, 80Mb RAM) with Windows NT 4.0. I used the IE 4.0 Java VM. The timings I give are after a couple of runs, so there's little or no disk I/O involved. Lark 0.97 parsed the file in about 10.5 seconds, MSXML in about 24 seconds. I suspect the difference is partly because MSXML is building a tree (I didn't see any command line switch to turn this off). By comparison nsgmlsu -s took about 8 seconds. I also tried LT XML (which is written in C). I didn't find a program that did nothing but parsing. The fastest one I found was the sgcount program (which counts the number of each element type); it took about 11 seconds. That's much slower than I expected; I suspect there may be some Windows-specific performance problems. The code I wrote is available at <URL:ftp://ftp.jclark.com/pub/test/xmltok.zip>. First I wrote a little library in C for doing XML "tokenization". This code just splits the input up into "tokens" where each token is data or some kind of XML markup (start-tag, end-tag, comment etc). The idea is that it does the minimum necessary to do any kind of useful XML-aware processing. I wrote a little application xmlec that just counts the number of elements in an XML document. This can compiled either to use Win32 file mapping (if FILEMAP is defined) or normal read() calls. You'll probably have to tweak the code a little if you're using anything other than Visual C++. I then translated this into Java (I'm not much of a Java programmer, so there's probably plenty of scope for improvement in the Java version). xmlec parses the test file in about 0.5 seconds. Using read() instead of file mapping increases the time to about 0.65 seconds. The Java version takes about 1.5 seconds. I also wrote a Java version of the LT XML textonly program (which extracts the non-markup of an XML document). The LT XML version ran in about 13.5 seconds. My Java version ran in about 3.5 seconds. The class files for the Java element counting program total about 6k. The source for the C version is about 750 lines, including both the file mapping and read()ing version. I was quite surprised that there was such a big performance difference between real, conforming XML processing that does well-formedness checking, and quick and dirty XML processing that does the minimum necessary to get the correct result. This doesn't seem right to me... James xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








