[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: XML processing experiments

  • From: Tim Bray <tbray@t...>
  • To: xml-dev@i...
  • Date: Wed, 05 Nov 1997 00:36:07 -0800

java char processing
First off, thanks to James for a some very thought-provoking work.

At 07:03 PM 04/11/97 +0700, James Clark wrote:
>If all you want to do is be able to
>correctly parse well-formed XML, and you don't care about detecting
>whether or not it is well-formed, how much code does it take and is it
>significantly faster than using an XML parser ...
>Lark: 10.5 seconds .. MSXML: 24 .. nsgmlsu: 8 .. sgcount:11 .. 
>xmlec (C): 0.5 seconds .. (Java): 1.5 seconds.

[BTW, when I got Lark to run "almost as fast as SP", I decided that
was qualitatively fast enough for now].

>I was quite surprised that there was such a big performance difference

No kidding.  

Discussions here are a bit dangerous, since in the Java domain, we are 
kind of operating in the dark; we don't have profiling tools 
with really good granularity.  This is my excuse for engaging in 
performance analysis based on intuition, something for which I have 
personally fried more than one junior programmer.

Let's look at James' code eating up a "-quoted literal, where characters 
are in the byte array buf[], start and end being integer indices therein:
      case (byte)'"':
	{
	  for (++start; start != end; ++start) {
	    if (buf[start] == (byte)'"') {
	      nextTokenIndex = start + 1;
	      return TOK_LITERAL;
	    }
The following are candidates for why a program like Lark or MSXML
might run slower.
 - works with Java char rather than byte variables
 - does a method dispatch (or at least a few conditionals) per 
   character processed for at least two reasons: to manage the entity
   stack, and to have a place to put the different character encoding
   processing modules.
   [Note: A look at James' code makes me wonder if this is
    *really* as necessary as I thought]
 - does quite a bit more work upon recognizing some markup
   constructs; in particular for a start tag it pulls
   appart the attribute list and packages up the element type
   & attributes in a nice structure convenient for an API user

I went and looked at Lark's main loop, and for a 'typical' character 
processing mode, i.e. it's not the begin or end of a tag or attribute or 
something and no buffers run out but the text is being saved, it ends up 
executing 25 lines of Java including one getXmlCharacter() method 
dispatch; none of them are monster conditionals or anything.  

James' code above, in the equivalent case,  is executing 3 I think.

so while lines-of-code is very shaky yardstick indeed, the difference is
8 or 9 to 1, which is not out of line with the observed performance
difference. 

My intuition is that what's holding Lark back is 
(a) the per-char dispatching, and
(b) turning the DFA crank, which requires a 2D array reference, then
    a shift & mask

I have some ideas on how to fix both, but first I have to make Lark
do conditional sections and validate (neither should slow it down
significantly). 

One other experiment would be useful, that might shed light from
a different angle.  James, how about doing element counts per type;
i.e. actually *using* some of the info come back from the tokenizer,
nothing fancy, just use a java.util.Hashtable or some such; should be 
able to run very similar code on Lark and your TokenStream thing; I
wonder if it would change the numbers.  I'll get around to this sometime
if nobody else does, but not for the next 2-3 weeks. -Tim

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.