Data-binding versus XQuery performance

From: "Michael Kay" <mike@s...>
To: <xml-dev@l...>
Date: Thu, 11 Dec 2008 16:14:53 -0000

Play the video

I decided to do some further investigations comparing data binding with
XQuery performance using the data proposed by Boris Kolpakov in

http://markmail.org/message/utwrryr5ojvsu3jg

First, I implemented the data binding approach using the reference
implementation of JAXB.

I was interested to see how much time it would take to parse/unmarshall the
raw XML data. The results were interesting:

  JAXB                          69ms
  Saxon without validation      75ms
  Saxon with validation        192ms

The interesting thing here is that although JAXB uses a schema to extract
type information, it appears that it does not actually perform full
validation of the XML against the schema. This can be verified by
introducing errors, for example a disallowed attribute. Saxon doesn't
provide this option: if you want typed data, you have to do full validation.
It would be interesting to see whether a mode of operation that provides
typed data without doing full validation would be possible and would close
this gap.

It's interesting that JAXB is marginally faster in building the data than
Saxon even without validation: I hadn't expected this. It's not a big
difference, and I don't know whether it would still be there for a dataset
with a less trivial structure (only three element types). It's possible that
it's solely due to the use of a different XML parser. Needs further
investigation.

Then I looked at the cost of executing the query originally proposed by
Boris: find persons whose age is less than X and whose gender is male, over
a variety of age values (I used a linear distribution between 1 and 100).
The average execution time was:

  JAXB                         0.42ms
  Saxon-B (untyped)            9.43ms
  Saxon-SA (typed)             6.85ms

This is running a query that counts the person elements meeting the
criteria, but generates no output.

Then I thought I would try a slightly more realistic query: produce an HTML
table showing the number of males of each age between 1 and 100, something
like this:

      <table>
         <tr>
            <th>Age</th>
            <th>Number of males</th>
         </tr>
         <tbody>
            <tr><td>1</td><td>0</td></tr>
            <tr><td>2</td><td>0</td></tr>

This time I measured the cost of doing the query, serializing it to HTML,
and writing the results to a file:

  JAXB                       42.1ms
  Saxon-B                   676.0ms
  Saxon-SA                    7.5ms

It's immediately obvious, of course, that Saxon-SA has optimized the query
by building an index. Of course, I can do that by hand-optimizing my JAXB
version of the query as well, and this brings the cost of the query down to
1.02ms (in my first attempt, it also made it crash when it found someone
over 100 years old).

In the JAXB code here I did the serialization using crude write("<html>")
statements. This is of course not recommended practice, because of issues
like character escaping, character encoding etc. So I'm giving JAXP an
unfair advantage here.

For comparison, here is the query:

  declare namespace t='http://www.example.com/test';
  <html>
    <head><title>Number of males, by age</title></head>
    <body><h1>Number of males, by age</h1>
      <table><tr><th>Age</th><th>Number of males</th></tr>
        <tbody>{for $age in 1 to 100 return 
           <tr>
             <td>{$age}</td>
             <td>{count(/t:people/person[@age = $age and @gender =
'male'])}</td></tr>
        }</tbody></table></body>
  </html>

and here is the second version of the JAXB code:

            PrintWriter writer = new PrintWriter(new FileOutputStream(new
File("e:/temp/test.out")));
            writer.write("<html><head><title>Number of males, by
age</title></head>");
            writer.write("<body><h1>Number of males, by age</h1>");
            writer.write("<table><tr><th>Age</th><th>Number of
males</th></tr>");
            writer.write("<tbody>");
            int[] histogram = new int[101];
            for (Iterator<Person> iter = people.getPerson().iterator();
iter.hasNext();) {
                Person person = iter.next();
                if (person.getGender().equals(Gender.MALE)) {
                    int age = person.getAge();
                    if (age <= 100) {
                    	histogram[age]++;
                    }
                }
            }
            for (int age=1; age<=100; age++) {
                writer.write("<tr><td>" + age + "</td><td>" + histogram[age]
+ "</td></tr>");
            }
            writer.write("</tbody></table></body></html>");
            writer.close();

So, what are the lessons?

(a) In many typical message transformation scenarios, the cost will be
dominated by parsing, validation and serialization costs. For simple
documents there's no significant performance difference between the two
architectural approaches in these areas, though there may well be
significant differences between products. We haven't measured what happens
when the document structure becomes more complex.

(b) If you apply some effort to your Java coding, you will probably be able
to get the query to go faster using data binding than it will run in an
XQuery engine. However, the more complex the query becomes, the more likely
it is that a smart query optimizer will do better than an "average"
(non-optimized) hand-coded implementation.

(c) But the difference may be irrelevant. Compared with a fixed parsing cost
of 70ms, the difference between 8ms query time and 1ms is very unlikely to
make the XQuery approach unviable.

(d) The JAXB code I wrote is a lot less robust than the XQuery solution. It
falls over when it gets out-of-range data, it doesn't do serialization
properly. Producing a robust solution is more effort and will reduce
performance.

Basically, no surprises here. Declarative code is fewer lines of code,
easier to read, closer to the user statement of requirements, more amenable
to automatic optimization. Procedural code is faster if well-written (slower
if not so well written), more likely to be buggy, more lines of code, harder
to change if the requirements change, more likely to need rewriting if the
structure of the data changes.

(In the real world, I would probably split the XQuery approach into two
phases: computing the results, and formatting them as HTML. This would lose
some performance, but gain a considerable advantage in application
modularity. The performance is good enough that one can probably afford to
do this. I'm not sure how one would achieve this separation between logic
and presentation in the JAXB case. Judging from a lot of PHP code I've seen,
you wouldn't even try.)

Those who said that declarative query languages for database access would
never offer adequate performance were proved wrong by the end of the 1980s.
It surprises me that 20 years later people don't recognize that the same
applies to XML access.

Michael Kay
http://www.saxonica.com/

Follow-Ups:
- Re: Data-binding versus XQuery performance
  - From: Boris Kolpackov <boris@c...>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >