[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Data-binding versus XQuery performance
I decided to do some further investigations comparing data binding with XQuery performance using the data proposed by Boris Kolpakov in http://markmail.org/message/utwrryr5ojvsu3jg First, I implemented the data binding approach using the reference implementation of JAXB. I was interested to see how much time it would take to parse/unmarshall the raw XML data. The results were interesting: JAXB 69ms Saxon without validation 75ms Saxon with validation 192ms The interesting thing here is that although JAXB uses a schema to extract type information, it appears that it does not actually perform full validation of the XML against the schema. This can be verified by introducing errors, for example a disallowed attribute. Saxon doesn't provide this option: if you want typed data, you have to do full validation. It would be interesting to see whether a mode of operation that provides typed data without doing full validation would be possible and would close this gap. It's interesting that JAXB is marginally faster in building the data than Saxon even without validation: I hadn't expected this. It's not a big difference, and I don't know whether it would still be there for a dataset with a less trivial structure (only three element types). It's possible that it's solely due to the use of a different XML parser. Needs further investigation. Then I looked at the cost of executing the query originally proposed by Boris: find persons whose age is less than X and whose gender is male, over a variety of age values (I used a linear distribution between 1 and 100). The average execution time was: JAXB 0.42ms Saxon-B (untyped) 9.43ms Saxon-SA (typed) 6.85ms This is running a query that counts the person elements meeting the criteria, but generates no output. Then I thought I would try a slightly more realistic query: produce an HTML table showing the number of males of each age between 1 and 100, something like this: <table> <tr> <th>Age</th> <th>Number of males</th> </tr> <tbody> <tr><td>1</td><td>0</td></tr> <tr><td>2</td><td>0</td></tr> This time I measured the cost of doing the query, serializing it to HTML, and writing the results to a file: JAXB 42.1ms Saxon-B 676.0ms Saxon-SA 7.5ms It's immediately obvious, of course, that Saxon-SA has optimized the query by building an index. Of course, I can do that by hand-optimizing my JAXB version of the query as well, and this brings the cost of the query down to 1.02ms (in my first attempt, it also made it crash when it found someone over 100 years old). In the JAXB code here I did the serialization using crude write("<html>") statements. This is of course not recommended practice, because of issues like character escaping, character encoding etc. So I'm giving JAXP an unfair advantage here. For comparison, here is the query: declare namespace t='http://www.example.com/test'; <html> <head><title>Number of males, by age</title></head> <body><h1>Number of males, by age</h1> <table><tr><th>Age</th><th>Number of males</th></tr> <tbody>{for $age in 1 to 100 return <tr> <td>{$age}</td> <td>{count(/t:people/person[@age = $age and @gender = 'male'])}</td></tr> }</tbody></table></body> </html> and here is the second version of the JAXB code: PrintWriter writer = new PrintWriter(new FileOutputStream(new File("e:/temp/test.out"))); writer.write("<html><head><title>Number of males, by age</title></head>"); writer.write("<body><h1>Number of males, by age</h1>"); writer.write("<table><tr><th>Age</th><th>Number of males</th></tr>"); writer.write("<tbody>"); int[] histogram = new int[101]; for (Iterator<Person> iter = people.getPerson().iterator(); iter.hasNext();) { Person person = iter.next(); if (person.getGender().equals(Gender.MALE)) { int age = person.getAge(); if (age <= 100) { histogram[age]++; } } } for (int age=1; age<=100; age++) { writer.write("<tr><td>" + age + "</td><td>" + histogram[age] + "</td></tr>"); } writer.write("</tbody></table></body></html>"); writer.close(); So, what are the lessons? (a) In many typical message transformation scenarios, the cost will be dominated by parsing, validation and serialization costs. For simple documents there's no significant performance difference between the two architectural approaches in these areas, though there may well be significant differences between products. We haven't measured what happens when the document structure becomes more complex. (b) If you apply some effort to your Java coding, you will probably be able to get the query to go faster using data binding than it will run in an XQuery engine. However, the more complex the query becomes, the more likely it is that a smart query optimizer will do better than an "average" (non-optimized) hand-coded implementation. (c) But the difference may be irrelevant. Compared with a fixed parsing cost of 70ms, the difference between 8ms query time and 1ms is very unlikely to make the XQuery approach unviable. (d) The JAXB code I wrote is a lot less robust than the XQuery solution. It falls over when it gets out-of-range data, it doesn't do serialization properly. Producing a robust solution is more effort and will reduce performance. Basically, no surprises here. Declarative code is fewer lines of code, easier to read, closer to the user statement of requirements, more amenable to automatic optimization. Procedural code is faster if well-written (slower if not so well written), more likely to be buggy, more lines of code, harder to change if the requirements change, more likely to need rewriting if the structure of the data changes. (In the real world, I would probably split the XQuery approach into two phases: computing the results, and formatting them as HTML. This would lose some performance, but gain a considerable advantage in application modularity. The performance is good enough that one can probably afford to do this. I'm not sure how one would achieve this separation between logic and presentation in the JAXB case. Judging from a lot of PHP code I've seen, you wouldn't even try.) Those who said that declarative query languages for database access would never offer adequate performance were proved wrong by the end of the 1980s. It surprises me that 20 years later people don't recognize that the same applies to XML access. Michael Kay http://www.saxonica.com/
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|