[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: XML spec and XSD
Andrew Welch wrote: > I mentioned the streaming aspect for 2 reasons: > > 1) If validation performance is an issue, can RNG + Schematron still > be considered when XSD is validation so fast? > I think Andrew is identifying one of the basic flaws of XSD evolution: an optimization controls the design. But I don't believe the claim that XSD validation is necessarily faster than RNG + Schematron, even at the poor state of optimization we have at the moment. 1) There are many kinds of constraints that Schematron can do that XSD 1.1 assertions etc cannot do. XSD cannot be considered faster at things that are completely out of its reach! On these, I can state that RELAX NG and Schematron are absolutely faster than XSD :-) 2) An XSD assertion requires that a local XDM trimmed branch be constructed. So the worst case for XSD is where all elements have an assertion: the same amount of tree building would have to go on as for building a full XDM tree for the typical non-streaming implementation of Schematron. There would be a space saving, but not necessarily a speed saving. (In fact, I think only the non leaf elements would need assertions to get to this same point.) 3) There are Schematron implementations with a terminate on fail construction. (Ken Holman contributed this IIRC.) So where pass/fail testing is required, these can be very fast in the amount of work they need to do. Combine them with a streaming implementation or even a lazily constructed DOM, and they certainly could be faster than an XSD implementation that attempts to run over a whole large file. 3a) A RELAX NG implementation that just provides a validation result does much less work than an XSD implementation that produces a fill PSVI, too. 4) XSD implementations are not necessarily streaming, but may be random access. For example, my implementation of XSD by converting it to Schematron would use whatever the Schematron implementation used. Or a validator that ran over data in a database directly without pickling it first. 5) Where the application that uses the XML requires a tree, the tree needs to be built even if you have streaming validation: so you aren't actually saving any tree construction time or space. In fact, since the PSVI has no standard XML form or standard streaming API form, I actually imagine that most uses of XSD actually result in a tree being built (or the data being entered into a DBMS): the point of the PSVI is to make extra information available for systems which are typically random access, keyed access or object trees (anything except streaming!) 6) Where a document is not large, it is not certain that a streaming implementation of a validator using a modern language with automatic garbage-collected will actually allocate or use fewer objects compared to a tree-building implementation. And object-allocation-avoidance strategies such as a cross-thread pool of DOM objects can also benefit in-memory implementations just as much as streaming implementations. The size of documents limits the number of simultaneous process more in the case of the tree-building implementation, but not necessarily the number of objects allocated. (In fact, if the system is a validator, it may be that the event stream may need to be queued untill validation has finished before passing it on to the application: this will limit opportunities from speed-ups due to less object allocation from pooling or singleton strategies, for example.) The exception might be XSD validation used for firewalls. But when you look at, for example, the Lloyds London Market system, they validate incoming data using XSD for coarse-grain validation, then Schematron for fine-grain validation: non-streaming is not a bar for their documents. 7) Where there is a resource constraint like a real-time constraint, benchmarking is ultimately the most objective way of determining performance. Whitebox knowledge of algorithms and implementation details may certainly give hints about behaviour, but they are just armchair hints that may vary with different implementations, schemas and input documents: an algorithm that is efficient but explosive may give better performance than an algorithm that is slow but inefficient. (For example, a system that takes 10+n^2 has better performance than a system that is 110 + 10n for n <11. We know about XSLT engines that the slowest is at least 24 times slower than the fastest even for basic transformations, for example, so the constants could plausibly swamp the exponents. ) 8) XSD schemas can be very large and verbose, with multiple files, and many internal checks of the schema components such as derivation by restriction and UPA. I see no reason to expect that loading a large XSD schema with all that extra work would necessarily be more efficient than the effort in loading a RNC or Schematron schema. Indeed, in XSD is is quite common that the schema is larger than the instance: even when there is a streaming validation, most of the process is taken up with creating persistent objects for the schema. Putting all these together, I certainly concede that if you have a large document, a small memory, a compiled and pre-loaded schema, a small schema with only a few assertions, constraints that are only local, where the document is thrown away after validation and the PSVI or tree or stream not passed on, the document is parsed from XML rather than coming in as a DOM, and you want the validation to get as much outcome as possible, then you might reasonably suspect that a streaming implementation (whether XSD or RNG + Schematron) would be faster to go through an entire document to confirm that no errors exist than an in-memory implementation made with the same attention to memory issues, in the absence of benchmarking. Added to that, I think there is tremendous scope for optimization of Xpaths and XSLT, and consequently Schematron. Michael Kay's optimization work in XSLT and XPath is interesting. There are a lot of fun possibilities for Schematron-specific optimization based on getting fast results (e.g. http://www.topologi.com/public/SchematronHeuristic.pdf) or optimization on tries and feature sets (http://broadcast.oreilly.com/2009/06/validation-using-tries-and-fea.html) > 2) Isn't it the case that some of the complexities of XSD are that way > to allow for that validation speed? > Do you have an example? (I imagine it causes some simplifications as well as some complexities.) Cheers Rick Jelliffe
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|