XML Parser

A parser is a piece of program that takes a physical representation of some data and converts it into an in-memory form for the program as a whole to use. Parsers are used everywhere in software. An XML Parser is a parser that is designed to read XML and create a way for programs to use XML. There are different types, and each has its advantages. Unless a program simply and blindly copies the whole XML file as a unit, every program must implement or call on an XML parser.

The main types of parsers are known by some funny names: SAX, DOM and pull. For each type, there are some excellent implementations freely available for a variety of languages, including Java, C++, C#, VB# (in fact, any .Net language), PHP, Perl, Python, Ruby and so on.


What is SAX?

SAX Parser in Stylus Studio

SAX stands for Simple API for XML. Its main characteristic is that as it reads each unit of XML, it creates an event that the calling program can use. This allows the calling program to ignore the bits it doesn't care about, and just keep or use what it likes. The disadvantage is that the calling program must keep track of everything it might ever need. SAX is often used in certain high-performance applications or areas where the size of the XML might exceed the memory available to the running program.

The design inspiration and subsequent coodination was done by Dave Megginson, who continues to maintain the SAX Project website. The SAX standard currently is at version 2.0.

SAX is used everywhere in Stylus Studio®. It is used for building certain representations of XML structure for the XSLT and XQuery Mappers, and also used extensively within the XML Converters.

There have been many implementations of SAX parsers. The Apache project has sponsored some, including Crimson and its successor, Xerces (available in both C++ and Java.) The author of Saxon, Dr. Michael Kay, himself wrote Ælfred — another SAX parser.


What's a DOM?

DOM Parser in Stylus Studio

DOM stands for Document Object Model. It differs from SAX in that it builds the entire XML document representation in memory and then hands the calling program the whole chunk of memory. DOM can be very memory intensive; by the time you figure in the overhead for managing the relationships of the nodes, you might be talking 4× to 8× the size of the original document in memory usage.

There are places in Stylus Studio® where a DOM is necessary. The Tree View in the XML Editor and all XSLT and XQuery processors, no matter what the brand, with two notable exceptions. Both the underlying Saxon engine and DataDirect XQuery support pull parsing, which will be covered below. The XML Pipeline deployer is very smart; it knows for each component what the optimal representation is, and will work hard to ensure that memory is conserved wherever possible by avoiding unnecessary transformations from DOM to SAX and back.

Implementations include Xerces (again both in C++ and Java), and Microsoft's MSXML and System.Xml classes.

DOM (currently up to level 3.0) has been widely criticized for being too complicated; it has tried to maintain the same programming interface for whatever language it is implemented in, even if it violates some of the conventions of that language. This has led to some DOM-like implementations that are more in keeping in line with the philosophy of the local language. Examples in Java include TinyTree (used only in Saxon), JDOM, DOM4J and XOM.


What's a Pull Parser?

XML Pull Parser in Stylus Studio

SAX is a push parser, since it pushes events out to the calling application. Pull parsers, on the other hand, sit and wait for the application to come calling. They ask for the next available event, and the application basically loops until it runs out of XML.

Pull parsers are useful in streaming applications, which are areas where either the data is too large to fit in memory, or the data is being assembled just in time for the next stage to use it. It is designed to be used with large data sources, and unlike SAX which returns every event, the pull parser can choose to skip events (or in some implementations, whole sections of the document) that it is not interested in. The converters are designed to work with both the SAX and the pull parser interfaces.

In Java, the current leading contender for streaming parsers appears to be StAX, while in Microsoft's .Net platform, the System.Xml XmlReader is built right in.

StAX — Streaming API for XML

The StAX pull-parser has been implemented in the Java world by a standard called JSR-173. Both Saxon and DataDirect XQuery support pull parsing. In some instances, particularly in DataDirect's implementation, pull parsing can give a significant performance boost, but both implementations have been so highly tuned that the choice between SAX, DOM and StAX for any given application is a matter for testing. Since within Stylus Studio® XML Enterprise Suite the XML Pipeline constructor knows the capabilities of each node in the pipeline, this choice is handled automatically for you.


Standards and Stylus Studio®

One very important point is that each of these is an industry-recognized standard. This means that whatever you do is portable across implementations. Whether you like the SJSXP or Woodstox StAX parsers, whether you like Crimson or Ælfred or Xerces, the point is you have a choice. If one fails to perform for you in the way you hope, or you want to migrate from one platform to another for deployment, you are never locked in by Stylus Studio®. As a division of DataDirect Technologies, we have a legacy of participation and conformance to the standards process, and we are proud of this heritage Download and Examine a copy of Stylus Studio® today, and see how powerful a fully standards-compliant XML application you can design and deploy.

 
Free Stylus Studio XML Training:
W3C Member