Re: DESIGN PROPOSAL: Java XMLIterator
> This is a first design for XMLIterator, a third base-level API > which allows an application to pull content from XML. This > avoids the memory demand and navigation issues of DOM, and > is a more straightforward programming model than SAX, which > requires magic data connections between the event handlers in > order to maintain application state. XMLIterator extends > the familiar Iterator interface, so it models an XML document > as a linear collection of partially specified nodes. I very much agree that we need such an API. SAX works great for some kinds of application. In particular, it works well for generic XML applications which do not have to parse a particular XML vocabulary. However, SAX is really awkward for some applications, particularly applications that parse a particular XML vocabulary with a complex, highly nested structure. As it happens, I have been working on a similar API for the last few months. One impetus for doing this was my experience in implementing Jing. I was struck by how painful it was to parse a RELAX NG schema into an internal form using SAX. The equivalent non-XML syntax was easily parsed using a straightforward recursive descent parser. By contrast, the parser for the XML syntax was a warped and twisted mess. My API is currently called "pullax" (pull API for XML). This is still very much work in progress. I hadn't been planning to release for a month or two yet. But since you have started this discussion, I think the most constructive thing I can do is to release what I have now. I do have quite a comprehensive API and I do have a fairly complete sample implementation. I have made this available at http://www.thaiopensource.com/pullax/ I chose to do my initial sample implementation on top of Xerces 2 because it provides a native interface (XNI) with a "pull" parser API. (I would call it a "controlled push" rather than a "pull" API. Roughly, it has a variant of XMLReader.parse which you call multiple times; on each call, it parses some portion of the document making SAX-like callbacks on handlers.) This allows an implementation that neither requires the whole document in memory (as would an implementation on top of DOM), nor the use of threads (as would an implementation on top of SAX). XNI also provides a very rich set of information. You'll need Xerces 2 Beta 3 if you want to play with my implementation. See http://xml.apache.org/xerces2-j/index.html Obviously, SAX and DOM adapters are on my list of things to do. The bad news is that the API documentation is pretty pathetic at the moment and still needs a lot of work. This message will have to serve as an overview of the API for now. In designing pullax, I have tried to follow modern Java best practices, for example, in favoring immutability and using classes for type-safe enumerations. One of my main guides here has been Joshua Bloch's book "Effective Java" (http://java.sun.com/docs/books/effective/). This is a truly excellent book done by the guy who designed several of the better recent Java platform APIs (including the Collections API). Perhaps the most fundamental decision in designing a pull API is whether the properties for each node are provided (a) by methods on some sort of node object returned by the scanner/parser/iterator object (b) by methods on the scanner/parser object itself; the scanner/parser object has methods to move to the next node You've chosen (a). A couple of notable pull APIs use (b): - the XmlReader API in .NET; this is the principal XML parser API for .NET (see http://msdn.microsoft.com/library/en-us/cpref/html/frlrfsystemxmlxmlreadercl asstopic.asp) - XML Pull Parser (http://www.extreme.indiana.edu/soap/xpp/) I tried it both ways in pullax. I ended up, like you, with (a), for the following reasons: 1. Handling attributes in (b) is messy 2. (a) works more like the java.util.Iterator and java.util.Enumeration that are familiar to every Java programmer 3. (a) makes it much easier to construct filters/processing pipelines; for example, writing a RELAX NG validator that wraps around a non-validating parser. The main argument against (a) is that it involves more object creation, which, according to Java folklore, is a performance killer. Now, you've minimized object creation by having next() implicitly invalidate any previously returned nodes. I don't think this is an acceptable design for an API intended for widespread public use: 1. It's a common requirement to need to lookahead in the document when deciding how to process the current node. Your design makes this awkward. It also makes it very awkward to write a filter that needs lookahead in doing its filtering (imagine a filter that merges adjacent text nodes). 2. This behavior would be a big surprise to the average Java user. The Iterators and Enumerations which a typical Java user will be familiar with just don't work like this. 3. It's the kind of API that leads to "Write Once, Debug Everywhere" rather than "Write Once, Run Everywhere". A typical scenario is that a user writes an application that needs lookahead; they incorrectly access an XMLNode object after another call to next(); they test their application with an implementation that allocates a new XMLNode object for each next() call; their application appears to work fine. Then somebody else tries to use the application with a parser implementation that reuses XMLNode objects and the application mysteriously and silently gives the wrong results. In summary, this design does not promote reliability. I believe priority should be given to reliability over performance. My "solution" is simply to accept the object creation. Modern Java VMs (like Hotspot) do a fantastic job of efficient allocation of short-lived objects; object creation has much less performance overhead with modern VMs than it used to with classic VMs. In any case, a user that is prepared to sacrifice programming convenience for an extra ounce of performance can use SAX. (Also, since the objects returned are immutable, there is an opportunity for reducing object creation by sharing.) The central interface in my API is XmlScanner. (I'm planning a companion XmlPrinter interface for writing XML.) This corresponds to your XMLIterator interface. This interface is similar to java.util.Iterator but I chose not to derive XmlScanner from Iterator, for two reasons: 1. the equivalents of the next() and hasNext() methods need to be able to throw a java.io.IOException 2. it's awkward and inefficient to have always to cast the return value of next() My XmlScanner object returns XmlItem objects. I call these objects "items" rather than "nodes" because "node" to me suggests a tree view where elements have children rather than a flat view with start-element and end-element objects. My XmlItem object has similar methods to your XMLNode object to return the item type, the local name, namespace URI, QName, prefix, value etc. The method names are chosen based on the Infoset and XPath. I toyed with the approach to attributes that you took, that is, having ATTRIBUTE items following the START_ELEMENT item. This has the advantage of being simple. However, I found it inconvenient to work with and felt it would seem rather strange to anybody with exposure to SAX or DOM. So instead an XmlItem of type START_ELEMENT has getAttribute() methods that return an XmlItem for an attribute identified by name or index. XmlItem has a getContext() method returning an XmlContext object. This provides information about the context of the item, such as the in-scope namespaces. Typically, many XmlItem objects can share the same XmlContext object. A major challenge in designing a general-purpose XML API is to deal with the diversity of XML applications. At one end of the spectrum are simple applications that need no more than elements, attributes and text (the "holy trinity of XML" as I think David Megginson once called them). At the other end of the spectrum are applications such as XML editors that want as much detail about the markup as they can get including things like comments and entities. Just as there is a diversity of XML applications, so is there a diversity of XML processors/parsers. There are large, complex parsers like Xerces that a very rich set of information but take a corresponding hit in terms of size and speed. There is also a need for simpler parsers that do less but can be smaller and faster. The solution I use in pullax is based on the "feature" concept of SAX2. An implementation of the pullax API implements the XmlScannerFactory interface. By default an XmlScanner created by an XmlScannerFactory returns exactly three types of XmlItem: START_ELEMENT, END_ELEMENT, TEXT. Also by default TEXT items are maximal. So, for example, the document <doc>4<!-- a silly comment -->2</doc> will be returned as three items: a START_ELEMENT item, a TEXT item with string value "42", and an END_ELEMENT item. If an application wishes to see, for example, comments, it must request the SHOW_COMMENT feature from the XmlScannerFactory before creating the XmlScanner. If the parser cannot satisfy the request, it must throw an exception. XmlScannerFactory objects are designed to be dynamically discoverable using the service provider mechanism (like JAXP). XmlScannerFactoryFinder is a utility class that takes a set of features and dynamically finds an XmlScannerFactory implementation that supports those features. This approach ensures that the support for a rich information set in pullax does not get in the way of simple applications or simple XML processors. The pullax API aims to provide a very rich information set. As far as the document instance is concerned, it is intended to support the union of SAX2, DOM2 core, and the XML infoset and then some. As far as the DTD is concerned, pullax currently provides approximately the same information as the union of the XML Infoset and DOM Level 2 core. I have opted not to provide the detailed lexical information about the DTD that SAX2 provides. It seems to me that it is not much use having lexical information about DTDs if you lose information about parameter entities within declarations; but dealing with parameter entities within declarations is just too hard for a general-purpose API, especially when consider nested parameter entity references. I believe DTD editor type applications really require specialized APIs and parsers (eg DTDinst see http://www.thaiopensource.com/dtdinst). Another respect in which pullax's approach to DTDs differs from SAX is that it represents the DOCTYPE declaration as a single item. There does seem much point in breaking it down into a multiple items. Most of the information is in the XmlDtd object which is available from the XmlContext. Note that the XmlDtd object is immutable. I'm planning to extend the API to allow straightforward DTD caching: the idea is that a user-supplied XmlDtdResolver object will map the system id, public id and internal subset to an XmlDtd object. I've written too much already. I'll be happy to answer any questions people may have about the design and I'll try to get the API doc into shape as soon as possible. James
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format