[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] ANN:half a parser
I've written a bit of Java code that reports XML documents as a series of text-based events with a context object reflecting the structure of the document so far. It's not technically an XML parser, but it is designed to be something on which you could build an XML parser or even an XML editor. More information is avaiable at: http://simonstl.com/projects/gorille/ Ripper, DocProcI, and ContextI are the most relevant bits, and RipperTest provides a command-line interface. Ripper is designed to report every character in an XML document, from the XML declaration to the DOCTYPE (which it doesn't process) to spaces and quote styling inside tags to entity references to whitespace and comments at the end of the document. This approach should make it easier to perform minimal transformations which preserve as much of an original document as possible, as well as custom entity handlers and character testing. More details on why I did this and project status follow, if you're interested. I'll also be presenting on this project at XML Europe in May. ----------------------------------- About four years ago I wrote an article called "Toward A Layered Model for XML" [1]. At the time I was inspired by a variety of problems that XML 1.0 and Namespaces in XML had created for XML 1.0 [2]. Breaking down the parsing process into a series of smaller and better-defined parts seemed like a possible answer to a number of complex problems. [1] - http://simonstl.com/articles/layering/layered.htm [2] - http://simonstl.com/articles/interop/ More recently, I've been exploring character entity processing in the absence of a DTD [3], as well as a new problem that arose with XML 1.1, the prospect of different rules for the characters in XML components [4]. Both of these issues are closely tied to the parsing process, and fixing them is difficult without writing a whole new parser. [3] - http://simonstl.com/projects/ents/ [4] - http://simonstl.com/projects/gorille/ While SAX2, DOM, and a variety of other APIs provide access to document information, these APIs are designed rather explicitly around the expectation that the document will have already been parsed. For a variety of questionable reasons I took the long way around and created an API, Markup Object Events (MOE) [5], that was capable of storing information in a parsed but not completely processed form. Things like entity boundaries, CDATA sections, and additional metadata can all be stored in this framework. [5] - http://simonstl.com/projects/moe/ Unfortunately for MOE, there doesn't seem to be much of an audience for Java events that can be combined into object models and vice-versa; various tools for SAX2, DOM, and other frameworks already had that covered. Just as important, there were no parsers around that could provide MOE with the level of content it was capable of storing. It's nice to be able to keep track of entities used in attribute values, but since parsers squash them into simple strings anyway, there hasn't been much point. The next piece of the puzzle was the Tiny API for Markup (TAM) [6], which included a J2ME MIDP 1.0 parser. It skipped the DOCTYPE declaration completely, so it wasn't an XML parser, and it turns out I forgot to implement CDATA sections anyway. In any event, while TAM provided a simplified SAX-like view of parsed documents, it provided a foundation on which later parsing work could build. [6] - http://simonstl.com/projects/tam/ The latest piece of the puzzle is a part of the Gorille package but builds on the TAM work. As a test-project building on J2ME code, it isn't the lovely programming, but so far it does appear to work. Most of the information on what "Ripper" produces, is presently in two javadoc files, one covering the DocProcI interface [7] and one covering the ContextI[8] interface. The parser feeds both interfaces with information, sending a raw text view to DocProcI and a more Infoset-like tree view to ContextI. [7] - http://simonstl.com/projects/gorille/docs/com/simonstl/gorille/DocProcI. html [8] - http://simonstl.com/projects/gorille/docs/com/simonstl/gorille/ContextI. html My initial tests with this simple processor have shown that it's possible to parse a document and preserve every character in it, which is a rather expensive reinvention of the Unix cat command. Perhaps more promising is the hope that developers can build tools which combine textual awareness and an understanding of markup context on top of this framework. I need to build unit tests for various pieces of the parser and the context objects, as well as exercise the parser on a greater variety of cases. Currently the parser only works on UTF-8 documents, at least without intervention in Java. Future concrete work will focus on creating layers on top of these interfaces which integrate with the surrounding Gorille work as well as Ents. A DOCTYPE processor which can modify both the character and context objects will hopefully follow, as will a consumer that turns these events into SAX2 events and MOE events. There's a lot to do yet, and it'll be a while coming, but hopefully what I've done might at least make other folks consider what's possible rather than just what's easy today. -- Simon St.Laurent Ring around the content, a pocket full of brackets Errors, errors, all fall down! http://simonstl.com -- http://monasticxml.org
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|