[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: XML too hard for programmers?
Hi, I'm afraid I'm new to this list, so am probably breaking 300 list taboos... I sent the below to Tim Bray in response to his article (the one presumably spawned this thread); he suggested I post it to this list. It's just a simple and small solution to an XML parsing problem I had, which was satisfactory at the time, seems to me to be generally applicable, and I haven't seen anything similar. Sorry in advance about the length. First thing: this is no panacea, and to be quite honest, I'm actually (heretic!) not at all keen on XML, but this API at least made things bearable; the structure of the code dealing with the XML was quite logical, and the space used in doing so was largely bounded. I developed it when I was writing a browser for the Open Ebook standard to run on limited memory platforms. Obviously DOM was out of the question, and SAX became really awkward as it would have been necessary to traverse the whole XML tree from the start when moving back a page; moreover I found it difficult to write code that corresponded directly to the DTDs. The idea is very simple: treat the XML as a multi-level stream, and provide an interface that allows one to *mark* a place in the stream, and *go to* a previously marked place. The basic API looked something like: open: fn(f: string): ref Parser; Parser: adt { next: fn(p: self ref Parser): ref Item; up: fn(p: self ref Parser); down: fn(p: self ref Parser); mark: fn(p: self ref Parser): ref Mark; goto: fn(p: self ref Parser, m: ref Mark); atmark: fn(p: self ref Parser, m: ref Mark): int; }; Item: adt { pick { Tag => name: string; attrs: Attributes; Text => ch: string; Process => target: string; data: string; Doctype => name: string; public: int; params: list of string; Stylesheet => attrs: Attributes; Error => loc: Locator; msg: string; } }; Attributes: adt { all: fn(a: self Attributes): list of Attribute; get: fn(a: self Attributes, name: string): string; }; Attribute: adt { name: string; value: string; }; The unconventional language is Limbo (see http://www.vitanuova.com/inferno/papers/limbo.html), but the concepts are familiar: an "adt" is basically a struct with function members, and a "pick" is a type-safe discriminated union; the type declarations are Pascal-like (i.e. type after identifier name). Open()ing an XML file produces a parser p; then p.next() produces the next XML item in the file *at the same nesting level*. Therefore, p := xml->open("foo.xml"); while ((i := p.next()) != nil) process_item(i); will only process the top level elements. e.g. given the following document: <item> <title>MetaData</title> <date>2003-01-12T00:18:05-05:00</date> <link>http://bitworking.org/news/8</link> <description>Upon waking, the dinosaur...</description> </item> you'll get one item of type Tag, with name "item". A crucial point is that when you get to the end of the current nesting level, next() returns nil; this allows one to easily write a recursive-descent-style parser, for instance (from the ebook reader) parses a <head> tag: e_head(p: ref Parser, i: ref Item.Tag) { p.down(); while ((t0 := nexttag(p)) != nil) { case t0.name { "title" => e_title(p, t0); "link" => e_link(p, t0); "style" => e_style(p, t0); } } p.up(); } Here, nexttag() is a locally defined function that returns the next Item that's a Tag, ignoring everything else. The various e_* functions deal with the various kinds of tags that can be found within an XHTML <head> tag. This style of interface means that it's possible to write code that matches fairly closely the DTD, does not parse the whole document into one in-core data sstructure, and avoids having to write abstruse state machines! If the XML is in a seekable file, you can mark a place in the file (which records all the state of the XML parser at that point in the file, and a place to seek to), and then return to it later, or even store the mark externally and use it as an index for rapid retrieval at a later date. This means that for files containing a large dataset (e.g. Ebooks!) you don't necessarily have to store all the dataset, even in a derived data structure. I'm aware that my parsing of XML is probably hopelessly naive, and perhaps there is some facet of XML that makes this approach impossible for XML in general (I came up with this for a specific problem, after all). If so, I'd love to know why. If not, I hope I've managed to contribute a thought or two to the debate... cheers, rog.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|