[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: XML too hard for programmers?

  • To: xml-dev@l...
  • Subject: RE: XML too hard for programmers?
  • From: rog@v...
  • Date: Wed, 19 Mar 2003 14:28:15 0000

limbo parser
Hi,
I'm afraid I'm new to this list, so am probably breaking 300 list
taboos...

I sent the below to Tim Bray in response to his article (the one
presumably spawned this thread); he suggested I post it to this list.

It's just a simple and small solution to an XML parsing problem I had,
which was satisfactory at the time, seems to me to be generally
applicable, and I haven't seen anything similar. Sorry in advance
about the length.

First thing: this is no panacea, and to be quite honest, I'm actually
(heretic!)  not at all keen on XML, but this API at least made things
bearable; the structure of the code dealing with the XML was quite
logical, and the space used in doing so was largely bounded.

I developed it when I was writing a browser for the Open Ebook
standard to run on limited memory platforms.  Obviously DOM was out of
the question, and SAX became really awkward as it would have been
necessary to traverse the whole XML tree from the start when moving
back a page; moreover I found it difficult to write code that
corresponded directly to the DTDs.

The idea is very simple: treat the XML as a multi-level stream, and
provide an interface that allows one to *mark* a place in the stream,
and *go to* a previously marked place.

The basic API looked something like:

	open: fn(f: string): ref Parser;

	Parser: adt {
		next:		fn(p: self ref Parser): ref Item;
		up:		fn(p: self ref Parser);
		down:	fn(p: self ref Parser);
		mark:	fn(p: self ref Parser): ref Mark;
		goto:	fn(p: self ref Parser, m: ref Mark);
		atmark:	fn(p: self ref Parser, m: ref Mark): int;
	};

	Item: adt {
		pick {
		Tag =>
			name:	string;
			attrs:		Attributes;
		Text =>
			ch:		string;
		Process =>
			target:	string;
			data:		string;
		Doctype =>
			name:	string;
			public:	int;
			params:	list of string;
		Stylesheet =>
			attrs:		Attributes;
		Error =>
			loc:		Locator;
			msg:		string;
		}
	};

	Attributes: adt {
		all:			fn(a: self Attributes): list of Attribute;
		get:			fn(a: self Attributes, name: string): string;
	};

	Attribute: adt {
		name:			string;
		value:			string;
	};

The unconventional language is Limbo (see
http://www.vitanuova.com/inferno/papers/limbo.html), but the concepts
are familiar: an "adt" is basically a struct with function members,
and a "pick" is a type-safe discriminated union; the type declarations
are Pascal-like (i.e.  type after identifier name).

Open()ing an XML file produces a parser p; then p.next() produces the
next XML item in the file *at the same nesting level*.

Therefore,

	p := xml->open("foo.xml");
	while ((i := p.next()) != nil)
		process_item(i);

will only process the top level elements.
e.g. given the following document:

	<item>
	  <title>MetaData</title>
	  <date>2003-01-12T00:18:05-05:00</date>
	
	  <link>http://bitworking.org/news/8</link>
	  <description>Upon waking, the dinosaur...</description>
	</item>

you'll get one item of type Tag, with name "item".

A crucial point is that when you get to the end of the current nesting
level, next() returns nil; this allows one to easily write a
recursive-descent-style parser, for instance (from the ebook reader)
parses a <head> tag:

	e_head(p: ref Parser, i: ref Item.Tag)
	{
		p.down();
		while ((t0 := nexttag(p)) != nil) {
			case t0.name {
			"title" =>
				e_title(p, t0);
			"link" =>
				e_link(p, t0);
			"style" =>
				e_style(p, t0);
			}
		}
		p.up();
	}

Here, nexttag() is a locally defined function that returns the next
Item that's a Tag, ignoring everything else.  The various e_*
functions deal with the various kinds of tags that can be found within
an XHTML <head> tag.

This style of interface means that it's possible to write code that
matches fairly closely the DTD, does not parse the whole document into
one in-core data sstructure, and avoids having to write abstruse state
machines!

If the XML is in a seekable file, you can mark a place in the file
(which records all the state of the XML parser at that point in the
file, and a place to seek to), and then return to it later, or even
store the mark externally and use it as an index for rapid
retrieval at a later date.

This means that for files containing a large dataset (e.g.  Ebooks!)
you don't necessarily have to store all the dataset, even in a derived
data structure.

I'm aware that my parsing of XML is probably hopelessly naive, and
perhaps there is some facet of XML that makes this approach impossible
for XML in general (I came up with this for a specific problem, after
all).  If so, I'd love to know why.

If not, I hope I've managed to contribute a thought or two to the
debate...

  cheers,
    rog.


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.