[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: More on taming SAX (was Re: ANN: Amara XMLToolki
Jeff Rafter wrote: >> While on the topic of SAX taming features in Amara, there is also >> amara.saxtools.xpattern_sax_state_machine, which I didn't even bother >> mentioning in the announcement (too much to cram in). > > > Can you expand on your expansion? Why not? This is fun. > As I was reading this I was thinking that in the Java/C# world an > interesting approach would be to keep a pseudo DOM stack for the event > hierarchy. Maybe something where you keep everything at an ancestral > level intact while parsing > > > <foo> > <bar1> > <baz1/> > <baz2/> > </bar1> > <bar2> > <baz1> > <sub/> > </baz1> > <baz2>text</baz2> > </bar2> > </foo> > > So when the event stream reached /foo/bar2/baz2/text() you would have > the following in a DOM like structure: > > foo > \ > bar1 (... no children) > bar2 > \ > baz1 (... no children, just the previous sibling and attrs) > baz2 (only the StartTag) > > I am not sure that the preceding siblings would be very useful and > have more chances for pathological cases but when I construct > mini-trees this is the subset I find handy. This would save space in a deep, but not a wide tree, no? > It is useful when working with an editor to > understand the immediate context. Unfortunately by requiring the > previous siblings you have to maintain quite a bit more... the whole > preceding branch of the tree. Yep. Anyway, this whole idea is different from the pushdom. I considered it, but decided the complexity/benefit ration was too poor. pushdom only instantiates the nodes at the "end" of the XPattern: only baz2 in your example. If you try to go up the tree from that node you get a dead end at the document fragement node. As to the behavior of amara.saxtools.xpattern_sax_state_machine, it doesn't have anything to do with instantiating nodes or anything like that: it's a pure state machine, and the SAX handler code can choose to use the states to do such processing, if it likes (which is what pushdom does). Using your example to illustrates the workings of amara.saxtools.xpattern_sax_state_machine: from amara.saxtools import xpattern_sax_state_machine patterns = ['/foo/bar2/baz2'] machine = xpattern_sax_state_machine(patterns, {}) #{} = no namespace import pprint #Python standard data structure pretty-printer pprint.pprint(machine._state_table) The output of this is: {0: {(1, None, None): 1}, 1: {(1, None, u'foo'): 2, (0, None, None): 0, (1, '?', False): 3}, 2: {(0, None, u'foo'): 1, (1, None, u'bar2'): 4, (1, '?', False): 5}, 3: {(0, '?', False): 1}, 4: {(0, None, u'bar2'): 2, (1, None, u'baz2'): 6, (1, '?', False): 7}, 5: {(0, '?', False): 2}, 6: {(0, None, u'baz2'): 4}, 7: {(0, '?', False): 4}} This is the state machine that was constructed for you from the pattern '/foo/bar2/baz2'. As you can see, amara.saxtools.xpattern_sax_state_machine does its most interesting work even before you know what the XML instance to be processed looks like. A SAX handler can then use the resulting machine object to automatically manage the state as it is processing the instance. All it has to do it notify the machine object of the SAX events as they come, and the machine will take care of state transitions. The machine object will notify the SAX handler of when it has found an event that matches one of the registered patterns. All this does if free the programmer from managing all the familiar hair-pulling logic "OK, I know I'm supposed to bargle when I see a baz within a bar, but only if it's a child of a top-level foo" and all that. If the programmer can express things in XSLT patterns (and I find you usually can), amara.saxtools.xpattern_sax_state_machine does the heavy lifting for them. Pushdom, for example, uses such a machine object to trigger "start creating a DOm node for this subtree" based on the patterns provided by the caller. The state table structure is probably not clear unless you're pretty familiar with Python, so let me expand on that a bit: {0: {(1, None, None): 1}, 1: {(1, None, u'foo'): 2, (0, None, None): 0, (1, '?', False): 3}, 2: {(0, None, u'foo'): 1, (1, None, u'bar2'): 4, (1, '?', False): 5}, 3: {(0, '?', False): 1}, 4: {(0, None, u'bar2'): 2, (1, None, u'baz2'): 6, (1, '?', False): 7}, 5: {(0, '?', False): 2}, 6: {(0, None, u'baz2'): 4}, 7: {(0, '?', False): 4}} This is a dictionary within a dictionary. A Python dictionary is basically what some languages call an associative array and some (confusingly, IMO) call a hashtable. In {foo: bar, spam: eggs}, the key foo is mapped to bar and the key spam to eggs. In the above all the keys of the outer dict are states. The value is a nested dictionary that defines the transitions possible from that state. Each transition is defined as a mapping from a SAX event (the tuple that serves as the key) to a next state (the number that serves as the value). Syntax for the event tuples is as follows: (1, None, None) = start document (0, None, None) = end document (1, None, u'foo') = start element "foo" (no namespace) (0, None, u'foo') = end element "foo" (no namespace) (1, '?', False) = wildcard for any element that is not "expected" in the pattern The wildcard is needed so that, say the pattern 'foo/baz' Does not unexpectedly match any node in: <foo> <bar> <baz/> </bar> </foo> When the state machine sees foo, it's state becomes effect "waiting for baz", but we don't want to match the baz within the bar, so we switch to a new state (triggered by a wildcard) that's in effect: "waiting for the end of this unexpected element". Clear as mud? I hope it at least covers the ground. >> This module takes an XPattern (e.g. "/xbel/folder/bookmark") and >> generates a state machine which can be plugged into any regular SAX >> handler. In this way, you can automatically look for certain XPatterns >> which have interesting bits of code for you to process, and ignore the >> rest. This is sort of the opposite of Tenorsax: embrace the state >> machine, but automate it, rather than sweeping it unto a fancy >> framework. > > > Karl Waclawek has done some work in this area in both Delphi and C# in > his toolkit XPEA. But I am sure he will take some ideas from this > thread as well... it is all very interesting. http://sourceforge.net/projects/xpea/ ? Interesting. Thanks for the ref. Yes this seems along the general lines of what I'm doing. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.net http://4Suite.org http://fourthought.com Use CSS to display XML - http://www.ibm.com/developerworks/edu/x-dw-x-xmlcss-i.html Full XML Indexes with Gnosis - http://www.xml.com/pub/a/2004/12/08/py-xml.html Be humble, not imperial (in design) - http://www.adtmag.com/article.asp?id=10286 UBL 1.0 - http://www-106.ibm.com/developerworks/xml/library/x-think28.html Use Universal Feed Parser to tame RSS - http://www.ibm.com/developerworks/xml/library/x-tipufp.html Default and error handling in XSLT lookup tables - http://www.ibm.com/developerworks/xml/library/x-tiplook.html A survey of XML standards - http://www-106.ibm.com/developerworks/xml/library/x-stand4/ The State of Python-XML in 2004 - http://www.xml.com/pub/a/2004/10/13/py-xml.html
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|