[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Streaming XML and SAX
Hi Nathan, <YourComment> Building a DOM everytime is inefficient, but I have to agree with Tom that having XML act as the protocol as well is quite elegant. Why presume that the XML processor capable of handling the protocol layer would have to build a _generic_ object model? And why presume that an XML processor has to build a _single_ object from all the information? > <purchase xmlns="http://www.ecommerce.net/ns/ec/"> > <seqno>12345678</seqno> > <customer-id>87654321</customer-id> > <vendor-id>18273645</vendor-id> > <invoice-id>81726354</invoice-id> > <total>92674.12</total> > </purchase> It seems like parsers could be made a whole lot more configurable than they currently are. If more configurable, the top level XML processor could build the domain-specific objects itself. Continuing with your <purchase> example, I can envision a processing model like this: Parser sees: <purchase> Checks: Is a 'purchase' parser registered? Yes: Pass control to it, 'purchase' parser reads until </purchase>, then returns control to top level parser. or Yes: Slurp text until </purchase>, pass "<purchase>...</purchase>" (unparsed) to a 'purchase' parser running under another thread or Yes: Slurp text until </purchase> and store it (unparsed) in the DOM to be handled on a later pass. No: keep parsing text and adding nodes to the DOM. or No: Throw away text (unparsed) up until </purchase> It would then be up to the subparser to build its own objects which could be used later. Or the subparser could return an already processed node to be inserted into the generic object model (or DOM). Is this model possible with any existing parsers? </YourComment> <Reply> This architecture brings more work than required. An other way to do it would be (in fact we are already doing that with our DSSSL,XSL interpreters). a) parse the document or the stream b) a interpreter router check for certain Gi or Pi. On matching one, load the appropriate interpreter c) the interpreter interprets parsed GIs until the end of the document (in your example: </purchase>) d) When the end of the document is reached, the router goes back to listen mode for this multiplexed channel (a channel is a multiplexed stream within a session) and the interpreter is unloaded For document based parsing, as usual, we use file protocols. For streaming parsing, we are using HTTP-NG or MEMUX techniques. MEMUX is a work in progress but basically, this is multiplexing on a single session. Because, this protocol level takes care of the multiplexing, the parser do not have to care about mixing streams and its universe is only a single stream with documents organized in strict sequence. In a multiplexed stream all documents are in a row and follow a strict sequence. However, globally, on a single session, several documents are sent simultaneously. Thus, this architecture has several layers: interpreters ----------- Interpreter router ----------- SGML/XML parser ----------- MEMUX ----------- Transports For file based or blob based documents, replace the first two layers by the file protocol (file, http,ftp. etc...) A SGML/XML document without an interpreter is like a sleeping beauty :-). To transform a XML document into something useful, you not only have to parse the it but also to interpreter what you will do with each GI. Actually, because MEMUX is still a moving target, we implemented our own version of it until we get a consensus around a new spec which should be the conclusion of the newly created IETF MEMUX workgroup. <YourComment> Building the object model is probably the more expensive part, but in many cases multiple selective parsing passes (skimming) would be more efficient than parsing everything completely the first time through. It seems that all current parsers assume that their duty is always to create a faithful model of all of the entire document they are presented with, and thus parse the entire document in a single pass with a single thread of control. Why this assumption? </YourComment> <Reply> Not all parsers make this assumption :-) in our case, our parser either do event based processing or build a grove or a DOM. In fact, for DOM like interface, we prefer a new model we internally use which is based on generalized property sets. This kind of interface can deal with either directory service objects or document objects. We merged both world because, when you look at these thought the perspective of property sets, both are very similar. Then, with property set based model, an interpreter support an interface based on the composite pattern (ref: "Patterns" - Gamma & al.). It can _do_ something either with directory service objects, relational database rows or document elements. This abstraction set apart the interpretation and the parsing operations. What is a property set based API then? Imagine this: A hierarchy of objects and each object has a property set attached to it. An object can contain other objects (i.e. the composite pattern). thus, if each object is a collection of objects and that each member of the collection is classified with an associative array (i.e. a map, B+ tree, etc..) therefore if an object can contain an other object, you obtain a tree. A) the object has members to manipulate the objects collection and has collection manipulation members like: add remove update get/find get enumerator B) a property set is also a collection and the property set interface has the same members: add remove update get/find get enumerator c) an enumerator can be implemented as: next previous ResetTo Thus, if this is implemented with objects languages or object middleware (java, DCOM, CORBA, ILU, etc...), an interpreter has just to get/find the object, enumerate its content and for each object get its properties. In the case of a document object then one of the properties is the GI content. For instance, to get a property from the <vendor-id> GI we call property->Get("Content", Content) or with an interpreted language: Content = Get("Content"). Remark that we don't need with a composite pattern interface to know in advance all properties names nor do we have to know all object's type. Therefore, the interface is more lightweight and we don't have to create a new interface with each new object. The interface is general enough to process a lot of whole-part structures as long as each member can be associated with a name. If we don't need a property set based interface because of memory footprint or other constrains, the interpreter is event based and implements an object event handler which receives a property set enumerator as parameter like: On_Object(PROPERTYSETENUM propertySetEnum) { for (.....){ PropertySetEnum.next...... } } The interpreter just enumerate the property set and do something on it. Because the interpreter knows only certain keywords, it will process only these keywords. But, all interpreters being event based, in this case, use the same interface with parsers or something else like a directory service,etc... To replicate DSSSL or XSL mechanisms, each even handler can have the GI name. For instance, On_Vendor-id that correspond to <vendor_ID>, On_Customer to <customer> etc... This way, each even handler can process property sets differently based on each event handler. this last mechanism replicates in a certain way the pattern match mechanism found with style languages or transformation languages. Thus, to process your document, we would have: function On_purchase() { // enumerate all properties with enumerator // and _do_ something } function On_Customer() { } etc... So, not all architectures are primitive :-). </Reply> Regards Didier PH Martin mailto:martind@n... http://www.netfolder.com xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|