[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Enhancements to RAX / RSS angle
In my email to Sean that started this discussion, I mentioned that I had some ideas about integrating PYX with RAX to provide a very simple "pull-mode" interface for parsing XML. Many XML documents include one or more "records" that are processable with the RAX API, but also include some "loose" elements. For example, an RSS file includes "records" like <image> and <item> but also includes elements like <title>, <description>, and <managingEditor>. A simple way to parse such a document would be to read the "loose" elements as PYX lines and the "record" elements as RAX records. I'm going to call this the "PYXRAX" interface, which will be identical to the RAX interface with the addition of one method, ReadPYX(), which returns the next PYX line from the input as a string. ReadPYX() will return the lines of the PYX stream corresponding to the input being parsed, except that prior to returning the start-tag event for an element that has been defined as a record delimiter (using SetRecord()), it will return a special PYX line, consisting of the letter 'R' followed by the element name, indicating that a RAX record is waiting to be processed. At this point the caller may either call ReadRecord(), in which case the record will be read and the next call to ReadPYX() will return the next PYX event for the portion of the input after the closing tag for the record delimiter (e.g. the contents of the record will have been "swallowed" as far as ReadPYX() is concerned; note that this event could be another 'R' event if there are consecutive records), or may continue calling ReadPYX(), in which case no record will be recognized and the PYX events corresponding to the record's contents will be returned as if no record had been set. Calling ReadRecord() before a record delimiter has been seen, or in the middle of a record that has been partially read by ReadPYX(), will skip to the next opening record delimiter, if any; this corresponds to the current ReadRecord() behavior. Thus in parsing a typical RSS file where "image," "item," and "textinput" were set as record types, ReadPYX() would return standard PYX events for all the elements prior to the <image> and would then return an "Rimage" event. If ReadPYX() were immediately called again, it would return a start-tag event "(image", a start-tag event "(title", etc. Calling ReadRecord() at this point would *not* return an "image" record; it would return the first "item" record. Calling ReadRecord() immediately after reading the "Rimage" event *would* return an "image" record, and calling ReadPYX after reading the record would return a "Ritem" event. Is everybody hopelessly confused by now? Should I present an example of reading an OCS file? On another matter, the documentation for RAX doesn't clearly specify what should be done if a "field" level element contains nested elements. If I understand correctly, Sean's Python implementation omits the content of nested elements, whereas Robert Hanson's Perl implementation concatenates their text values in document order, similar to the way XPath computes the value of a node with children. I tend to favor the latter (a third alternative would stringify the nested tags as well as their content, resulting in a value that could be "microparsed" by another parser). *************************************************************************** This is xml-dev, the mailing list for XML developers. To unsubscribe, mailto:majordomo@x...&BODY=unsubscribe%20xml-dev List archives are available at http://xml.org/archives/xml-dev/ ***************************************************************************
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|