|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: XML processing experiments
> Given XML's requirements that entity references in the instance are > synchronous, I would have thought that the overhead of an entity stack > could be avoided for parsing the instance. The parser passes the > application an entity reference event, and the application can then, if > it chooses, recursively invoke the parser to parse the referenced > entity. A pedant might note that the XML standard requires that for internal entities "the processor must ... retrieve its replacement text ... passing the result to the application in place of the reference". No doubt the same pedant could draw the line between processor and application such that this was satisfied. This scheme seems reaonable for a parser that works in terms of events implemented by callbacks. Our parser on the other hand returns "bits" (these are essentially start tags, end tags, and pcdata) *sequentially*, following the model of reading a plain text file. Entity references are expanded, and a bit may end in a different entity from the one it started in (suppose foo is defined as "a<b/>c"; then the first bit returned from "x&foo;y" is "xa" - as far as I can tell this is quite legal XML). In a language with threads, it's easy to implement this on top of a callback interface (in a sense the procedure stack in the parsing stack would replace the entity stack), but it's much messier in plain C. Partly the reason for using the sequential model is historical: this parser is used in the LT-NSL system, which already worked like that. But it's also for simplicity: I want this parser to be easily usable with existing C applications (for example, someone here wants to be able to read XML-marked-up text into his speech synthesizer). > [...] > This is particularily the case if you want to get > correct byte offsets when using a variable width encoding (such as > UTF-8); it's hard to do this without a method call per character. Misha Wolf tells me that my earlier comment about the non-invertibility of UTF-8 is wrong: the Unicode standard requires that the shortest encoding be used. So, for example, if you know the byte offset of the start of the line then you can find the byte offset of a character in the line by calculating the encoded length of the preceeding characters. On the other hand I note that low-end current machines can do about 10 million trivial non-leaf procedure calls per second, so maybe the overhead of a call per character is not unacceptable (in C I would be doing something like parser->source->get_translated_char(); there would probably be more overhead in an object-oriented language). > [...] > there would be a one stage process that > converted a stream of bytes into a stream of characters already split up > into tokens. Yes - I have been thinking about that too. Outside the dtd the tokenisation is relatively trivial, and the speed of dtd processing is unimportant in many applications so it can just use character-at-a-time translation. -- Richard xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








