[Home] [By Thread] [By Date] [Recent Entries]
From: "Michael Kay" <michael.h.kay@n...> > I would be much more interested in any innovation that allowed a parser > to report more that one well-formedness error in a single run. Does XML 1.0 anywhere actually state that parsing happens starting from the beginning? :-) A madman could parse starting from the end. Interestingly, it would give you a different result from most XML parsers with: <x> <!-- xxx --> yyy ---> </x> Now you could parse starting from the end, but still get the same result as parsing from the start, by being a little more clever. For example, you find a -->, then you search for a <!-- unless you find an intervening -->, in which case you parse (backwards or forward) the intervening text. A progressive cha-cha. Actually, my editor has a small backwards parser in it: because we cannot guarantee that we are working on a well-formed tree, when you want to close the current element, we backwards parse to find what the context is. This is quite a useful technique for avoiding building a DOM (or for if you are working with pre-WF documents), however it has a worst-case performance penalty if you attempt to be too faithful to simulating a forward parsing. So you can have errors from a forward parse and combine them with errors from a backwards parser. For example, say we had the text "XYZ" and try to WF check it: a forwards parser might say "X not allowed here" and a backwards parser might say "Y not allowed here". In any case, there are many WF errors that a forwards parser can recover from: for example a missing entity reference close delimiter or a strange character in a name. One of the differences with writing a streaming parser and a checkpointing incremental parser (such as an editor uses) is that the checkpointing parser almost every legitimate state requires a corresponding error state and/or recovery state: not only do you have to parse, but you also have to cope with containing errors to just around where they occur. Cheers Rick Jelliffe
|

Cart



