|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Why the Infoset?
Sean McGrath wrote: > > >John Cowan wrote: > >> Character references are lost, it is true. > >> If you want them back, shout now. > > > At 21:56 01/08/00 +0800, Rick JELLIFFE wrote: > >Can I shout the opposite: "the fact that a character was entered > >directly or by reference should not be information available for any > >other specification or general-purpose application: it should not be > >part of the infoset." ... > This is a good case in point where the in/not-in dualism of the > OTI (One True Infoset) approach falls down. If character references > are not in the infoset then it is impossible to > write an XML parser based app that processes them. Yes. This is a great thing. > The only way to process them would be to do so *lexically*. > In shifting to a lexical based algorithm you would need to > basically *re-write* an XML parser in order to be sure > that you were identifying character entity references correctly > every time. You couldn't do it reliably: you could only guess based on some other out-of band information. (Such as a "character collection" specification) > Oh, sure you can write a regexp that will work "most of the time" but > try tell that to the client of the m-commerce/healthcare/rocket launching > XML application your are building. I don't understand this point at all. If the infoset contained only resolved characters, then any regexp on the XML-parsed string (normalization issues aside) will always work the same every time. If you say that a character reference is a part of the infoset, that will suggest that you want the defult behaviour of applications to be to preserve them: that is not robust because no application has been built with this in mind. And it means that you want the presence of a character reference to signify some processing instruction or semantic, it is tag abuse: use a PI or entityref or element. Furthermore, it suggests that you think that preservation of character references should be the default behaviour for round-tripping applications: however I expect that the the default behaviour of XML generating routines will be to generate something closer to c14nized XML as well as to perform Unicode early normalization. Finally, it would introduce incompatabilities into something that all systems agree on currently (as they should). In SGML days, the first thing we did on data coming (after making sure it validated somehow) was to normalize it, so that all tags were explicit and all characters represented in the same way, either as direct characters or references. XML has reduced the need for data normalization because it is fully tagged. But if you have problems with data coming from different sources with different referencing conventions, the last thing you would want would be for references to be preserved in the infoset or for it not to be easy to write a data normalizer. Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








