[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Why the Infoset?

  • From: Rick JELLIFFE <ricko@g...>
  • To: xml-dev@x...
  • Date: Wed, 02 Aug 2000 00:25:29 +0800

Re: Why the Infoset?
Sean McGrath wrote:
> 
> >John Cowan wrote:
> >>  Character references are lost, it is true.
> >> If you want them back, shout now.
> >
> At 21:56 01/08/00 +0800, Rick JELLIFFE wrote:
> >Can I shout the opposite: "the fact that a character was entered
> >directly or by reference should not be information available for any
> >other specification or general-purpose application: it should not be
> >part of the infoset."
... 
> This is a good case in point where the in/not-in dualism of the
> OTI (One True Infoset) approach falls down. If character references
> are not in the infoset then it is impossible to
> write an XML parser based app that processes them.

Yes. This is a great thing.

> The only way to process them would be to do so *lexically*.
> In shifting to a lexical based algorithm you would need to
> basically *re-write* an XML parser in order to be sure
> that you were identifying character entity references correctly
> every time.

You couldn't do it reliably: you could only guess based on some other
out-of band information. (Such as a "character collection"
specification)
 
> Oh, sure you can write a regexp that will work "most of the time" but
> try tell that to the client of the m-commerce/healthcare/rocket launching
> XML application your are building.

I don't understand this point at all. If the infoset contained only
resolved characters, then any regexp on the XML-parsed string
(normalization issues aside) will always work the same every time.  If
you say that a
character reference is a part of the infoset, that will suggest that you
want the defult behaviour of applications to be to preserve them: that
is
not robust because no application has been built with this in mind. And 
it means that you want the presence of a character reference to signify
some
processing instruction or semantic, it is tag abuse: use a PI or
entityref or element. Furthermore, it suggests that you think that
preservation of character
references should be the default behaviour for round-tripping
applications:
however I expect that the the default behaviour of XML generating
routines will be to generate something closer to c14nized XML as well as
to perform Unicode early normalization. Finally, it would introduce
incompatabilities into something that all systems agree on currently (as
they should). 

In SGML days, the first thing we did on data coming (after making sure
it validated somehow) was to normalize it, so that all tags were
explicit and all characters represented in the same way, either as
direct characters or references.
XML has reduced the need for data normalization because it is fully
tagged. But if you have problems with data coming from different sources
with different referencing conventions, the last thing you would want
would be for references to be preserved in the infoset or for it not to
be easy to write a data normalizer.  

Rick Jelliffe

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.