Re: Simple approaches to XML implementation
[from PMR] > > ESIS doesn't retain everything from the original document(s) and I've > been asking the experts what gets lost. In case someone wants to get even more precise information, ESIS (Element Structure Information Set) is fully defined in annex G of document ISO/IEC/JCT1/SC18/WG8/N1035: Recommendations for a Possible Revision of ISO 8879 (SGML). You can find an exact replication of this passage in Charles Goldfarb's "SGML Handbook" (Clarendon Press, 1990), pp 588 to 591. > My rough summary is that > XML->ESIS loses: > - comments (this matters if you want to edit the document or have > it read by humans. However comments should not be used > by machines - simply passed through) True > - entities. If your document includes entities such as &chapter1; > these may be expanded and replaced by their contents. In > this way some of the structure may be less clear It's actually more complex than that. SGML *text* entity references, whether entities are "internal" or "external", are indeed fully expanded and you are not even notified this in the ESIS event stream. Therefore, ESIS does not convey the "entity structure" of an SGML document. This is, by the way, irrelevant to most applications ... except for those, such as some SGML editors, whose purpose is seen as being able to manipulate SGML documents without arbitrarily altering their entity structure (in addition to their element structure). External data entity references, internal SDATA and PI entity references are signaled in the ESIS, while CDATA internal entity references are expanded without being reported. This may appear as as bizarre design choice, but there is something even more disturbing: in the case of internal SDATA entity references, only the entity "replacement value" is passed, not the entity "name". This of one of the reasons why ESIS information, alone, does not allow to implement an "identity transformation" for SGML documents, even when you don't care about the physical decomposition of the document into several files (SGML entities). Note that SDATA entity disappear in XML, so that THIS PROBLEM DISAPPEARS AS WELL! > - conditional markup. If you use INCLUDE and/or IGNORE then the > IGNORE'd sections won't come through and the INCLUDE'd > ones won't be marked as such True > [I think that processing instructions come through OK? True > And that you can determine whether an attribute value was defaulted > or not?] Unfortunately not. This information is unavailable in ESIS, and you would need to access some "DTD information set" to be able to recover it. Besides attribute names and de facto values, the only side information you have in ESIS is when the value for an #IMPLIED attribute has not been specified. There is one more piece of information missing in ESIS, and which causes a problem to implement an "identity transformation" for plain SGML documents: you don't know WHICH ELEMENTS HAVE BEEN DECLARED #EMPTY in the DTD. You may know when an element has null content, but you don't know whether this is because it happens to be so (optional content) or because it can't have any (declared #EMPTY). Therefore, you do not know whether you should output an end tag for it or not. Again, you would need some "DTD information" to disambiguate. Maybe not everyone realized it yet, but this *is* the one and only reason why XML introduces this explicit </EMPTY/> syntax for empty elements. This, again, makes this problem disappear with XML. All in all, you can see that some design decisions in XML were precisely motivated by the desire to make an ESIS event stream sufficient to implement an identity transformation, even with no access to DTD information. This is, of course, totally consistent with the idea that DTDs should not be systematically needed for processing XML fragments. Whether you work with an event stream or an abstract tree(*) is orthogonal to this discussion: we are discussing about the *available* information, not about the way it is represented. This does not mean that I see abstract trees as useless, all the contrary (see my previous mail). I hope I helped clarify what ESIS was. (*): I use the term "asbtract tree" instead of "parse tree" to designate the "tree of typed nodes with attributes" (you could also say "SGML object tree", but this term to be somewhat overloaded these days...). From an SGML parser's point of view, an SGML "parse tree" would have distinct nodes for start tags and end tags, which are not what you are looking for when you want a useful representation allowing to cut-and-paste SGML elements (seen as atomic, typed text objects with attached properties). _/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/ _/ François CHAHUNEAU phone: [+33] 1 40 64 43 00 _/ _/ Directeur Général/General Manager _/ _/ AIS S.A. FAX: [+33] 1 40 64 43 10 _/ _/ 15-17 rue Rémy Dumoncel email: fcha@a... _/ _/ 75014, Paris, FRANCE WWW: http://www.berger-levrault.fr _/ _/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/_/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@i... the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@i...)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format