[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Saxon and Sun Serializer problems?
At 2009-05-31 10:11 -0700, Jim Tivy wrote: >comments below. Thanks, Jim. I think you've answered my question and I'll offer some comments as well. >Numeric character references are not "dropped" they are converted into their >equivalent form according to the encoding. Numeric character references are unrelated to the encoding and are, in fact, dropped when replaced with the equivalent Unicode character independent of the encoding. In the data model you will find only the character, without any record of whether the character was natively included or included by means of a character reference. >If entities are inlined in the parsing process in the are not "lost", rather >they are inlined. > >CDATA are characters so do not need to be "lost" - just the fact that they >were treated in a special CDATA section. Yes to both ... my point in all of these is in regard to what I interpreted you to require which was input syntax preservation. All three things I cited are syntax features that are dropped when they are processed into the content they represent. It is the syntax that is dropped, not the information. In all three cases when you look at the information in the data model you have no idea what syntactic mechanisms may have been used. >So given so many features for round-tripping that are not there, just >putting in the DOCTYPE won't fix any of the ones I've cited. >[<JT>] How many of these are fully lossy and how many have a logical >equivalent. Forgive me for not understanding what you mean by "fully lossy". If you are talking syntax, sure many things are lost, but that's just syntax (the means to the end) that isn't information (the end). >How many are we trying to discourage for fully interoperable >Xml. My point is DocType limited to Name, PublicId and SystemId is an >important thing to round trip - sax does it. *There* is the answer to my question: you want three items expressed in the data model from the DOCTYPE declaration and no aspect of the internal declaration subset that is part of the DOCTYPE. Thank you. >[<JT>] I am not sure I agree XML editors process the syntax of Xml >serializations. Many XML editors operate on DOMs. I understand a number of editors work on their own private extensions to the DOM, but I also understand that using the DOM as standardized does not support all of the syntax of an XML document. Which I acknowledged in my earlier email. >In the DOM the input tree *is* the output tree, unlike >XSLT and XQuery where the input tree is read-only and the output tree >is write-only: created, from scratch, in a single pass, without >backtrack or repair or inspection. >[<JT>] Without backtrack is a bit unclear - since most XSLT processors are >based on DOM. That's merely an implementation perspective. The XSLT and XQuery language definitions do not allow a transformation to backtrack, repair or inspect any part of the result tree that has been constructed to that point (thus, none at all). A processor is allowed during serialization to serialize and forget an element's start tag once the element's content begins. There are no aspects of the language that give the stylesheet writer any information about the result tree they've created. >The XSLT feature of adding a SYSTEM >identifier is there as I see it really only for the validation >bit. Because what is serialized is the information that was used to >build the result tree ... not the syntax borrowed from the source tree. >[<JT>] Why does this feature exist in XSLT if DocTypes are irrelevant as you >suggested in your first question above. Because one is creating an XML document from scratch and may want to ascribe a DOCTYPE declaration to it, as I said for validation purposes. >Ummmmm .... I can't agree for anything other than XML editors which >are XML syntax applications not XML information >applications. >[<JT>] By syntax I assume you mean "exact serialized form syntax". XML >Editors do not have to be "syntax" based applications - they operate on DOM >many times (XMetal). Again, I think you'll find that XMetal works on their own extensions to the DOM and not purely on the DOM as standardized. Citing http://www.w3.org/TR/DOM-Level-3-Core/core.html I read "Note that character references and references to predefined entities are considered to be expanded by the HTML or XML processor so that characters are represented by their Unicode equivalent rather than by an entity reference." So, right there, an XML editor based solely on the DOM cannot preserve the user's typing of a numeric character reference. And yet XML editors do preserve that information ... so they are working on the syntax of the XML document and not solely on the DOM model of the XML document because that information isn't in the DOM model. Which is what I've been trying to say: these standardized interfaces to XML documents are not designed to support general purpose XML syntax editors. >[<JT>] I am not saying syntax should be preserved. I am saying that >information items should not be "dropped" or lost especially when it is not >replaced by some other "logical" equivalent. And DocType is an information >item that has a purpose in its own right and it should not be dropped. >Unlike character references which are converted into their equivalent >underlying character. You've narrowed it down to the parts of the DOCTYPE that you are interested in: the name of the document element (which is already there), the PUBLIC identifier and the SYSTEM identifier. >[<JT>] My focus is on the idea of progress. Perhaps in the name of progress >we should not use DocTypes and DTDs but instead use Xml Schema to store our >validation information since the schema location will not be lost as it is >an attribute in the XDM. Why W3C Schema and not RELAX-NG or NDVL? Anyway, many people don't subscribe to embedding schema references in XML documents because validation constraints are arbitrary and any XML document should be validatable against any set of validation constraints not just the constraints that are embedded. I grant there is a convenience to some, and there is a project in ISO to standardize a processing instruction pointing to a document model independent of the model syntax. And that will show up in the XDM. >This will not happen since many people agree DTDs >are here to stay. Then, perhaps we should make the DocType with its public >and systemIds accessible in the XDM and thus accessible in the input >document. ><!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" >"/SysSchema/dita/topic.dtd"> Fine ... if all you want are those two identifiers and not anything from the internal declaration subset of the DOCTYPE, then you've answered my question. Thank you, Jim, for taking the time to clarify your needs. I don't have any further questions in this regard. . . . . . . . . . . . . Ken -- XQuery/XSLT/XSL-FO hands-on training - Los Angeles, USA 2009-06-08 Crane Softwrights Ltd. http://www.CraneSoftwrights.com/x/ Training tools: Comprehensive interactive XSLT/XPath 1.0/2.0 video Video lesson: http://www.youtube.com/watch?v=PrNjJCh7Ppg&fmt=18 Video overview: http://www.youtube.com/watch?v=VTiodiij6gE&fmt=18 G. Ken Holman mailto:gkholman@CraneSoftwrights.com Male Cancer Awareness Nov'07 http://www.CraneSoftwrights.com/x/bc Legal business disclaimers: http://www.CraneSoftwrights.com/legal
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|