Hello, Still about white space, sorry :-) First part : comments on the XML draft approach to WS handling. Second part : comments on Neil Bradley's five rules for WS handling (version 1). **First part** In the current draft, I see 3 rules concerning WS : *Rule 1* : all WS is preserved and fed to the application. A very simple rule indeed, in accordance with XML design goals. But Neil Bradley five rules are simple to implement too (though incorrect). On the contrary, consider parameter entities: the committee members aknowledged they had some difficulty designing a grammar for DTD declarations, because of PEs. So implementing such a grammar won't be trivial (BTW, someone said he had designed a W grammar. It could be interesting to see what it looks like. Please post!), far less trivial than replacing CR, LF, CRLF by a single character! (NB: the WG agreed a few days ago on that rule :-) So the simplicity argument doesn't hold. The real issue is that the aplication must be fed with a credible tree structure. Take a document without a DTD: <DOC>CR <PART>CR <P> foo</P>CR </PART>CR </DOC> What kind of tree structure will the processor offer us? A root node "DOC". So far, so good. But everybody expects now a single child node (the "PART" element). The processor gives us *three* for the same price: the very useful "CR" element. The "PART" element. And another "CR" node. What kind of ridiculous tree is that ? A Tchernobyl tree I guess. *Rule 2*: a validating parser must distinguish WS in element content and signal to the application that such WS is not significant. I observe that it is not said how the parser will tell the application about such insignificant WS. A minor point, I concede. Wether the parser is validating or not, a solution should be found where WS in element content is *discarded* : this is the important point. No node with only WS in it : it is completely against the philosophy of SGML/XML: (well)*structured* content. If the parser is able to distinguish what is element content and what is not (the hard part without a DTD), it should discard those completely useless WSs (the easy part). *Rule 3*: A special attribute may be inserted in documents to signal an intention that the element to which this attribute applies requires all white space to be treated as significant by applications. The value DEFAULT signals that applications' default white-space processing modes are acceptable for this element; the value PRESERVE indicates the intent that applications preserve all the white space. As someone observed, this is contradictory with the position "the application should manage WS issues, the parser doesn't intervene". BTW, the attribute is hardly useful: suppose I put on the web a document, with a "FOO" element with the attribute "XML-SPACE" set to "DEFAULT". Application A normalizes WS by default. Application B does nothing with WS by default. As a result, an attribute set to "DEFAULT" conveys absolutely no information. It will be the same as "PRESERVE" with some applications. Basically, it will be a mess :-) But we are used to that :-)) What is strange too, is that there is no default value for this attribute by default. Those SGML guys are really subtle :-)) A default value of "DEFAULT" would seem to be natural, but in that case the application does anything it wants to, so who cares :-) **Second part** Neil Bradley proposed some simple rules (this is "version 1", a second version, a little more complex, but simple enough, was proposed). I really like the approach, even if it doesn't work for the moment. *Rule 1*: standardization of input from different OSs. CR, LF, CRLF are translated to a line end code. OBVIOUS!!!!! *Rule 2*: line end codes after a start tag or before an end tag are discarded. A simple rule. For usual elements, it is exactly what you expect : <P> blabla <P> becomes <P>blabla</P> for PRE-like elements: <PRE> SPSPblabla </PRE> becomes <PRE>SPSPblabla</PRE>, so two line ends are discarded. It seems nevertheless natural that these line ends are dropped. BTW, this rule was in the first (11/14/96) XML draft. There is a first problem with this approach: in default content (preserved content will be examined later): <P><EM>Two </EM>words</P> becomes <P><EM>Two</EM>words</P> The space between "Two" and "words" evaporated. Same thing with: <P><EM> Two </EM>words</P> I don't think this particular problem is important: the encoding is not natural. It should be an error! I think everybody would write: <P><EM>Two</EM> words</P>, or <P> <EM>Two</EM> words </P>, etc... Inside a preserved element, line end codes are wrongly discarded after element start tags and before element end tags: <PRE XML-SPACE="PRESERVE"> blabla <EM> bloblo</EM> blublu </PRE> The coding in this case is natural: bla, blo and blu are very aesthetically aligned! But: a line end code is discarded after "<EM>", it shouldn't be. So: preserved elements need a special rule. It seems quite natural they need a special rule concerning line end codes (and space codes). A possibility: the parser closes a "default" (not preserved) element, and opens a "preserved" element: the line end codes after the start tag and before the end tag are discarded. But for a preserved element directly embedded in a preserved element, line end codes are left intact. *Rule3*: WS in element content is discarded. WS space in element content *must* be discarded. The problem is: without a DTD, one doesn't know if an element contains only other elements. Suppose we have : <P><EM>blabla</EM>SP<EM>bloblo</EM></P> We could choose a rule like: an element in which the parser finds only other elements and WS (no characters) is an element content element. But as the above example shows, it doesn't work. If we follow this rule, we have a tree with a root node "P" and two child nodes "EM". And what we want is a root note with three child nodes: two "EM" elements and between the two a "PCDATA" element (the space between "blabla" and "bloblo") So a different method must be found. A radical constraint put on the user would be: don't input a single space character in element content. With this rule the parser will be able to recognize easily element content. But you can forget about indentation in that case. The rule for the user would be: "when you type a space, you mean a space". BTW, this is always the case, except for indentation. If the semantic overloading for the space character is removed (a space is either a "real" space or an indentation space), things are so much easier. *Rule 4*: Except in preserved elements (elements with a space attribute set to "PRESERVE") line end codes are discarded when preceded by a hard or soft hyphen (in the process, a soft hyphen is also discarded) and remaining line end codes are treated as space. The rule concerning hyphens is not necessary. If it's a hard hyphen, don't put it at line end (who would do that?) Moreover, there is no use in an XML source file to put a soft hyphen at line end. Who would do that? In my poor life, I have no occa- sion to see some text with hyphens at line end. There is a possible problem with the replacement of line end codes in default (that is, not preserved) elements by a space character. Suppose we have a text coded with Unicode (that could happen :-)), with chinese ideographs. In chinese, there is no concept of a word (sequence of letters): each ideograph is a "word". I don't know how in fact the chinese encode their texts, but there is obviously no utility in putting a space after each ideograph. The chinese must use nevertheless the end of line character. And one shouldn't replace such a character by a space, which would be an error, but simply discard it. Depending on the class of characters, there could be a different treatment of line end codes. But this becomes complex :-( Another approach: simply ignore line end codes. But you have to put a space at the end of a line. The idea is quite natural: line end codes are there for our eyes, they don't add anything to the meaning of a text. The XML tree should reflect the substance of a text, not the particular way it was input: <P> We should get rid of line end codes </P> and <P>We should get rid of line end codes</P> should give the same node in the document tree. If line end codes must be preserved: use a preserved element, or an empty element (<BR/>). *Rule 5*: except in preserved elements, consecutive WS characters are reduced to a single space. I don't like this rule. If I put two spaces after a point, I mean two spaces. It's a typographic decision. Rule 5 is meant to allow some indentation: <P> He said: <QUOTE> I need some indentation.SPSPIndentation is needed. </QUOTE> </P> In the above example, it is necessary to get rid of spaces caused by indentation. But the two spaces marked "SP" should be retained. So the new rule would be: SPs at the beginning of a line should be discarded. This rule must happen before line end codes ere discarded, ie before rule 2. What a headache :-) Perhaps a simple rule could be: don't use indentation in XML files, or you'll get burned. More generally, if we want the parser to produce a clean data structure out of an XML file, some burden will have to be put on the user's shoulders. The contract could be: the user accepts some limitations on the way to input the source code. He could have to write instead of the above something like: He said: <QUOTE> I need some indentation.SPSPIndentation is needed. </QUOTE> </P> The reward (unvaluable) will be: a clean data structure available for applications. Thanks for your attention! Regards, Arnaud xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@i... the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@i...)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format