[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Specification Questions
In message <199708020838.JAA11135@a...> "Neil Bradley" writes: [...] > <p>This is a long paragraph that is broken over two > <!-- comment --> > lines, with an implied space between 'two' and 'lines'.</p> > > Is this interpreted as "two <!-- comment --> lines...", which reduces > to "two lines"? Some additional - hopefully constructive - thoughts on whitespace. The XML-lang spec does not ( and I suspect will not) give detailed guidance on how whitespace will be managed. My impression is that it is up to implementers and/or groups like this to come up with particular solutions. My worry is that these will be inconsistent and not inter-operable. *** Therefore I propose that those on XML-DEV who care about this problem come up with some guidelines for implementers. *** XML does NOT treat whitespace like SGML and does NOT behave like HTML (although it can be configured to do so). As far as I see them, the rules are: 'All characters that are not markup are passed to the application'. (This is independent of any value of XML-SPACE (see below), processing instructions, stylesheets, etc.) These characters include HT, CR, LF, SP, and probably a number of other Unicode 'whitespace' characters. What the application does with them is *undefined* in XML-lang. Note that this means that CR and LF are passed as separate characters. No normalisation takes place. Therefore Line one\n\rline two is different from Line one\nline two even if they are visually similar on various text editors/displays, etc. (My impression was that SGML normalised these two strings to the same ESIS output - is that right?). This means that the author/processor 'contract' has to be aware of this. Note also that *all* line-ends are passed (even immediately before/after markup) unlike SGML. Therefore: <FOO> line one </FOO> and <FOO>line one</FOO> are different. Note also that: <FOO><BAR>baz</BAR></FOO> is different from <FOO> <BAR>baz</BAR> </FOO> The latter contains two pseudo-elements which contain only whitespace (line-end characters) and FOO therefore has three children. [Note that to make documents readable, the following trick can be used: <FOO ><BAR >baz</BAR ></FOO > since whitespace within the tag is ignored. I do not think newcomers will adopt this easily, and I suspect it can lead to errors in document editing.] *** In some cases the document author and the application author are both aware of this problem and so the whitespace characters inserted by the author will be processed in the way that they expect. However, in most cases I suspect this will NOT be true and that authors will inadvertently create documents that are processed differently *** XML provides an attribute XML-SPACE (local to an element BUT inherited by its children) which can have three values: - #IMPLIED (no signals about whitespace handling) - PRESERVE (applications preserve all the whitespace) - DEFAULT (the *application's* default white-space processing modes are acceptable fro this element). PRESERVE seems clear. All whitespace is passed to the application. The others seem to be dangerous unless there are some general conventions. [Note also that XML parsers or processors have to ensure that children inherit the XML-SPACE attributes of their parents. Where does this get done? In the parser? (It's part of XML-lang), in the processor - in which case there is ample scope for inconsistent treatment... Inheritance is already required in two places - XML-SPACE and XML-ATTRIBUTES (XML-link). This is a generic mechanism and presumably should be implemented in some package independenetly of the application. Comments?] If possible, we should propose a *general* default mechanism for whitespace handling for XML-SPACE="DEFAULT". If everyone adopts this, it will greatly reduce this problem. Is this a reasonable strategy? If so, we can propose that the DEFAULT mode for any whitespace processing is something along the lines (similar to HTML?). Within an element with XML-SPACE="DEFAULT" All whitespace sequences are mapped into a single space character. All whitespace pseudo-elements are ignored (i.e. whitespace between markup) All leading and trailing whitespace in #PCDATA is ignored. Does this cover everything? Is it workable? Example: <FOO XML-SPACE="DEFAULT"> <BAR> this <!-- comment --> is<!-- comment -->a bar </BAR></FOO> folds to: <FOO XML-SPACE="DEFAULT"><BAR>this is a bar</BAR></FOO> [Note that the Xpointer STRING syntax and the use of pseudo-elements works on the *raw* data (i.e. all non-markup characters). Therefore the application has to have access to this - it has to maintain a PRESERVEd version of the document as well as (say) displaying or transforming a DEFAULTed document.] I think it's important to address this, since otherwise I predict we shall have considerable confusion, especially when implementors of authoring or processing software have not thought this through completely. P. -- Peter Murray-Rust, domestic net connection Virtual School of Molecular Sciences http://www.vsms.nottingham.ac.uk/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@i... the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|