[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Whitespace rules (v2)
At 6:36 AM -0500 8/25/97, Peter Murray-Rust wrote: >I have been away for a few days so maybe it's a useful time to try to >summarise >the Whitespace debate and to ask a few questions. You don't need to read the >rest of this unless you believe there is a problem to be addressed :-) Afraid that I have to chime in when I see a non-problem consuming valuable time... > >In message <v03007800b01fa935a1f1@[205.181.197.116]> dgd@c... (David >G. Durand) writes: >> I observed with dismay that the issue of whitespace has surfaced on this >> list, after we finally gave it the wooden-stake-in-the-heart treatment on >> the WG discussion lists. As a chief proponent of the current method, I'll > >:-) I am not sure what has been killed :-) I hoped the discussion. Certainly I hoped the shibboleth of a parser "normalizing" whitespace on behalf of the application. >I will take David's points first, because I *do* believe that many of those >who were involved in the development of the spec feel that there is no scope >for further discussion of this *IN THE SPEC*. I agree with this. Actually, the only question remaining, in my mind, is how the XML stylesheet language should allow shitespace to be processed. I disagree that there is any need for a non-stylesheet, non-application convention for whitespace. Note, that in some sense, the Document type _description_ (i.e. descriptive prose desribing the intent of a DTD) and the "schema" notions are application specifications, and are entitled to declare whitespace handling rules. >Essentially the spec says: > - This is a difficult problem. [Actually it doesn't say this, but >it might help if it did in a footnote.] It's only difficult if you think that it's a parser problem. It's easy in XML, because all whitespace is visible. I can think of no _simpler_ rule that a _parser_ could implement. > - We have taken a minimalist approach where we do not give any support >to any whitespace philosophy [other than PRESERVE which passes everything and >can be platform-dependent], but leave this to the community. DEFAULT is simply >the absence of PRESERVE. Yes, since there is not a universal "whitespace philosophy" even for a single document (see my response to Marcus for an example), there's no reason to declare it in the instance. >I believe this solves one species of problem, where the authoring tool/system >is closely coupled to the application. CDF might be such a system (e.g. I have >never seen a native CDF file). No, it's a case where the "philosophy" is coupled to the application, not to the "document" in the abstract -- except insofar as it is defined by a "document type description" or "schema" -- which is essentially a set of ideal constraints that applications are expected to follow. >(A) There is a defined DTD (e.g. TEI, HTML) but a variety of authoring tools >and a variety of applications from different providers. Traditionally these >will come from the SGML community. I believe that there will certainly be >initial problems where m'facturer X emits whitespace in a particular way >which is incompatible with Y's tools for rendering/transforming it. It may >also be platform dependent. We've seen this in the development of HTML >systems >although they are improving. TEI defines where whitesspace is signficant (almost nowhere if I remember correctly). >Remember that most SGML systems are current implemented within a single site >(the tools are chosen to be compatible throughout the process). Very little >SGML is delivered over the WWW to be consistent between different m'facturers. >XML is specifically designed to be delivered over the WWW in (I assume) >a platform and m'facturer-independent way. Do we expect to see 'this XML >file best viewed with FOO software'??? If so, we might as well give up now. No, but every document will _have_ to either conform to a well-known DTD or schema of some sort, or be delivered with a stylesheet, and those are usefule places that this behavior should be explained. >IMO any developer needs to be able to say: > (i) I support a wide range of XML DTDs. > (ii) I can easily customise my software to support a range of commonly >used DTDs > (iii) Documents authored by my software should be readable by software >from another m'facturer with whom I have had no formal discussions > (iv) My system can support a range of applications which read documents >produced by other m'facturers systems and with whom I have had no formal >discussions Nothing in a stylesheet based solution violates this to my mind. >If all the manufacturers tell me this is a non-problem, I'll shut up (on this >issue!) If each DTD defines its own use of whitespace (or worse, doesn't >define it) they may have a lot of work. > >(B) There are generic XML applications. The XML community continues to discuss >documents which 'contain information from more than one DTD' or 'are WF but >not necessarily valid(atable)'. Examples of these are: > (i) an XML document to which meta-data has been prepended. I'm probably not the best person to address this, as I think that the mix-and-match proposals are ill-thought out, but since the data is supposed to recognizable, presumably it is also to be ignored by all applications other than "meta-applications". So that's not a problem. > (ii) an XML document which includes chunks conforming to well-defined >DTDs such as MathML. In which case, they should have well-known stylesheets or descriptions that explain any whitespace conventions in use. > >The possible combinations are indefinitely large. But since each individual part must have defined bevhavior, this should not be a problem. >It is impossible to write bespoke software to process these documents, and we >need generic mechanisms. Perhaps many will be dealt with by stylesheets, and >maybe the WS issue is a question of developing appropriate conventions in >stylesheets. In documents of this sort there have to be conventions and flags >that indicate how to interpret the documents. The spec has indicated that it >shouldn't be in the XML markup - no problem. Somehow conventions have to >evolve, either conveyed implicitly or explicitly (e.g. through PIs). >[Remember that there are - as yet - no agreed conventions as to what a PI can >look like - you can put anything in after the target.] I used to think this might be useful, but I can't actually think of any application that could plausibly care about whitespace folding and also do meaningful processing without knowledge of the DTD. A text-indexer can work without a DTD, but also doesn't need any whitespace info (folding is always good enough) -- and it needs to see every byte, because it may have to track file offsets of hits. Can you think of any other useful examples of "DTD-blind" applications that might care about how the document _intended_ the whitespace to be processed. I cofness that I can't. >Note; I am NOT trying to find a universal solution here. I am suggesting that >we develop some common, useful approaches which will solve a reasonable >number of problems. But I don't actually see what problems we can solve with such solutions, that are not better addressed in either the stylesheet or DTD/schema problems. >> The problem with this is that there are a large number of ways that >> whitespace can be used: the "tokens" form mentioned at the end, for >> example, has never been proposed for XML. > >I agree there are a large number of ways. Some classification would be >valuable and IMO the sort of thing that XML-DEV could usefully provide. >[The WS-separated tokens are no different from 'words' in HTML and I would >expect that a large number of people would welcome a convention on >normalising whetspace between 'words'.] Enumerating these might have some pedagogical value, but I no longer see the practical value of declaring the behaviors. I used to think it might be useful, but I'm not so sure. >Then the application needn't implement them :-) Applications have to do >*something* about whitespace. This can be: > - ignore the problem (or use PRESERVE) > - their own thing > - a set of choices which is understood by the community > - refuse to process the document. Only 2 (their own thing) makes any sense -- and is typically driven by their knwoledge of a DTD or possesion and following of the dictates of a stylesheet. >It 'works' in that it shifts the problem to the application developer. I like >the idea of an XML->XML transducer - perhaps in front of the application, or >callable within it. If David thinks that such tools could be built >independently of applications that is exactly what I am suggesting :-) They are close to a _null_ application, and require _no_ whitespace normalization, since they need only pass any whitespace they see straight through. This was my original point. Only if you insist on "normalizing" do you _create_ problems with transduction. >it's clear that an application *must* have access to all whitespace if it >wants it (this is made clear by, say, the requirement of XMl_LINK to search >on pseudoelements). However it should also be able to access a normalised >form of the document. Why? I think I've argued effectively that this is not useful without a stylesheet or well-known DTD, and in those cases, it is not necessary (as the DTD or stylesheet should declare the conventions in use). >> This is the option that XML universally adopts. That means that any other >> method can be implemented _by any processor that cares_. If one can imagine >> destroying meaning of a document's content by the flattening of all >> whitespace strings to a single space, then you may need more elements in >> your content model, if you are not able to control the software that will >> process the document. > >This is a good point. > >> >> In other words the parser guarantees all WS will be visible to applications >> -- this makes designing and implementing WS dependent processing easy -- >> but since applications are _not_ constrained as folding or other WS >> processing behaviour, document authors will have to be cautious in using >> significant whitespace. If you can't assume that applications to process >> your markup will do the right thing, then you should not play games with WS. > >Yes. But where is the rigour in authoring going to come from? This is where >I believe that XML-DEV has a role. I'm not sure what you mean here... If the application or DTD depend on whitespace critically (a bad idea, probably, but a permissible one) -- then it is the author's responsibility to use it properly (and select a tool that let's her). Since the generic dumb text-editor is such a tool, and it's widely available, I don't see a big problem here. >> This actually is not much of an issue for CML, since it's a reasonable >> assumption that any implementation of CML markup-display will have to do >> lots of special things, of which whitespace is the least. > >No, the point was that CML wishes to re-use HTML and MathML as additonal >components in the document. And then meta-data, and ... So that the >application will become bloated unless it can re-use the approaches from >the rest of the community. I'm afraid I don't see how you're going to share code with an HTML processor. Nor can I psych myself up to believe that whitespace folding code: while (isspace(c = getc())) ; outchar = ' '; is a big bloat problem in a program that can render organic chem reaction diagrams. >> I think XML's agnostic position is the correct one for tha language. >> Authors should probably assume (unless they anticipate absolutely no >> re-use) that HTML-style draconian normalization might occur anywhere and >> use markup rather than whitespace, or at least CDATA sections. This >> position _may_ be moderated (a little) where a well-known DTD with >> well-defined WS rules can be used (like the TEI or HTML). > >I agree on this. The point I have been trying to promote is that it should >be possible to collate the requirements of such systems and offer them >on a re-usable basis. If it's useful, just list some policies and be done with it, I guess. In answering this mail I've found that I no longer believe that it's very important, because I don't see how to use it effectively anywhere. >An author could then say: > - the content of FOO, BAR, FLIP can be expected to be treated by >XML-DEV-HTML-like WS normalisation. > - the content of BAZ, BLORT suffers WS stripping as described in >XML-DEV-HTML-like-stripping. > >and that's about it. If we can get something along those lines, then >I think a reasonable number of people would take note. It doesn't just have >to apply to HTML DTDs. Why not. Make a web page for the policies, create a notation declaration that points at it, and then use that notation as a prefix on a PI to declare these things. It can't do any harm other than maybe wasting time. -- David _________________________________________ David Durand dgd@c... \ david@d... Boston University Computer Science \ Sr. Analyst http://www.cs.bu.edu/students/grads/dgd/ \ Dynamic Diagrams --------------------------------------------\ http://dynamicDiagrams.com/ MAPA: mapping for the WWW \__________________________ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@i... the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|