Re: Whitespace rules (v2)

From: dgd@c... (David G. Durand)
To: xml-dev@i...
Date: Mon, 25 Aug 1997 10:58:13 -0500
Play the video
At 6:36 AM -0500 8/25/97, Peter Murray-Rust wrote:
>I have been away for a few days so maybe it's a useful time to try to
>summarise
>the Whitespace debate and to ask a few questions. You don't need to read the
>rest of this unless you believe there is a problem to be addressed :-)

Afraid that I have to chime in when I see a non-problem consuming valuable
time...

>
>In message <v03007800b01fa935a1f1@[205.181.197.116]> dgd@c... (David
>G. Durand) writes:
>> I observed with dismay that the issue of whitespace has surfaced on this
>> list, after we finally gave it the wooden-stake-in-the-heart treatment on
>> the WG discussion lists. As a chief proponent of the current method, I'll
>
>:-) I am not sure what has been killed :-)

I hoped the discussion. Certainly I hoped the shibboleth of a parser
"normalizing" whitespace on behalf of the application.

>I will take David's points first, because I *do* believe that many of those
>who were involved in the development of the spec feel that there is no scope
>for further discussion of this *IN THE SPEC*.  I agree with this.

Actually, the only question remaining, in my mind, is how the XML
stylesheet language should allow shitespace to be processed. I disagree
that there is any need for a non-stylesheet, non-application convention for
whitespace. Note, that in some sense, the Document type _description_ (i.e.
descriptive prose desribing the intent of a DTD) and the "schema" notions
are application specifications, and are entitled to declare whitespace
handling rules.

>Essentially the spec says:
>	- This is a difficult problem.  [Actually it doesn't say this, but
>it might help if it did in a footnote.]
It's only difficult if you think that it's a parser problem. It's easy in
XML, because all whitespace is visible. I can think of no _simpler_ rule
that a _parser_ could implement.

>	- We have taken a minimalist approach where we do not give any support
>to any whitespace philosophy [other than PRESERVE which passes everything and
>can be platform-dependent], but leave this to the community. DEFAULT is simply
>the absence of PRESERVE.

Yes, since there is not a universal "whitespace philosophy" even for a
single document (see my response to Marcus for an example), there's no
reason to declare it in the instance.

>I believe this solves one species of problem, where the authoring tool/system
>is closely coupled to the application. CDF might be such a system (e.g. I have
>never seen a native CDF file).

No, it's a case where the "philosophy" is coupled to the application, not
to the "document" in the abstract -- except insofar as it is defined by a
"document type description" or "schema" -- which is essentially a set of
ideal constraints that applications are expected to follow.

>(A) There is a defined DTD (e.g. TEI, HTML) but a variety of authoring tools
>and a variety of applications from different providers. Traditionally these
>will come from the SGML community. I believe that there will certainly be
>initial problems where m'facturer X emits whitespace in a particular way
>which is incompatible with Y's tools for rendering/transforming it. It may
>also be platform dependent.  We've seen this in the development of HTML
>systems
>although they are improving.

TEI defines where whitesspace is signficant (almost nowhere if I remember
correctly).

>Remember that most SGML systems are current implemented within a single site
>(the tools are chosen to be compatible throughout the process). Very little
>SGML is delivered over the WWW to be consistent between different m'facturers.
>XML is specifically designed to be delivered over the WWW in (I assume)
>a platform and m'facturer-independent way.  Do we expect to see 'this XML
>file best viewed with FOO software'??? If so, we might as well give up now.

No, but every document will _have_ to either conform to a well-known DTD or
schema of some sort, or be delivered with a stylesheet, and those are
usefule places that this behavior should be explained.

>IMO any developer needs to be able to say:
>	(i) I support a wide range of XML DTDs.
>	(ii) I can easily customise my software to support a range of commonly
>used DTDs
>	(iii) Documents authored by my software should be readable by software
>from another m'facturer with whom I have had no formal discussions
>	(iv) My system can support a range of applications which read documents
>produced by other m'facturers systems and with whom I have had no formal
>discussions

Nothing in a stylesheet based solution violates this to my mind.

>If all the manufacturers tell me this is a non-problem, I'll shut up (on this
>issue!) If each DTD defines its own use of whitespace (or worse, doesn't
>define it) they may have a lot of work.
>
>(B) There are generic XML applications. The XML community continues to discuss
>documents which 'contain information from more than one DTD' or 'are WF but
>not necessarily valid(atable)'. Examples of these are:
>	(i) an XML document to which meta-data has been prepended.
I'm probably not the best person to address this, as I think that the
mix-and-match proposals are ill-thought out, but since the data is supposed
to recognizable, presumably it is also to be ignored by all applications
other than "meta-applications". So that's not a problem.

>	(ii) an XML document which includes chunks conforming to well-defined
>DTDs such as MathML.

In which case, they should have well-known stylesheets or descriptions that
explain any whitespace conventions in use.
>
>The possible combinations are indefinitely large.

But since each individual part must have defined bevhavior, this should not
be a problem.

>It is impossible to write bespoke software to process these documents, and we
>need generic mechanisms. Perhaps many will be dealt with by stylesheets, and
>maybe the WS issue is a question of developing appropriate conventions in
>stylesheets.  In documents of this sort there have to be conventions and flags
>that indicate how to interpret the documents. The spec has indicated that it
>shouldn't be in the XML markup - no problem.  Somehow conventions have to
>evolve, either conveyed implicitly or explicitly (e.g. through PIs).
>[Remember that there are - as yet - no agreed conventions as to what a PI can
>look like - you can put anything in after the target.]

I used to think this might be useful, but I can't actually think of any
application that could plausibly care about whitespace folding and also do
meaningful processing without knowledge of the DTD. A text-indexer can work
without a DTD, but also doesn't need any whitespace info (folding is always
good enough) -- and it needs to see every byte, because it may have to
track file offsets of hits.

Can you think of any other useful examples of "DTD-blind" applications that
might care about how the document _intended_ the whitespace to be
processed. I cofness that I can't.



>Note; I am NOT trying to find a universal solution here.  I am suggesting that
>we develop some common, useful approaches which will solve a reasonable
>number of problems.

But I don't actually see what problems we can solve with such solutions,
that are not better addressed in either the stylesheet or DTD/schema
problems.

>> The problem with this is that there are a large number of ways that
>> whitespace can be used: the "tokens" form mentioned at the end, for
>> example, has never been proposed for XML.
>
>I agree there are a large number of ways.  Some classification would be
>valuable and IMO the sort of thing that XML-DEV could usefully provide.
>[The WS-separated tokens are no different from 'words' in HTML and I would
>expect that a large number of people would welcome a convention on
>normalising whetspace between 'words'.]

Enumerating these might have some pedagogical value, but I no longer see
the practical value of declaring the behaviors. I used to think it might be
useful, but I'm not so sure.

>Then the application needn't implement them :-)  Applications have to do
>*something* about whitespace.  This can be:
>	- ignore the problem (or use PRESERVE)
>	- their own thing
>	- a set of choices which is understood by the community
>	- refuse to process the document.

Only 2 (their own thing) makes any sense -- and is typically driven by
their knwoledge of a DTD or possesion and following of the dictates of a
stylesheet.

>It 'works' in that it shifts the problem to the application developer. I like
>the idea of an XML->XML transducer - perhaps in front of the application, or
>callable within it.  If David thinks that such tools could be built
>independently of applications that is exactly what I am suggesting :-)

They are close to a _null_ application, and require _no_ whitespace
normalization, since they need only pass any whitespace they see straight
through. This was my original point. Only if you insist on "normalizing" do
you _create_ problems with transduction.

>it's clear that an application *must* have access to all whitespace if it
>wants it (this is made clear by, say, the requirement of XMl_LINK to search
>on pseudoelements).  However it should also be able to access a normalised
>form of the document.
Why? I think I've argued effectively that this is not useful without a
stylesheet or well-known DTD, and in those cases, it is not necessary (as
the DTD or stylesheet should declare the conventions in use).

>> This is the option that XML universally adopts. That means  that any other
>> method can be implemented _by any processor that cares_. If one can imagine
>> destroying meaning of a document's content by the flattening of all
>> whitespace strings to a single space, then you may need more elements in
>> your content model, if you are not able to control the software that will
>> process the document.
>
>This is a good point.
>
>>
>> In other words the parser guarantees all WS will be visible to applications
>> -- this makes designing and implementing WS dependent processing easy --
>> but since applications are _not_ constrained as folding or other WS
>> processing behaviour, document authors will have to be cautious in using
>> significant whitespace. If you can't assume that applications to process
>> your markup will do the right thing, then you should not play games with WS.
>
>Yes. But where is the rigour in authoring going to come from? This is where
>I believe that XML-DEV has a role.
I'm not sure what you mean here... If the application or DTD depend on
whitespace critically (a bad idea, probably, but a permissible one) -- then
it is the author's responsibility to use it properly (and select a tool
that let's her). Since the generic dumb text-editor is such a tool, and
it's widely available, I don't see a big problem here.

>> This actually is not much of an issue for CML, since it's a reasonable
>> assumption that any implementation of CML markup-display will have to do
>> lots of special things, of which whitespace is the least.
>
>No, the point was that CML wishes to re-use HTML and MathML as additonal
>components in the document. And then meta-data, and ... So that the
>application will become bloated unless it can re-use the approaches from
>the rest of the community.

I'm afraid I don't see how you're going to share code with an HTML
processor. Nor can I psych myself up to believe that whitespace folding
code:
  while (isspace(c = getc())) ;
  outchar = ' ';
is a big bloat problem in a program that can render organic chem reaction
diagrams.

>> I think XML's agnostic position is the correct one for tha language.
>> Authors should probably assume (unless they anticipate absolutely no
>> re-use) that HTML-style draconian normalization might occur anywhere and
>> use markup rather than whitespace, or at least CDATA sections. This
>> position _may_ be moderated (a little) where a well-known DTD with
>> well-defined WS rules can be used (like the TEI or HTML).
>
>I agree on this.  The point I have been trying to promote is that it should
>be possible to collate the requirements of such systems and offer them
>on a re-usable basis.

If it's useful, just list some policies and be done with it, I guess. In
answering this mail I've found that I no longer believe that it's very
important, because I don't see how to use it effectively anywhere.

>An author could then say:
>	- the content of FOO, BAR, FLIP can be expected to be treated by
>XML-DEV-HTML-like WS normalisation.
>	- the content of BAZ, BLORT suffers WS stripping as described in
>XML-DEV-HTML-like-stripping.
>
>and that's about it. If we can get something along those lines, then
>I think a reasonable number of people would take note. It doesn't just have
>to apply to HTML DTDs.

Why not. Make a web page for the policies, create a notation declaration
that points at it, and then use that notation as a prefix on a PI to
declare these things. It can't do any harm other than maybe wasting time.

  -- David

_________________________________________
David Durand              dgd@c...  \  david@d...
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________



xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@i... the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@i...)
References:
- Re: Whitespace rules (v2)
  - From: Peter@u... (Peter Murray-Rust)
Prev by Date: Re: Whitespace
Next by Date: Re: Whitespace
Previous by thread: Re: Whitespace rules (v2)
Next by thread: Testing digest - please ignore
Index(es):
- Date
- Thread
PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.
Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.
XML Editor - Download a 15 Day Free Trial Now >
See What's New in Stylus Studio >
Buy Stylus Studio - XML Editor - Now >