[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Whitespace rules (v2)

  • From: "Neil Bradley" <neil@b...>
  • To: xml-dev@i...
  • Date: Sun, 10 Aug 1997 22:48:32 +0000

xml whitespace characters
Due to some useful feedback, and further thoughts of my own, I would 
like to amend my list of 5 whitespace rules in a few respects.

For people who read the previous set of rules, the corrections are:

a) block-enclosing elements must be identified via list or style 
sheet
b) PI, Comment and empty element processing has totally changed
c) all rules explicitly apply to both validating and non-validating applications
d) the rules are explicitly to be applied in sequence

The new rules can be summarized as:

1. normalize line-end codes
2. Remove block surrounding whitespace
3. Remove leading/trailing block line-ends
4. Join lines and de-hyphenate
5. Remove surplus spaces in text

------WHITESPACE RULES------

A formatting application should remove or transform whitespace characters 
received from the XML-processor according to the following 5
rules. These rules are to be applied in sequence, by both validating and 
non-validating applications.

Note 1: PI's, comments and empty elements may be removed, and at 
any point in the process. 

Note 2: in some cases, 'line-end' codes (CR and LF) are distinguished 
from 'spacing' characters (SP and TAB), but the term 'whitespace' 
continues to indicate all these characters


----------
RULE 1. Every line-end code is regarded as a line terminator, except
when it immediately follows the other code ([CR] following [LF] or 
[LF] following [CR]), in which case it is discarded (and is also
ignored, so has no effect on calculations for the next character).
This rule also applies in 'preserved' content.
---
Note: this rule standardizes input from documents prepared on Mac, Unix and
MS-DOS/Windows platforms.

[CR] ---> line-end
[LF] ---> line-end
[CR][LF] ---> line-end
[LF][CR] ---> line-end
[LF][LF] ---> line-end, line-end
[CR][CR] ---> line-end, line-end
[CR][LF][CR][LF] ---> line-end, line-end (because both LF's are 
ignored)

Note: by including this rule in preserved content, we avoid alternate blank
lines appearing in documents prepared on an MS-DOS system but viewed
on another system.


----------
RULE 2. All whitespace preceding the start-tag and following the end-tag 
of a 'block enclosing' element is discarded.
---
Note: a non-validating applications must refer to a style sheet or
configuration file to identify 'block enclosing' elements (perhaps by 
applying this rule to elements not specified as in-line elements).
As a validating application cannot easily determine this rule from the
content model (the first mixed content element in the hierarchy is 
block enclosing, as well as all outer layers), it may choose the same 
approach. 


Note:

 <chapter>[SP]<note>[SP][TAB]<p>This is a[SP]<em>para</em>...

becomes:

 <chapter><note><p>This is a[SP]<em>para</em>

and:

 <p>Para 1.</p>[CR]
 <p>Para 2.</p>

becomes:

 <p>Para 1.</p><p>Para 2.</p>

Note: If PI's, comments or empty elements remain in the data stream,
they are deemed transparent to this process, so:

 [SP]<!--comment--><p>Some text...

becomes:

 <!--comment--><p>Some text...


----------
RULE 3. A sequence of one or more line-end codes immediately
following a start-tag, or immediately preceding an end-tag, are
discarded (except in preserved content).
---
Note:

 <note>[CR]
 <p>[CR]
 This is a para in a note.[CR]
 </p>

becomes:

 <note><p>This is a para in a note.</p>

Note: If PI's, comments or empty-elements remain in the data stream, 
they are deemed transparent to this process, so:

 <p><!-- a comment -->[CR]
 some text...

becomes:

 <p><!-- a comment -->some text...


----------
RULE 4.  A remaining line-end code is converted into a space, except when it is 
preceded by a normal (hard) hyphen, or by a soft hyphen ('&#176;'), 
in which case it is removed (a soft hyphen is also then removed). 
---
Note:

 A[CR]
 line-[CR]
 end code sep&#176;[CR]
 erates lines.

becomes:

 A line-end code seperates lines.

Note: PI's, comments and empty elements are treated as text, so:

 <p>Some[CR]
 <!-- comment -->[CR]
 text.

becomes:

 <p>Some[SP]<!-- comment -->[SP]text.

Note: if a space is required after the hyphen, it must be inserted before the 
line-end:

 4 -[SP][CR]
 3 = 1

becomes:

 4 -[SP][SP]3 = 1 


----------
RULE 5. Consecutive whitespace characters (including translated 
line-end codes) are reduced to a single space, except in preserved
mode.
---
Note:

 4 -[SP][SP]3 = 1 

becomes:

 4 -[SP]3 = 1 

Note: if PI's, comments or empty elements are removed after rule 5:

 <p>Some[SP]<!-- comment -->[SP]text.

has already become:

 <p>Some[SP][SP]text.

but now becomes:

 <p>Some[SP]text.

Note: Multiple spaces can be preserved using the non-break space
character ('&#160;').

 <p>Some&#160;&#160;&#160;spaces.
------------------------------

-----------------------------------------------
Neil Bradley - Author of The Concise SGML Companion.
neil@b...
www.bradley.co.uk

xml-dev: A list for W3C XML Developers
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To unsubscribe, send to majordomo@i... the following message;
unsubscribe xml-dev
List coordinator, Henry Rzepa (rzepa@i...)


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.