[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Retain or discard whitespace surrounding an element?

  • From: Roger L Costello <costello@mitre.org>
  • To: "xml-dev@lists.xml.org" <xml-dev@lists.xml.org>
  • Date: Mon, 27 Dec 2021 12:03:58 +0000

Retain or discard whitespace surrounding an element?
[Definition] Lexer: a tool that inputs a linear sequence of characters and assembles them into meaningful groups (tokens). A lexer is also called a scanner or a tokenizer.

Hi Folks,

In the following XML document, what is the content of the <Document> element? 

<Document>
    <Test>Hello, world</Test>
</Document>

Is it: 

(a) Just the <Test> element?
(b) The whitespace following <Document>, plus the <Test> element, plus the whitespace (newline) following </Test>?

Should a lexer discard or retain the whitespace surrounding the <Test> element?

The answer is this: The content of the <Document> element could be either (a) or (b). A lexer should or shouldn't retain the whitespace surrounding the <Text> element. It is ambiguous.

Yikes!

If the XML document must conform to this XML Schema:

<xs:element name="Document">
    <xs:complexType>
        <xs:sequence>
            <xs:element name="Test" type="xs:string" />
        </xs:sequence>
    </xs:complexType>
</xs:element>

then the answer is (a). A lexer may safely discard the whitespace surrounding the <Test> element. The whitespace is not significant. Presumably the whitespace was placed there to make it easier for humans to read the document.

If the XML document must conform to this XML Schema:

<xs:element name="Document">
    <xs:complexType mixed="true">  /* Notice mixed="true" */
        <xs:sequence>
            <xs:element name="Test" type="xs:string" />
        </xs:sequence>
    </xs:complexType>
</xs:element>

then the answer is (b). A lexer may not discard the whitespace surrounding the <Test> element. The whitespace is significant. Presumably the whitespace has some special meaning to applications that process the XML document.

If the XML document is not associated with a schema (XSD, DTD, or RNG), then the answer is always (a) and the whitespace may be safely discarded.

So, sometimes the content of <Document> is one thing, sometimes it's another thing. This complicates lexers (and parsers) because they must have external, out-of-band knowledge about the document. Is that good language design?

/Roger


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.