[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: whitespace in 1.1


xml normalizes data
From: "Amelia A.Lewis" <amyzing@t...>

> I asked about this, and was told that it's supposed to be normalized to
> LF before whitespace processing happens.  At which point I asked why CR
> was part of the S production, and was given this hideous hack, using
> parameter entities, that allows one to force an un-normalized CR into
> attribute content.  

XML 1.1 has this problem even more so because of the (important) restrictions
on direct representation of the C1 control characters (which lets you know
in many common cases whether your are about to corrupt your nice databases).

In particular, the xml 1.n specs could stand being clarified about whether the
productions refer to external entities, internal entities, or 
the post-parse document.  In the case of the CRs in data, it is because
a CR could end up in the infoset, not because it can appear directly in an
external entity. 

I think the specs should be recast in terms of a (notional) preprocessing filter 
on external entities that
 0) converts encodings
 1) barfs if a non-allowed character is present, such as a C1
 2) normalizes newlines
 3) normalizes data (SHOULD) 
and which then removes all these considerations from impinging on the
XML productions. 

> Which struck me as a completely bizarre and useless
> form of backward compatibility with SGML (the reason, insofar as I
> understand it, to preserve the hackishness of this particular hack), but
> so it goes.
 
No, I don't think this comes from SGML. SGML has a completely different 
approach to lines: it does not even have CR and LF, but instead brackets
every line inside Record Start (RS) and Record End (RE) characters or signals.
That has the dubious virtue of not really corresponding to any text format,
and the side effect of making it more challenging to implement SGML with
standard libraries that use the now-ubiquitous \n.    (RS/RE will make more
sense to regex people, who may be more used to thinking in terms of boundaries
between characters as well as characters themselves. XML didn't need it.)

> Seriously strange corners of XML.  CR cannot appear in content when the
> S production is applied, except if you pull some 'rageous nonsense to
> make it do so, at which point one really *wonders* why it ought to be
> considered a space at all.
 
Not strange at all. The range of characters allowed directly is different from the
range of characters allowed using references.  There are characters that are
nice to have that are not nice to use (C1 controls), and there are characters that 
are nice to use but not nice to have (CR in markup). 

Cheers
Rick Jelliffe



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.