[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: whitespace in 1.1
From: "Amelia A.Lewis" <amyzing@t...> > I asked about this, and was told that it's supposed to be normalized to > LF before whitespace processing happens. At which point I asked why CR > was part of the S production, and was given this hideous hack, using > parameter entities, that allows one to force an un-normalized CR into > attribute content. XML 1.1 has this problem even more so because of the (important) restrictions on direct representation of the C1 control characters (which lets you know in many common cases whether your are about to corrupt your nice databases). In particular, the xml 1.n specs could stand being clarified about whether the productions refer to external entities, internal entities, or the post-parse document. In the case of the CRs in data, it is because a CR could end up in the infoset, not because it can appear directly in an external entity. I think the specs should be recast in terms of a (notional) preprocessing filter on external entities that 0) converts encodings 1) barfs if a non-allowed character is present, such as a C1 2) normalizes newlines 3) normalizes data (SHOULD) and which then removes all these considerations from impinging on the XML productions. > Which struck me as a completely bizarre and useless > form of backward compatibility with SGML (the reason, insofar as I > understand it, to preserve the hackishness of this particular hack), but > so it goes. No, I don't think this comes from SGML. SGML has a completely different approach to lines: it does not even have CR and LF, but instead brackets every line inside Record Start (RS) and Record End (RE) characters or signals. That has the dubious virtue of not really corresponding to any text format, and the side effect of making it more challenging to implement SGML with standard libraries that use the now-ubiquitous \n. (RS/RE will make more sense to regex people, who may be more used to thinking in terms of boundaries between characters as well as characters themselves. XML didn't need it.) > Seriously strange corners of XML. CR cannot appear in content when the > S production is applied, except if you pull some 'rageous nonsense to > make it do so, at which point one really *wonders* why it ought to be > considered a space at all. Not strange at all. The range of characters allowed directly is different from the range of characters allowed using references. There are characters that are nice to have that are not nice to use (C1 controls), and there are characters that are nice to use but not nice to have (CR in markup). Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|