RE: CDATA by any other name... (was The raw and the cooked)

From: "Rick Jelliffe" <ricko@a...>
To: <xml-dev@i...>
Date: Sat, 31 Oct 1998 17:31:29 +1100

Play the video

Henry Thompson wrote:

> The DOM made a serious mistake here in my opinion: it's
> stranded in no-person's-land between raw and cooked, without being
> either.  It's not cooked, because it gives you EntityReference and
> CDATA nodes.  It's not raw, because it DOESN'T give you character
> entity references.

CHARACTER REFERENCES
I think Henry means "numeric character reference", and this is the heart of
the matter. A numeric character is not an entity, any more than a
directly-entered character is. It is just an alternative encoding of the
character, and should be of no more interest to a general API than the
charset encoding of the document was. (I am putting words into his mouth: or
does Henry mean the [XMLs4.6] predefined entities?)

Even if you make
	<!ENTITY example "&#123;">
The numeric character is not an entity: it is the value of an entity with
the name "example".

MARKED SECTIONS
On the subject of marked sections, I personally think that (in SGML) marked
sections should do more than just alter delimiter recognition: I think they
delimit anonymous inline entities, and label the entity with text-type
information. Unerlying this is that, marked sections actually mark up
notations: at ISO there has been discussion of whether to allow something
like (for example)
	<![JAVA[ java code here ]]>

This is not something that I would expect to make its way into XML (and I
think the ISO people are now more keen to help XML/WebSGML than on tidying
up SGML) but I think the idea that a marked section not only alters
delimiter recognition but also labels the data can be seen (in embryo or
residually) in DOMs elevation of CDATAsection to node-worthiness, which has
so perplexed Henry.

If you take the view that CDATA section labels the data as character data
(i.e. not ignorable whitespace) then <![CDATA[ ]]> is clearly invalid in
Henry's example: because the " " is marked as data and data is not allowed.
But that is emphera: what does the spec say?

I think the answer is clear from the spec:
[43]  content ::=  (element | CharData | Reference | CDSect | PI | Comment)*
so a CDSect is not CharData. Therefore a CDSect is only valid in mixed
content, even though it is well-formed to have it in element content.

I think this is doubly clear from the discussion of "white-space" in [XML
2.10]: white-space for xml:space considerations (in element content) is
space added for "greater readability". <![CDATA[ ]]> does not do this!! It
disrupts readability. So from the purpose of valid whitespace in element
content it is clear that <![CDATA ]]> is not legitimate. The text is just as
important as the productions.

SPACES
Henry's problem brings up a further important consideation.  XML gives an
attribute "xml:space" by which an application can know whether white:space
may be collapsed or not. Can <![CDATA[ ]]> be used to override
xml:space=default?  The answer is NO, because

* an application is free to decide whether collapse spaces inside CDATA
marked sections or not;

* in PCDATA, ISO 10646 provides a specific character to indicate
non-collabsible whitespace: IDEOGRAPHIC SPACE  &#x3000;

* outside mixed content <![CDATA[ ]]> is not valid for the reasons above.

XML, by adopting ISO 10646, takes the line that the only way to overcome the
problems that (ASCII) people have with spaces is to un-overload that damned
space character. The basic principle of markup is that if a user wants
something, they should unambiguosly mark it up in their data: if they want
non-collapsible space, the correct answer is "Use &#x3000;" or "Use
xml:space='preserve'". (However, font issues are important here: IDEOGRAPHIC
SPACE may be twice as wide as " " spaces, so the xml:lang attribute may be
important.)

I urge deve2lopers to make sure that their products handle the 17 ISO10646
spacing/hypenation characters properly. There have been previous postings on
this group, (what happen to that XML jewels website: it was there too?), or
get the Unicode book, or get ISO 10646, or (best option:-) get my book (XML
& SGML Cookbook, p 3-90).

Rick Jelliffe

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)

Follow-Ups:
- RE: CDATA by any other name... (was The raw and the cooked)
  - From: <david@m...>

References:
- Re: CDATA by any other name... (was The raw and the cooked)
  - From: ht@c... (Henry S. Thompson)

Prev by Date: RE: CDATA by any other name... (was The raw and the cooked)
Next by Date: RE: Is XML 1.0 underspecified? (was: Re: CDATA by any other name...)
Previous by thread: Re: CDATA by any other name... (was The raw and the cooked)
Next by thread: RE: CDATA by any other name... (was The raw and the cooked)
Index(es):
- Date
- Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >