Re: A heavier-weight proposal for character entity definition

To: James Clark <jjc@j...>
Subject: Re: A heavier-weight proposal for character entity definition
From: ht@c... (Henry S. Thompson)
Date: 06 Feb 2002 11:21:32 +0000
Cc: xml-dev@l...
In-reply-to: <269036974.1012997921@[192.168.0.198]>
References: <269036974.1012997921@[192.168.0.198]>
User-agent: Gnus/5.0808 (Gnus v5.8.8) XEmacs/21.4 (Civil Service)

Play the video

James Clark <jjc@j...> writes:

> Before getting into the details of a schema for an XML syntax for
> declaring character entities, I think we should step and ask what the
> real requirements are.

For sure.  I think there are a number of obvious use cases, from which
we might derive requirements:

1) Hand-authoring an XML document, and need to include a few
well-known useful non-ASCII characters, e.g. &eacute;, &bullet;,
&copyright;

2) Post-processing arbitrary XML to make it encoding='ISO-646' or
'ISO-8859-1';

3) Authoring MathML, with or without helpful UI.

4) Marshalling implementation data, e.g. from a database, whose string
fields may have arbitrary Unicode, where e.g. ISO-8859-1 is the
required encoding (similar to (2)).

<snip/>

> - if you have user-defined character entity names, then users will
> start demanding the ability to preserve those names, which means that
> the DOM/SAX/Infoset will need to record which entity name if any was
> used for a character

As now, that demand can be responded to sensibly by saying editors are
not vanilla applications.

> So I'm wondering whether a more constrained approach to character
> entities would work.  Suppose for example there is a standard
> W3C-defined builtin entity set; this would have a version number and
> would add new characters from time to time (but never change existing
> entity names).  There would be a standard mapping from a version
> number to a URI where a XML specification of the entity set would be
> available.  However, parsers wouldn't have to fetch and parse this,
> they could just recognize the version number and refer to an
> appropriate compiled-in table.  The XML declaration would declare the
> version number of the builtin entity set that was being used; if the
> XML declaration didn't specify a version number, only the 5 XML 1.0
> builtin entities could be used. Just as now, the SAX/DOM/infoset
> wouldn't record whether a particular character was entered literally
> or using a builtin entity reference. Instead programs that serialize
> XML (like XSLT) would have options saying when to use builtin entity
> references to represent characters.

I think this works for use-cases (2) and (4) above, but at a pretty
high cost.  Conformant parsers will have no choice but to read or
build-in the complete set (40K names or so, at the moment, is it?) in
order to handle any entity references at all.  This seems too high a
cost for cases (1) and (3) above.

> For the first version of the standard builtin entity set we could start with
> 
> - HTML entities
> - MathML entities
> - maybe a set of entity names algorithmically generated from the
> standard Unicode names in Unicode 3.2; 0xe01; which has a Unicode name
> of "THAI CHARACTER KO KAI" might be entered as &thai_character_ko_kai;.

I'm also concerned that centralising maintenance and updating of this
mechanism is a recipe for frustration and interop nightmares.

What about a middle way, combining the two proposals:

1) Some document type for entity definitions is adopted by W3C;
2) XML n.m is appropriately modified to provide for exploitation of
   such definitions;
3) W3C publishes definitions of at least the three sets you name above
   at stable URIs with a public versioning policy;
4) Then full-featured parsers that want to can build in tables for the
   published URIs, but light-weight parsers that don't want to can
   operate a "read only what's required" policy, thereby handling the
   simple cases simply.

ht
-- 
  Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh
          W3C Fellow 1999--2001, part-time member of W3C Team
     2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440
	    Fax: (44) 131 650-4587, e-mail: ht@c...
		     URL: http://www.ltg.ed.ac.uk/~ht/

References:
- Re: A heavier-weight proposal for character entitydefinition
  - From: James Clark <jjc@j...>

Prev by Date: Re: A heavier-weight proposal for character entity definition
Next by Date: Re: A heavier-weight proposal for character entity definition
Previous by thread: Re: A heavier-weight proposal for character entitydefinition
Next by thread: Re: A heavier-weight proposal for character entity definition
Index(es):
- Date
- Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >