[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: A heavier-weight proposal for character entity definition
James Clark <jjc@j...> writes: > Before getting into the details of a schema for an XML syntax for > declaring character entities, I think we should step and ask what the > real requirements are. For sure. I think there are a number of obvious use cases, from which we might derive requirements: 1) Hand-authoring an XML document, and need to include a few well-known useful non-ASCII characters, e.g. é, •, ©right; 2) Post-processing arbitrary XML to make it encoding='ISO-646' or 'ISO-8859-1'; 3) Authoring MathML, with or without helpful UI. 4) Marshalling implementation data, e.g. from a database, whose string fields may have arbitrary Unicode, where e.g. ISO-8859-1 is the required encoding (similar to (2)). <snip/> > - if you have user-defined character entity names, then users will > start demanding the ability to preserve those names, which means that > the DOM/SAX/Infoset will need to record which entity name if any was > used for a character As now, that demand can be responded to sensibly by saying editors are not vanilla applications. > So I'm wondering whether a more constrained approach to character > entities would work. Suppose for example there is a standard > W3C-defined builtin entity set; this would have a version number and > would add new characters from time to time (but never change existing > entity names). There would be a standard mapping from a version > number to a URI where a XML specification of the entity set would be > available. However, parsers wouldn't have to fetch and parse this, > they could just recognize the version number and refer to an > appropriate compiled-in table. The XML declaration would declare the > version number of the builtin entity set that was being used; if the > XML declaration didn't specify a version number, only the 5 XML 1.0 > builtin entities could be used. Just as now, the SAX/DOM/infoset > wouldn't record whether a particular character was entered literally > or using a builtin entity reference. Instead programs that serialize > XML (like XSLT) would have options saying when to use builtin entity > references to represent characters. I think this works for use-cases (2) and (4) above, but at a pretty high cost. Conformant parsers will have no choice but to read or build-in the complete set (40K names or so, at the moment, is it?) in order to handle any entity references at all. This seems too high a cost for cases (1) and (3) above. > For the first version of the standard builtin entity set we could start with > > - HTML entities > - MathML entities > - maybe a set of entity names algorithmically generated from the > standard Unicode names in Unicode 3.2; 0xe01; which has a Unicode name > of "THAI CHARACTER KO KAI" might be entered as &thai_character_ko_kai;. I'm also concerned that centralising maintenance and updating of this mechanism is a recipe for frustration and interop nightmares. What about a middle way, combining the two proposals: 1) Some document type for entity definitions is adopted by W3C; 2) XML n.m is appropriately modified to provide for exploitation of such definitions; 3) W3C publishes definitions of at least the three sets you name above at stable URIs with a public versioning policy; 4) Then full-featured parsers that want to can build in tables for the published URIs, but light-weight parsers that don't want to can operate a "read only what's required" policy, thereby handling the simple cases simply. ht -- Henry S. Thompson, HCRC Language Technology Group, University of Edinburgh W3C Fellow 1999--2001, part-time member of W3C Team 2 Buccleuch Place, Edinburgh EH8 9LW, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@c... URL: http://www.ltg.ed.ac.uk/~ht/
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|