[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Character Entities: An XML Core WG View
From: "John Cowan" <jcowan@r...> > But character entity names are all drawn from the same Unicode character space. > I have yet to see a principled defense for supporting inconsistency here. I am not sure I get what you are saying, but the ISO, AMS, HTML and seemingly the MathML entities were developed separately and sometimes prior to Unicode. Unicode Consortium has used some rules for unifying characters, but sometimes the entities serve the purpose of getting specialized variants: for example, the ISOGrk4 entities have no equivalents in Unicode (still, I believe). They are supposed to be bold versions of the Greek characters for maths use. So a real mapping of these characters should included some PI or html:font (no flames) element to select the style (or something to map them to the PUA). This is a class of publishing characters that the Unicode consortium says are variants and to be handled by a higher-layer. Entities allows these characters to be named, and specific purpose mappings made. And that is where it falls apart: you cannot provide these kinds of mappings without resorting to elements or PIs. PIs being out of fashion, it means that these publishing characters can only be used in concert with a specific element vocabulary: MathML is probably the exemplar here. Another example of this problem is where Unicode has unified a character, but there has been regional variants: then it becomes essential to have an indication of the locale (xml:lang). But every avenue to cope with this is being blocked off: * W3C I18n WG deprecates PUA characters, or ways to make use of them in public * W3C misc deprecate use of PIs, for example to allow CSS properties embedded in text for font selection * W3C Schemas split out schemas to be post-parse and have not provided, say, an annotation for allowing entity definitions to be bundled with schemas * The Unicode Consortium has helped/hindered things by providing a variant selector character, but this is as yet disconnected from any standard to make use of these, and such a standard (in the markup world) would ultimately map to a PUA, an element, or a PI anyway. * Almost no non-core W3C XML specification treats attribute inheritence seriously: there is no way to say for an element type "my ancestor's attribute xxx is in scope for me, if I am cut out from my context I need to take that attribute with me". (A defaulting type similar to SGML's #CURRENT but only working on ancestors would be a big step forward.) * The infoset simplification that treats entities as macros, to be forgotten after parsing. Imagine, for example, how much simpler life would be if there were an XSLT mode that silently shoved through undeclared entity references as part of text. This is certainly no criticism of the XML Infoset spec. (I believe this shows a systematic problem in Unicode. Of course, we have to play with the cards we have been dealt, but another approach would have been that used by the CCCII format: every character is made from a base and a variant selector, potentially allowing systems to fall back to a close glyph if the desired glyph is not available, and avoiding the need for higher-level protocols to supply variant information. But recognizing an approach has certain problems is not to say Unicode is not correct for XML: XML has to work with fairly unified characters: publishers and technical people have to work with very specific characters and we need the glue.) So it is quite possible that two entities could be mapped to the same Unicode string (e.g. "-") or that an entity could be mapped by different people to different strings (e.g. should &heart; be the filled or unfilled character?) or that there is no corresponding Unicode string for an entity (e.g. the fj which has the trivial fallback "fj") or that a mapping for an entity requires some kind of variant selector or markup, and therefore a higher-level protocol.) The standard entity sets were designed to allow workarounds to system-specific issues. "The system" in XML's case is Unicode. The XML as "atomic strings in trees" kind of view that is particularly associated with database people, makes the assumption of entity=Unicode string, and so keeps solutions from being developed. The thing is that this can almost all be coped with by * catalogs, which let a terminal system provide the mappings it understands, for the standard entity sets, including PIs * transformation systems that allow standard entities references to emerge after the transformation as entity references again, transparently to to the user (i.e. as part of data) These were fairly commonplace things for SGML systems, and XML needs to catch up in order to support MathML and publishing. We hear very often that XML needs adjustment to cope with the needs of data exchange, but it also needs adjustment to cope with quality document production issues. (Actually, not so much XML itself as infoset-manipulating systems such as XSLT.) Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|