|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Blueberry/Unicode/XML
Tim Bray scripsit: > The problem isn't > 3.1, it's that Unicode is an unfinished standard that > continues to grow actively, whereas it would be nice if > we could declare XML syntax finished and go back to our > plows. It surely would, but that isn't the Real World. You will note that one of the Requirements is for the Core WG to consider the future evolution of Unicode. > XML 1.0 took a design decision in favor of enumeration of > name characters, simply because the alternative - outsourcing > the problem to the Unicode/ISO10646 process - had two > problems: > > (a) We didn't know them well enough to trust them, and > (b) writing a satisfying set of rules for XML name chars > based solely on Unicode metadata is pretty hard. > > The force of argument (b) is unabated. Actually, it turns out to be pretty easy. The following isn't official, but it's what I have in mind (and so far nobody has really poked holes in it): 1. Basic name-start characters are Unicode classes Ll (lower case), Lu (upper case), Lm (modifier letters), Lo (other letters, including ideographs), and Nl (a handful of oddballs). 2. Basic name characters are the above plus Mn (non-spacing combining marks), Mc (Indic vowels and the like), Nd (digits), and Pc (connective punctuation like KATAKANA MIDDLE DOT). These two rules constitute the Unicode 3.1 rules for "what is an identifier" (except that Unicode allows invisible formatting characters that are also invisible to name matching, a concept that doesn't fit XML), so already XML and Unicode are in good alignment. 3. Exclude all compatibility characters, and all characters in the Compatibility Zone (which are mostly, but not entirely, compatibility characters) except the 12 IBM ideographs that aren't unifiable with anything else. Unicode rules would leave these in, but only if loose matching is allowed. With XML's strict name matching, they would just cause hopeless confusion. 4. Add the XML-specific name-start characters colon and underscore, and the XML-specific name characters hyphen, dot, and middle dot. 5. Finally, there are 21 characters (18 are name-start) that XML 1.0 included that aren't covered by these rules for a variety of reasons, so just include them as a fixed list of exceptions. 21 out of 90,000+ isn't bad. > And what happens if ISO and > Unicode stop getting along one of these centuries, whose > side is XML on? Sooner the moon will fall from heaven! > 1. Leave it the way it is. > 2. Do Blueberry and then repeat the process for Unicode 3.2 > and 4.0 and so on every couple of years forever. One thing to say about this is that the list of characters to be added is shrinking all the time. Unicode 3.2 will add only 139 name characters, of which less than 20 are actually used by modern scripts. If we add another rule 6. Omit all characters from archaic scripts, as they have no native users any more. then the next change will be scarcely a ripple, affecting IIRC only Ainu (a minority language of Japan that uses additional katakana). > I think (3.) will prove to be really hard to do well - and > then the Unicode metadata fields might get changed and screw > it all up. Unicode has come a long way toward stabilizing the relevant categories. > But I really can't see how anyone can get behind any of > these positions and feel entirely comfortable with where > they find themselves standing. I sure don't. -Tim a) Slippery slopes can get to be a habit, I guess. b) It's a dirty job, but someone's got to do it. -- John Cowan cowan@c... One art/there is/no less/no more/All things/to do/with sparks/galore --Douglas Hofstadter
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








