[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: First draft of proposed XML TC for Unicode 3.0 (unofficial)
Even though the XML 1.0 Rec does specify the use of Unicode 2.0, i heartily agree that we should be moving to support Unicode 3.0, rather than remaining with the older version of Unicode. John Cowan wrote: > : >In addition, the following characters no longer pass the tests given >in Appendix B for valid name or name-start characters, but should >remain legal in XML names for backward compatibility, and therefore >should be explicitly enumerated in the corrigendum: > >03D0;GREEK BETA SYMBOL >03D1;GREEK THETA SYMBOL >03D2;GREEK UPSILON WITH HOOK SYMBOL >03D5;GREEK PHI SYMBOL >03D6;GREEK PI SYMBOL >03F0;GREEK KAPPA SYMBOL >03F1;GREEK RHO SYMBOL >03F2;GREEK LUNATE SIGMA SYMBOL > : I disagree the these characters should remain legal in XML names. 1) Were the above changes based upon the recognition that Unicode 2.1 erroneously classified these symbols as letters? 2) If these characters continue to be considered legal name-start characters, won't productions [4], [5], [84], and [85] now contradict the text (following the legal characters table in Appendix B) regarding legal name and name-start characters? 3) If question #2 is true, won't the text then need to be modified to read: "Name start characters must have one of the categories Ll, Lu, Lo, Lt, Nl [, except for these "special" ones..]"? This change of classification may well break some existing XML parsers and/or apps, no matter whether or not these characters remain legal in XML names. Consider that there are two ways that an XML parser might have implemented production [85]: 1) use a simple table of character ranges, copied directly from the XML 1.0 Rec; or 2) a truly Unicode-aware parser might have instead used a table of categories derived from the Unicode data file, and implemented the "Ll, Lu, Lo, Lt, Nl" rule, based upon that table. If i were the developer of serious Unicode-aware software, i'd probably have chosen the second approach, since it is _extensible_ (my parser changes in sync with the Unicode changes); whereas the first is based upon a _static_ table (that changes only when the W3C decrees, if ever). I do suppose we could argue that Unicode was expected to change more often than XML, and that the first approach would therefore require less frequent parser software updates. Either way -- if Unicode changes than those things built upon it (e.g. Java, XML) also have to change. I argue that keeping simple "legal name character" rules is more important than the rather slight possibility of breaking some existing XML documents. At the risk of being labeled Anglo-centric, how many docs are likely to have used these Greek, Arabic, Thai, Lao, or Tibetan symbols in XML names? (I do suppose that James Clark's choice of residence might have skewed the frequency of Thai in XML, though ;-). IMHO, "backward compatibility" does not justify a special rule for the treatment of these characters! If symbols, in general, are not legal name characters, then these symbols should not receive special treatment, just because there were erroneously classified in an earlier Unicode. If these characters indeed aren't letters, then they should be removed from production [85]. This way the corrigendum need only correct [85], a relatively simple change. Also, won't the entry in "A.1 Normative References" also need to be changed to reference the Unicode 3.0 spec, rather than the older version? I, too, have no insight into the W3C process in this matter. Presumably there will one day be an XML 1.1, if only after the XML 1.0 errata reach a critical mass., and/or the Namespaces issue is resolved... Regards, Nik O, Teton Data Systems, Jackson, Wyo. ======= Begin excerpt (from XML 1.0 Rec) ======= [4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender [5] Name ::= (Letter | '_' | ':') (NameChar)* : [84] Letter ::= BaseChar | Ideographic [85] BaseChar ::= [#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E] | [#x0141-#x0148] | [#x014A-#x017E] | [#x0180-#x01C3] | [#x01CD-#x01F0] | [#x01F4-#x01F5] | [#x01FA-#x0217] | [#x0250-#x02A8] | [#x02BB-#x02C1] | #x0386 | [#x0388-#x038A] | #x038C | [#x038E-#x03A1] | [#x03A3-#x03CE] | [#x03D0-#x03D6] | #x03DA | #x03DC | #x03DE | #x03E0 | [#x03E2-#x03F3] : The character classes defined here can be derived from the Unicode character database as follows: * Name start characters must have one of the categories Ll, Lu, Lo, Lt, Nl. * Name characters other than Name-start characters must have one of the categories Mc, Me, Mn, Lm, or Nd. : ======= End excerpt ======= xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|