[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: XML Blueberry (long response on CJK background)
From: "Murata Makoto" <mura034@a...> > Rick Jelliffe wrote: > > >Of these, most are CJK Unified Ideographs Extension B. > >These are characters which must be considered bad practise > >for use in markup, perhaps with some exceptions. They are mostly > > characters which readers may easily find confusing, > >being archaic, regional, variant, uncommon or non-interoperable. > > This is completely different from what I have heard from CJK experts. > Do you have any supporting evidence? 1) To answer a question with a question first, have these experts also given any indication of how many of the approx 71,000 Han ideographs in Unicode 3.1 are in *current* common use (not being personal names or place names)? If we allow 12,000 characters in common use (i.e. where a substantial proportion of the population could read and write them) in Taiwan, Japan, and Korea each (surely a rather large figure) and no overlap (very generous), that would still make 50% of the characters uncommon, merely on rule-of-thumb. I do have a number, at least for getting an inkling for Chinese use. CCCII classifies: 4,808 common Chinese characters 17,032 less common Chinese 20,583 rare Chinese characters (mostly variants?) 11,517 simplified Chinese gives about 59,000 characters. (However, I believe this usage is not current usage, but usage from the historical sources. ) It is not impossible that the IRG has found an extra 10,000 Han characters in common current use. But that still leaves, from the CCCII classification at least, perhaps 20,000 to 40,000 characters that are less common or rare. 2) The IRG's unification principles do not include anything to remove characters based on their rarity. A rare or archaic character included in a source set will be included under the round-tripping rule. Where the source sets are small, then there will be fewer uncommon characters. If large sets, constructed on historic "catch-all" principles, are included, then there must archaic, uncommon, regional, etc. characters. If an expert is saying that archaic characters or uncommon characters are not used, are they being removed by some undocumented protocol, or are source sets with no archaic characters being considered, or is the expert saying that there are no archaic or variant characters at all as some kind of categorical statement? 3) In Unicode 3.1, an extra 42,711 Han characters are being added. Of these, (all numbers +/- 2 counting error) 30,713 are found in Taiwanese sources (CNS 11643 in particular) 30,529 are found in mainland Chinese sources, most typically from the two major lexicons (the KangXi and the HanyuDaZidian) 4775 are from Vietnam. 1088 are from Hong Kong 303 are from Japanese sources 160 are from South Korea and 5,760 are from North Korea (These are all not mutually exclusive.) Lets look at these in more detail. Hong Kong ------------- I was told by a staff person of the Hong Kong government (who had some involvement with GCCS) that most of the Hong Kong characters are connected with place or personal names. I have not verified it, but that is what I was told. These kinds of characters are unlikely to be used as element or attribute names. Hence the comment about "regional" characters. Mainland Chinese --------------------- There is obviously a lot of overlap between the mainland and Taiwanese sources. I cannot count them readily, but at least 18,520 are the same (and at most all of them). (18486 is also about the same number as are sourced from the KangXi, but this looks to be coincidence.) About 28922 of the characters are sourced from the HanyuDaZidian. Nevertheless, as Mainland China does not use traditional characters, and limits the characters it does use, characters that come from China sources from the dictionaries which are not from Taiwan, Japan, Korea or Vietnam must be considered archaic. This could be up to 10,000 of the characters, on the numbers above. Hence the mention of "archaic". Taiwan -------- In the Taiwan sources, there are about 350 (?) characters which http://www.unicode.org/unicode/reports/tr27/ states "CJK Compatibility Ideographs Supplement: U+2F800-U+2FA1D This block consists of additional compatibility ideographs required for round- trip compatibility with CNS 11643-1992, planes 3, 4, 5, 6, 7, and 15. They should not be used for any other purpose" Presumably, use-in-XML-Names is such an "other purpose". These characters are probably considered variants of mistakes. Hence the mention of "variants". Vietnamese -------------- It seems many (most? all?) of the Vietnamese characters are also found in CNS (or in the Korean characters). Japanese, Korean --------------------- I leave the Japanese and Korean characters out. Most of the North Korean characters are also found in CNS or a lexicon. Comment ------------ We can attribute at least 30,000 of the characters in Unicode 3.1 as characters which were considered variants or secondary by Unicode 2.0 and 3.0: the CNS characters. These are characters which Unicode 3 http://www.unicode.org/unicode/uni2book/ch10.pdf says could not be included because CNS (etc) used unification rules that were "substantially different" from Unicode's. So what has made these characters suddenly not dismissable variants but needed characters? The paragraph from http://www.unicode.org/unicode/reports/tr27/ quoted above seems to hold the answer: CNS11643 is now included in the list of round-trippable characters. Even though Unicode 3.1 says that the same unification principles are being applied as with Unicode 3.0 and Unicode 2.0, and even though 3.0 (and I think 2.0) promised that no more characters would come in by the round-tripping rule (p.259) in fact it looks like over 30,000 character have come in en masse. (Strictly, we can say that it is only the few hundred CJK Compatibility Ideographs Supplement represent the characters which have come in for round-tripping against previous announced policy. ) It looks rather like an embarrassing change of an announced policy, with some face-saving wording. Nevertheless, I would not question that it the best policy: having grappled with the issue for so long, I am sure that the IRG would have only made this change if they felt it was not warranted. I am not questioning that they are correct in their decision. But I see nothing to go against my original statement: that with some exceptions (e.g. the modest, additional Japan-sourced characters) it seems that the CJK Unified Ideographs Extension B must indeed contain a prepondance of uncommon, archaic, regional, and variant characters. The function of markup is not to preserve historical characters, but to communicate using common language, including common "terms of art" which may well include otherwise unusual characters, to some extent. 5) But even in regard to jargon, there is a strong tendency in XML to name things generically and to use attribute values to subclass (i.e. "generic identifiers"). So it is more likely that we will have <zoo> <primate type="mandrill" /> </zoo> rather than <zoo> <mandrill class="primate" /> </zoo> Generic terms for the things we typically will use in markup is probably quite a small set, and certainly made up of current and common words. So even where there are uncommon characters used for terms of art, if they are specific rather than generic they may still not be good for use in markup (as element names and attribute names.) 6) The other aspect is the question "does the absense of these characters prevent any markup?" Given that mainland Chinese won't use them, Japanese can use kana or variants, and traditional Chinese can spell the words using the customary methods, it seems that not-having these characters does not prevent native-language markup: it just makes it marginally less satisfactory. (Indeed, for Japanese it may be that an uncommon term of art may be better understood by lay programmers when spelled out in kana than written using an obscure kanji.) 7) Furthermore, to restate a previous comment, the greatest need for native-script is to allow end-users to use native-language NMTOKENs (enumerations) or IDs. The availability of XML Schemas datatypes and richer datatyping removes a lot of the commercial imperitive for native-script element names and attribute names. I don't believe that many people will accept that the extra characters are required by pressing need or enormous benefit or blatent inequity. But I don't think people necessarily have to accept the argument of benefit. The extra characters can also be justified on their low cost instead: it does not mean we all have to introduce 32-but characters: just that surrogates get used. It can be implemented largely by removing some constraints on which UTF-16 16-bit code-points are allowed, in those systems which prevent surrogates in names currently. Just because I, as a Westerner, cannot see much benefit is no reason why the 3.1 changes should not be adopted. I naturally want to err on the side of conserving what we have, but perhaps it is better for us to err on the side of respect. Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|