[Home] [By Thread] [By Date] [Recent Entries]
From: "Tim Bray" <tbray@t...> > The point about doing strings, not characters, is well-taken, and one of > the things in the W3C i18n draft that gave me an "aha" moment. On the > other hand, I think that when I say a "Unicode character", that has a > very well-defined semantic, and COMBINING UMLAUT is one while codepoints > from the surrogate blocks aren't, and any API that doesn't make that > clear is, well, wrong. Put another way, something that is a Unicode > character in UTF-16 should also be a character in UTF-8 and UTF-32, > which the surrogates aren't, so they are just not characters in any > meaningful sense of the word. I'm puzzled. What is the "aha moment" here? Your point seems to be that Java char != Unicode character. True. Exactly like UTF-8 octet != Unicode character. The fact that half a surrogate pair is not a Unicode character doesn't seem like breaking news. Do you mean to say that use of UTF-16 character encoding in a programming language is broken as designed? In the perfect language of your own design, would you have the "char" type be 32 bits? Is that what this is all about? Bob
|

Cart



