[Home] [By Thread] [By Date] [Recent Entries]
On Thu, 2002-01-10 at 20:24, Rick Jelliffe wrote: > And also, do surrogate pairs really introduce any issues that > are not already present in combining character sequences? Perhaps not, but they happen at a different level of processing. Surrogate sequences require processing before combining character sequences, unless there's a ban on surrogates participating in combinations I haven't heard of. Surrogates also interact more directly with the productions in XML 1.0 - combining chars are permitted, but there's no need to perform the combination to see if your characters are acceptable. Normalization is a good idea, but not required for basic syntactical checking. > Using IBM's Internationalization Classes for Unicode > (bulk kudos to Mark Davis), it is quite straightforward > to add normalization to data import and character > entry in an interactive application. This means that > your application uses combined characters where they > are available rather than combining character sequences. > For most Western Latin languages, Unicode provides > pre-combined characters: enough even to support > Vietnamese with multiple levels of accent. This looks very cool, but it also seems like a lot more overhead than is necessary for a trivial character check like Gorille performs. > The other issue here is that 1 Java char = 1 glyph > assumption does not imply that every character is > the same width: if you support proportional width > characters you can still support Chinese and Japanese. > > The W3C I18n WG has a new version of their "Character > Model for the WWW" at http://www.w3.org/TR/ > which is looking pretty good. It is really well written > and anyone who wants to get a grip on internationalization > or character issues should find it a good place to start. It's a great document, but its call for processing at the character string level doesn't mesh well with the current exigencies of Java - where a char is a glyph under many circumstances, not a glyph under others, and normalizing combining characters doesn't help with surrogate processing issues. I don't think normalization answers the kinds of issues Gorille is designed to address. Fortunately, I don't think surrogates will be a common problem for most people (both developers and users), but they'll continue to irk a lot of people dealing with Java. -- Simon St.Laurent Ring around the content, a pocket full of brackets Errors, errors, all fall down! http://simonstl.com
|

Cart



