[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Unicode, xml:lang, and variant glyphs

  • From: John Cowan <cowan@l...>
  • To: XML Dev <xml-dev@i...>
  • Date: Tue, 03 Nov 1998 15:20:15 -0500

japanese glyphs
Rick Jelliffe wrote:

> Not so. The additions are use composed of standard radicals and
> combinations. There are various projects around (such as C.C.Hsieh in
> Taiwan) to figure out encodings to "spell" Han ideographs by component
> radicals. 

I'm glad to hear about this; I find the IRG archives utterly
impenetrable.

> I guess the point is that John thinks that if an XML system can produce
> characters which a recipient system cannot process, because it does not use
> ISO 10646, that is not something that CDATA sections should be used to
> address. I think his reasons are that he cannot see it in the spec. [...]
> I think a lot
> of people now think that any non-ISO10646 system is for losers anyway
> (except for whatever character set they use, probably).

Well, actually I would say the latter rationale has more effect on me
than the former, if I must choose either.  It just seemed to me that
using CDATA sections to constrain the behavior of editors was not
particularly user-friendly; if the user wants a character, let her
have it, using a character reference if possible.

In general, transcoding XML documents involves inserting NCRs as needed,
unless the target is UTF-8 or UTF-16.

> The primary purpose of xml:lang, as far as I am concerned, should be to
> convey the information lost by ISO 10646 unification: where the Japanese and
> Chinese glyphs

Actually, the problem isn't that clearcut.  As John Jenkins posted
to the Unicode list last year:

# FACT.  It is true that some Unihan characters are typically written 
# differently within the Japanese, Taiwanese, Korean, and Mainland Chinese 
# typographic traditions.  
# 
# FACT.  These differences of writing style are within the general range of 
# allowable differences within each typographic tradition.  
# 
# E.g., the official "Taiwanese" glyph for U+8349 ("grass") per ISO/IEC 
# 10646 uses four strokes for the "grass" radical, whereas the PRC, 
# Japanese, and Korean glyphs use three.  As it happens, Apple's LiSung 
# Light font for Big Five (which follows the "Taiwanese" typographic 
# tradition) uses three strokes.  
# 
# (This is easily confirmed by accessing 
# http://www.unicode.org/unihan/unihan.acgi$8349.)  
# 
# FACT.  Japanese users prefer to see Japanese text written with "Japanese" 
# glyphs.  
# 
# FACT.  It is also acceptable to Japanese users to see Chinese text 
# written with "Japanese" glyphs.  
# 
# E.g., I just borrowed from Lee Collins a standard Japanese dictionary 
# which quotes Chinese authors (e.g., Mencius) to show how a character is 
# used.  When doing so, they use "Japanese" glyphs, not Chinese ones. 
# 
# In particular, it is acceptable within Japanese typography for a small 
# stretch of Chinese quoted in a predominantly Japanese text to be written 
# with "Japanese" glyphs.  
# 
# FACT.  Han unification allows for the possibility that a Japanese user 
# might be required to use a Chinese font to display some Japanese text 
# (e.g., if it uses a rare kanji).  
# 
# FACT.  Ditto for JIS or an ISO 2022-based solution.  
# 
# FACT.  Unicode doesn't include all the characters in actual use in Japan 
# today, particularly for personal names.  
# 
# FACT.  Neither does JIS or an ISO 2022-based solution.  There are vendor 
# sets which include many of these characters, and Unicode is working with 
# the IRG and East Asian national bodies to add them.
  
> (or Polish and Russian)

How's that again?

Polish uses Latin, Russian uses Cyrillic!  What could possibly
count as a unification between these two??  *Nobody* thinks that
LATIN LETTER A and CYRILLIC LETTER A should be unified....

> for a unified character differ, then
> I think transcoding and unifying the characters into ISO 10646 can lose
> information unless the xml:lang attribute is set.

It doesn't lose information about meaning.  It may make characters
harder to read, but the distinction is one of typographic tradition,
not language, and can cross languages.

-- 
John Cowan	http://www.ccil.org/~cowan		cowan@c...
	You tollerday donsk?  N.  You tolkatiff scowegian?  Nn.
	You spigotty anglease?  Nnn.  You phonio saxo?  Nnnn.
		Clear all so!  'Tis a Jute.... (Finnegans Wake 16.5)

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.