[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: ArchForms and LPDs

  • From: Rick Jelliffe <rjelliffe@allette.com.au>
  • To: xml-dev <xml-dev@lists.xml.org>
  • Date: Sat, 31 Jul 2021 17:50:54 +1000

Re:  ArchForms and LPDs
{{ Normalization:
For background, for readers who don't know what normalization is: consider A with an angstrom diacritical:  a legacy character set may use two one character to represent A and one character to represent combining the angstrom, or it may use one. Unicode supports both forms ( U+0041 U+0301 i.e. NFD,  and U+0058 i.e. NFC) , and they are invisible to the eye and disruptive for simple collating and string matching.   So Unicode supports various kinds of decomponsing and combining operations, called normalization.  W3C has a Character Model specification which recommends using  Unicode Normalization Form C.
 }}

First, to confirm the status quo: as I understand it:
  • W3C Charmod (https://www.w3.org/TR/charmod-norm/#unicodeNormalization) does not endorse blanket normalization of a document before parsing. (I believe one of the reason why is because many fonts are normalization-form dependent, so arbitrary normalization can be unproductive.) It likes NFC for comparisons etc.  Therefore, it seems to me that XML 1.1 may not conform to W3C CharMod, while XML 1.0 does, in this respect.
(My proposal for my system is that normalization of names (to NFC) is a server-side responsibility, which clients may check for: or they may build name normalization in themselves too. This only applies to tokens that are not in double quotes, not to strings or literals.   (I will update the documentation on www.schematron.com  for RAN: Random Access Notation with this. )

On Sat, Jul 31, 2021 at 8:04 AM John Cowan <johnwcowan@gmail.com> wrote:


On Tue, Jul 27, 2021 at 11:44 AM Rick Jelliffe <rjelliffe@a...> wrote:

In XML, it is needed because XML supports data coming in with legacy character sets;

Not at all.  Conversion from legacy charsets to Unicode ones already produces NFC normalization (except in a few rare cases like XCCS), because those charsets don't have combining characters, nor both Hangul jamo and Hangul syllables.  It's data in Unicode charsets that may or may not be normalized.
 
I don't understand this. I don't think we disagree, but clearly there are transcoders in the wild that actually do not produce NFC for every legacy charset. (I think John may be reading "it is needed" as "the only reason it is needed" but I meant "it is at least needed".)
 
Normalization had to be the responsibility of the receiver system because it could not be the responsibility of the generating system.
 
Well, it was originally the *creating* system that is supposed to NFC-normalize, and neither the receiving system nor a retransmitting system.  But that has never applied to XML or HTML, and as a systems property is too hard to manage.  So you should normalize just in case you need to compare: it's not normalization but equality under normalization that really matters.

Yes. But hard does not mean impossible: if you have a media-type  indented for speed or random-access, then it may become in the sender's interest to produce normalized data. (And especially if the media-type was developed mainly for trusted and private use.)  

Really, the issue is building normalization checking into the APIs for creating element objects, etc., which requires doing it on the ground floor.

Rick


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.