[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

How best to represent unrepresentable characters in NAME tokens?

  • From: agreene@b... (Andrew Greene)
  • To: xml-dev@i...
  • Date: Mon, 3 Nov 1997 14:52:50 -0500

best names

If you have a Unicode-friendly XML environment, then users can create
elements whose GIs or attribute names contain "interesting"
characters. (Yes? A NAME token can contain "BaseChars", which includes
characters beyond ASCII and even beyond Latin-1.)

So, if a user requests that the document instance be saved as an ASCII
file, what is the best way for a Unicode-aware and standards-compliant
application to represent these characters? It's not legal to say

   <Stra&sz;e>

and the user may already have an element type called "Strasse" so it
would be inappropriate to "reduce" it. [I chose this example because
it is easy to describe in email; the problem is much more difficult
if, instead of German, the user has used Cyrillic or Hebrew NAMEs.]

I've thought of three solutions:

1. It's an error. Tell the user "Sorry, your file could not be saved
   in that character encoding because the element name 'StraBe' could
   not be represented.

   Advantages: It's fully compliant and no data can get lost.

   Disadvantages: No data can get out, either. Perhaps the user has
   an 8-bit app to massage the data in a particular way, and she
   doesn't want to rename all her elements.

2. Rename all the offending elements and attributes, and use PIs to
   ensure that when they're read back in we can patch things up.
   So, for example, the file could contain:

   <?GoodCitizen MangledGI Strae1="Stra&#x00DF;e"?>
   <Strae1>foo bar</Strae1>

   Advantages: It's fully compliant.

   Disadvantages: It assumes that all other processing applications
   will be nice and won't lose my processing instructions, and it
   makes the file hard to read. It's also non-portable; unless we
   as a community decide on a "semi-standard" PI to use, no one else 
   will know how to interpret this convention. (On the other hand, 
   this is exactly why I'm bringing the issue up here. Maybe we can 
   all agree on a semi-standard and I'll feel less uneasy about
   doing something like this....)

3. Violate the standard and use character entities to represent the 
   ineffable, for example:

   <Stra&#0xDF;e>foo bar</Stra&#0xDF;e>

   Advantages: It's compact and unambiguous (even if it's illegal :-).

   Disadvantages: It violates both XML and 8879 in a new and perverse
   way. The user's file will not be usable by any other piece of 
   standards-compliant software. That's worse than refusing to write
   the file at all (number 1).


My questions to the assembled multitudes are:

* Is there a need for a "semi-standard" solution to this problem, or am
  I the only one struggling with it?

* Is there interest in adopting some variation of number 2 so that we're
  better able to exchange such data?

* I can't help but think that number 3 would be the most elegant solution
  if it were only legal. Yet I'm also sure that the XML committee had a 
  good reason for disallowing it. I'd be interested in hearing what their
  reason was, so that I may become enlightened. :-)

Thanks for your thoughts,
  Andrew Greene


xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.