How best to represent unrepresentable characters in NAME tokens?
If you have a Unicode-friendly XML environment, then users can create elements whose GIs or attribute names contain "interesting" characters. (Yes? A NAME token can contain "BaseChars", which includes characters beyond ASCII and even beyond Latin-1.) So, if a user requests that the document instance be saved as an ASCII file, what is the best way for a Unicode-aware and standards-compliant application to represent these characters? It's not legal to say <Stra&sz;e> and the user may already have an element type called "Strasse" so it would be inappropriate to "reduce" it. [I chose this example because it is easy to describe in email; the problem is much more difficult if, instead of German, the user has used Cyrillic or Hebrew NAMEs.] I've thought of three solutions: 1. It's an error. Tell the user "Sorry, your file could not be saved in that character encoding because the element name 'StraBe' could not be represented. Advantages: It's fully compliant and no data can get lost. Disadvantages: No data can get out, either. Perhaps the user has an 8-bit app to massage the data in a particular way, and she doesn't want to rename all her elements. 2. Rename all the offending elements and attributes, and use PIs to ensure that when they're read back in we can patch things up. So, for example, the file could contain: <?GoodCitizen MangledGI Strae1="Straße"?> <Strae1>foo bar</Strae1> Advantages: It's fully compliant. Disadvantages: It assumes that all other processing applications will be nice and won't lose my processing instructions, and it makes the file hard to read. It's also non-portable; unless we as a community decide on a "semi-standard" PI to use, no one else will know how to interpret this convention. (On the other hand, this is exactly why I'm bringing the issue up here. Maybe we can all agree on a semi-standard and I'll feel less uneasy about doing something like this....) 3. Violate the standard and use character entities to represent the ineffable, for example: <Stra�xDF;e>foo bar</Stra�xDF;e> Advantages: It's compact and unambiguous (even if it's illegal :-). Disadvantages: It violates both XML and 8879 in a new and perverse way. The user's file will not be usable by any other piece of standards-compliant software. That's worse than refusing to write the file at all (number 1). My questions to the assembled multitudes are: * Is there a need for a "semi-standard" solution to this problem, or am I the only one struggling with it? * Is there interest in adopting some variation of number 2 so that we're better able to exchange such data? * I can't help but think that number 3 would be the most elegant solution if it were only legal. Yet I'm also sure that the XML committee had a good reason for disallowing it. I'd be interested in hearing what their reason was, so that I may become enlightened. :-) Thanks for your thoughts, Andrew Greene xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format