[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Quiz: How do you put a Euro sign in your data if your XML

  • From: Norman Gray <norman@astro.gla.ac.uk>
  • To: Michael Kay <mike@saxonica.com>
  • Date: Fri, 1 Mar 2013 17:45:38 +0000

Re:  Quiz: How do you put a Euro sign in your data if your XML

Greetings.

On 2013 Mar 1, at 11:36, Michael Kay wrote:

>> I hinted at this months ago on this list that I believe the level of misunderstanding of encoding and Unicode concepts is both high and not self recognized.  Which is a deadly combination.
>> Is there more "the community" can do to make it clearer?
>> 
> 
> If there is, please let me know.
> 
> I've been advising people how to solve character encoding issues for about 100 years, but our own internal system for handling Saxon license requests still gets it wrong. It ain't easy.

For what it's worth, 1: Joel Spolsky's article on "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" <http://www.joelonsoftware.com/articles/Unicode.html> is quite good, I think.  It's (surely) been mentioned here before, but it might be worth mentioning again in this thread.

For what it's worth, 2: on the couple of occasions when I've had to explain 'unicode' to a colleague, it's been the notion of the abstract Unicode codepoints that's turned out key to the illumination.  The structure of my successful explanations has been something like this:

  * The Unicode consortium has (with much agonising and negotiation) managed to give a number to a large fraction of the characters in use.  These numbers are (jargon) called 'codepoints'.

  * 'The Letter A' has a codepoint, and this is independent of fonts.  Thus 'A' and 'a' have different codepoints, but roman, bold, italic, serif, sans serif (et very much cetera) are not distinguished.  Japanese kanji, tengwar and klingon characters (for example) have codepoints (this gets attention).

  * A 'unicode string' is (conceptually) a sequence of codepoints.  This is a sequence of mathematical integers.  It does not make sense to ask whether these are bytes, 2-byte or 4-byte words; the sequence has nothing to do with computers.

  * If you want to send that sequence of integers to someone, or save it on a computer disk, you have to do something to encode it.  You could also write down the sequence of numbers on a piece of paper, but let's specialise to computers at this point.  If you want to store or send this on a computer, you have to transform these integers into a sequence of bytes.  There are multiple procedures for doing that, and each of these procedures is named an 'encoding'.  One of these 'encodings' is UTF-8.

  * When you 'read a Unicode file', you are starting with a sequence of bytes, on disk, and conceptually ending up with a sequence of integers.  If the 'unicode file' is indicated, somehow, to be encoded in UTF-8, then you have to decode that sequence of bytes to get the sequence of integers.  All of the subsequent operations on the 'unicode string' are defined in terms of the sequence of codepoints, and the fact that it started off, on disk, as 'UTF-8' is forgotten.

The key point seems to me to be making it clear that 'UTF-8' is no more than a detail -- a necessary complication occasioned by the need to save the 'unicode string' to a disk.

Depending on audience, it takes more or fewer words than that.  But not much more, and I think that Spolsky's explanation is still longer than it has to be.  In any case, that ordering of points works for me.

Best wishes,

Norman


-- 
Norman Gray  :  http://nxg.me.uk
SUPA School of Physics and Astronomy, University of Glasgow, UK



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.