[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: encoding woes: ISO-8859-1 vs. UTF-8

Subject: RE: encoding woes: ISO-8859-1 vs. UTF-8
From: Tony Graham <Tony.Graham@xxxxxxx>
Date: Wed, 24 Jul 2002 11:10:59 +0100
utf8 vs double byte
Michael Kay wrote at 24 Jul 2002 09:05:31 +0100:
 > > > ISO-8859-1 can only encode the characters in the
 > > > range 0-255.
 > > 
 > > That's what I thought as well.  How did saxon
 > > converted those two control chars into the proper
 > > encoding for &#8220; and &#8221; even though the input
 > > XML was marked as encoding in ISO-8859-1?  I was fully 
 > > expecting the import would fail, but somehow it was successful.
 > 
 > I have no idea. This isn't done by Saxon, it's done by the XML parser.
 > If you were using the default parser (AElfred), I think that it actually
 > accepts bytes x80-x9F with encoding="iso-8859-1", converting them into
 > characters x80-x9F.

Windows code pages, e.g. CP 1252, typically encode #x201C, LEFT DOUBLE
QUOTATION MARK, and #x201D, RIGHT DOUBLE QUOTATION MARK, as 0x93 and
0x94, respectively.

The Windows 2000 "Character Map" utility, for example, shows the
characters with those byte values for their encoding when the
"Character set" is "Windows: Western" or "Windows: Central Europe",
etc.

#X201C and #x201D aren't part of ISO 8859-1, so when the encoding
really is ISO 8859-1 and not CP 1252 (or similar), then the only way
to represent #x201C and #x201D is as numeric character references:
&#x201C (or &#x8220;) and &#x201D; (or &#x8221;).

It appears that AElfred is accommodating the extras in the Windows
code page even then the input is labelled ISO-8859-1.  Since it used
to be said (and may still be true) that some Microsoft software
labelled CP 1252 text as ISO 8859-1 (although I thought that Outlook
was the main culprit) and since "real" ISO 8859-1 isn't going to use
the byte values for the CP 1252 extras (until we get NEL, that is),
then it's forgiving of AElfred to accept the extras.  It's just that
this "principle of least surprise" action surprised several of us.

 > > Good point.  For export output, I changed encoding to
 > > UTF-8, that seems to have resolved the problem, now
 > > export is successful.  Open the exported CSV in Hex
 > > editor, those two chars are shown as Hex 93/94,
 > > respectively.
 > > 
 > Now I really am puzzled.

I'm puzzled too. #x201C is not 0x93 in UTF-8.

Regards,


Tony Graham
------------------------------------------------------------------------
XML Technology Center - Dublin                mailto:tony.graham@xxxxxxx
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.