[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: encoding woes: ISO-8859-1 vs. UTF-8

Subject: RE: encoding woes: ISO-8859-1 vs. UTF-8
From: Xiaocun Xu <xiaocunxu@xxxxxxxxx>
Date: Wed, 24 Jul 2002 07:23:13 -0700 (PDT)
iso vs utf
--- Tony Graham <Tony.Graham@xxxxxxx> wrote:
> Michael Kay wrote at 24 Jul 2002 09:05:31 +0100:
>  > > > ISO-8859-1 can only encode the characters in
> the
>  > > > range 0-255.
>  > > 
>  > > That's what I thought as well.  How did saxon
>  > > converted those two control chars into the
> proper
>  > > encoding for ¡° and ¡± even though
> the input
>  > > XML was marked as encoding in ISO-8859-1?  I
> was fully 
>  > > expecting the import would fail, but somehow it
> was successful.
>  > 
>  > I have no idea. This isn't done by Saxon, it's
> done by the XML parser.
>  > If you were using the default parser (AElfred), I
> think that it actually
>  > accepts bytes x80-x9F with encoding="iso-8859-1",
> converting them into
>  > characters x80-x9F.
> 
> Windows code pages, e.g. CP 1252, typically encode
> #x201C, LEFT DOUBLE
> QUOTATION MARK, and #x201D, RIGHT DOUBLE QUOTATION
> MARK, as 0x93 and
> 0x94, respectively.
> 
> The Windows 2000 "Character Map" utility, for
> example, shows the
> characters with those byte values for their encoding
> when the
> "Character set" is "Windows: Western" or "Windows:
> Central Europe",
> etc.
> 
> #X201C and #x201D aren't part of ISO 8859-1, so when
> the encoding
> really is ISO 8859-1 and not CP 1252 (or similar),
> then the only way
> to represent #x201C and #x201D is as numeric
> character references:
> &#x201C (or Ås) and ¡± (or ô­).
> 
> It appears that AElfred is accommodating the extras
> in the Windows
> code page even then the input is labelled
> ISO-8859-1.  Since it used
> to be said (and may still be true) that some
> Microsoft software
> labelled CP 1252 text as ISO 8859-1 (although I
> thought that Outlook
> was the main culprit) and since "real" ISO 8859-1
> isn't going to use
> the byte values for the CP 1252 extras (until we get
> NEL, that is),
> then it's forgiving of AElfred to accept the extras.
>  It's just that
> this "principle of least surprise" action surprised
> several of us.

Thanks for the explanation, that made a lot of sense, 
sounds like the entire MSOffice suite are culprit, if
not more.  If this is only allow by AElfred, I guess I
really have to resolve this problem when I am
upgrading to Saxon7.x and XercesJ2.

>  > > Good point.  For export output, I changed
> encoding to
>  > > UTF-8, that seems to have resolved the problem,
> now
>  > > export is successful.  Open the exported CSV in
> Hex
>  > > editor, those two chars are shown as Hex 93/94,
>  > > respectively.
>  > > 
>  > Now I really am puzzled.
> 
> I'm puzzled too. #x201C is not 0x93 in UTF-8.

Very strange indeed.  I checked the hex values stored
in SQLServer after import, both chars are stored as
&#22, the quotation mark in ISO-8859-1.  How did it
transpose these characters to &#93 and &#94 on export?
 Even I marked the export proprietary XML as UTF-8,
Saxon/AElfred had no problem processing it.

To consistently use UTF-8 for encoding, for import
Excel CSV, I guess I need to run native2ascii before I
start XSLT transformation.  But what happens on
export?  Open CSV in hex editor and it uses one byte
per char, how could the export generate CSV with
&#8220 and &#8221 chars?

Thanks,
Xiaocun

__________________________________________________
Do You Yahoo!?
Yahoo! Health - Feel better, live better
http://health.yahoo.com

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.