[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: output encoding="iso-8859-1"

Subject: Re: output encoding="iso-8859-1"
From: Mike Brown <mike@xxxxxxxx>
Date: Mon, 4 Jun 2001 20:04:27 -0600 (MDT)
iso 8859 1 rss character
Daniel Florian wrote:
> <?xml version="1.0" encoding="utf-8"?>
> <?xml-stylesheet type="text/xsl" href="Untitled2.xsl"?>
> <start>
> á °
> </start>

Everyone else's answers weren't to my satisfaction, so I'm jumping in on 
this one even though it's a few days old.

Your email was iso-8859-1 encoded. In other words, "á" (Latin small letter
a with acute) is byte 0xE1 and "°" (degree sign) is byte 0xBA. I'm
guessing that your original file is iso-8859-1 encoded, too.

Your XML is misdeclaring its encoding. It is an error to say it is utf-8
encoded when it is actually iso-8859-1. The bytes 0xE1 0x20 0xBA work out
to an invalid UTF-8 sequence and it shouldn't even be parseable XML, but 
apparently your parser doesn't care.

&#6192; = &#x1830; which is equivalent to the bytes 0xE1 0xA0 0xB0 in 
utf-8. I'd say your parser is being very liberal with its interpretation
of the bytes.

> What character reference is the &#6192?  This is supposed to be ISO-8859-1
> isn't it?

The 7 characters "&" "#" "6" "1" "9" "2" ";" are encoded in the output 
as their 7 respective iso-8859-1 bytes, as per your xsl:output 
instruction, yes. What "&#6192;" means, however, in the context of an XML 
or HTML document, is the single character known as MONGOLIAN LETTER SA.

>  Then how come I can't seem to find the character code for 6192

Maybe because you weren't looking at The Unicode Standard at unicode.org,
or the Letter Database at http://www.eki.ee/letter/, or at the standard
that is referenced by both the XML and HTML specs: ISO/IEC 10646-1.

> And also, what happened to the 2 distinct characters from the
> source xml?

Your 3 characters (including the space in between them) became 3 bytes in
the encoding supported by the editor that made the file. When read back in
by an XML parser under the assumption that utf-8 was the character map
used, and taking into account the fact that your parser is apparently very
forgiving of the illegal byte sequence, the 3 bytes together imply 1
abstract character -- that Mongolian character that you probably won't
find in any font. When this character is copied to the result tree in your
XSL transformation, it retains its identity as a single character. When
the result tree is serialized as iso-8859-1 bytes and the HTML syntax, it
is impossible to represent this character as anything other than "&#6192;"
or "&#x1830;"


   - Mike
_____________________________________________________________________________
mike j. brown, software engineer at  |  xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA    |  personal: http://hyperreal.org/~mike/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.