Re: [xsl] output encoding="iso-8859-1"

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

Subject: Re: output encoding="iso-8859-1"
From: Mike Brown <mike@xxxxxxxx>
Date: Mon, 4 Jun 2001 20:04:27 -0600 (MDT)

Daniel Florian wrote:
> <?xml version="1.0" encoding="utf-8"?>
> <?xml-stylesheet type="text/xsl" href="Untitled2.xsl"?>
> <start>
> á °
> </start>

Everyone else's answers weren't to my satisfaction, so I'm jumping in on 
this one even though it's a few days old.

Your email was iso-8859-1 encoded. In other words, "á" (Latin small letter
a with acute) is byte 0xE1 and "°" (degree sign) is byte 0xBA. I'm
guessing that your original file is iso-8859-1 encoded, too.

Your XML is misdeclaring its encoding. It is an error to say it is utf-8
encoded when it is actually iso-8859-1. The bytes 0xE1 0x20 0xBA work out
to an invalid UTF-8 sequence and it shouldn't even be parseable XML, but 
apparently your parser doesn't care.

&#6192; = &#x1830; which is equivalent to the bytes 0xE1 0xA0 0xB0 in 
utf-8. I'd say your parser is being very liberal with its interpretation
of the bytes.

> What character reference is the &#6192?  This is supposed to be ISO-8859-1
> isn't it?

The 7 characters "&" "#" "6" "1" "9" "2" ";" are encoded in the output 
as their 7 respective iso-8859-1 bytes, as per your xsl:output 
instruction, yes. What "&#6192;" means, however, in the context of an XML 
or HTML document, is the single character known as MONGOLIAN LETTER SA.

>  Then how come I can't seem to find the character code for 6192

Maybe because you weren't looking at The Unicode Standard at unicode.org,
or the Letter Database at http://www.eki.ee/letter/, or at the standard
that is referenced by both the XML and HTML specs: ISO/IEC 10646-1.

> And also, what happened to the 2 distinct characters from the
> source xml?

Your 3 characters (including the space in between them) became 3 bytes in
the encoding supported by the editor that made the file. When read back in
by an XML parser under the assumption that utf-8 was the character map
used, and taking into account the fact that your parser is apparently very
forgiving of the illegal byte sequence, the 3 bytes together imply 1
abstract character -- that Mongolian character that you probably won't
find in any font. When this character is copied to the result tree in your
XSL transformation, it retains its identity as a single character. When
the result tree is serialized as iso-8859-1 bytes and the HTML syntax, it
is impossible to represent this character as anything other than "&#6192;"
or "&#x1830;"

   - Mike
_____________________________________________________________________________
mike j. brown, software engineer at  |  xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA    |  personal: http://hyperreal.org/~mike/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

Current Thread

output encoding="iso-8859-1"
- Daniel Florian - Sat, 2 Jun 2001 00:36:08 -0400 (EDT)
  - David Carlisle - Sun, 3 Jun 2001 14:40:34 -0400 (EDT)
  - Mike Brown - Mon, 4 Jun 2001 22:01:07 -0400 (EDT) <=
    - Michael Beddow - Tue, 5 Jun 2001 03:50:34 -0400 (EDT)
  - <Possible follow-ups>
  - Clapham, Paul - Sat, 2 Jun 2001 16:37:22 -0400 (EDT)
  - Daniel Florian - Sat, 2 Jun 2001 17:48:01 -0400 (EDT)
    - Michael Beddow - Sun, 3 Jun 2001 09:42:42 -0400 (EDT)

<- Previous	Index	Next ->
Re: output encoding="iso-8859, David Carlisle	Thread	Re: output encoding="iso-8859, Michael Beddow
find the correct rows to appl, Xiaocun Xu	Date	Re: Problem in making choices, Sreekanth Pallavoor
	Month

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >