[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Trouble with special characters

Subject: Re: Trouble with special characters
From: "Peter West lists@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 25 Jan 2016 21:57:25 -0000
Re:  Trouble with special characters
Replace bASCIIb in the following with bISO-8859-1b?

Peter West
b&as they were delivered to us by those who from the beginning were
eyewitnessesb&

> On 26 Jan 2016, at 5:36 am, Eliot Kimber ekimber@xxxxxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> For a situation like this you have to look closely at the chain of custody
> of the data as it comes in and out of different tools--any component that
> touches it has the opportunity to mess things up.
>
> As others have pointed out, if the data coming in is correct then the data
> going out as produced directly by Saxon should be correct as well. That
> is, the mapping from Unicode characters to ISO-8859 should be handled
> correctly by the serializer Saxon is using.
>
> The "gibbersh" you're showing is the three bytes of the UTF-8 encoded
> "REPLACEMENT CHARACTER" interpreted as individual Unicode characters. The
> UTF-8 encoding of this character, Unicode code point FFFD, is 0xEF 0xBF
> 0xBD. Character 0xEF (239) is i-umlaut in ISO-8859, 0xBF (191) is inverted
> question mark, and 0xBD (189) is the 1/2 fraction. Thus your gibbersh.
> (http://www.fileformat.info/info/unicode/char/0fffd/index.htm)
>
> So the following is happening somewhere in your tool chain:
>
> 1. Something is not recognizing the character you think should be a degree
> symbol as a known Unicode character and is replacing it with the UTF-8
> replacement character.
>
> 2. Something is then reading the bytes resulting from (1) as ASCII rather
> than UTF-8 and treating each byte of the replacement character sequence as
> individual ASCII characters.
>
> 3. The remaining stages don't know any better and continue to treat the
> characters as characters, resulting in the three characters i-umlaut,
> inverted question mark, 1/2 fraction in the output.
>
> I think the most likely thing is that something is reading the incoming
> ASCII as Unicode, not recognizing the ASCII byte "0xB0" (degree symbol) as
> a unicode character (because it's not one in any Unicode-defined
> encoding), and replacing it with the Unicode replacement character.
>
> Something then reads this byte sequence as ASCII, not UTF-8 but then
> generates UTF-8 output (otherwise the byte sequence would be the same on
> input and output), resulting in the gibberish.
>
> Some tools write XML in one encoding but put in a different encoding
> declaration, e.g., a file is written as ISO-8859 but with a UTF-8 encoding
> declaration. This would lead to the behavior we're seeing here, where the
> degree symbol should be encoded as two UTF-8 bytes but is output as a
> single ASCII byte.
>
> Using Java it's easy to forget to specify the encoding when writing a byte
> sequence using a Writer or when constructing a String instance. This will
> result in the bytes being written in the default encoding for the system
> running the application, which is almost always *not* a Unicode encoding,
> rather than an Unicode encoding. Other languages have similar pitfalls.
>
> I find the free Windows tool Unipad to be invaluable when trying to track
> down this type of encoding problem--it does a good job of guessing the
> real encoding and also has tools for converting between many encodings,
> inspecting files in uncommon encodings, and so on. However, oXygenXML has
> a lot of good tools for this now, so I depend on Unipad less than I used
> to 10 years ago. (http://www.unipad.org/main/)

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.