[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: using xsl:message with UTF-8 characters

Subject: Re: using xsl:message with UTF-8 characters
From: "Andrew Welch" <andrew.j.welch@xxxxxxxxx>
Date: Mon, 23 Apr 2007 14:21:04 +0100
Re:  using xsl:message with UTF-8 characters
On 4/23/07, Abel Braaksma <abel.online@xxxxxxxxx> wrote:
When the
Regional settings are set to US or some Western European country, the
codepage will default to CP1252 (windows-1252) (which is, like I said,
incompatible with the codepage for the console, giving the weird
characters in the U+0127+ range).

In the 8-bit character range there are two blocks C0 and C1 which contain "control characters" which are non-printable characters which were used to control the printing equipment, for example "move print head here" (sorry for the lack of depth here :)

Apparently Microsoft decided to wedge more characters into the 8-bit
range by replacing characters in the C0 and C1 ranges with more useful
characters, which seems fair enough, but this is the only encoding
(afaik) which remaps these two ranges.

The problem arises when you save any file without being explicit with
the encoding, and reading back in any other encoding.  This happens a
lot (in Windows) when you save an XML file with a non-xml-aware editor
(say notepad), and then open it in an XML aware editor.  The file will
be saved in CP1252, and with characters like "en dash" and "em dash"
being saved as #150 and #151 instead of #8211 and #8212 respectively.
So when you open the file in using an XML aware editor it reads the
xml prolog and reads the file in say, UTF-8, and you get non-printable
characters instead of the dashes... which can be represented as either
a box or a question mark depending on (...I'm not sure what that
depends on actually).

To compound the issue, if your XML is specified as IS0-8859-1 in the
prolog, some MS tools will read the characters in the control ranges
and auto-switch the encoding to CP1252, giving the impression
everything is fine.

The simple rule is, always read and write using the same encoding, and
be aware when something is converting between characters and bytes
behind the scenes - servlets for example.  Make sure the font you're
viewing the result in contains the glyphs for the characters you're
trying to view (helpfully the no-glyh character is often the same box
or question mark used to mean no-mapping in the encoding...requiring a
hex editor to check the underlying bytes), and be certain the viewer
is showing the result in the right encoding (the cmd window here, or
say the Eclipse output window is another notorious spot)


cheers andrew

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.