[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: xml invalid characters

Subject: Re: xml invalid characters
From: Mike Brown <mike@xxxxxxxx>
Date: Fri, 22 Mar 2002 16:08:11 -0700 (MST)
xml invalid character
stevenson wrote:
> How can I avoid these problem. The data is from the database, and the
> character crashing it is £

You probably have an encoding problem. I assume that you're having trouble
with the British currency symbol for a Pound? At least, that's what it looks
like on my screen.

Quick lesson:

The POUND SIGN is character number A3 (hex) in Unicode. "U+00A3" is how you
can write it unambiguously in prose.

Encoding provides a way of representing that A3 as bytes.

iso-8859-1:  A3
     utf-8:  C2 A3
    utf-16:  00 A3 (little endian)
             A3 00 (big endian)

utf-8 and utf-16 can represent any Unicode character, but other encodings are 
more limited, usually only representing 256 characters max.

If a character cannot be represented in a particular encoding, you write it as
a sequence of characters that can be represented in any encoding (spaces added
for clarity):

   & # x A 3 ;    or    & # 1 6 3 ;

For example, us-ascii does not have POUND SIGN (this may be the source of your 
problem; it's hard to say, without knowing all the stages of processing of 
your data, and the role Cold Fusion plays in it). So you'd have to use this 
escaped format.

             &  #  x  A  3  ;
  us-ascii:  26 23 78 41 33 3B

And this escaped format (a "character reference") also works just as well in 
other encodings:

iso-8859-1:  26 23 78 41 33 3B
     utf-8:  26 23 78 41 33 3B
    utf-16:  00 26 00 23 00 78 00 41 00 33 00 3B (little endian)
    utf-16:  26 00 23 00 78 00 41 00 33 00 3B 00 (big endian)

Now check your XML document. When you look at the document in a text editor, 
it might say 

<?xml version="1.0" encoding="utf-8"?>
                    ^^^^^^^^^^^^^^^^

This encoding declaration is an assertion made by the document as to how its
bytes map to Unicode characters. It is just a hint for the XML parser to use
when reading the document; it is not secret code that causes anything about
the document's *actual* encoding to change. 

If this declaration is missing, UTF-8 or UTF-16 are assumed 
(UTF-8 unless the document begins with bytes FF FE or FE FF).

It is your responsibility to ensure that the encoding declaration is an
accurate reflection of the document's *actual* encoding.

As you can guess, this is where most people run into problems. They are
passing "text" around in their software without paying attention to whether &
how it has been encoded. So, in order to diagnose encoding related problems,
you must trace the processes that your data passes through, and determine how
it is encoded/decoded at each step.

Also, you didn't say what your problem has to do with XSLT. This is the 
xsl-list. If you have general xml processing questions, ask them on xml-dev.

If you're using XSLT, then you usually only need to be concerned about

 - the source and stylesheet XML documents must have accurate encoding 
   declarations

 - the output encoding, as controlled by <xsl:output encoding="..."/>
   should be what you wanted (there is a FAQ regarding invoking MSXML
   from scripts, where the output becomes UTF-16, depending on how
   you capture it)

Good luck.

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.