[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: invalid character (Unicode: 0xa0) in xsl docume

Subject: RE: invalid character (Unicode: 0xa0) in xsl document - LONG
From: "Joshua Allen" <joshuaa@xxxxxxxxxxxxx>
Date: Sat, 28 Apr 2001 22:41:50 -0700
unicode 0xa0
This is correct -- 0xA0 cannot appear as the first byte of a UTF-8
sequence [1].  This character could easily appear as the second byte of
a two-byte sequence, and I could also see the error appearing IF you
receive a UTF-8 file that does not have a BOM, and is in a different
byte-order than your system expects (for example, little-endian, and
your system uses big-endian for two-byte sequences).  In this case, the
parser would (perhaps) assume the preferred byte order, and since 90% of
the file is single-byte characters anyway, it would not die until it
reaches a sequence that has two bytes or more (perhaps A0E0 in
little-endian, your processor would be expecting big-endian, so would
expect to see that character as E0A0, and would see instead a character
starting with A0 and would throw the error you are seeing).  So this
error could very well occur when exchanging valid UTF-8 with no BOM
between systems with differing byte-orders.  Lesson is, always use a BOM
:-)

Also note that just using an encoding stream that does UTF-8 as
suggested below will not solve all of your problems.  There are
characters which are not valid XML [2], but which are perfectly valid
UTF-8.  I am not aware of any streamwriters that automatically strip
these out for you.


[1] http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html
(see table 3.1b)
[2] http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char

Regards,
Joshua


> -----Original Message-----
> From: Eric Jacobson [mailto:ericjacobson@xxxxxxxxxxxx]
> Sent: Saturday, April 28, 2001 8:47 PM
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: Re:  invalid character (Unicode: 0xa0) in xsl document -
> LONG
> 
> The essay below may or may not pertain to your actual problem.
However,
> it may very likely be that your XML is declaring itself to be
> encoded as UTF-8 without that actually being the case.
> 
> jackson wrote:
> >
> > Alan
> >
> > > I'm processing an xsl file with the apache xalan 2 processor, and
am
> > > getting the following error message when i run my application:
> > >
> > > javax.xml.transform.TransformerConfigurationException: An invalid
XML
> > > character (Unicode: 0xa0) was found in the element content of the
> > > document.
> >
> > Well, your document says it's UTF-8. I'm not an expert on Unicode
> > and related issues, but i think 0xa0, while it is Unicode, is not a
> possible
> > UTF-8 character.
> >
> > The character 0xa0 is a non-breaking space. I don't know how
> > it might have got in your document (possibly from some HTML?),
> > but you could find it and get rid of it. Since it's white space,
it's
> > not going to be obvious.
> >
> > You could write a script to look for this character and change
> > it - say, to a normal space. You could also do it in your java
> > program i suppose, before parsing.
> >
> > I suppose you could also turn 0xa0 into the UTF-8 equivalent
> > (i can't help you there). Java classes might be able to do it for
> > you - from what i remember (quite a while ago), there is a class
> > for writing to a UTF file?
> >
> > David Jackson
> >
> 
> A brief note before the long-winded part: I suspect you are referring
> to the DataInputStream and DataOutputStream classes, which have
> methods to readUTF() and writeUTF(). These methods read and write a
> modified form of UTF-8 that will not be meaningful to a
> standards-compliant processor.  Specifying an encoding name to the
> constructor of an InputStreamReader or OutputStreamWriter will work,
> as will passing an encoding name to the String method getBytes().
> 
> Your other option is to figure out what encoding your system uses
> by default and declare that in the encoding attribute in your XML
> prolog. However, the only two encodings required for all XML
processors
> by the standard are UTF-8 and UTF-16.
> 
> Now for the long part:
> 
> UTF-8 is a method for representing Unicode characters (16 bit values)
> on a stream of 8-bit units. Given that a large volume of data is still
> primarily composed of the traditional ASCII characters, which require
> only 7 bits to represent, using 16 bits per character would be quite
> inefficient. UTF-8 uses 8 bits with the sign bit 0 to represent
> characters that fall into the ASCII range in a single octet. For
> character codes that are larger, more than one byte is used. The
leading
> bits of the first octet are used to indicate (1) that more than one
> octet should be read and (2) how many. The following octets begin
> with a pattern that indicates that they are not the start of a
> character.
> The remaining bits in each octet are then used to hold the actual
value
> being stored.
> 
> The overall effect is that if your data is all ASCII, the UTF-8
> encoding comes out just like a traditional ASCII file - one
> character for every 8-bits. You can create and read such files
> with traditional software that never actually heard of UTF-8.
> If it uses characters whose codes are
> >= 128, it will translate those into multiple octets and a system that
> is not making the appropriate interpretations will come up with an
> error.
> 
> XML requires all XML processors to
> support UTF-8, and the prolog <?xml version="1.0" encoding="UTF-8" ?>
> has been added to a great number of XML files as a hard-coded string,
> based in part on copying examples.
> The data in those files is then generated by a system that may not
> be aware of what UTF-8 really means and use some other actual
> encoding scheme (Cp1252 aka winAnsi aka Windows-Latin-1, for example).
> The end result is that the XML processor expects UTF-8 encoding,
> finds a bit pattern that is not valid in UTF-8, and screams.
> 
> In Java, a character is an unsigned 16 bit value containing a
> Unicode character code. When reading or writing characters from
> 8-bit byte oriented streams or buffers, many Java classes give the
> option of specifying the name of an encoding to use and apply a
> system default otherwise. The String method getBytes("UTF8")
> would return a buffer of bytes representing the String's characters
> using the UTF-8 encoding. Alternatively, you could wrap an
> OutputStreamWriter around your actual OutputStream with the
> encoding set in the constructor.
> 
> Hope this helps.
> 
> Eric Jacobson
> 
>  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.