[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Parse Error - Invalid Character


xml invalid characters
From: "Thomas B. Passin" <tpassin@c...>

> [Karl Stubsjoen]
>>Here is an outline of my current problem then:
> >
> > 1.  original data submitted - unicode "TM" submitted as part of data
> > 2.  server side XML generated and encoded as ISO-8859-1
> > 3.  ixmlhttprequest made for XML data - which is *blindly* downloaded and
> > encoded as UTF-8
> > 4.  MSXML3 chokes when attempting to load xml, error is "Invalid
> > characters..."
 
> I've really been surprised at all the places that Microsoft is either
> non-conforming or simply does things in a way that can be unworkable in
> certain situations.  I've seen in in .NET web services, and SQL server
> querying an xml file for query parameters, and now this.

Actually, I would not be too hard on Microsoft here. (I am happy to supply
other reasons :-)

Throughout the computing world transcoders (the software that converts text 
between encodings) typically do not provide proper facilities to cope with
missing characters in the output encoding.  If you are lucky, they transcoder
will fail and tell you there is something wrong.  But typically transcoders
will just strip or substitute with '?' the missing character.

It is not just Microsoft but the state of play in our computing infrastructure.
When you are working with data in different encodings and Unicode
infrastructure, importing from different encodings is safe but exporting
is not safe. At least, you need to take especial care.

How could API vendors help in this?  For a start, they should offer
a mode for all text export so that an encoding error can cause the
export to fail.   Even better would be to offer "smart trancoders"
which would allow characters not in the output character encoding
to be replaced by numeric character references (e.g. \uHHHH or
&#xHHHH; ) of various kinds.

A couple of years ago I created a couple of lossless transcoders:
see http://www.ascc.net/xml/en/utf-8/i18n-index.html.
AT&Ts licensing of tcs put the kibosh on the tcs-based version.

Actually, I believe that the general way we think about character
encodings is faulty: we need to think in terms of coping with 
variants. The GLUE project (GLUE Loses User Encodings!)
at http://www.ascc.net/xml/en/utf-8/glue.html  was an attempt
to move in a different direction, but we dropped it in favour of
Mark Davis' ICU effort which looked promising. 

The other culprit is C and byte-based DBMS.   The generation
of programmers who grew up expecting a character to be
8-bytes (or expecting that all strings will be in their local encoding)
-- which is my generation -- have made an infrastructure that breaks 
easily.  The more recent APIs from Java, .NET, Apples etc are
much better in this, but we still have a lot of older code floating
about, and code written by private individuals and contributed
to open source is often really bad in this regard.

Even HTTP has not been immune to this: when you send a 
request, what encoding is used?  Until recently it was up
in the air.  

That is why XML is so strict and definite about encodings:
you have to know every step of the chain.  Ultimately many
programmers will conclude that it is simpler to mandate 
UTF-8 at every part of their processing chain, whereever
possible. 

Furthermore, this is why it is important that XML keep
enough characters unused to be able to detect encoding 
errors.  XML 2.0 should bad all non-whitespace control
characters. See
 http://www.topologi.com/public/XML_Naming_Rules.html
for more on that.

Cheers
Rick Jelliffe

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.