[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Parse Error - Invalid Character
From: "Thomas B. Passin" <tpassin@c...> > [Karl Stubsjoen] >>Here is an outline of my current problem then: > > > > 1. original data submitted - unicode "TM" submitted as part of data > > 2. server side XML generated and encoded as ISO-8859-1 > > 3. ixmlhttprequest made for XML data - which is *blindly* downloaded and > > encoded as UTF-8 > > 4. MSXML3 chokes when attempting to load xml, error is "Invalid > > characters..." > I've really been surprised at all the places that Microsoft is either > non-conforming or simply does things in a way that can be unworkable in > certain situations. I've seen in in .NET web services, and SQL server > querying an xml file for query parameters, and now this. Actually, I would not be too hard on Microsoft here. (I am happy to supply other reasons :-) Throughout the computing world transcoders (the software that converts text between encodings) typically do not provide proper facilities to cope with missing characters in the output encoding. If you are lucky, they transcoder will fail and tell you there is something wrong. But typically transcoders will just strip or substitute with '?' the missing character. It is not just Microsoft but the state of play in our computing infrastructure. When you are working with data in different encodings and Unicode infrastructure, importing from different encodings is safe but exporting is not safe. At least, you need to take especial care. How could API vendors help in this? For a start, they should offer a mode for all text export so that an encoding error can cause the export to fail. Even better would be to offer "smart trancoders" which would allow characters not in the output character encoding to be replaced by numeric character references (e.g. \uHHHH or &#xHHHH; ) of various kinds. A couple of years ago I created a couple of lossless transcoders: see http://www.ascc.net/xml/en/utf-8/i18n-index.html. AT&Ts licensing of tcs put the kibosh on the tcs-based version. Actually, I believe that the general way we think about character encodings is faulty: we need to think in terms of coping with variants. The GLUE project (GLUE Loses User Encodings!) at http://www.ascc.net/xml/en/utf-8/glue.html was an attempt to move in a different direction, but we dropped it in favour of Mark Davis' ICU effort which looked promising. The other culprit is C and byte-based DBMS. The generation of programmers who grew up expecting a character to be 8-bytes (or expecting that all strings will be in their local encoding) -- which is my generation -- have made an infrastructure that breaks easily. The more recent APIs from Java, .NET, Apples etc are much better in this, but we still have a lot of older code floating about, and code written by private individuals and contributed to open source is often really bad in this regard. Even HTTP has not been immune to this: when you send a request, what encoding is used? Until recently it was up in the air. That is why XML is so strict and definite about encodings: you have to know every step of the chain. Ultimately many programmers will conclude that it is simpler to mandate UTF-8 at every part of their processing chain, whereever possible. Furthermore, this is why it is important that XML keep enough characters unused to be able to detect encoding errors. XML 2.0 should bad all non-whitespace control characters. See http://www.topologi.com/public/XML_Naming_Rules.html for more on that. Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|