Re: character encodings
Pam Huntley wrote: > I'm having a problem where my XML file is in utf-8 (and has > english characters in it), but my XSL file has DBCS characters in it, and > although I saved it as UTF-8, I don't really know what the encoding is (I > think for japanese it's ms_kanji, big5 for chinese). > [...] It is the responsibility of the XML document author (whether that XML doc is an XSLT stylesheet or any other kind of XML) to know what the encoding of their document is, and to accurately declare it in the encoding declaration part of the prolog... <?xml version="1.0" encoding="whatever"?> at the top of your document, even if it is a stylesheet. This is a requirement for well-formedness (it is an error if you misdeclare the encoding, though it is not always detectable, such as when you have only ASCII characters and you say it's anything but utf-16 or ebcdic). > When I go to transform using the microsoft msxml stuff, I get an error > saying the XSL does not contain a document element. However, if I use the > exact same XSL, only the untranslated version (or any single byte version), > saved as utf-8, it works. Right, utf-8 uses 1 to 4 bytes per character in unambiguous sequences, while these other encodings tend to use 2 or 4 per character, or 1 per character but with the interjection of certain bytes to "shift" into an alternate "page" in their character maps, thus requiring stateful decoding algorithms. You can't expect an XML parser to know that up until byte x in your file the encoding is utf-8 and then suddenly it switches to big5. > I got the strings translated, and they came back in an ANSI file. By ANSI do you mean windows-1252? I don't see how that could be, because there are less than 256 characters in windows-1252, and none of them are in CJK scripts. You said you get them as big5 or whatever. > I couldn't send the XSL off to be translated because our translation centers > don't really know what to do with it. Then I used a program to go replace > the strings back where they belong in the XSL. Yeah, you can't really do that. You're pasting encoded strings (bytes) into the middle of a bunch of bytes derived through some other encoding. You can only do that if your encodings are the same, and even then, it's not an advisable way to go about things. > So, for single byte > languages, I save the resultant XSL in utf-8 and everyone is happy. But > for the DBCS languages, even if I save the resulting file in utf-8, I get > the error. > > I don't have any control over the XML file - it comes from a server, and I > just save it to a file. Is there some way to make the XSL work, even if it > is not utf-8? You really need to know what the encoding is of what you're getting back. I don't know the API, exactly, but you use that info to decode all your strings into Unicode string objects. Then you can stitch them together however you want, and then encode the entire result as utf-8. - Mike ____________________________________________________________________________ mike j. brown | xml/xslt: http://skew.org/xml/ denver/boulder, colorado, usa | resume: http://skew.org/~mike/resume/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format