|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: SAX InputSource and character streams
From: Rob Lugt <roblugt@e...> >... So, in >this case, by deciding to pass the SAX Parser a character stream, the >application has taken on part of the responsibility of an XML processor - >namely the responsibility of dealing with any encoding issues, thereby >relieving the SAX processor of any need, indeed any right, to have an >opinion on how the encoding is performed. For encoding, the general rule is that (reliable) information provided by a higher-level protocol has preference over the header. So if the XML header says the entity is shift-JIS, but a Japanese transcoding proxy has coverted the entity to Japanese EUC encoding and rewritten the MIME header accordingly, then the MIME header should be used. So a processor needs some mechanism to over-ride auto-detection. However, because transcoding proxies are only an issue in a few (one?) countries and perhaps for gateway-ing EBCDIC onto the WWW, and because there may be a legitimate expectation that application/xml should never be transcoded anyway, in effect a lot of applications will not override the XML declaration. (If the proxy has not rewritten the MIME header, then parsing the entity should fail at the first occurrence of a code sequence that could not be shift-JIS. If the entity is saved to a file to a without fixing up the XML header then parsing that file should fail at the first occurrence of a code sequence that could not be shift-JIS. If the chain of information breaks, the entity is lost; fair enough.) Actually, I think there is nothing stopping a processor from being very strict, and rejecting application/*xml entities if the XML header and the MIME header disagree: this would rule out transcoding proxies that do not rewrite the XML header. I think that is a perfectly appropriate approach, but it may go beyond what XML specifies. (For text/*xml, transcoding or line-break fiddling is a desirable feature.) Personally, I think the MIME headers are inappropriate way to specify character encoding. This is because setting system defaults is a system-administrator task, and even setting local Apache .htaccess directory- defaults is too much for normal users. Is XML software, when writing out an entity, suppposed to also rewrite any .htaccess file? What about the config file formats for other webservers? The encoding information should not be labelled in-band with the entity: Apple's late lamented resource forks would have been fine for this. So the technology is not in place for reliable end-to-end out-of-band signalling, though it can be done if you have control over every step in the chain. It only works because most people are use a single encoding for all their work or for a particular language: not the case for many people (e.g. Singapore has 4 languages and three scripts in common use). At my multilingual site, I ended up reserving directories for different encodings. And, ultimately, the solution is to use UTF-8 for all web-transmissions and data files and to correctly set their servers to provide the correct information. (For people concerned with Chinese/Japanese/Korean file blowout with UTF-8, the answer is that compressed UTF-8 is about the same size as compressed UTF-16, comressed Big5, etc.) Cheers Rick Jelliffe Taipei
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








