[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Character Encoding and the XML PR (was Re: PR.xml)
Peter Murray-Rust writes: > Thanks. I am also aware of it now :-). Can I make the assumption that: > > - ISO-8859-1 and UTF-8 look identical to not-very-experienced humans. They look identical to most English speakers, but differ in their treatment of accented characters (> 0x7f), so French and German speakers probably notice. > - in principle I should be able to sort this by adding something like > > <?xml version="1.0" encoding="ISO-8859-1"?> > to the top of the document Correct. The other alternative is to configure your web server to send the encoding ISO-8859-1 in the HTTP header for this document if the text/xml MIME type is approved, but the problem will reappear if you download the file and the parse it on your own system. > - in practice this fails because by the time it gets to the encoding > declaration it has already assumed the encoding is UTF-8 and has crashed :-) It should not fail with AElfred -- I just downloaded the PR and added your XML declaration to the top, and AElfred reported no errors. In fact, the XML declaration is guaranteed to use only ASCII characters, which are the same in UTF-8 and ISO-8859-*. AElfred is very careful not to try to read too far until the document until it has discovered whether there is an explicit encoding declaration. > I am not quite clear why we need this problem. Do different tools emit > different encodings? If so, what should I work with?. Can I convert this > document? ISO-8859-1, which is used for most web pages, contains characters only for Western European languages. UTF-8 can encode any Unicode characters up to 0xff (and a little higher with surrogates), so it can handle Kanji, Han Chinese, Arabic, etc. The PR rightly specifies that any entity that begins with neither an encoding declaration nor a byte-order mark (for UCS-2) should be assumed to be encoded in UTF-8. Conversion should be fairly simple -- take a look at the AElfred source to see how the different encodings are constructed. Just for the record, AElfred accepts the following encodings, and to my knowledge, supports them completely and correctly to the extent allowed by Java's 16-bit characters and by surrogates: - UTF-8 - ISO-10646-UCS-2 (both byte orders) - ISO-10646-UCS-4 (four byte orders) - UTF-16 - ISO-8859-1 All the best, David -- David Megginson ak117@f... Microstar Software Ltd. dmeggins@m... http://home.sprynet.com/sprynet/dmeggins/ xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|