[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Question about UTF-8
Gustaf Liljegren wrote: >In an XML-aware editor, yes. But the question is about general >('non-XML-aware') text editors. A general editor has no idea of the >encoding detection mechanism in XML, so I wonder how it knows that the >octets C3 A4 should be written 'ä' and not 'ä' (or something else). > Operating systems (or, if you are lucky, particular user sessions) have a setting called "locale". Among other things, this sets the default character encoding used for processing text. For example, in Java when you open a stream and don't specify an encoding, Java uses the locale's default encoding. On West Western PCs (English-speaking countries and their neighbours) this encoding will be CP1252, a superset of ISO 8859-1. However, on older Macs, it may be MacRoman, which is different. On newer Macs and Linux it may be ISO 8859-15, which is slightly different again. Many modern text editors understand the Byte Order Mark that UTF-16 allows. >Many users who see 'ä' when they open a UTF-8 encoded XML document in a >text editor, prefer to use ISO 8859-1 to avoid this effect. > You are right that if you use an encoding that the text editor does not understand, the results will not be satisfactory. Worse than nasty glyphs, you may find that your data is actually corrupted. Or you can find that some parts of an entity are in one encoding and some other sections are in another. Unfortunately, people have this idea that all "text editors" will be able to edit all "text": but there is no such beast as "text"--it is always "text in a particular encoding". XML allows you to alter the encoding to suit your tools. Encoding isn't important, within reason. If one set of tools works best with a particular encoding, transcode your data to use that encoding. And if you are really worried, use character entities such as ä to prevent stuff-ups. You should be free to change encodings* because XML forces you to label which encoding has been used; that way there can to be no ambiguity--which is not to say that there will be no confusion as you figure out which is the appropriate encoding for your particular toolset. >Maybe the answer is to stay in ISO 8859-1 (or whatever default encoding the >editor has), but I was hoping it was possible to recommend using UTF-8 all >the time (for European scripts). > > Modern editors allow the user to select the encoding used. Some editors, <plug>such as Topologi's</plug>, have XML encoding detection built-in, but over-ride-able. Perhaps your people should consider moving away from non-Unicode based text editors. When XML was being developed, many people just wanted to use UTF-8/UTF-16 and to ignore "legacy encodings" and "legacy systems". I had expected that by 2002 Unicode would be so entrenched that other encodings (in the West) would be relatively unimportant; however it seems that (especially for the Linux world, and also the PC world it seems) the legacy applications are still very much alive and kicking. You might think "wouldn't it be simpler if unlabelled XML just used my system's default encoding?" Well, how would that work unless there is someone at the receiving end to check that the encoding you used iss the same as theirs? Ordinary users don't have the ability to check encodings, especially with any kind of large document, and often the receiving end may be a computer. It is much simpler to state what the encoding used is rather than to have some guessing system...especially given that encoding is not always guessable, especially for performance reasons. Cheers Rick Jelliffe http://www.topologi.com * providing the characters you have used are in both character sets
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|