[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Identifying the encoding of a document
Lisa Retief wrote: > > > Do you mean you need to detect in some documents which encoding they > > use? > > Yes - I need to do this programatically - sometimes the user has not > specified an encoding and I would prefer not to default to something if I > can figure it out. If you can figure out three things about the text, it will help narrow your choices: 1) Do you know the language (really, the script) used in the documents by some external mechanism? In particular, is it a Latin-based script or something else? 2) Is the encoding ASCII-family, EBCIDIC family, or something exotic. A simple way to do this is to open the document up in a vanilla ASCII text editor: if there are any places where you expect ASCII character (a-z A-Z 01-9 simple puncuation) that will help you figure it out. 3) What locale was it created in (or what locale were the tools)? E.g. was it made in Japan, or by or for Japanese? When you know all these three things, often you will only have one or two main choices. There are automated programs available too: many encodings have a distinct signature that can be detected this way. Some (such as the ISO 8859-n encodings) may not have distinct signatures, but if you documents provide a large enough sample it is possible to use statistical techniques to figure out the characters. This is of course a skill which you don't need if you are using XML: all documents must be labelled with explict information about the encoding used. Authors and programmers are not used to the discipline of doing this, but it is the only thing that can work reliably: guesswork isn't good enough. > > Or which encoding is best to use when generating XML documents for > > different locales? > I am interested in this question too, as I need to advise clients and > users of the application I am developing about this. It is prudent to limit yourself to international or national encodings: stay away from encodings that are regional or vendor-specific (i.e. Microsoft's "ANSI" and Macintosh "MacRoman" or IBM's EBCDIC family). You may find it useful in the short run to converge on UTF-8: there are many text conversion programs that can help in this: GNU iconv, IBMs Internationalizion Classes for Unicode, etc. Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|