[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Identifying the encoding of a document

  • From: Rick JELLIFFE <ricko@g...>
  • To: xml-dev@x...
  • Date: Mon, 07 Aug 2000 19:21:19 +0800

identify encoding
Lisa Retief wrote:
> 
> > Do you mean you need to detect in some documents which encoding they
> > use?
> 
> Yes - I need to do this programatically - sometimes the user has not
> specified an encoding and I would prefer not to default to something if I
> can figure it out.

If you can figure out three things about the text, it will help narrow
your choices:

 1) Do you know the language (really, the script) used in the documents
    by some external mechanism?  In particular, is it a Latin-based
script
    or something else?

 2) Is the encoding ASCII-family, EBCIDIC family, or something exotic.  
    A simple way to do this is to open the document up in a vanilla
ASCII
    text editor: if there are any places where you expect ASCII
character
    (a-z A-Z 01-9 simple puncuation) that will help you figure it
out.    

 3) What locale was it created in (or what locale were the tools)?
    E.g. was it made in Japan, or by or for Japanese?

When you know all these three things, often you will only have one or
two main choices.

There are automated programs available too: many encodings have a
distinct signature that can be detected this way. Some (such as the ISO
8859-n encodings) may not have distinct signatures, but if you documents
provide a large enough sample it is possible to use statistical
techniques to figure
out the characters.  

This is of course a skill which you don't need if you are using XML: all
documents must be labelled with explict information about the encoding
used. Authors and programmers are not used to the discipline of doing
this, but it is the only thing that can work reliably: guesswork isn't
good enough.

> > Or which encoding is best to use when generating XML documents for
> > different locales?

> I am interested in this question too, as I need to advise clients and 
> users of the application I am developing about this.

It is prudent to limit yourself to international or national encodings:
stay away from encodings that are regional or vendor-specific (i.e.
Microsoft's "ANSI" and Macintosh "MacRoman" or IBM's EBCDIC family). You
may find it useful in the short run to converge on UTF-8: there are many
text conversion programs that can help in this: GNU iconv, IBMs
Internationalizion Classes for Unicode, etc. 

Rick Jelliffe

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.