[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: How to read the encoding of an XML document

Subject: Re: How to read the encoding of an XML document
From: "Christopher R. Maden" <crism@xxxxxxxxx>
Date: Thu, 25 Oct 2001 14:46:54 -0700
read utf 8 encoded xml
At 14:18 25-10-2001, James Garriss wrote:
I've been looking at a lot of European web pages, viewing source to see what charset they define in the HTML META tag. The majority use iso-8859-1, but a few don't. Most notably Turkey and Greece have character sets that are quite different. How do I determine if UTF-16 (or UTF-8) will work for those languages?

Time for the primer again.


A character is an abstract notion, like "Latin capital letter A".

A character repertoire is a collection of characters - like "Latin upper-case letters". Different languages require different character repertoires.

A character set is an ordered, numbered character repertoire. ISO 8859-1 is one such character set, assigning numbers 0-255 to 256 characters. Its repertoire covers nearly all of the characters needed for western European languages like French, Spanish, German, and Italian, as well as English, Icelandic, Swedish, Norwegian, and Dutch. There are other ISO 8859 character sets that cover characters needed by other languages like Turkish, Polish, Greek, Russian, Hebrew, and Arabic.

Unicode is also a character set. It assigns the numbers 0 - (2^32)-1 to a whole lot of characters. Its repertoire includes all of the characters covered in other national and International Standards, including all of the ISO 8859 sets.

An encoding is a mapping of bit patterns to a character set. UTF-8 and UTF-16 are encodings of Unicode. In a sense, ISO 8859-1 and its kin are also encodings of Unicode, but ones that can not represent all of the characters.

In short: Unless you are working in Klingon, Minbari, or Silvestri, Unicode covers the characters you need in its repertoire. UTF-8 and UTF-16 are both capable of representing all of the characters in Unicode. All XML parsers are required to read UTF-8 and UTF-16 data.

Use them. Know them. Love them.

-Chris
--
Christopher R. Maden, Principal Consultant, HMM Consulting Int'l, Inc.
DTDs/schemas - conversion - ebooks - publishing - Web - B2B - training
<URL: http://www.hmmci.com/ > <URL: http://crism.maden.org/consulting/ >
PGP Fingerprint: BBA6 4085 DED0 E176 D6D4  5DFC AC52 F825 AFEC 58DA


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.