Re: International Characters in attributes
> we have had problems with the encoding set in our > documents [affecting] what was displayed in the browser This might have to do with whether the browser has been properly configured to detect and use the encoding that you set. You also do not want to just make up encodings; the document has only one encoding and if you declare it to be something that it is not, it is only natural that the browser will misinterpret it. Since you said you are fuzzy on encoding... -------------------------------------------------------------------------- The encoding that we are talking about here is the mapping of characters (which are abstract) to sequences of bits (which are.. less abstract). Strictly speaking, this is a "character encoding scheme". Take, for example, the non-breaking space character, which in HTML we often write as " ", a predefined (in HTML, not XML) entity reference defined as equivalent to " ", which in turn is interpreted as the single non-breaking-space character. Different encoding schemes will represent this character as different bit sequences. For example, in the "iso-8859-1" encoding, the non-breaking space character maps to the bit sequence 10100000, an 8-bit byte representing a value that we can also easily express as decimal 160 or hex A0. But in "utf-8", the non-breaking space maps to the bit sequence 11000010 10100000. If we interpret this as a pair of 8-bit bytes, we could say they represent the values hex C2 followed by A0 (192 and 160). Now imagine you are the web browser, receiving an HTTP message containing an HTML document. All you see in the message is a stream of bits. How do you know what 1100001010100000 means? If you think the document is encoded using utf-8, you'll correctly interpret this sequence as one single NO-BREAK SPACE character (that's its Unicode name). If you think the document is encoded using iso-8859-1, you will incorrectly interpret it as *two* characters: (0xC2) LATIN CAPITAL LETTER A WITH CIRCUMFLEX followed by (0xA0) NO-BREAK SPACE. Where do you get info about the document's encoding? Well, there are 3 places to get it: - from the transport (e.g., one of the HTTP message headers) - from within the document itself (e.g., assume the document is us-ascii encoded, read until you find a META tag that is intended to mean the same thing as the HTTP header from the first option, then reprocess the document using whatever encoding was declared there); or - by analyzing the bit sequences in the document and making an educated guess The first option is supposed to take highest precedence. I believe that in the case of HTML documents, the second option has higher precedence, in practice, even though it is in violation of the relevant specs for it to do so. The last option is difficult, but browsers will make a stab at it if properly configured. XML makes this option much more feasible in XML parsers than HTML does for HTML user agents because you know an XML document always begins with the bits for "<?xml ", possibly preceded by a UTF-16 byte order mark, and if it doesn't then the parser is required to assume it is UTF-8 or UTF-16 (and it is an error if the document is not!). If you use the first or second options, you have to be sure that the encoding being declared is accurate. If you saved your document to disk from a text editor, it exists on disk as (essentially) a sequence of bits, so it must have been subjected to some encoding. Your editor might have given you the option of choosing this encoding. If it didn't, then it probably stored it using the encoding that is your operating system's default, which can vary depending on the OS and locale. (e.g. windows-1252 aka cp1252 on USA versions of Windows. You must declare this encoding, or an encoding that is a superset of it, to be the encoding of the document. So let's say you have in your HTML document a declaration of the encoding that was used, and that this declaration is accurate. Whether or not the browser will actually honor this declaration and decode it appropriately is an entirely separate matter! You will find that in many cases, if you have not gone to the trouble of configuring your browser to auto-detect the encoding, it will proceed under the assumption that the document is in some default encoding that it shipped with, or the one that your operating system uses by default. Above and beyond this, most browsers give you the option of manually resetting the encoding while you are viewing a page, which really means you are choosing to *decode* the document's bit sequences according to that particular scheme. I saw a post on the Unicode list, I believe, from Microsoft explaining that in one of their 3.x browsers they either had auto-detect on by default, or they didn't allow users to override the encoding.. and this resulted in innumerable complaints from people who could no longer view web pages with misdeclared encodings. Apparently there are a lot of Shift-JIS documents out there claiming to be ISO-8859-1, and people needed to be able to make the browser ignore these misdeclarations by default so their surfing experience wouldn't be traumatic. FWIW, http://www.hclrss.demon.co.uk/unicode/ contains some good info that compares encoding support in various browsers. - Mike ____________________________________________________________________ Mike J. Brown, software engineer at My XML/XSL resources: webb.net in Denver, Colorado, USA http://skew.org/xml/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format