[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: character encoding questions
Christina Portillo wrote: > > My questions are: > 1. How are the software vendors (browser, parser, authoring) planning on > supporting documents which utilize the UNICODE character set? > The Double Byte Edition of the Balise SGML/XML transformation tool uses internally characters coded with 16 bits, which allows to transform transparently any Unicode documents (see http://www.balise.com/). Balise is able to parse, read and write most usual encoding schemes: UCS-2, UTF-8, ISO-8859-[1-9], Shift-JIS, EUC-JP, EUC-KR, CN-GB, and Big5. The Balise xml scanner switches to the adequate decoder when the XML PI changes the encoding. For instance, <?XML version='1.0' encoding='ISO-8859-1' ?> specifies that the flow should be interpreted according to the ISO latin1 encoding scheme. When reading or writing character files, Balise can specified the used encoding scheme and by this mechanism is able to transform from one encoding scheme to another (as long they are compatible). The internal double byte coding of the characters allows the user to see directly one flat Unicode character set. This is particularly important for operation like searches and sortes. The Single Byte Edition of Balise is able to support ISO-8859-1 and UTF-8 (in its ASCII subset). > 2. a) Can all the characters referenced in ISO LAT,1 positions 0-256, be > referenced in the document without benefit of escape codes? > Only the UCS-2 and UTF-8 encoding schemes are absolutely required by the XML spec. Tools need to support the encoding scheme ISO-8859-1, for processing characters in the range 160-255 of ISO-8859-1. If the ISO-8859-1 is not available then you must code your characters with character references. > 2. b) What about positions 0-125? Characters of ISO Latin 1 between 32 and 127 (ASCII part) are OK because they are mapped in the same place in most encoding scheme including UTF-8. > > 2. c) Must the characters above 126 be escaped? > No, if the appropriate encoding scheme (here ISO-8859-1) is used. If not, you should use character references like é or express the desired character in current encoding scheme (UCS-2 or UTF-8). For ISO-Latin1, the mapping of every character is the same as in Unicode. This is not true for other ISO-8859 encoding scheme and for ISO TECH, ISO PUB, ... This means that tools using 8 bit internal representation are obliged to code them internally in an escaped way, which may be inefficient or inadequate for some coding and some processing. > 3. At what point in the ISO10646 character set must escaping be > instituted in order to reference a character within the set? > Character references (like é) is a convenience to cover any Unicode character, even if they are not compatible with the encoding scheme of the document. Tools like Balise can be used to transform documents between any character formats: special characters can be coded directly by Unicode character code (if compatible with the encoding scheme), XML character references or SGML SDATA entity references. You can use Balise at different steps of your process to adapt your data with the capabilities and limitations of other tools. When tools are not coding characters internally in 16 bits, they are obliged to code these escaped characters into an escaped form. -------------------------------------------------------------------------------- Nicolas Paris AIS Software tel. : (33+1) 40 64 43 00 17 rue Remy Dumoncel fax. : (33+1) 40 64 43 10 75014 Paris email: nico@A... FRANCE web: http://www.balise.com/ xml-dev: A list for W3C XML Developers Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To unsubscribe, send to majordomo@i... the following message; unsubscribe xml-dev List coordinator, Henry Rzepa (rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|