[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: SDATA or UNICODE
> From: Paul Prescod <papresco@t...> > On Wed, 28 Jan 1998, Gavin McKenzie wrote: > > > > XML provides a way for specifying the encoding of an entity with the > > ?XML pi encoding declaration. Why wouldn't this be sufficient. If the > > euro or florin symbol is available in some non-Unicode character > > encoding scheme, isn't it sufficient to encode the text which requires > > the symbol in the appropriate scheme and use the encoding declaration? > > No, for the reason Tim points out. On the other hand, you might be on the > right track. A processing instruction would serve as a hack to tell the > application where to insert the euro. <?EURO> XML has, underlying its decisions, the SGML model which separates the encoding of data (i.e. "storage management") from their logical representation as streams of characters in a single character set (i.e. "entity management"). This is a very flexible model, since it allows any system of encoding that anyone can dream up to be used without having to alter XML/SGML: an entity can be sourced from files, multipart MIME, data base, random number generators, standard input, anything. To allow multiple encodings within an XML file, delimited using PIs or elements or internal entities would violate this model, and I would strongly recommend against it. If your customers require multiple encodings, then they have to source each one from a separate external entity. These entities can be bundled up or interleaved in any fashion you like, but this is a *PRE* XML storage management issue, not an XML issue. I think there is a great desire that XML will be a Trojan horse to force the development of wide-character applications, and Universal Character Set-using ones (UCS = ISO 10646 ~= Unicode) in particular. I, for one, hope that by disconnection encoding and character "repertoire", XML will marginalise the character encoding issue to the extent that it will become easier to use Unicode than to use a regional encoding, in the long run. > I think you should implement a language that allows this and is preprocessed > into XML. If I were you I would use marked sections and not attributes to > describe the boundaries. Marked sections are really easy to scan for. But once you have changed encodings, do you scan for the end of the marked section using the old or the new encoding? These kinds of ISO 2022 mode changing are what we are trying to get rid of from XML (and from SGML). So you can have multiple encodings before the parser, but not being presented to the parser. The other choice is multiple encodings after the parser: e.g. embedded the SJIS encoded in a latin-1-safe way. This is the same as Dave's comment about transliteration using notation. You can have a document like <?XML version="1.0" encoding="8859-1"?> <!DOCTYPE x SYSTEM "x.dtd" [ <!NOTATION sjis-Qencoded SYSTEM "SjisQ.pl"> <!ELEMENT SJIS-SECTION ( #PCDATA ) > <!ATTLIST SJIS-SECTION I-need-decoding NOTATION ( sjis-Qencoded ) > ]> <x> ... <SJIS-SECTION><![CDATA[ smdkfjhhjwfnnweofijslkdm ]]></SJIS-SECTION> ... </x> (You cannot do the same thing using internal entities in XML, since you cannot put a notatation on an internal entity declaration.) Rick Jelliffe xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|