[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Specifying a Unicode subset
What *exactly* do you hope to accomplish? Because I'm not seeing any value at all here and as a programmer I feel like you're compelling me to stare into the fires of hell. Unicode has arrived to kill off all of the short sighted legacy character encodings and while unicode has a *lot* of problems for asian languages (Han unification was *NOT* a good idea), it remains infinitely better than the tower of Babel we had before. Besides, there are good libraries (http://oss.software.ibm.com/developerworks/opensource/icu/project/) for dealing with internationalization and the legacy encodings and once they are done I hope never to revisit this nightmare again. Anybody building any kind of development environment that does not take advantage of this extensive body of code is a fool who deserves interop with nothing more than his navel. Lets move on. UTF-8 is your transfer encoding, use UCS-2 in memory (unless planning to process ancient Sumerian or something - then use UCS-4) and lets all move on to something remotely interesting. On Monday, October 21, 2002, at 06:03 PM, Gustaf Liljegren wrote: > One thing I remember from SGML is the flexibility it allows in > defining the > character repertoire and even map characters from a BASESET to a > DESCSET. > While there are many longtime SGML users here, there are probably many > without this experience too, so here's a quick review: > > In the SGML declaration (that's a file apart from the document and the > DTD > with settings for a certain application), you first declare a BASESET, > that > closely resembles the characters you'll use. The BASESET is given by a > name > which is understood by the system: > > BASESET "ISO 646:1983//CHARSET ..." > > The information carried in this string is a numbered character > repertoire > (a.k.a. coded character set, or CCS). ASCII is one numbered character > repertoire, where the number 65 is assigned to the character 'A'. > EBCDIC is > another, where the character 'A' is assigned the number 193. > > In a DESCSET you map characters encountered in the document to > positions in > the BASESET. So if you parse a document using EBCDIC and it encounters > a > character numbered 193, it may be mapped automatically to 65, if your > tools > prefer ASCII: > > DESCSET 193 1 65 > > This means you map 1 character in the document, starting at position > 193, > to character position 65 in the BASESET. You can map several chacters > at > the same time, by increasing the number in the middle. The last number > may > be set to 'UNUSED' to indicate that the parser should exclude > characters > with these numbers: > > DESCSET 0 9 UNUSED -- 0 to 8 are not used -- > > Today, everyone seem to support the idea of one true CCS (Unicode). > Therefore, with XML we don't have the kind of problem illustrated in > the > first DESCSET example; a character number can have only one meaning in > XML. > However, there's no way to specify which characters to include or > exclude > in XML, as illustrated in the second example. > > With XML 1.1 (here's my point), there's a proposal to include more > characters from Unicode in XML. So while people nowadays agree on > which CCS > to use, there's still discussion about which *part* of that CCS should > be > included in XML. Maybe XML needs a more flexible solution? > > I see three aspects in this: > > 1. Which CCS is used? > 2. Which subset from the CCS is used? > 3. Which algoritm is used to encode character numbers to binary > sequences? > > As far as I'm concerned, it's a good thing that XML clearly specifies > the > unconditional use of Unicode as its CCS. By doing so, XML removes one > level > of complexity and most of the character conversion headaches. > > The third aspect, if I'm not mistaken, is exactly what is specified in > the > 'encoding' attribute in the XML declaration. That is good too. > > However, some want more characters in XML, while others don't want > them. > Perhaps we can allow for both by letting documents declare their own > subset > of Unicode? > > <?xml version="1.0" encoding="iso-8859-1"?> > <?xml-characters plain="add_nel.xml" charref="add_c0.xml"?> > <doc> > <p><!-- Unicode characters, some not standard in XML --></p> > </doc> > > The PI would point to one or two files that (one way or the other) > specifies a subset of Unicode. The 'plain' subset is for characters > that > may be written directly (i.e. acts as a replacement for the 'Char' > production in the specification). The 'charref' subset is for > characters > that may be represented as character entities. > > I need help in understanding the implications of this solution. Would > it > break something fundamental? > > Gustaf > > > > ----------------------------------------------------------------- > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > initiative of OASIS <http://www.oasis-open.org> > > The list archives are at http://lists.xml.org/archives/xml-dev/ > > To subscribe or unsubscribe from this list use the subscription > manager: <http://lists.xml.org/ob/adm.pl> >
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|