Specifying a Unicode subset
One thing I remember from SGML is the flexibility it allows in defining the character repertoire and even map characters from a BASESET to a DESCSET. While there are many longtime SGML users here, there are probably many without this experience too, so here's a quick review: In the SGML declaration (that's a file apart from the document and the DTD with settings for a certain application), you first declare a BASESET, that closely resembles the characters you'll use. The BASESET is given by a name which is understood by the system: BASESET "ISO 646:1983//CHARSET ..." The information carried in this string is a numbered character repertoire (a.k.a. coded character set, or CCS). ASCII is one numbered character repertoire, where the number 65 is assigned to the character 'A'. EBCDIC is another, where the character 'A' is assigned the number 193. In a DESCSET you map characters encountered in the document to positions in the BASESET. So if you parse a document using EBCDIC and it encounters a character numbered 193, it may be mapped automatically to 65, if your tools prefer ASCII: DESCSET 193 1 65 This means you map 1 character in the document, starting at position 193, to character position 65 in the BASESET. You can map several chacters at the same time, by increasing the number in the middle. The last number may be set to 'UNUSED' to indicate that the parser should exclude characters with these numbers: DESCSET 0 9 UNUSED -- 0 to 8 are not used -- Today, everyone seem to support the idea of one true CCS (Unicode). Therefore, with XML we don't have the kind of problem illustrated in the first DESCSET example; a character number can have only one meaning in XML. However, there's no way to specify which characters to include or exclude in XML, as illustrated in the second example. With XML 1.1 (here's my point), there's a proposal to include more characters from Unicode in XML. So while people nowadays agree on which CCS to use, there's still discussion about which *part* of that CCS should be included in XML. Maybe XML needs a more flexible solution? I see three aspects in this: 1. Which CCS is used? 2. Which subset from the CCS is used? 3. Which algoritm is used to encode character numbers to binary sequences? As far as I'm concerned, it's a good thing that XML clearly specifies the unconditional use of Unicode as its CCS. By doing so, XML removes one level of complexity and most of the character conversion headaches. The third aspect, if I'm not mistaken, is exactly what is specified in the 'encoding' attribute in the XML declaration. That is good too. However, some want more characters in XML, while others don't want them. Perhaps we can allow for both by letting documents declare their own subset of Unicode? <?xml version="1.0" encoding="iso-8859-1"?> <?xml-characters plain="add_nel.xml" charref="add_c0.xml"?> <doc> <p><!-- Unicode characters, some not standard in XML --></p> </doc> The PI would point to one or two files that (one way or the other) specifies a subset of Unicode. The 'plain' subset is for characters that may be written directly (i.e. acts as a replacement for the 'Char' production in the specification). The 'charref' subset is for characters that may be represented as character entities. I need help in understanding the implications of this solution. Would it break something fundamental? Gustaf
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format