[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Specifying a Unicode subset


unicode subset
One thing I remember from SGML is the flexibility it allows in defining the
character repertoire and even map characters from a BASESET to a DESCSET.
While there are many longtime SGML users here, there are probably many
without this experience too, so here's a quick review:

In the SGML declaration (that's a file apart from the document and the DTD
with settings for a certain application), you first declare a BASESET, that
closely resembles the characters you'll use. The BASESET is given by a name
which is understood by the system:

BASESET "ISO 646:1983//CHARSET ..."

The information carried in this string is a numbered character repertoire
(a.k.a. coded character set, or CCS). ASCII is one numbered character
repertoire, where the number 65 is assigned to the character 'A'. EBCDIC is
another, where the character 'A' is assigned the number 193.

In a DESCSET you map characters encountered in the document to positions in
the BASESET. So if you parse a document using EBCDIC and it encounters a
character numbered 193, it may be mapped automatically to 65, if your tools
prefer ASCII:

DESCSET   193     1     65

This means you map 1 character in the document, starting at position 193,
to character position 65 in the BASESET. You can map several chacters at
the same time, by increasing the number in the middle. The last number may
be set to 'UNUSED' to indicate that the parser should exclude characters
with these numbers:

DESCSET     0     9     UNUSED  -- 0 to 8 are not used --

Today, everyone seem to support the idea of one true CCS (Unicode).
Therefore, with XML we don't have the kind of problem illustrated in the
first DESCSET example; a character number can have only one meaning in XML.
However, there's no way to specify which characters to include or exclude
in XML, as illustrated in the second example.

With XML 1.1 (here's my point), there's a proposal to include more
characters from Unicode in XML. So while people nowadays agree on which CCS
to use, there's still discussion about which *part* of that CCS should be
included in XML. Maybe XML needs a more flexible solution?

I see three aspects in this:

1. Which CCS is used?
2. Which subset from the CCS is used?
3. Which algoritm is used to encode character numbers to binary sequences?

As far as I'm concerned, it's a good thing that XML clearly specifies the
unconditional use of Unicode as its CCS. By doing so, XML removes one level
of complexity and most of the character conversion headaches.

The third aspect, if I'm not mistaken, is exactly what is specified in the
'encoding' attribute in the XML declaration. That is good too.

However, some want more characters in XML, while others don't want them.
Perhaps we can allow for both by letting documents declare their own subset
of Unicode?

<?xml version="1.0" encoding="iso-8859-1"?>
<?xml-characters plain="add_nel.xml" charref="add_c0.xml"?>
<doc>
  <p><!-- Unicode characters, some not standard in XML --></p>
</doc>

The PI would point to one or two files that (one way or the other)
specifies a subset of Unicode. The 'plain' subset is for characters that
may be written directly (i.e. acts as a replacement for the 'Char'
production in the specification). The 'charref' subset is for characters
that may be represented as character entities.

I need help in understanding the implications of this solution. Would it
break something fundamental?

Gustaf



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.