Re: About sml and internationalization

From: nisse@l... (Niels M�ller)
To: Sean McGrath <digitome@i...>
Date: 29 Nov 1999 17:06:18 +0100

Play the video

Sean McGrath <digitome@i...> writes:

> I am thinking about the issue to with allowing/disallowing
> sets of Unicode characters in element type names as per XML
> 1.0.
> 
> If SML has very few special tokens
> e.g. "<", "&" and whitespace, what would happen
> if any character outside this teeny weeny set is
> allowed in an element type name.

I would say this is the way to go. And I have seen it done before,
both with eight-bit charsets like latin1 andwith unicode.

It gives people the ability to shoot themselves in the foot by using
strange characters (my favourite is using non-breakable space in
variable names in emacs lisp). But I still think it is the way to go:
The parser and language can define a small set of characters as
special, and just pass on whatever is between those special characters
to the application.

If you think about it this way, most of the charset considerations can
be removed from the parser. Treat the input as a sequence of
non-negative integers (which may be 7, 8 or 36 bits wide, depending on
the application; if you think in C++, the parser could be a template
parameterized on the character type). If an application needs to
handle several charsets, it can use something like a content-type:
text/sml; charset = iso-8859-2 header to convert the input into
unicode before feeding it into the parser.

One could define the special characters more abstractly, and leave it
to the application to tell the parser how an "<" is represented today,
but I think that's overabstracting things. Using plain ascii values
(possibly embedded into an ascii superset like unicode or latin-2)
should be good enough.

This line of thinking also means that "whitespace", as far as the
parser is concerned, should be limited to a few ascii characters. SPC
and NL ought to be enough. To keep with tradition, perhaps TAB an CR
as well. Having the parser recognize all unicode whitespace characters
as adds some complexity. (There are 5 spacing control characters in
traditional ASCII, and ordinary space, non-breakable space (in latin-x
and unicode), and an additinal 18 in the rest of unicode. I.e 25 in
all).

/Niels

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo@i... the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)

Follow-Ups:
- DTD.com - new repository
  - From: Avi Rappoport <xml@s...>

References:
- About sml and internationalization
  - From: "Didier PH Martin" <martind@n...>
- RE: About sml and internationalization
  - From: Sean McGrath <digitome@i...>

Prev by Date: Re: How to keep "useless" information with SAX (2?).
Next by Date: unsubscribe
Previous by thread: RE: About sml and internationalization
Next by thread: DTD.com - new repository
Index(es):
- Date
- Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >