[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: About sml and internationalization

  • From: nisse@l... (Niels Möller)
  • To: Sean McGrath <digitome@i...>
  • Date: 29 Nov 1999 17:06:18 +0100

unicode whitespace
Sean McGrath <digitome@i...> writes:

> I am thinking about the issue to with allowing/disallowing
> sets of Unicode characters in element type names as per XML
> 1.0.
> 
> If SML has very few special tokens
> e.g. "<", "&" and whitespace, what would happen
> if any character outside this teeny weeny set is
> allowed in an element type name.

I would say this is the way to go. And I have seen it done before,
both with eight-bit charsets like latin1 andwith unicode.

It gives people the ability to shoot themselves in the foot by using
strange characters (my favourite is using non-breakable space in
variable names in emacs lisp). But I still think it is the way to go:
The parser and language can define a small set of characters as
special, and just pass on whatever is between those special characters
to the application.

If you think about it this way, most of the charset considerations can
be removed from the parser. Treat the input as a sequence of
non-negative integers (which may be 7, 8 or 36 bits wide, depending on
the application; if you think in C++, the parser could be a template
parameterized on the character type). If an application needs to
handle several charsets, it can use something like a content-type:
text/sml; charset = iso-8859-2 header to convert the input into
unicode before feeding it into the parser.

One could define the special characters more abstractly, and leave it
to the application to tell the parser how an "<" is represented today,
but I think that's overabstracting things. Using plain ascii values
(possibly embedded into an ascii superset like unicode or latin-2)
should be good enough.

This line of thinking also means that "whitespace", as far as the
parser is concerned, should be limited to a few ascii characters. SPC
and NL ought to be enough. To keep with tradition, perhaps TAB an CR
as well. Having the parser recognize all unicode whitespace characters
as adds some complexity. (There are 5 spacing control characters in
traditional ASCII, and ordinary space, non-breakable space (in latin-x
and unicode), and an additinal 18 in the rest of unicode. I.e 25 in
all).

/Niels

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To unsubscribe, mailto:majordomo@i... the following message;
unsubscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.