[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Tokenizer question


xml based markup languge
From: "zhengyu" <zhengyu@a...>


> I was reading W3C documents early today. Boy, how complicated the
> character-set definitions are!!

> I can't help but wondering, does anyone really both implementing all these
> into their tokenizer at all, if they
> really do, how incredibly slow it is going to be?

Speed is not the only criterion for what makes a good markup languge.

In the XML rules, you only need to look for whitespace or delimiters
to scan incoming text to find a name.  You never need to check the
characters of an end-tag: you just need to match them against the
characters for the start-tag.  Most documents use ASCII or 
Latin1 characters-only for markup, so these only need a test for range
(<xFF) and a test on a single entry in a 256-entry table to determine,
and chances are much of the table will fit into a CPUs cache and so
not really cost that much.  It is prudent to disallow characters that
can be used as delimiters in other language  (of course <, >, &, %, ", ', ?, / 
for XML, and = for URLs, though the horse has bolted on -,:,- and _) 
and for digits, so you have to test for those characters in the ASCII range 
anyway.  

So actually the XML 1.0 names rules need cause no performance penalty 
for people who are just using ASCII or Latin 1 characters in names.
If they do, it is an implementation decision.

And for people using characters outside that range, if they are
using Chinese characters, then they are probably using half
the number of characters anyway, so the performance impact
of testing characters is relatively less.

What do you gain by these tests?

Here are five things:

1) Robustness by detecting some kinds of encoding errors
   - see http://www.topologi.com/public/XML_Naming_Rules.html

2) Baseline readability
   - no non-graphical characters are allowed, so you won't need a 
  hex editor to view what your names actually are. (Normalization
  is also appropriate for XML documents for the same reason.)

3) Near compatability with the Unicode Consortium's guidelines on
   characters suitable for identifiers.  As programming languages implement
   these guidelines more, XML names can be used as tokens in
   programming languages.

4)  Accessability.  Symbols and marks typically have no "reading" in
  speech synthesizers or Braille readers, so allowing such characters
  creates a disability where none needs to exist.  

5) A clear message to implementers that if they do not accept
characters outside ASCII in XML names, they do not conform.

So the rules provide a safety net, and then best practises can be
followed for the particular names chosen: for example to
use names taken from a single natural language.

Cheers
Rick Jelliffe

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.