|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Tokenizer question
From: "zhengyu" <zhengyu@a...> > I was reading W3C documents early today. Boy, how complicated the > character-set definitions are!! > I can't help but wondering, does anyone really both implementing all these > into their tokenizer at all, if they > really do, how incredibly slow it is going to be? Speed is not the only criterion for what makes a good markup languge. In the XML rules, you only need to look for whitespace or delimiters to scan incoming text to find a name. You never need to check the characters of an end-tag: you just need to match them against the characters for the start-tag. Most documents use ASCII or Latin1 characters-only for markup, so these only need a test for range (<xFF) and a test on a single entry in a 256-entry table to determine, and chances are much of the table will fit into a CPUs cache and so not really cost that much. It is prudent to disallow characters that can be used as delimiters in other language (of course <, >, &, %, ", ', ?, / for XML, and = for URLs, though the horse has bolted on -,:,- and _) and for digits, so you have to test for those characters in the ASCII range anyway. So actually the XML 1.0 names rules need cause no performance penalty for people who are just using ASCII or Latin 1 characters in names. If they do, it is an implementation decision. And for people using characters outside that range, if they are using Chinese characters, then they are probably using half the number of characters anyway, so the performance impact of testing characters is relatively less. What do you gain by these tests? Here are five things: 1) Robustness by detecting some kinds of encoding errors - see http://www.topologi.com/public/XML_Naming_Rules.html 2) Baseline readability - no non-graphical characters are allowed, so you won't need a hex editor to view what your names actually are. (Normalization is also appropriate for XML documents for the same reason.) 3) Near compatability with the Unicode Consortium's guidelines on characters suitable for identifiers. As programming languages implement these guidelines more, XML names can be used as tokens in programming languages. 4) Accessability. Symbols and marks typically have no "reading" in speech synthesizers or Braille readers, so allowing such characters creates a disability where none needs to exist. 5) A clear message to implementers that if they do not accept characters outside ASCII in XML names, they do not conform. So the rules provide a safety net, and then best practises can be followed for the particular names chosen: for example to use names taken from a single natural language. Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








