[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Blueberry is not "closed"

  • From: Toby Speight <streapadair@g...>
  • To: XML developers' list <xml-dev@l...>
  • Date: Wed, 25 Jul 2001 13:48:01 +0100

blueberry tim
0> In article <5.1.0.14.2.20010724225913.020f5760@p...>,
0> Tim Bray <URL:mailto:tbray@t...> ("Tim") wrote:

Tim> Ouch, it's worse than I thought.  One of the "nice" things about
Tim> the UTF16 surrogate system is that if you don't have the apparatus
Tim> around to deal with astral-plane chars, you can just obliviously
Tim> treat 'em as pairs of characters you don't know.

Except that you have to be careful about how you count "characters".


Tim> But XML carefully rules out that possibility, prod [2] for "Char"
Tim> rules excludes surrogate blocks.  In retrospect, maybe that was
Tim> dumb?

In a Java environment, it's sensible to pass around surrogates in String
objects - think of it as using UTF-16 as the internal representation,
which is trivial if the input is UTF-16 and (potentially) less trivial
otherwise.

Production [2] doesn't say anything about what happens internally, of
course, as this is external syntax - it rules out numeric character
references to the surrogate area, or surrogate characters in UCS-2,
etc.  This actually makes things easier for a Java implementation,
since whenever you see a character from the surrogate area, you know
it's being used as one half of a surrogate pair.


Tim> Which means in effect that Dave's right, basically you just totally
Tim> can't use a java's String or char in dealing with Blueberry docs.
Tim> Or am I missing something... please?

It seems that you might need to at least temporarily combine surrogates
whilst parsing (or write your parser such that UTF-16 state is taken
account of), but I don't think the parser would need to retain the
UCS-4 form, and it seems okay to pass UTF-16 to downstream components
(as long as you don't split surrogate pairs!).


Tim> Or re-open the door to the UTF-16 hack by putting the surrogate
Tim> blocks back into [2] as part of the Blueberry update.

Ugh!


I knew a 16-bit char type would be a nuisance before too long!

-- 

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.