[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Lexical vs value spaces (re: Binary content and allowed characters in XM

xml allowed characters

Nicolas LEHUEN wrote:

> I don't think readability alone is a sufficient reason to forbid binary
> content from appearing in an XML document.

I agree: there is a much better and simpler reason to forbid binary 
content from XML documents ;=) ...

> What defines the set of allowed characters in XML content ? Is it technical
> reasons, or readability reasons ?

IMO, none of them, but rather a fundamental design decision: a XML 
entity is a Unicode text (eventually using another encoding) and not a 
stream of bytes.

This should be a sufficient reason to close the debate IMO!

The problem with including arbitrary binary content would not so much be 
the "control characters", but the fact that the physical value of this 
content read as bytes would change depending on the encoding used for 
the document (what if I save it as utf-16 while it has been created as 

We are using a layered model where XML is built on Unicode and that 
would be a short-circuit of the lower level...

That being said, this doesn't seem to be a problem to use XML as a 
serialization format for integers, float or dates, why should it be for 
binary data?

The trick is just to realize that, to take a notion which I find very 
useful in W3C XML Schema, there is a decoupling between lexical and 
value spaces and to define the best lexical space for the binary content 
you want to serialize.

For arbitrary binary data, hex or base64 seem to be obvious choices but 
for data which is "almost text" with special "things" embedded, other 
solutions can be found.

One of them is to serialize the "things" found in the text as elements 
(and you have then a mixed content), the other is to define a specific 
lexical space for them (like "=00" or whatever). Which one you want to 
use comes back to the debate of using structured values in elements or 

I think that it's important to realize that the cases where the lexical 
and value spaces are identical are fairly uncommon (except in the 
"document" world) and that for a vast majority of datatypes a coding 
needs to be performed and these spaces are different.

BTW, when you think about it, this decoupling goes beyond XML world... 
In Europe, the Euro has already been there for a couple of years and 
what will happen in 12 days is just an harmonization of the many lexical 
spaces to cut the processing costs ;=) ...

Rendez-vous a Paris pour les Electronic Business Days 2002.
Eric van der Vlist       http://xmlfr.org            http://dyomedea.com
http://xsltunit.org      http://4xt.org           http://examplotron.org


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.