|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Lexical vs value spaces (re: Binary content and allowed ch
>IMO, none of them, but rather a fundamental design decision: a XML >entity is a Unicode text (eventually using another encoding) and not a >stream of bytes. > >This should be a sufficient reason to close the debate IMO! OK, that's the good reason I was waiting for :). I was kind of playing the devil's advocate here, but without knowing the proper answer :P. >The problem with including arbitrary binary content would not >so much be >the "control characters", but the fact that the physical value of this >content read as bytes would change depending on the encoding used for >the document (what if I save it as utf-16 while it has been created as >utf-8). > >We are using a layered model where XML is built on Unicode and that >would be a short-circuit of the lower level... Agreed. There would be no way for the parser to distinguish before text content and binary content, so we could expect that the parser tries to decode our binary content as encoded Unicode strings, which would lead us to nonsense. I get it, now. >That being said, this doesn't seem to be a problem to use XML as a >serialization format for integers, float or dates, why should >it be for >binary data? Well, some people aren't happy because they can't directly embed binary content within XML document, but alas, even if they could, they would have to escape the byte sequence corresponding to '<' in the document encoding, which sometimes is unknown at the time of document creation (especially if you use a SAX or DOM API without taking care of the serialization part). In XML, you just CANNOT embed anything WITHOUT taking care of escaping the XML control characters, which are '<' and quotation marks, depending on the current parser/serializer state. That's a direct consequence of the XML format, which uses delimiter characters ; that's too bad those delimiters are found in the "useful" set of characters instead of special control characters, which forces us to escape even simple text (well, at least technical texts with '<' inside). When you're working with text, and a Unicode-aware programming language, escaping is easy, since you compare characters with '<'. If you were encoding binary data, you would have to compare your data with the result of the encoding of '<', which is not always known at document building time (in UTF-16 it would be 0x003C, in UTF-8 and ISO-8859-1 0x3C only). So, since you're forced to encode your binary content into *characters* (not bytes) that will then be encoded into bytes according to the character encoding, why not use Base64 ? Note that there are other solutions which may be more economic [1]. >The trick is just to realize that, to take a notion which I find very >useful in W3C XML Schema, there is a decoupling between lexical and >value spaces and to define the best lexical space for the >binary content >you want to serialize. > >For arbitrary binary data, hex or base64 seem to be obvious >choices but >for data which is "almost text" with special "things" embedded, other >solutions can be found. > >One of them is to serialize the "things" found in the text as elements >(and you have then a mixed content), the other is to define a specific >lexical space for them (like "=00" or whatever). Which one you want to >use comes back to the debate of using structured values in elements or >attributes. > >I think that it's important to realize that the cases where >the lexical >and value spaces are identical are fairly uncommon (except in the >"document" world) and that for a vast majority of datatypes a coding >needs to be performed and these spaces are different. So why do people keep on insisting that their XML content be readable with vi ? Why does it matter so much for people to be able to read XML documents with non appropriate tools, while we could easily have true XML viewers ? I could invent a stupid Unicode encoding that would make any XML document unreadable in vi (for example : U+0123 would be encoded as 0x32 0x10), yet perfectly correct provided that the parser has the corresponding encoder. But nobody would like to use it, because they would not be able to read it in the lexical space... We (human) don't care about the lexical space, it's the value space that has some meaning ! Regards, Nicolas [1] http://www.javaworld.com/javaworld/javatips/jw-javatip117.html
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








