[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] (char)0 handling proposal
Hi, I just joined the list to ask about a � issue, and the first three posts I see are about it! Serendipity! Is there any standard convention for representing a character of value 0 in XML (and other control characters)? I understand that we can't actually *have* such a character - that's why � is illegal - but sometimes we want to output data that includes such characters. (I'm thinking Java, which doesn't use the nul char as a string terminator.) Is there a convention for doing this (albeit informal)? It's not really "binary data", but the rare control characters that sometimes appear in strings that are otherwise mostly printable characters. Below is a (short) essay on the problem, and possible solutions. I'd most welcome further comment/critcism. ;-) Cheers, Brendan ----------------------------------------------------------------------------- Here's the problem: How can we represent (char)0 and control characters in XML, in a way that standard XML tools (like SAX and DOM) can read them? Here's the proposal: Encode them Java style, like this: \u0000 Here's more detail: JSX needs to represent characters of 0 value (and other control characters), because Java permits them to occur in Strings. In practice, they rarely occur there - but are very common in StringBuffers for example, where they pad out the unused portion. Because JSX needs to be able to map *all* objects to XML, it needs to be able to do this. But XML doesn't allow 0 characters - and the "Ӓ" and "Ӓ" syntax explicitly forbids "�". Of course, JSX has the option of encoding it in any way that it can read it back in - for example, at present it simply writes the control characters directly as is. But we want it to be legitimate XML, so it can interoperate with other tools, such as SAX and DOM and XSLT and so on. This is a real issue! Note that the problem is not exactly "binary data", like a set of pixel values for an image. For that, an array of bytes might be more approriate. But for Strings, most of the characters are regular human readable characters - it will usually be only a few that are control characters or (char)0 etc. Some potential solutions to encode individual (char)0 are: (1). external unparsed entities (2). introduce a new scheme to XML, like Java's \u0000 (3). use a different range of Unicode characters for this purpose (4). treat the whole String as containing binary data (1). "External unparsed entities" --------------------------------- An external unparsed entity (this can appear as an attribute alue, if the DTD specifies it to be so) - but SAX and DOM won't know that they are supposed to include it in the document. We could have a list of all the chars, like nul, bel, etc This is how SAX would deal with reading it: when it is notified of an external unparsed reference, it needs to read in the appropriate value (would need to have a list of what they mean). A limitation is that external unparsed entities can only be referenced from within an attribute (not embedded within a String) - and furthermore, the type of the attribute must be ENTITY or ENTITIES. To use this scheme, every possible character would need an unparsed entity - since chars are two bytes, that's 2**16 or 65536 possible values! (2). A New Scheme, like \u0000 ------------------------------ Include our own proprietary encoding scheme, like: \u0000 - but it's not XML. (3). shift range of Unicode characters -------------------------------------- Represent the ASCII control chars (ie 0x00 - 0x1F) with chars permitted in XML (eg: 0x7F - 0x9F). "Encodings using the *bytes* 0x7F to 0x9F aren't the issue. What counts here is the Unicode *characters* U+007F to U+009F, which are solely the control characters." Citation: http://lists.xml.org/archives/xml-dev/200006/msg00502.html The big problem with this approach is how to encode characters which were already in the range 0x7F - 0x9F... it might not happen often, but a bijective mapping (ie reversible) needs to be able to handle all cases! (4). Treat as Binary Data ------------------------- This approach is probably keeping more in spirit with XML: if a String or StringBuffer etc contains *any* control characters, it is no longer character data (from the point of view of XML), but really is "binary data". Therefore, encode it as such - for example, treat it as an array of short: each char can become a short (both are stored in two bytes), with some kind of markup saying it should be converted back into an array of char. Thus, <ArrayOf-char ... /> becomes <ArrayOf-short reallyChar="true" ... />. But this raises an interesting issue... after all, the whole ArrayOf convention is an invention of JSX - why should we worry about other XML conventions, if we are happy to make up this one? Aha! The key thing is that the ArrayOf convention is built *on top* of the XML conventions, and is consistent with them. SAX and DOM can read them in fine, even though they don't know what to do with them - that is, it takes additional code to parse them fully (handling ArrayOf etc). Thus, important factors in how to handle (char)0 is to build on top of the XML conventions (consistently); and with a scheme that is easy to understand and to write code to parse and unpack them fully. Which scheme is easier to parse? Let's review the three choices in this light: (1). extern unparsed entities don't seem too bad; though can't be embedded in Strings (2). the "\u0000" is also not too bad: just need to check all char data (including String) for \u0000 (etc), and if present, convert to a char of that value. This may be a bit inefficient, since SAX will have already done this kind of test for & etc. It can be embedded in Strings. (3). A shifted range is very easy to parse back; and it can be embedded in Strings. (4). binary data is a little complex, and the code would need to understand how JSX handles arrays in some depth: if reallyChar's value true, then cast the remaining attriutes to char. It can't be embedded in Strings. OK! We've looked at 4 different possibilities, and considered what factors are important in the choice. Here's a conclusion: It seems that \u0000 would be best, because: - it is *obvious* what this means to any (Java-aware) human. - It is easy to write a parser for it. - It can be embedded in the middle of a String. - It only affects the parts of the String that are "binary" - the rest is still rendered as perfectly readable text, instead of the whole thing being treated as binary (not one apple spoiling the lot!). - It doesn't require any extra mucking about (like a DTD, or strange variation on String encoding, or an initial pass on the entire String to check if it does contain any binary data etc). Here's a sketch of an implementation for JSX: To encode: (1). If in control char range, need to convert to hex, and then output exactly 4 chars... [is there an existing Java method for this?] and output preceeded by "\u". (2). If "\", then write "\\" - need to escape the escape char!) To decode: (1). if see a "\", followed by a "u", then grab the next four characters, parse them as an int, and cast to char. [is there a sign-unsigned issue here?] (2). if see a "\", followed by a "\" then return a '\' We'd put these in with the same code that presently encode and decode the "&" etc. This is quite exciting! As always, your thoughts are not only welcome, but actively sought and requested! Hope all this wasn't too much of an ordeal to get through!
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|