Binary content and allowed characters in XML
I don't think readability alone is a sufficient reason to forbid binary content from appearing in an XML document. What defines the set of allowed characters in XML content ? Is it technical reasons, or readability reasons ? Technical reasons are related to the need of simplicity of implementation of XML parsers. XML parsers should be allowed to follow a very simple set of states and rules, being implemented as finite-state automata with very few states. This means, for example, that you have to forbid certain delimiter characters from appearing in names, attribute values or text nodes, the best example being '<'. We could remove this limitation with various tricks, but this would complicate the parser and/or the serializer. I think technical reasons alone forbid a very small set of characters from appearing in the content : it may be limited to '<', whitespaces and quotation marks, depending on the state of the parser (e.g. in text nodes whitespaces and quotation marks are allowed). Moreover, the fact that some character have to be forbidden is simply due to the fact that XML parsing uses delimiter characters ; there are other ways of encoding XML-like labeled trees that do not use delimiter characters. For example, we could encode the length of each content string instead of marking its end by a state-dependent stop character. I'm not sure this would complicate the parser. This way, strings could be composed of any arbitrary byte sequence, which would mean that we could encode text as well as binary data in XML, at the cost of readability (no one wants to read all those bytes encoding the length of strings). The problem is, readability is a subjective concept. There's no character encoding that both contains all required characters for a given language and that are readable on all platform. If you define readability by "I can read it with vi under my Unix variant", you'll have hard times trying to find such an encoding. Forget about english-centric character encoding. Lots of people have to encode content with weird accentuated characters (I'm French so I know a bit about that), and many more people don't have a 26 character alphabet, but bunches of ideograms. Chances are that an XML file containing Kanji characters in not readable on vi running on whatever Unix variant. How do you define readability in this context ? How do you justify the fact that you cannot directly embed binary data within an XML document, whereas Kanji text would look to me as binary data, on my occidental computer ? I don't think the readability of the serialized form is so important. What matters to me is the fact that I can correctly exchange labeled trees while keeping the serialization/parsing process simple and platform-independent. XML answers this need, as long as labels and values are 'text', whatever it means for the W3C (once again, UTF-8 string is no longer text for me as soon as I've got French accentuated characters in my strings). But for binary data, I have to use tricks. I don't want to. Regards, Nicolas Lehuen
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format