Re: MSXML DOM Special Chars Less Than 32
The W3C XML Core Working Group (I am not a member, but John Cowan is) has apparantly been discussing these kinds of issues as part of their XML 1.1 escapade. Some interesting issues that arise out of that include: * The RFC on MIME types talks about "textual" data rather than the text/binary distinction. So control characters in an ASCII file may be "text" to some but they are not "textual". So the debate is not between XML as text or binary, but whether XML should be text-with-controls or be textual, in the sense of being legible in a capable vanilla editor (or being character-by-character speakable in a speech synthesizer for the locale of that document, I guess.) * Unicode has changed since 3.0 to allocate by default the ISO 6429 control codes. In Unicode 2.x which XML 1.0 was based on, the control range was not allocated to particular points. If XML 1.1 mandates that NEL is U+0080, then it adopts the ISO 6429 characters: however, some ISO 6429 controls do not have corresponding Unicode mappings: they are reserved only for lower-layer use, presumably in legacy systems, such as PAD, 0x80, 0x81, 0x82. Should these characters be left free and privately defined, or banned. * For the robustness reasons I give in that note, it is highly desirable that XML 2.0 ban as many C1 (0x80-0x9F) characters as possible. However, the XML Core WG iseem to be backing themselves into a corner: they say they cannot deprecate or shun the C1 controls because of XML 1.0 compatability but also that they want to close the character repertoire issue once and for all--this in effect closes the door on improvements on repertoire and robustness from XML 2.0: they can only allow supersets (Of course, every issue can be revisited, so I think they are fooling themselves if they think that repertoire issues can ever go away.) From: "Michael Kay" <michael.h.kay@n...> > I don't want to dumb XML down. But we do sometimes need to store data (e.g. > WebDAV property values) which can potentially contain characters that are > not permitted in XML. In fact, it's very unlikely that a WebDAV property > value will contain such a character, but the software still needs to allow > for the possibility. XML has never been about guaranteed interoperability. Rather, it means that if you pick a conservative character encoding, and conservative name characters, and conservative data characters, and only use reliable URIs for system identifiers and links etc, and send standalone documents, and normalize your document correctly before you send it, you can expect your data to go through. Around this core of expectable interoperability is a cloud of regional interoperabilty, where, say, people in Taiwan probably only use XML systems that support Big5 and people in Uganda probably are happy to use systems whether or not they support Big 5. It might be that people who want to exchange UTF-* with control characters are better off treated as if they are a region. So an XML document with <?xml version="1.1" encoding="utf-8"?> would barf if it found a C1 control (for the robustness/mislabelling reasons) but accept them if it found <?xml version="1.1" encoding="utf-8-with-controls"?> or <?xml version="1.1" encoding="utf-8" controls="allow" ?> That has the advantage of moving the issue into being one of labelling rather than invisible characters, with a safe default. And it would save the XML Core WG from accusations of favouring Westerners over Asians since non-ASCII users do face robustness issues that ASCII-repertoire users do not. (And, actually, because of the Euro issue, this is now more like [English, Bahasa]-users versus non-[English, Bahasa] users.) > I don't personally see any good reason why C0 (and C1) characters shouldn't > be permitted XML characters, with the restriction that they must be written > as character references. There is no reason from SGML compatability. It would be merely an additional requirement for XML that the particular characters are only referenced not used directly, and something that serializers should attend to. Another alternative is to define built-in named character references (i.e. like <) based on the actual control characters: so people can type <p>blah&BEL;blah&EOT;blah</p> Of course, it is likely that people who want to send information using C1 controls are not actually using the ISO 6429 characters: they are using the characters for some private, proprietary or nefarious purpose rather than the public, resolved, robust, safe data interchange for which XML was created. So named character references would not really answer their needs. Cheers Rick Jelliffe (Writing personally)  One point about XML being textual or not is whether you need an API to access/create/read the data or not. If you need an API, then the issue arises who controls the API?
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format