Re: Some comments on the 1.1 draft
From: "Alan Kent" <ajk@m...> > To separate the two issues - I have no opinion on name characters. > PCDATA however is different. I read through you entire post twice > and must admit I still don't quite understand what your point is > exactly. I *think* you might be saying "its good to specify the > encoding because that way its possible to make sure characters > not valid in that encoding are rejected." (My reading of the XML spec > is that 0x85 is legal in the Unicode character set - that is, its > not marked as UNUSED in the good old SGML jargon.) Sorry I am not being clear. I am saying that it is vital in practice that there are enough characters that are UNUSED (and characters that are NAME characters, SEPCHAR, etc) to catch the most common mislabellings of character encoding. More than "good": vital. It is one of the best Software Engineering features of XML: it can make several very difficult problems effectively disappear. Everytime someone complains "I cannot process this document because the XML parser says I have an unexpected code point" it is victory for software quality and reliable data interchange. Encoding problems must be detected and dealt with at source, and not allowed to propagate and corrupt distributed systems. This is an area in which ASCII developers' judgements about the tradeoffs may easily be in conflict with the tradeoffs of people from the rest of the world. If you only ever use ASCII, then having the C1 controls (80 to 9F) available would probably cause you no grief. Even if you only use ISO 8859-1, it is still important. The Euro=0x80 mistake will be increasingly common, and we need to make sure that XML processors continue to catch this error. Character encodings are hard. Programmers are not trained to deal with them -- Computer Science classes teach what a float is and what a byte is, but usually not what a character is and inevitably not what a multibyte code is. APIs do not expose the character set (Java is getting better at this). DBMS do not perform checks that the encoding that their locale says they use is in fact the one that any particular string does use. The only place we can verify character encodings is at the point of data interchange: as XML. Before XML, the only way that people had to make reliable systems for data interchange was to agree on a common encoding. That is impractical for all sorts of reasons. XML has given us an alternative, where it can be safe, in practice in many situations, to use different encodings because we have a coarse net that exposes mislabelled encodings. To allow x80 to x9F and to allow silly characters in XML names takes us back to the 80s, where there were no checks at any part of a system that encodings were correct. Unrestricted ranges is not the future, it is the past, and a past that failed miserably. > If this is your point, then would it be possible to define a new > encoding which permitted the full range of Unicode characters > (including control characters which are valid in Unicode). > Would that address your issues? I don't believe the world is crying out for more encodings :-) When we consider the solutions available, we do not have the ability to force people to choose encodings or the ability to make APIs which transmit the character encoding: not only because we are not ISO or Microsoft, but because Pandora's box is already open (or the horse has already bolted). > But I must admit that I do not understand why allowing control > characters in PCDATA results in "we won't actually increase the number > of characters that can be reliable sent: we will just make non-ASCII > characters suspect and unreliable." It may make translation between > different character sets harder, but hey - how do I turn Unicode > encoded chinese into plain ASCII? My point is that not permitting > a small number of characters does not solve all such problems. (Off the point, one can transmit Chinese in ASCII using numeric character references.) UTF-8, VISCII and Big5 all use the code points x80 to x9F. Most transcoding systems that read ISO8859-1 into Unicode will merely convert unsigned bytes to unsigned shorts to read the data in. So merely by labelling an XML file as ISO8859-1 is enough to silence most XML processors, in the absense of a restriction of the C1 control characters. If x80 to x9F are errors, then it become a statisical issue of how many non-ASCII characters can occur in, say, a UTF-8, VISCII or Big 5 document labelled as ISO 8859-1 before the error is detected. As far as the issue of reliability and "trust", Alan is certainly correct that diallowing x80 to x9F will not catch some errors ("solve all problems") such as mislabelling ISO8859-15 as ISO 8859-1 (especially if element names are all ASCII-repertoire characters). And in some case it may be some small statistical number of characters before the problem is detected (e.g. a Chinese Big5 character has a one in eight chance of having a second byte x80-x9F, ignoring the higher rates of some common characters such as MA, so we can expect that the problem will be detected for most documents with more than 8 Chinese characters.) But XML does not need to "solve all problems". It just needs needs to catch an adequate number of important problems in a straightforward way, where users can have statistical expectations about problems detected. There are some problems (such as whether a Japanese document is using backslash or Yen) which cannot be detected by this method, which is a pity for the people with those problems, not some sign that the C1 restrictions are not worthwhile. > If you are only talking about name characters (element names, attribute > names etc), then that is a different matter. Restricting the control characters catches some significant problems. Restricting the name characters catches some more. > But I think its wrong to put too much trust into XML to protect > against data corruption. This seems (to me) to be a poor rationale > for omitting a small select number of characters. In the abstract, sure. But since there is *nothing* else that catches such errors, the abstract is irrelevent. It is not "data corruption" in the sense of errors that creep in, it is data corruption by programmer action. APIs usually just use the default encoding of the locale to serialize text, so there is no way for programmers to be aware that they have mislabelled their documents' encoding unless the XML processor tells them. Anyone who has had to populate a database with feeds coming in in different character sets knows that keeping track of the character encoding is vital. Without it, the database becomes useless. I would be interested in the people who say we should make C1 available specifying an alternative way to detect these errors, or explaining why the problem is not real. (I am sure that developers from China, Japan, Korea and Vietnam would be interested that character encoding issues cause so few problems that we should not have machine checks to help us.) Cheers Rick Jelliffe  For more info, see my old GLUE transcoder project: http://www.ascc.net/xml/en/utf-8/glue.html
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format