Re: Is it a well-formedness error to use a character not in t
> Rick, > Unicode tech reports 22 and 36 both describe transcoding producing both 1A > and FFFD characters as a result of character mismatches depending on > context > and direction. It appears to me that 1a can be introduced when > transcoding > either into or out of Unicode, but this is not my area of specialisation. > Could you point me at where the XML standard says that transcoding > problems > that result in the introduction of substitution characters into transcoded > text should "cause processing to report an error"? .I had a look for > exactly > this earlier and must have missed it. The W3C document seems to leave > transcoding issues to the Unicode standards. U+FFFD is apparently a valid > XML character so there should be no issue with processing it. . There are two issues: 1) What should an XML processor do when faced with a bad byte sequence? The answer is very clear: s4.3.3. "It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. " 2) Is the character FFFD allowed in data? Again, the answer is very clear: s2.2  Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] So as I said, XML can have U+FFFD in data, but not put there by a transcoder. So I don't think it is correct behaviour to fall back to any character, including U+FFFD, especially silently. Silently failure undercuts XML approach. (I will modify this: however, an implementation could choose to put in a SUB or FFFD or any other signal anywhere it likes, as long as it is clear that the DOM or stream or whatever is not WF XML and there has been a fatal error. But this is not something "allowed" by XML or Unicode, because by this stage you don't have XML.) On the issue of what to do if you are using some magical encoding has characters that are not in Unicode, it is a really specialist topic and should not be confused with the general case. (There are a few CJK dictionary character repertoires which have more characters than Unicode, for example. However, these are not in any off-the-shelf transcoders so it is not this case.) Cheers Rick Jelliffe
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format