[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Is it a well-formedness error to use a character not in t

  • From: rjelliffe@allette.com.au
  • To: xml-dev@lists.xml.org
  • Date: Fri, 19 Mar 2010 20:21:54 +1100

Re:  Is it a well-formedness error to use a character not in  t
> Rick,
> Unicode tech reports 22 and 36 both describe transcoding producing both 1A
> and FFFD characters as a result of character mismatches depending on
> context
> and direction.  It appears to me that 1a can be introduced when
> transcoding
> either into or out of Unicode, but this is not my area of specialisation.

> Could you point me at where the XML standard says that transcoding
> problems
> that result in the introduction of substitution characters into transcoded
> text should "cause processing to report an error"? .I had a look for
> exactly
> this earlier and must have missed it.  The W3C document seems to leave
> transcoding issues to the Unicode standards.  U+FFFD is apparently a valid
> XML character so there should be no issue with processing it.  .

There are two issues:

1) What should an XML processor do when faced with a bad byte sequence?

The answer is very clear: s4.3.3.
"It is a fatal error  if an XML entity is determined (via default,
encoding declaration, or higher-level protocol) to be in a certain
encoding but contains byte sequences that are not legal in that encoding.
"

2) Is the character FFFD allowed in data?

Again, the answer is very clear: s2.2

[2]   	Char	   ::=   	#x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]

So as I said, XML can have U+FFFD in data, but not put there by a transcoder.

So I don't think it is correct behaviour to fall back to any character,
including U+FFFD, especially silently. Silently failure undercuts XML
approach.

(I will modify this: however, an implementation could choose to put in a
SUB or FFFD or any other signal anywhere it likes, as long as it is clear
that the DOM or stream or whatever is not WF XML and there has been a
fatal error. But this is not something "allowed" by XML or Unicode,
because by this stage you don't have XML.)

On the issue of what to do if you are using some magical encoding has
characters that are not in Unicode, it is a really specialist topic and
should not be confused with the general case. (There are a few CJK
dictionary character repertoires which have more characters than Unicode,
for example. However, these are not in any off-the-shelf transcoders so it
is not this case.)

Cheers
Rick Jelliffe



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.