Re: Is it a well-formedness error to use a character not in th

From: Greg Hunt <greg@firmansyah.com>
To: Rick Jelliffe <rjelliffe@allette.com.au>
Date: Fri, 19 Mar 2010 19:34:23 +1100

Play the video

Rick,

Unicode tech reports 22 and 36 both describe transcoding producing both 1A and FFFD characters as a result of character mismatches depending on context and direction. It appears to me that 1a can be introduced when transcoding either into or out of Unicode, but this is not my area of specialisation.

Could you point me at where the XML standard says that transcoding problems that result in the introduction of substitution characters into transcoded text should "cause processing to report an error"? .I had a look for exactly this earlier and must have missed it. The W3C document seems to leave transcoding issues to the Unicode standards. U+FFFD is apparently a valid XML character so there should be no issue with processing it. .

Greg

On Fri, Mar 19, 2010 at 6:17 PM, Rick Jelliffe <rjelliffe@allette.com.au> wrote:

On Roger's initial question about an XML processor failing to report a non-ASCII code sequence, this is not at all impossible. In fact, most transcoders that were made before XML or independently of consideration of XML's requirements do not report wrong codes, unless they get seriously in trouble. They may substitute some bogus character, or strip out the character, or even silently strip the character out; sometimes they will actually use the default encoding of the platform (if it is an ASCII superset at the encoding level.)
These kind of transcoders are not sufficient for use in XML WF detection. The general character set infrastructure of our software systems started off broken and it is only by taking care that anything will work in this area: the standards must have good enough policies, the users must implement these policies in their markup/configuration, the transcoder libraries must be chosen to implement the policies, and other sources of information about bad encodings (e.g. the presence of disallowed control characters) must be utilized to try to fill in any gaps. The world is full of programmers determined to remain ignorant of basic working knowledge of character encoding issues and to complicate the life of people downstream.

On Greg's question about the ASCII SUB character: this is a control character intended to be used for transmission level problems: the encoding relates to signals on wires when transmitting ASCII, not to transcoding mismatches, as I understand it. (The Wikipedia entry incorrectly states that this is to be used for signalling that the following character needs to NOT the 5th bit as an escape issue. I think this may be the EBCDIC operation? Anyway, see
http://www.itscj.ipsj.or.jp/ISO-IR/001.pdf )

The correct Unicode character would not be U+001A SUB but U+FFFD REPLACEMENT CHARACTER, however, because of XML's rules, transcoding errors should cause processing to report an error. In other words, if U+FFFD were to appear in a WF document, it should only be because there was some pre-existing text which had that character in it that was then marked up: in other words, the data correctly contains the REPLACEMENT CHARACTER due to some prior flaw. (See http://www.unicode.org/versions/Unicode5.2.0/ch16.pdf and search for FFFD.)

Note that Unicode does not define semantics for SUB and other control characters, but defers to implementations and other standards, such as IS6429:1992: you can see the front matter at http://webstore.iec.ch/preview/info_isoiec6429%7Bed3.0%7Den.pdf that the scope of that standard is (page 1) intended to be used "in particular with character-imaging devices": think Teletype printer's BEL and BS and by a stretch modem's X-on/Xoff flow control. It isn't for use in data exchange as part fo the data but for simple transmission protocols underneath the data.

Finally, Greg should note that the correct transcoding from UTF-8 to ISO8859-1 is not to use any substitution characters, but 1) to replace the character with numeric character entities when the item is in data content, and 2) to fail when the character is in markup. If you need more detailed transcoding than that, then it is not something that XML processors will provide, and you will have to make your own preprocessor.

Now there have been multiple character set formats: indeed, RTF allows sections in different embedded encodings. The result is that you want to use a text editor, it must be 8-bit clean (not do any transcoding) and you have to change the screen encoding to view different sections correctly. XML did not take this route.

Cheers
Rick Jelliffe

P.S. The most common transcoding error I used to see is where there is a UTF-8 data stream and someone puts in the byte xA0, intending it to be the non-breaking space character. More common now is where there is a UTF-8 stream that has the UTF-16 Byte Order Mark converted to UTF-8 rather than stripped (this is not so much a code error as an operational error).

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php

Follow-Ups:
- Re: Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?
  - From: rjelliffe@allette.com.au

References:
- Is it a well-formedness error to use a character not in theencoding specified by the XML declaration?
  - From: "Costello, Roger L." <costello@mitre.org>
- Re: Is it a well-formedness error to use a character not in theencoding specified by the XML declaration?
  - From: Michael Glavassevich <mrglavas@ca.ibm.com>
- RE: Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?
  - From: "Michael Kay" <mike@saxonica.com>
- Re: Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?
  - From: Greg Hunt <greg@firmansyah.com>
- Re: Is it a well-formedness error to use a character notin the encoding specified by the XML declaration?
  - From: Liam R E Quin <liam@w3.org>
- Re: Is it a well-formedness error to use a character not in the encoding specified by the XML declaration?
  - From: Greg Hunt <greg@firmansyah.com>
- Re: Is it a well-formedness error to use a character notin the encoding specified by the XML declaration?
  - From: Rick Jelliffe <rjelliffe@allette.com.au>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >