[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Is it a well-formedness error to use a character notin th

  • From: Liam R E Quin <liam@w3.org>
  • To: Greg Hunt <greg@firmansyah.com>
  • Date: Fri, 19 Mar 2010 00:19:28 -0400

Re:  Is it a well-formedness error to use a character notin  th
On Fri, 2010-03-19 at 13:59 +1100, Greg Hunt wrote:
> Liam,
> I can assure you that I don't WANT to put these characters in.

I did put a smiley there :-)

>    What I'm asking about is the mapping from the ASCII substitution
> character to the Unicode one.

[...]

>   I suspect that the 8859 substitution character (1a) is not getting
> mapped to the (valid for XML) UTF-8 substitution character (FFFD) by
> the XML parser's transcoding.

I think that ASCII SUB isn't quite the same as Unicode Substitute:
SUB (which is also in Unicode) indicates that the following character
is from a different character set; Substitute appears to replace the
character altogether [1].

There is nothing like the SUB mechanism for XML directly, because it's
poorly defined (_which_ other character set?) and because in XML you'd
normally use named character entities in this circumstance... althouth
XML punts on the values of the replacement text. We thought we were
going to work on SGML-style "SDATA" entities shortly after XML was
published, more than a decade ago....

At any rate, XML does not allow such control characters.  I'd suggest
using an external tool to map them to the private use area in UTF8,
either using an entity reference or a numeric character reference,
no the literal character, so that your XML is 8-bit clean and will
work in an ISO 8859-1 environment.

You could use "tr" or "sed" on a Unix or Linux system.

> Unfortunately I don't have a development box to play with at the
> moment to work on this further.  I don't know whether I'm looking at a
> bug or correct behaviour.

I don't think software needs to change SUB in converting from UTF-8 to
ISO 8859-1, since it has the same meaning in both, so I don't think it's
a bug. I think it would probably be a mistake to convert it to
Substitute, but I'd need to delve into the Unicode report to give a
better answer.  At any rate this sort of chicanery is not expected in
XML files -- the XML answer is that you should use explicit markup.

[1] http://www.interfacebus.com/ASCII_Table.html has a short summary,
    although there's obviously a typo in the entry for SUB.
    SUB is actually a safer mechanism than shift-in/shift-out, because
    it only affects the single next character (octet).

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.