[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Is it a well-formedness error to use a character notin th
On Fri, 2010-03-19 at 13:59 +1100, Greg Hunt wrote: > Liam, > I can assure you that I don't WANT to put these characters in. I did put a smiley there :-) > What I'm asking about is the mapping from the ASCII substitution > character to the Unicode one. [...] > I suspect that the 8859 substitution character (1a) is not getting > mapped to the (valid for XML) UTF-8 substitution character (FFFD) by > the XML parser's transcoding. I think that ASCII SUB isn't quite the same as Unicode Substitute: SUB (which is also in Unicode) indicates that the following character is from a different character set; Substitute appears to replace the character altogether [1]. There is nothing like the SUB mechanism for XML directly, because it's poorly defined (_which_ other character set?) and because in XML you'd normally use named character entities in this circumstance... althouth XML punts on the values of the replacement text. We thought we were going to work on SGML-style "SDATA" entities shortly after XML was published, more than a decade ago.... At any rate, XML does not allow such control characters. I'd suggest using an external tool to map them to the private use area in UTF8, either using an entity reference or a numeric character reference, no the literal character, so that your XML is 8-bit clean and will work in an ISO 8859-1 environment. You could use "tr" or "sed" on a Unix or Linux system. > Unfortunately I don't have a development box to play with at the > moment to work on this further. I don't know whether I'm looking at a > bug or correct behaviour. I don't think software needs to change SUB in converting from UTF-8 to ISO 8859-1, since it has the same meaning in both, so I don't think it's a bug. I think it would probably be a mistake to convert it to Substitute, but I'd need to delve into the Unicode report to give a better answer. At any rate this sort of chicanery is not expected in XML files -- the XML answer is that you should use explicit markup. [1] http://www.interfacebus.com/ASCII_Table.html has a short summary, although there's obviously a typo in the entry for SUB. SUB is actually a safer mechanism than shift-in/shift-out, because it only affects the single next character (octet). Liam -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org www.advogato.org
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|