Subject: RE: unparsed-text() and illegal characters
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Thu, 27 Jul 2006 20:21:40 +0100
|
The spec is very strict that characters not allowed in XML cause an error.
This is a change since the book was written.
However, the spec is very loose about how URIs are resolved. So a conformant
product could take the URI
thing.txt?substitute-illegal-chars=FFFD
as a reference to "the document formed by taking thing.txt and substituting
illegal characters with xFFFD."
Perhaps I'll do that.
Michael Kay
http://www.saxonica.com/
> -----Original Message-----
> From: Abel Braaksma Online [mailto:abel.online@xxxxxxxxx]
> Sent: 27 July 2006 19:10
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: unparsed-text() and illegal characters
>
> Dear List,
>
> Trying to "import" a non-XML file of an undefined encoding, I
> received the following error when using Saxon8: "The unparsed
> text file contains a character illegal in XML (line=1
> column=4 value=hex 11)". I only found one reference about
> this error
> (http://www.stylusstudio.com/xsllist/200510/post90470.html),
> which is actually a post about illegal characters inside the
> XSLT document.
>
> Michael Kay points out in that post that this error is merged
> into XTDE1190 (see
> http://www.w3.org/TR/xslt20/#err-XTDE1190). It is claimed in
> the specs that non-understood characters or byte sequences
> should result in this non-recoverable dynamic error.
>
> In his indispensable book, the XSLT 2.0 Programmer's
> Reference, he states the following:
> "Some processors will provide configuration options that pass
> this choice on the user. If the file contains characters that
> are invalid in XML (this applies to most control characters
> in the range x00 to x1F under XML 1.0, but only to the null
> character x00 under XML 1.1) then the invalid characters are
> substituted by the special Unicode character xFFFD, which is
> specifically intended for such purposes."
>
> I understand that the book was written before XSLT 2.0 was
> finalized (it is still a Candidate), but I wonder if a
> treatment like above is still possible somehow. The contents
> of the file is ISO-8859-1, apart from the start and end
> header, which contain control characters. I only need the
> part that is parsable as text, the rest can be dismissed.
>
> Am I asking too much from XSLT, or is this somehow possible?
> It would really add to the possibilities, and it means I
> don't need some extra filter or preparse step.
>
> Cheers,
> Abel Braaksma
> www.nuntia.nl
|