Subject: RE: Safe-guarding codepoints-to-string() from wrong input
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Wed, 20 Dec 2006 15:19:54 -0000
|
There's no obvious way of doing this within the language, other than
defining a function that knows which codepoints are valid characters.
In Saxon, there's an internal method which should be easy enough to call as
an extension function:
<xsl:if test="nc:isXML11Valid($codepoint)"
xmlns:nc="java:net.sf.saxon.om.XML11Char">
or
<xsl:if test="nc:isXML10Valid($codepoint)"
xmlns:nc="java:net.sf.saxon.om.XML10Char">
depending on which version of XML you are using.
You could of course run this on all the possible codepoints to generate a
lookup file: you'll want to use keys to make the lookup efficient.
Michael Kay
http://www.saxonica.com/
> -----Original Message-----
> From: Abel Braaksma [mailto:abel.online@xxxxxxxxx]
> Sent: 20 December 2006 14:34
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: Safe-guarding codepoints-to-string() from wrong input
>
> Hi all,
>
> In some translation-stylesheet, I take user-input (arbitrary
> string) and replace a set of numbers to a set of characters,
> like this:
>
> $input = "some [34]quoted[34] string"
> output --> some "quoted" string
>
> <xsl:analyze-string select="$input" regex="\[(\d+)\]">
> <xsl:matching-substring>
> <xsl:value-of
> select="codepoints-to-string(xs:integer(regex-group(1))" />
> </xsl:matching-substring>
> <xsl:non-matching-substring>
> <xsl:value-of select="." />
> </xsl:non-matching-substring>
> </xsl:analyze-string>
>
> Because we are talking tons of data containing the above-like
> strings (in text files), I'd like to make the
> codepoints-to-string() a bit more rock-solid. In normal
> operation, it fails hard. But I'd like it to gracefully
> degrade: be liberal in what you accept.
>
> I know that control characters are not allowed and throw an
> "Invalid XML character" error. Also, when adding very wide
> numbers (like "1234567") give a plural of the same error (Im
> not sure why). Some characters (I believe these are the ones
> that are not assigned in Unicode) result in an empty string
> (like "12345").
>
> Is there a robust way of allowing/disallowing a set of
> codepoints (other than making one huge lookup list)?
>
> Cheers,
> Abel
|