[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: decoding percent-escaped octet sequences

Subject: Re: decoding percent-escaped octet sequences
From: "Imsieke, Gerrit, le-tex" <gerrit.imsieke@xxxxxxxxx>
Date: Sat, 21 May 2011 12:13:32 +0200
Re:  decoding percent-escaped octet sequences
On 2011-05-20 18:14, Julian Reschke wrote:
On 2011-05-20 17:52, Brandon Ibach wrote:
Generally, when you're doing string manipulations inside XSLT/XPath,
there really is no such thing as ISO-8859-1, UTF-8 or any other
encoding, since the "string" data type in XPath is just a string of
Unicode characters. The encoding of the input is used to map the
sequence of octets to Unicode characters on the way in and the
requested encoding of the output is used to do the reverse on the way
out.

Percent-escaping is sort of an exception since it is, really, a form
of encoding, but not one that is generally handled automatically by
the parser, serializer, etc. So, you may need to decode the
percent-escapes, but you shouldn't have to worry about the overall
encoding.

If you think your use case requires that you really do need to deal
with encodings, please tell us a little more about it, so that we
might be able to better suggest a solution. How is this string
getting into your transform while still being encoded?
...

The XSLT code reads an XML document containing test cases for HTTP header fields using a variety of encoding styles, some of which are the ones I mentioned (either ISO-8859-1 or UTF-8, percent-escaped).

The goal is to transform the escaped strings from the test cases to XSLT
strings (Unicode sequences), essentially implementing the header field
parsing in XSLT (yes, this is a proof-of-concept, nothing more).

Best regards, Julian

Ok, the following approach isnbt quite a pure XSLT/XPath proof of concept, but maybe youbll still find it useful:


===========8<------------------------
<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  xmlns:my="my"
  xmlns:java-urldecode="java:java.net.URLDecoder"
  >

<xsl:output method="xml" indent="yes" />

  <!-- see comment below for ' escaping -->
  <xsl:variable name='input' as='xs:string*'
    select="(
             'us-ascii''en-us''This%20is%20%2A%2A%2Afun%2A%2A%2A',
             'iso-8859-1''en''%A3%20rates',
             'UTF-8''''%c2%a3%20and%20%e2%82%ac%20rates'
            )" />

  <my:input>
    <val>us-ascii'en-us'This%20is%20%2A%2A%2Afun%2A%2A%2A</val>
    <val>iso-8859-1'en'%A3%20rates</val>
    <val>UTF-8''%c2%a3%20and%20%e2%82%ac%20rates</val>
  </my:input>

<xsl:template name="decode">
<test>
<!-- if you select="$input" in the following for-each,
please note that the literal ' must be quoted as '' when specifying $input literally -->
<xsl:for-each select="document('')//my:input/val">
<xsl:analyze-string select="." regex="^(.*?)'(.*?)'(.*)$">
<xsl:matching-substring>
<string encoding="{regex-group(1)}" lang="{regex-group(3)}" encoded="{regex-group(3)}">
<xsl:value-of select="java-urldecode:decode(regex-group(3), regex-group(1))" />
</string>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:for-each>
</test>
</xsl:template>


</xsl:stylesheet>
===========8<------------------------

It requires a Java-based XSLT 2 processor such as Saxon or Altova. In case of Saxon, I think it works only with PE or EE versions, or with older 9.1 versions.

Output (invoke Saxon with -it:decode):
<?xml version="1.0" encoding="UTF-8"?>
<test xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:my="my"
xmlns:java-urldecode="java:java.net.URLDecoder">
<string encoding="us-ascii" lang="This%20is%20%2A%2A%2Afun%2A%2A%2A"
encoded="This%20is%20%2A%2A%2Afun%2A%2A%2A">This is ***fun***</string>
<string encoding="iso-8859-1" lang="%A3%20rates" encoded="%A3%20rates">B# rates</string>
<string encoding="UTF-8" lang="%c2%a3%20and%20%e2%82%ac%20rates"
encoded="%c2%a3%20and%20%e2%82%ac%20rates">B# and b, rates</string>
</test>


-Gerrit

--
Gerrit Imsieke
GeschC$ftsfC<hrer / Managing Director
le-tex publishing services GmbH
Weissenfelser Str. 84, 04229 Leipzig, Germany
Phone +49 341 355356 110, Fax +49 341 355356 510
gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

GeschC$ftsfC<hrer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard VC6ckler

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.