Re: Tokenize question: tokenize on words, spaces and

Play the video

Subject: Re: Tokenize question: tokenize on words, spaces and punctuation
From: Martin Holmes <mholmes@xxxxxxx>
Date: Wed, 16 Mar 2011 21:27:31 -0700

This looks perfect. I'm actually dealing with relatively modern French, so I think the Unicode character categories should work fine.

Thanks indeed,
Martin

On 11-03-16 09:19 PM, Brandon Ibach wrote:

The main trick here seems to be simply constructing an appropriate
character class for each type of token and then matching sequences of
one or more of each.

The following does just that, though it also tosses in a twist to
handle words with embedded dashes, so that the dash won't break the
word into three separate tokens.  Further adjustments along those
lines may be needed, depending on your requirements.

The use of Unicode character categories for the character classes
should ensure that this works for most languages, I think, though
non-English languages aren't my strong suit, so I make no guarantees.
:)

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                 version="2.0" xmlns:xs="http://www.w3.org/2001/XMLSchema"
                 xmlns:f="urn:stylesheet-func" exclude-result-prefixes="xs f">
     <xsl:output method="text"/>
     <xsl:param name="s" select="'Oh, what a fun-filled day!'"/>
     <xsl:function name="f:tokens" as="xs:string*">
         <xsl:param name="string"/>
         <xsl:analyze-string select="$string"
regex="{'\w[-\w]*|[\p{P}\p{C}]+|\p{Z}+'}">
             <xsl:matching-substring><xsl:sequence
select="."/></xsl:matching-substring>
         </xsl:analyze-string>
     </xsl:function>
     <xsl:template match="/">
         <xsl:text>('</xsl:text>
         <xsl:value-of select="f:tokens($s)" separator="', '"/>
         <xsl:text>')</xsl:text>
     </xsl:template>
</xsl:stylesheet>

-Brandon :)

On Wed, Mar 16, 2011 at 8:33 PM, Martin Holmes<mholmes@xxxxxxx> wrote:

Hi there,

This is really a question for XPath regex gurus:
I need to tokenize a string of text such that words, punctuation and spaces
are split. So from this:
Oh, what a great day!

I need to get:

('Oh', ',', ' ', 'what', ' ', 'a', ' ', 'great', ' ', 'day', '!')
I've been hacking away at this for a while, but regexps aren't my strong
suit. Can anyone help?
Cheers,
Martin

Current Thread
Tokenize question: tokenize on words, spaces and punctuation Martin Holmes - 17 Mar 2011 00:34:16 -0000 Suresh - 17 Mar 2011 02:28:20 -0000 Martin Holmes - 17 Mar 2011 04:24:28 -0000 Brandon Ibach - 17 Mar 2011 04:20:00 -0000 Martin Holmes - 17 Mar 2011 04:31:25 -0000 <=

<- Previous	Index	Next ->
Re: Tokenize question: tokeni, Brandon Ibach	Thread
Re: Tokenize question: tokeni, Martin Holmes	Date
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >