[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: regex, shortest match

Subject: Re: regex, shortest match
From: Dave Pawson <davep@xxxxxxxxxxxxx>
Date: Fri, 01 Aug 2008 10:19:05 +0100
Re:  regex
David Carlisle wrote:
I'm looking to parse sentences out of paras.

to be more exact you are trying to parse a sentence with a regular expression, which would cause you to fail a logic course as natural language must be the canonical example of a non regular language:-)
Highly likely.


You need to define a sentence.

I tried with the worst examples in the source text.



So perhaps a sentence is terminated by . followed by end of string or
whitespace

([^.]|\.[^ \n\r\t])*\.(\s+|$)





but this would of course still fail if the sentence were to contain
". " coming from "D. P. Carlisle" or "dr. " or ...

If you try to parse natural language with a single regular expression,
it will _always_ fail. But you can cover more or less arbitrarily
complicated subsets of the language by making the regexp
correspondingly more complicated (and slow)



<para>Sentance containing Dr. Michael Kay and D.P. Carlisle</para>


<grin/> I'd expect that to break most regexen :-)



  <xsl:template match="para">
    <para>
      <xsl:analyze-string select="." regex="([^.]|\.[^ \n\r\t])*\.(\s+|$)">
        <xsl:matching-substring>
          <s> <xsl:value-of select="normalize-space(.)"/></s>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <error> <xsl:value-of select="normalize-space(.)"/> </error>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </para>
  </xsl:template>

Thanks David. That's better than my improvement.
No 'error' elements in 12000 lines.

Much appreciated.

regards

--
Dave Pawson
XSLT XSL-FO FAQ.
http://www.dpawson.co.uk

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2011 All Rights Reserved.