[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: regex, shortest match

Subject: Re: regex, shortest match
From: David Carlisle <davidc@xxxxxxxxx>
Date: Fri, 1 Aug 2008 09:42:22 +0100
Re:  regex
> I'm looking to parse sentences out of paras.

to be more exact you are trying to parse a sentence with a regular
expression, which would cause you to fail a logic course as natural
language must be the canonical example of a non regular language:-)

> "((.+).)

. is a meta character matching any character so that is a sequence of
one or more characters, followed by a character, ie it's any sequence of
2 or more characters.




You need to define a sentence. If a sentemce can not contain a ".", but
always ends wiith a "." then you could do [^.]*\.

but then

it cost $2.00.

is two sentences.



So perhaps a sentence is terminated by . followed by end of string or
whitespace

 ([^.]|\.[^ \n\r\t])*\.(\s+|$)




<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 

<xsl:output method="text"/>

<xsl:template match="para">

new para
<xsl:analyze-string select="." regex="([^.]|\.[^ \n\r\t])*\.(\s+|$)">
<xsl:matching-substring>
 sentence: <xsl:value-of select="normalize-space(.)"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
 oops:  <xsl:value-of select="normalize-space(.)"/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>





 saxon9 para.xml para.xsl



new para

 sentence: It is sometimes desired to have a specific heading which should not be numbered.
 sentence: This corresponds to unnumbered list headers in lists (see sections 4.3).
 sentence: To facilitate this, an optional attribute text:is-list-header can be used.
 sentence: If true, the given header will not be numbered, even if an explicit list-style is given.


new para

 sentence: A text:style-name attribute references a paragraph style, while a text:cond-style-name attribute references a conditional-style, that is, a style that contains conditions and maps to other styles (see section 14.1.1).
 sentence: If a conditional style is applied to a paragraph, the text:style-name attribute contains the name of the style that was the result of the conditional style evaluation, while the conditional style name itself is the value of the text:cond-style-name attribute.
 sentence: This XML structure simplifies [XSLT] transformations because XSLT only has to acknowledge the conditional style if the formatting attributes are relevant.
 sentence: The referenced style can be a common style or an automatic style.


new para

 sentence: A text:class-names attribute takes a whitespace separated list of paragraph style names.
 sentence: The referenced styles are applied in the order they are contained in the list.
 sentence: If both, text:style-name and text:class-names are present, the style referenced by the text:style-name attribute is as the first style in the list in text:class-names.
 sentence: If a conditional style is specified together with a style:class-names attribute, but without the text:style-name attribute, then the first style in the style list is used as the value of the missing text:style-name attribute.


new para

 sentence: A page sequence element <text:page-sequence> specifies a sequence of master pages that are instantiated in exactly the same order as they are referenced in the page sequence.
 sentence: If a text document contains a page sequence, it will consist of exactly as many pages as specified.
 sentence: Documents with page sequences do not have a main text flow consisting of headings and paragraphs as is the case for documents that do not contain a page sequence.
 sentence: Text content is included within text boxes for documents with page sequences.
 sentence: The only other content that is permitted are drawing objects.




but this would of course still fail if the sentence were to contain
". " coming from "D. P. Carlisle" or "dr. " or ...

If you try to parse natural language with a single regular expression,
it will _always_ fail. But you can cover more or less arbitrarily
complicated subsets of the language by making the regexp
correspondingly more complicated (and slow)


David

________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
________________________________________________________________________

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.