[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: segmenting a paragraph

Subject: RE: segmenting a paragraph
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Tue, 2 Oct 2007 09:36:58 +0100
RE:  segmenting a paragraph
When you need to apply regex matching to text that crosses node boundaries,
in the past two approaches have been proposed:

(a) create a string in which the node boundaries are represented by some
recognizable textual markup (you could use saxon:serialize()), then apply
the regex processing, then reinstate the node structure (e.g. by using
saxon:parse()).

(b) do a deep copy, while processing each of the text nodes to replace the
significant features (such as end of sentence) by nodes (e.g. an
<end-of-sentence/> element). Then apply positional grouping techniques to
transform this into your target structure.

Neither is particularly easy, I'm afraid.

Michael Kay
http://www.saxonica.com/

> -----Original Message-----
> From: Christian Wittern [mailto:cwittern@xxxxxxxxx] 
> Sent: 02 October 2007 09:05
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject:  segmenting a paragraph
> 
> Dear XSL-list readers,
> 
> In trying to solve the following problem I am seeking your help:
> I want to segment paragraphs in a text, so that sentences are 
> enclosed in a <s> element and within the sentences, words 
> between interpunction are within <seg> elements.
> 
> So far, I have been capturing the content of <p> in a string 
> and then using two nested <xsl:analyze-string> blocks with 
> regexes, which work nicely and do what I want.  Now I 
> discovered that there are <note> elements with additional 
> markup in some paragraphs, which get lost in this process. 
> However, I really want to leave these notes alone, as they are.  So:
> 
> <p>Some text.  Some more text, with a comma. <note>This 
> stuff, how boring</note></p>
> 
> should look like:
> 
> <p><s><seg>Some text.</seg></s><s><seg>Some more 
> text,</seg><seg> with a comma.</seg></s><note>This stuff, how 
> boring</note></p>
> 
> I wonder how I tell the processor to leave the note stuff alone?
> 
> Any help appreciated,
> 
> Christian
> 
> --
>   Christian Wittern
>   Institute for Research in Humanities, Kyoto University
>   47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2007 All Rights Reserved.