Subject: RE: segmenting a paragraph
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Tue, 2 Oct 2007 09:36:58 +0100
|
When you need to apply regex matching to text that crosses node boundaries,
in the past two approaches have been proposed:
(a) create a string in which the node boundaries are represented by some
recognizable textual markup (you could use saxon:serialize()), then apply
the regex processing, then reinstate the node structure (e.g. by using
saxon:parse()).
(b) do a deep copy, while processing each of the text nodes to replace the
significant features (such as end of sentence) by nodes (e.g. an
<end-of-sentence/> element). Then apply positional grouping techniques to
transform this into your target structure.
Neither is particularly easy, I'm afraid.
Michael Kay
http://www.saxonica.com/
> -----Original Message-----
> From: Christian Wittern [mailto:cwittern@xxxxxxxxx]
> Sent: 02 October 2007 09:05
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: segmenting a paragraph
>
> Dear XSL-list readers,
>
> In trying to solve the following problem I am seeking your help:
> I want to segment paragraphs in a text, so that sentences are
> enclosed in a <s> element and within the sentences, words
> between interpunction are within <seg> elements.
>
> So far, I have been capturing the content of <p> in a string
> and then using two nested <xsl:analyze-string> blocks with
> regexes, which work nicely and do what I want. Now I
> discovered that there are <note> elements with additional
> markup in some paragraphs, which get lost in this process.
> However, I really want to leave these notes alone, as they are. So:
>
> <p>Some text. Some more text, with a comma. <note>This
> stuff, how boring</note></p>
>
> should look like:
>
> <p><s><seg>Some text.</seg></s><s><seg>Some more
> text,</seg><seg> with a comma.</seg></s><note>This stuff, how
> boring</note></p>
>
> I wonder how I tell the processor to leave the note stuff alone?
>
> Any help appreciated,
>
> Christian
>
> --
> Christian Wittern
> Institute for Research in Humanities, Kyoto University
> 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN
|