[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
[Recent Entries]
[Reply To This Message]
Re: Splitting a paragraph into sentences and keep mark
Therebs a package for splitting at arbitrarily deeply nested nodes. It
is part of a paper that I presented at XML Prague this year:
https://archive.xmlprague.cz/2019/files/xmlprague-2019-proceedings.pdf#page=347
The package itself is at
https://subversion.le-tex.de/common/presentations/2019-02-09_xmlprague_xslt-upward-projection/lib/split.xsl
Using this package, Martin's p-matching template becomes:
<xsl:template match="p[node()]">
<xsl:variable name="p-with-markers" as="element(p)">
<xsl:apply-templates select="." mode="insert-marker"/>
</xsl:variable><!-- this hasn't changed -->
<xsl:variable name="chunks" as="document-node(element(split:chunks))">
<xsl:apply-templates select="$p-with-markers"
mode="split:split-entrypoint"><!-- mode provided by
lib/split.xsl -->
<xsl:with-param name="group-start-exp" as="xs:string"
select="'self::eos'"/><!-- Will be evaluated as an XPath
expression for each node in a
for-each-group[@group-starting-with]
population. If a population node satisfies the expression, it
will
start a group.-->
<xsl:with-param name="keep-splitting-node" as="xs:boolean"
select="false()"/><!-- remove <eos/> after splitting -->
</xsl:apply-templates>
</xsl:variable>
<xsl:copy-of select="$chunks/split:chunks/split:chunk/p[node()]"
copy-namespaces="no"/>
</xsl:template>
The complete stylesheet is at
https://gist.github.com/gimsieke/529dab000386a45d6136e850a80ac726
Applying it to your input, David, will yield:
<?xml version="1.0" encoding="UTF-8"?><root>
<p>This has one <span class="zzz">sentence? </span></p><p><span
class="zzz">Actually, it has
<emphasis>two</emphasis>. </span></p><p><span class="zzz">No,</span> it
has three.</p>
</root>
Gerrit
On 24.11.2019 15:32, David Carlisle d.p.carlisle@xxxxxxxxx wrote:
can we assume the easy case (as in your example) where all the
sentences end at the top level?
a more challenging example is
<root>
<p>This has one <span class="zzz">sentence? Actually, it has
<emphasis>two</emphasis>. No,</span> it has three.</p>
</root>
as then you need to force-close any open elements at the sentence end
and re-open them in the new sentence so something like
<p>This has one <span class="zzz">sentence?</span></p>
<p><span class="zzz">Actually, it has <emphasis>two</emphasis>.</span></p>
<p><span class="zzz">No,</span> it has three.</p>
David
On Sun, 24 Nov 2019 at 13:34, Rick Quatro rick@xxxxxxxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
Hi All,
I have a situation where I want to split a short paragraph into sentences and use them in different parts of my output. I am using <xsl:analyze-string> because I want to account for a sentence ending with a . or ?. This will work except if there are any children of the paragaph, like the <emphasis> in the second sentence. Can I split a paragraph into sentences and still keep the markup?
Here is my input document:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<p>This has one sentence? Actually, it has <emphasis>two</emphasis>. No, it has three.</p>
</root>
My stylesheet:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:rq="http://www.frameexpert.com"
exclude-result-prefixes="xs rq"
version="2.0">
<xsl:output indent="yes"/>
<xsl:strip-space elements="root"/>
<xsl:template match="/root">
<xsl:copy>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>
<xsl:template match="p">
<xsl:variable name="sentences" select="rq:splitParagraphIntoSentences(.)"/>
<p><xsl:value-of select="$sentences[1]"/></p>
<note>Something in between.</note>
<p><xsl:value-of select="$sentences[position()>1]"/></p>
</xsl:template>
<xsl:function name="rq:splitParagraphIntoSentences">
<xsl:param name="paragraph"/>
<xsl:analyze-string select="$paragraph" regex=".+?[\.\?](\s+|$)">
<xsl:matching-substring>
<sentence><xsl:value-of select="replace(.,'\s+$','')"/></sentence>
</xsl:matching-substring>
</xsl:analyze-string>
</xsl:function>
</xsl:stylesheet>
My output:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<p>This has one sentence?</p>
<note>Something in between.</note>
<p>Actually, it has two. No, it has three.</p>
</root>
What I want is this:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<p>This has one sentence? </p>
<note>Something in between.</note>
<p>Actually, it has <emphasis>two</emphasis>. No, it has three. </p>
</root>
Any suggestions will be appreciated.
Rick
|
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format
RSS 2.0 |
|
Atom 0.3 |
|
|