[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: How to split text element to separate spans?

Subject: Re: How to split text element to separate spans?
From: "Imsieke, Gerrit, le-tex" <gerrit.imsieke@xxxxxxxxx>
Date: Tue, 08 Jun 2010 01:28:33 +0200
Re:  How to split text element to separate spans?
Dear Israel,

I once wrote a generic splitting routine where you can split at arbitrary XPath expressions, at any depth. It uses saxon:evaluate, though, and is too complicated to be instructive here. So I tried to simplify it, below.

Let's consider this input:

=========8<-------------------

<?xml version="1.0" encoding="utf-8"?>
<doc>
<p dir="ltr"><span class="smaller">text1
            <br />
             text2
            text3.
            <br />
            </span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
            <br />


<br /></span></p>


<p dir="ltr"><br/><span class="smaller">text1
            <br />
             <span class="reallytiny">text2 <br /></span>
            text3.
            <br />
            </span> <span class="smalleritalic">no</span> <span
class="smaller">problems.
            <br />


<br /></span></p>


<p dir="ltr">  <span class="regular">"What else?"</span></p>
</doc>

=========8<-------------------

The first p contains your original input, the second p contains a br within *nested* spans (and a br immediately below p), and the third one doesn't contain a br.

Applying the stylesheet quoted below, we'll arrive at this output:

=========8<-------------------

<?xml version="1.0" encoding="UTF-8"?><doc>
<p dir="ltr"><span class="smaller">text1
</span><br/><span class="smaller">
text2
text3.
</span><br/><span class="smaller">
</span> <span class="smalleritalic">no</span> <span class="smaller">problems.
</span><br/><span class="smaller">



</span><br/></p>


<p dir="ltr"><br/><span class="smaller">text1
</span><br/><span class="smaller">
<span class="reallytiny">text2 </span></span><br/><span class="smaller">
text3.
</span><br/><span class="smaller">
</span> <span class="smalleritalic">no</span> <span class="smaller">problems.
</span><br/><span class="smaller">



</span><br/></p>


<p dir="ltr">  <span class="regular">"What else?"</span></p>
</doc>

=========8<-------------------

You might find it dissatisfying that the XML code doesn't look as pretty-printed as your desired output. In order to arrive at an output as neat as specified, you will need to apply three more passes of whitespace extraction/normalization (left, right, middle) to the top-level spans. If you really have to pretty-print the XML in such a way, I will send you the complete stylesheet.

So here's the version that does just the splitting:

=========8<-------------------

<?xml version="1.0" encoding="utf-8"?>
<xsl:transform
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:my="my"
  version="2.0"
  exclude-result-prefixes="my">

<xsl:output method="xml" indent="no" />

  <!-- Default identity transform: -->
  <xsl:template match="@* | *">
    <xsl:copy>
      <xsl:apply-templates select="@* | node()"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="p/span">
    <xsl:sequence select="my:split-at-br(.)"/>
  </xsl:template>


<!-- split-at-br is intended for
<p>foo<br/>bar</p>
-> <p>foo</p><br/><p>bar</p> -->
<xsl:function name="my:split-at-br" as="element(*)+">
<xsl:param name="top" as="element(*)" />
<!-- group adjacent leaves (text nodes, empty elements) which are not br: -->
<xsl:for-each-group
select="$top//node()[ count(node()) = 0 ]"
group-adjacent="not(self::br)">
<xsl:choose>
<xsl:when test="current-grouping-key()">
<!-- output the top element and its subtree, restricted to
all ancestors of the current leaf group and the current leaf group itself: -->
<xsl:apply-templates select="$top" mode="split">
<xsl:with-param name="restricted-to" select="current-group()" tunnel="yes"/>
</xsl:apply-templates>
</xsl:when>
<xsl:otherwise>
<br/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:function>


<xsl:template match="*" mode="split">
<xsl:param name="restricted-to" as="node()*" tunnel="yes"/>
<!-- Only process this element if it's within the restriction group
or its members' ancestors: -->
<xsl:if test="generate-id(.) = (
for $n in $restricted-to
return (
for $a in $n/ancestor-or-self::*
return generate-id($a)
)
)">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates mode="#current">
<xsl:with-param name="restricted-to" select="$restricted-to" tunnel="yes"/>
</xsl:apply-templates>
</xsl:copy>
</xsl:if>
</xsl:template>


<xsl:template match="node()[count(node()) = 0]" mode="split">
<xsl:param name="restricted-to" as="node()*" tunnel="yes"/>
<xsl:if test="generate-id(.) = (for $n in $restricted-to return generate-id($n))">
<xsl:copy-of select="." />
</xsl:if>
</xsl:template>


</xsl:transform>

=========8<-------------------

(Please note that I called it xsl:transform instead of xsl:stylesheet, as a tribute to Roger L. Costello. But that's another thread, a dead thread.)

The stylesheet resp. transformation program does the following:

For each span immediately below a p, call a function that returns multiple spans, interspersed with br's.

This function works as follows:

Of all descendants of the span, only select the leaves. So if the structure is
p
span(1)
span(2)
text(a)
br
text(b)
span(3)
text(c)
it selects the sequence (text(a), br, text(b), text(c)).
Then it groups the sequence according to the criterion that all non-br nodes should be grouped (and all br nodes, too, as a consequence).
So we now have the following groups:
(text(a)) -- matches the grouping key
(br) -- doesn't match the grouping key
(text(b), text(c)) -- matches the grouping key


For each of the non-br groups, span(1) -- the span to be split at br -- is processed in mode="split", with the parameter $restricted-to set to the current group.

So firstly span(1) is being processed in mode="split" with $restricted-to = (text(a)).
Only if span(1) is among the ancestors of $restricted-to (or among $restricted-to itself) will its contents be processed.
Its contents will be processed in mode="split", with the same $restricted-to parameter.
Being an ancestor of text(a), span(2) will be processed, while nothing happens for span(3).
As a result of processing span(2) in mode="split", $restricted-to = (text(a)), text(a) will be output.


Going back to for-each-group: the next group is br which will be reproduced as br, but on the same level as span(1).

So far, our result tree looks like
p
  span(1)
    span(2)
      text(a)
  br

The next group is (text(b), text(c)). But again, span(1) will be processed in mode="split", now $restricted-to = (text(b) text(c)).
As an ancestor to any of the $restricted-to leaf nodes, span(1) will be reproduced (the element and its original attributes, not the entire subtree!).
As ancestors to each of the leaf nodes, both span(2) and span(3) will be reproduced below span(1).
When processing the subtree of span(2) with the restriction to (text(b), text(c)), only text(b) will be output. For span(3), only text(c) will be output.
So finally we have
p
span(1)
span(2)
text(a)
br
span(1)
span(2)
text(b)
span(3)
text(c)


Although it may seem as overkill at first sight, the big advantage of this approach is that it works well for br within nested spans.

With the generic approach (arbitrary XPath expressions for splitting), you can extend analyze-string to process markup: in a preparatory pass, use plain analyze-string on the text nodes to replace the regex with some unique markup, then use the generic splitting function to split at this markup, then treat the resulting nodes as you would have treated matching or non-matching substrings.

-Gerrit


On 07.06.2010 13:36, Israel Viente wrote:
Thank you for your answer Mukul.
It does put the br between the spans but lose the spaces between spans
and replace them with br.

The result of the code you sent gives the following output:

<p dir="ltr"><span class="smaller">text1</span><br /><span
class="smaller">text2 text3.</span><br /><span
class="smalleritalic">no</span><br /><span
class="smaller">problems.</span><br /><br /></p>

The desired one is:
<p dir="ltr"><span class="smaller">text1</span>
            <br />
             <span class="smaller">text2 text3.</span>
            <br />
            <span class="smalleritalic">no</span>  <span
class="smaller">problems.</span>
            <br />
            <br />
            </p>

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.