[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: finding and removing duplicate string

Subject: Re: finding and removing duplicate string
From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
Date: Fri, 2 Dec 2011 18:22:02 +0100
Re:  finding and removing duplicate string
Unless your <p>-paragraphs aren't very long you should not use pattern
matching like this because this is a pattern that exhibits quadratic
performance depending on the string length.

I ran a quick test comparing Java's regex engine to the substring
comparison proposed here earlier on.

The "hit" case (2 x "the quick brown..."):
   pattern:  0.000003061s - substr:  0.000000134s, a factor of 22

The "fail" case ("the quick brown..." vs "okkokoko...", equal lengths)
   pattern:  0.000004452s - substr:  0.000000026s, a factor of 171

Some XSLT regex engine might be better, but its execution time is
still bound to increase by O(n^2).

-W


On 2 December 2011 17:29, Imsieke, Gerrit, le-tex
<gerrit.imsieke@xxxxxxxxx> wrote:
>  <xsl:template match="p">
>    <xsl:copy>
>      <xsl:copy-of select="@*" />
> <!-- use replace() for normalizing the input first, i.e., replace the
> newline with a space: -->
>      <xsl:analyze-string select="replace(., '\s+', ' ')"
> regex="^(.+)\s+\1$">
> <!-- \1 is a back-reference to the first match, which is allowed according
> to http://www.w3.org/TR/xpath-functions/#regex-syntax -->
>        <xsl:matching-substring>
>          <xsl:value-of select="regex-group(1)"/>
>        </xsl:matching-substring>
>        <xsl:non-matching-substring>
> <!-- output the whole string if above regex doesn't match: -->
>          <xsl:value-of select="."/>
>        </xsl:non-matching-substring>
>      </xsl:analyze-string>
>    </xsl:copy>
>  </xsl:template>
>
>
> On 2011-12-02 16:32, Jacob L wrote:
>>
>> All,
>>
>>
>> I am using<xsl:stylesheet version="2.0" .If in the input XML file,
>> the text in the<p>  tag repeats itself such as
>>
>>
>>
>> <text>
>>
>> <p>Bradley Cooper named Peoples Sexiest man alive 2011  Bradley
>> Cooper named Peoples Sexiest man alive 2011</p>
>>
>> </text>
>>
>>
>>
>> I want to write code to check it and omit it. The result should be:-
>>
>>
>>
>> After putting check in the xsl and deleting the duplicate string. The
>> output should be:-
>>
>>
>>
>>  <text>
>>         <p>Bradley Cooper named Peoples Sexiest man alive 2011</p>
>>    </text>
>>
>>
>> Thanks for the help!
>>
>
> --
> Gerrit Imsieke
> Geschdftsf|hrer / Managing Director
> le-tex publishing services GmbH
> Weissenfelser Str. 84, 04229 Leipzig, Germany
> Phone +49 341 355356 110, Fax +49 341 355356 510
> gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de
>
> Registergericht / Commercial Register: Amtsgericht Leipzig
> Registernummer / Registration Number: HRB 24930
>
> Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek,
> Thomas Schmidt, Dr. Reinhard Vvckler

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.