[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Efficient way to check sequence membership

Subject: Re: Efficient way to check sequence membership
From: Dimitre Novatchev <dnovatchev@xxxxxxxxx>
Date: Wed, 2 Mar 2011 13:34:45 -0800
Re:  Efficient way to check sequence membership
One way is to have the strings sorted. Then use binary search -- this
is O(log(N)).

There is a standard FXSL function for this:  f:binSearch and here is the
code:

 <xsl:function name="f:binSearch" as="xs:boolean">
   <xsl:param name="pSeq" as="item()+"/>
   <xsl:param name="pVal" as="item()"/>
   <xsl:param name="pStart" as="item()"/>
   <xsl:param name="pEnd" as="item()"/>

   <xsl:sequence select=
    "if($pStart gt $pEnd) then false()
       else
        for $mid in ($pStart + $pEnd) idiv 2,
             $sVal in $pSeq[$mid]
           return
             if($sVal eq $pVal) then true()
             else
                if($sVal lt $pVal)
                   then f:binSearch($pSeq, $pVal, $mid+1, $pEnd)
                else
                  f:binSearch($pSeq, $pVal, $pStart, $mid - 1)
       "/>
 </xsl:function>

I have used the f:binSearch() functions for years -- for example in an
efficient spelling checker in which the dictionary is a sequence of
sorted strings. Even 5 years ago it was working with a speed of 3000
words per second when checking Shakespeare's Othello.


Another way is using XPath 3.0 and the Binary Search tree datatype
that I recently implemented.

This is described (and with full code) here:

http://dnovatchev.wordpress.com/2011/02/12/part-2-the-binary-search-data-stru
cture-with-xpath-3-0-deleting-a-node/



Hope this helped :)



On Wed, Mar 2, 2011 at 1:23 PM, Henry S. Thompson <ht@xxxxxxxxxxxx> wrote:
> A common requirement for me is to check if a particular string is a
> member of a set of other strings. B I often need this in an inner loop,
> doing it hundreds if not thousands of times.
>
> So, what's the most efficient way to do this?
>
> Consider the example (a real one) of checking to see if a word is what
> the IR people call a 'stop' word -- a short common word of little
> substance.
>
> Here's a function which checks this in a straightforward way:
>
> <xsl:variable
>
name="stopPat">it|its|itself|they|them|their|what|which|who|whom|this|that|th
ese|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|do
ing|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|aga
inst|between|into|through|during|before|after|above|below|to|from|up|down|in|
out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|a
ll|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|th
an|too|very|s|t|can|will|just|don|should|now</xsl:variable>
>
> B <xsl:variable name="stops" select="tokenize($stopPat,'\|')"/>
>
> B <xsl:function name="my:stop1" as="xs:boolean">
> B <xsl:param name="w" as="xs:string"/>
> B <xsl:sequence select="some $s in $stops satisfies ($s eq $w)"/>
> B </xsl:function>
>
> For obvious reasons this seems unlikely to be very efficient.
>
> I've come up with 4 other ways to do this, and the best of them is
> over 6 times faster than that one, using a realistic set of test
> data. B The worst is more than twice as _slow_ as that one. B That's a
> factor of 12 between best and worst!
>
> Timing was done using saxon91 on a Windows box, using -repeat:100,
> subtracting a baseline function which always returns true w/o checking
> anything. B I called the stop function 289 times. B On 95 occasions the
> argument was a stop word (drawn from only 23 word types). B There are
> 104 stop words in $stopPat.
>
> I'll follow this message with another containing the details: skip it
> if you don't care, or have a think about what _you_ think is the best
> answer before reading.
>
> ht
> --
> B  B  B  Henry S. Thompson, School of Informatics, University of Edinburgh
> B  B  B 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131
650-4440
> B  B  B  B  B  B  B  B Fax: (44) 131 651-1426, e-mail: ht@xxxxxxxxxxxx
> B  B  B  B  B  B  B  B  B  B  B  URL: http://www.ltg.ed.ac.uk/~ht/
> B [mail from me _always_ has a .sig like this -- mail without it is forged
spam]
>
>



--
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
Facts do not cease to exist because they are ignored.

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.