[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Spelling Othello (Was: Re: Text processing on XS

Subject: Spelling Othello (Was: Re: Text processing on XSLT 2.0)
From: Dimitre Novatchev <dnovatchev@xxxxxxxxx>
Date: Tue, 5 Apr 2005 07:00:07 +1000
str split example
I didn't mention that the text I was spelling was the play:

  "Othello"

by William Shakespeare

On Apr 5, 2005 6:56 AM, Dimitre Novatchev <dnovatchev@xxxxxxxxx> wrote:
> On Apr 5, 2005 6:41 AM, M. David Peterson <m.david.x2x2x@xxxxxxxxx> wrote:
> > Working on projects such as XBiblio/Citeproc lead by Bruce D'Arcus
> > I have realized that even as far as the XSLT 2.0 working draft goes in
> > regards to bringing Perl'esque type text processing to the XML
> > developer it is still up to the developer to fine-tune these
> > capabilities to cover their specific needs.  For example, a spell
> > checker.
> >
> > Can anyone who may have extended experience in regards to the
> > development of such capabilities using XSLT share with the rest of us
> > your experience?
>
> Hi Mark,
>
> These days I had fun with an f:binSearch() function and then,
> logically, with f:spell().
>
> I have a dictionary of about 47000 English wordforms, on which I
> search with f:binSearch()
>
> I had to produce a faster fn than the current quadratical
> str-split-to-words template -- this is the f:getWords() function.
>
> All these functions can be downloaded from the FXSL CVS (just let me
> know if you'd want me to send you the zip archive).
>
> The combination of these functions works quite well.
>
> This transformation (test-FuncSpell.xsl):
>
> <xsl:stylesheet version="2.0"
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> xmlns:xs="http://www.w3.org/2001/XMLSchema"
> xmlns:f="http://fxsl.sf.net/"
> exclude-result-prefixes="f xs"
> >
>  <xsl:import href="../f/func-getWords.xsl"/>
>  <xsl:import href="../f/func-spell.xsl"/>
>
>  <xsl:output omit-xml-declaration="yes"/>
>
> <xsl:variable name="vDelim" as="xs:string">
> ,:.-&#9;&#10;&#13;'!?;</xsl:variable>
>
> <!-- To be applied on ../data/othello.xml -->
>  <xsl:template match="/">
>    <xsl:variable name="vwordNodes" as="element()*">
>       <xsl:for-each select="//text()/lower-case(.)">
>         <xsl:sequence select="f:getWords(., $vDelim, 1)"/>
>       </xsl:for-each>
>    </xsl:variable>
>
>    <xsl:variable name="vUnique" as="xs:string+">
>      <xsl:perform-sort select="distinct-values($vwordNodes)">
>        <xsl:sort select="."/>
>      </xsl:perform-sort>
>    </xsl:variable>
>
>    <xsl:variable name="vnotFound" as="xs:string*"
>     select="$vUnique[not(f:spell(.))]"/>
>
>    <xsl:value-of separator="&#xA;"
>     select="$vnotFound"/>
>
>    A total of <xsl:value-of select="count($vwordNodes)"/> words
>    were spelt, (<xsl:value-of select="count($vUnique)"/>) distinct.
>
>    <xsl:value-of select="count($vnotFound)"/> not found.
> </xsl:template>
> </xsl:stylesheet>
>
> when applied on othello.xml (around 29000 words)
>
> produces this result:
>
> Saxon 8.3 from Saxonica
> Java version 1.5.0_01
> Stylesheet compilation time: 1140 milliseconds
> Processing file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml
> Building tree for
> file:/C:\xml\Parsers\Saxon\Ver.8.3\samples\data\othello.xml using
> class net.sf.saxon.tinytree.TinyBuilder
> Tree built in 94 milliseconds
> Tree size: 18539 nodes, 154557 characters, 0 attributes
> Building tree for file:/C:/CVS-DDN/fxsl-xslt2/f/func-getWords.xsl
> using class net.sf.saxon.tinytree.TinyBuilder
> Tree built in 0 milliseconds
> Tree size: 43 nodes, 143 characters, 22 attributes
> Building tree for file:/C:/CVS-DDN/fxsl-xslt2/data/dictEnglish.xml
> using class net.sf.saxon.tinytree.TinyBuilder
> Tree built in 188 milliseconds
> Tree size: 139140 nodes, 528397 characters, 0 attributes
> Execution time: 7015 milliseconds
>
> <a-list-of-567-unknown-words-ommitted/>
>
>    A total of 28622 words
>    were spelt, (3669) distinct.
>
>    567 not found.
>
> So, checking 3669 distinct words in 7015  milliseconds makes
>
>  523.02 words/sec.
>
> The actual speed is faster, as the total time includes splitting up
> the words and finding the distinct words.
>
> Among the unknown words are such nice words as:
>
> affordeth
> affrighted
> ariseth
> arithmetician
> arrivance
> bethink
> betimes
> bewhored
>
> :o)
>
> Cheers,
>
> Dimitre

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.