[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: distinct-values() optimization, sorting by frequen

Subject: RE: distinct-values() optimization, sorting by frequency
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Fri, 8 Feb 2008 14:48:28 -0000
RE:  distinct-values() optimization
In the alphabetical list,

count($persNames[normalize-space(lower-case(.)) =$current-name])"/

could be optimized by:

(a) using keys

(b) using Saxon-SA which will optimize it to use a key automatically

(c) using xsl:for-each-group rather than distinct-values(), though that will
require some restructuring of your code.

In the frequency-sorted list, I think for-each-group would definitely be
better:

<xsl:for-each-group select="$persNames" group-by="lower-case(.)">
  <xsl:sort select="count(current-group())"/>
  ...

(Note also the use of a case-blind collation rather than lower-case(),
discussed in another thread today)

Michael Kay
http://www.saxonica.com/


 

> -----Original Message-----
> From: James Cummings [mailto:cummings.james@xxxxxxxxx] 
> Sent: 08 February 2008 14:28
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject:  distinct-values() optimization, sorting by frequency
> 
> Hiya,
> 
> I'm wondering the best way to optimize a distinct-values() 
> based transformation.  What I'm basically doing is:
> ======
> <xsl:variable name="docs"  
> select="collection('../../working/xml/files.xml')"/>
> 
> <xsl:template name="main" >
>  <xsl:variable name="persNames" 
> select="$docs//tei:text//tei:persName"/>
>  <xsl:variable name="norm-persNames"
> select="$persNames/normalize-space(lower-case(.))"/>
>  <xsl:variable name="distinct-persNames"
> select="distinct-values($norm-persNames)"/>
> <!-- I realize that I could be more specific on the 
> $persNames variable, but doing so doesn't seem to affect 
> speed much at all. --> <div type="main">
> 
> <!-- Some overall counts -->
> <div><head>Overall Counts</head>
> <list type="unordered">
>   <item>Number of <gi>persName</gi> elements total:
>     <xsl:value-of select="count($persNames)"/></item>
>   <item>Number of <gi>persName</gi> elements which have a  
> @key attribute total: <xsl:value-of 
> select="count($persNames[@key])"/></item>
> <item>Number of distinct-value <gi>persName</gi> elements total:
> <xsl:value-of select="count($distinct-persNames)"/></item>
> </list></div>
> 
> <!-- An Alphabetical List -->
> <div><head>Alphabetical List</head>
>   <list type="unordered">
>     <xsl:for-each select="$distinct-persNames">
>       <xsl:sort select="."/>
>       <xsl:variable name="current-name" select="."/>
>       <xsl:variable name="count-distinct-current-name"
>      select="count($persNames[normalize-space(lower-case(.)) 
> =$current-name])"/>
>       <item><xsl:value-of select="concat($current-name,
>           '  --  ', $count-distinct-current-name)"/></item>
>       </xsl:for-each>
>    </list>
> </div>
> 
> <!-- A Frequency Sorted List  -->
> <div>
>   <head>Frequency List</head>
>   <list type="unordered">
>     <xsl:for-each select="$distinct-persNames">
>       <xsl:sort 
> select="count($persNames[normalize-space(lower-case(.))
>         = .])"/>
> <!-- I think it is this sort statement which slows things 
> down, since I have to repeat it twice. -->
>       <xsl:variable name="current-name" select="."/>
>       <xsl:variable name="count-distinct-current-name"
>         select="count($persNames[normalize-space(lower-case(.))
>         = $current-name])"/>
>       <item><xsl:value-of select="concat($count-distinct-current-name,
>           '  --  ', $current-name)"/> </item>
>     </xsl:for-each>
>   </list>
> </div>
> </div>
> ======
> 
> I think the real slow-down comes in the second xsl:for-each 
> where I want to sort by frequency of distinct-value by doing:
> <xsl:sort 
> select="count($persNames[normalize-space(lower-case(.)) = 
> .])"/> I have to have it for the sort, and then I have to 
> re-do it for the output inside the <item> element.  I'm 
> obviously not allowed a variable between the for-each and the 
> sort... but I have a feeling I'm missing some clever 
> optimization here.
> 
> Although this is for a pre-generated transformation, it 
> currently takes a *hugely* long time, and I'm thinking I must 
> be able to optimize it somehow.
> 
> Any suggestions appreciated,
> 
> -James

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.