[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Which is less expensive group by or select distin

Subject: Re: Which is less expensive group by or select distinct-values
From: "Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 15 Jul 2016 21:14:08 -0000
Re:  Which is less expensive group by or select  distin
group-by and distinct-values are both going to have fairly similar time and
memory characteristics, but of course the details depend on the specific
processor.

But there are some very odd things going on in this code.

>
> <xsl:variable name="TermList">
> <xsl:value-of select="distinct-values(.//term[not(@keyref)])"
> separator=", " />

xsl:variable with an xsl:value-of child always has a bad smell. Why are you
constructing an XML tree fragment when all you want is a string? In 99% of
cases it should be <xsl:variable name="x" select="y"/>.

More important, why are the distinct values being concatenated into a single
comma-separated string, only to be tokenized again immediately afterwards?

> </xsl:variable>
> <data type="topicreport" name="WDTermList">
>  <xsl:for-each select="tokenize(normalize-space($TermList), ', ')">
> 	<xsl:sort select="." />
> 	<xsl:value-of select="."/>
>         <xsl:if test="position() != last()">, </xsl:if>
>   </xsl:for-each>
> </data>

And then turned back into a comma-separated string again, this time in sorted
order.
>
> If this hadn't existed in the stylesheet already, I would have probably
> done something like:
>
> <xsl:for-each-group select=".//term[not(@keyref)])" group-by=".">
>   <xsl:sort select="current-grouping-key()" />
>   <xsl:value-of select="current-grouping-key()"/>
>   <xsl:if test="position() != last()">, </xsl:if>
> </xsl:for-each-group>

That's certainly a lot better, assuming the comma-separation of the sorted
list is actually wanted. Personally, I would write:

<xsl:for-each select="distinct-values(.//term[not(@keyref)])">
  <xsl:sort select="."/>
  <xsl:if test="position() ne 1">, </xsl:if>
  <xsl:value-of select="."/>
</xsl:for-each>

Note that putting a comma before every item except the first, rather than
after every item except the last, is less likely to disrupt the processing
pipeline by calling last() right at the beginning, and can therefore reduce
memory usage. Saxon will usually handle either form OK, but you don't want to
be over-reliant on the optimizer recognizing such coding patterns.

With distinct-values, the memory needed is for the set of distinct values.
With for-each-group, it's much more likely that the memory requirement will be
one entry for each distinct value, where the entry holds both the value, and
the list of nodes having that value, which you don't need in this case.
>
> I don't think the above is my major time synch in this process but it is
> one class of things that I'm reporting. I think the real processing time
> issue is coming from a lot of string analysis/parsing that is occurring.
>
Indeed, the costs might not come from this part of the code at all.

Michael Kay
Saxonica

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.