[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
[Recent Entries]
[Reply To This Message]
Re: Which is less expensive group by or select distin
At 02:14 PM 7/15/2016, Michael Kay mike@xxxxxxxxxxxx wrote:
group-by and distinct-values are both going to have fairly similar
time and memory characteristics, but of course the details depend on
the specific processor.
But there are some very odd things going on in this code.
>
> <xsl:variable name="TermList">
> <xsl:value-of select="distinct-values(.//term[not(@keyref)])"
> separator=", " />
xsl:variable with an xsl:value-of child always has a bad smell. Why
are you constructing an XML tree fragment when all you want is a
string? In 99% of cases it should be <xsl:variable name="x" select="y"/>.
>>>The separator attribute caused that nesting. The values returned
might have spaces, but not commas, so that was being used to break up
the results to sort them.
More important, why are the distinct values being concatenated into
a single comma-separated string, only to be tokenized again
immediately afterwards?
> </xsl:variable>
> <data type="topicreport" name="WDTermList">
> <xsl:for-each select="tokenize(normalize-space($TermList), ', ')">
> <xsl:sort select="." />
> <xsl:value-of select="."/>
> <xsl:if test="position() != last()">, </xsl:if>
> </xsl:for-each>
> </data>
And then turned back into a comma-separated string again, this time
in sorted order.
>
> If this hadn't existed in the stylesheet already, I would have probably
> done something like:
>
> <xsl:for-each-group select=".//term[not(@keyref)])" group-by=".">
> <xsl:sort select="current-grouping-key()" />
> <xsl:value-of select="current-grouping-key()"/>
> <xsl:if test="position() != last()">, </xsl:if>
> </xsl:for-each-group>
That's certainly a lot better, assuming the comma-separation of the
sorted list is actually wanted. Personally, I would write:
<xsl:for-each select="distinct-values(.//term[not(@keyref)])">
<xsl:sort select="."/>
<xsl:if test="position() ne 1">, </xsl:if>
<xsl:value-of select="."/>
</xsl:for-each>
Note that putting a comma before every item except the first, rather
than after every item except the last, is less likely to disrupt the
processing pipeline by calling last() right at the beginning, and
can therefore reduce memory usage. Saxon will usually handle either
form OK, but you don't want to be over-reliant on the optimizer
recognizing such coding patterns.
>>>Agreed
With distinct-values, the memory needed is for the set of distinct
values. With for-each-group, it's much more likely that the memory
requirement will be one entry for each distinct value, where the
entry holds both the value, and the list of nodes having that value,
which you don't need in this case.
>
> I don't think the above is my major time synch in this process but it is
> one class of things that I'm reporting. I think the real processing time
> issue is coming from a lot of string analysis/parsing that is occurring.
>
Indeed, the costs might not come from this part of the code at all.
>>>I've confirmed that this is at least where some of my memory
problems are coming from. I need to do some more work to figure out
what is actually going on. Typically each topic has 2-3 taxonomy
values and we only have about 200 unique terms in the taxonomy. So
worst case if things are running the way I think, there might be
12,000 uses that reduce down to less than 200 distinct values. There
are two other statements like this looking at different elements with
ths same sort of scope. If they are included, I cross the memory
limit. When I had these in place and increased the memory the run
time increased from 60 min to 300+ before failing.
Michael Kay
Saxonica
---------------------------------------------------------------------------
Danny Vint
Panoramic Photography
http://www.dvint.com
voice: 619-647-5780
|
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format
RSS 2.0 |
|
Atom 0.3 |
|
|