[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Keys and select distinct - is that the solution ?

Subject: Re: Keys and select distinct - is that the solution ?
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Mon, 05 Jun 2006 15:54:18 -0400
select distinct slow
Hi Christian,

At 07:30 PM 6/2/2006, you wrote:
I have now tried the solutions, but none of them works.

Actually, I kind of doubt that. :-> What you have tried is either an attempt at solving the problem blind, posted by contributors (me) who worked with a partial data set and partial problem description, or attempts of your own at patching such code.


Believe me, "the solution" works just fine. You just haven't figured out how to write it yet, and neither have we. This doesn't mean that the solution is not known -- we'ver written it plenty of times before, just not fitted for your particular problem (which we nevertheless recognize as a member of the species).

Actually I dont think I need to use the generic_id, do I?
Because I don't need to make all the elements unique!!? As far as I
can see, I only have to pick out all the distinct codes.

The generate-id() idiom I suggested is not for the purposes of "making an element unique". It is merely a way of checking whether one node is the same node as another node. Consider this document:


<a>
  <b>100</b>
  <b>100</b>
</a>

Are /a/b[1] and /a/b[2] the same node? No.

How does a stylesheet know this? It can't tell by comparing their names: they're both named 'b'. Nor by comparing their values, which are both '100'.

It would be possible to write a template that produced for each node a unique identifier, which we could compare. For example, it could generate for the first b node the identifier "/a/b[1]" and for the second, "/a/b[2]". We could compare these strings to establish the two nodes are not the same node.

Or, since generate-id() generates, for any node, an identifier that is unique to the node, we could just use this function, and not have to write that template.

Or, there's another way to test whether these are the same. Say we have

<xsl:variable name="first-b" select="/descendant::b[1]"/>

<xsl:template match="b">
  <xsl:choose>
  <xsl:when test="count(.|$first-b)=1">This b is the first</xsl:when>
  <xsl:otherwise>This b is not the first</xsl:otherwise>
</xsl:template>

Using generate-id() instead, we could say

<xsl:template match="b">
<xsl:choose>
<xsl:when test="generate-id() = generate-id($first-b)">This b is the first</xsl:when>
<xsl:otherwise>This b is not the first</xsl:otherwise>
</xsl:template>


which also works.

Either of these can be applied to solve the problem of "am I a unique representative of a given group of nodes", which is part of the grouping problem. (And David C is correct: yours is a grouping problem.)

By doing that I do have to match on the content of the node, and not
the element name, right!?

Actually you match on a node, not on its content or name.


We do match nodes *by* name. Indeed this is the normal way of doing it. In XSLT 1.0 it's not possible to match nodes with templates based on their content.

If I match on the content/text of the node
couldn't I say something like take all the elements whose content is
not in any preeceding sibling content ???

You could match a node and test to see if its content appeared on a preceding element or preceding-sibling element, yes. And indeed, that is a solution available to us for grouping. But: it is a slow solution with poor performance; it doesn't scale well to even medium-sized data sets.


It's much quicker to do something like

<xsl:template match="b">
  <xsl:variable name="bs-like-this"
    select="/descendant::b[.=current()]">
  <xsl:if test="generate-id()=generate-id($bs-like-this[1])">
    <xsl:text>I'm a b; my content is </xsl:text>
    <xsl:apply-templates/>
  </xsl:if>
</xsl:template>

Instead of using the painful traversal along the preceding axis, this template works like this:

1. Bind to a variable all the 'b' nodes in the document whose content
   is the same as the b node matched
2. Test to see whether the b node matched is the first of the nodes
   bound to the variable; if it is, report its content

If we can do this, then grouping all the bs by content (*not* by name) is as simple as processing all the bs bound to the variable in step 2. This is a trivial tweak to what I just wrote above (which I leave it to you to figure out).

This is still slow, however, since for every b matched by the template we have to assemble the set /descendant::b[.=current()], which entails looking through the entire document. Accordingly, for this we usually use keys (this was Steve Muench's contribution to the method), since keys are pre-indexed and hence, fast:

<xsl:variable name="bs-like-this" select="key('bs-by-value',.)"/>

which grabs those nodes without having to traverse the entire tree.

In this case the key 'bs-by-value' would index the 'b' nodes by their content (value):

<xsl:key name="bs-by-value" match="b" use="."/>

If you really want to pursue a solution based on checking backwards along the preceding:: axis, we can help with that. By pointing you to the grouping solutions (which build on what I just showed you above), we are trying to skip you past that point, since it's not the best solution available.

If you need more help disentangling this, please feel free to post again. But when you do, post your sample code again please, so we can point the way using examples that make sense.

Good luck,
Wendell

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.