[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Extraction of data using key() and matches()

Subject: Extraction of data using key() and matches()
From: Jakob Fix <jakob.fix@xxxxxxxxx>
Date: Sat, 5 Jun 2010 21:02:20 +0200
 Extraction of data using key() and matches()
Hello,

I have a large number of XML data files which contain a table with
rows and data cells each (previously Excel files).

I'm interested in finding out whether in the table's data cells there
is or is not a given country name. If so I want to record in another
file all country names that appear in the data file. The country name
may be the only content of the data cell (<col>United Kingdom</col>),
or it may be surrounded by other text (<col>Data has been provided for
United Kingdom only.</col>). It can also be that more than one country
name appears in a table cell. There won't be other elements in the
cell, just character data.

My current approach is to have an exhaustive lookup files with *all*
country names that are potentially used. For each XML data file, I
loop over all country names and query the contents of each data file
whether it matches the current country name.

The following works but is rather slow:

countries.xml

<countries>
  <country code="ABW">
    <fr>Aruba</fr>
    <en>Aruba</en>
  </country>
  <country code="AFG">
    <fr>Afghanistan</fr>
    <en>Afghanistan</en>
  </country>
  ...
</countries>

data.xml

<workbook>
  <sheet>
    <name><![CDATA[Figure 1.1 (I)]]></name>
    <row number="0">
      <col number="0"><![CDATA[United Kingdom]]></col>
    </row>
    <row number="1">
      <col number="0"><![CDATA[Part I. ]]></col>
      <col number="1"><![CDATA[These data apply to France, Germany and
a couple of other countries.]]></col>
     ...
    </row>
   ...
  </sheet>
</workbook>

extract.xsl

<xsl:for-each select="document($country-file)/countries/country/en">
  <xsl:variable name="current-node" select="."/>
  <xsl:if test="$data-doc//col[matches(., $current-node/text())]">
    <country><xsl:value-of select="$current-node/../@code"/></country>
  </xsl:if>
</xsl:for-each>


In order to speed up the process I was thinking about indexing all
data cells using xsl:key. However, I cannot see how the key() and the
matches() function can be combined to use the former's speed with the
latter's regex power.

I was hoping of doing something along these lines, but would need some
help as this doesn't currently work:

<xsl:key name="cell" match="col" use="text()"/><!-- create an index of
the cells' contents -->

<xsl:for-each select="document($country-file)/countries/country/en">
  <xsl:variable name="current-node" select="."/><!-- don't lose the
current node -->
  <xsl:for-each select="document($data-file)"><!-- change context to
data document -->
    <!-- key returns a nodeset, so count the number of nodes in the nodeset.
          this doesn't work if the country name is not the only content -->
    <xsl:if test="count(key("cell", $current-node)) > 0">
      <country><xsl:value-of select="$current-node/../@code"/></country>
    </xsl:if>
  </xsl:for-each>
</xsl:for-each>

Maybe there's another solution that is more elegant and more efficient
than what I've shown above. I'd love to know about it.  Thank you in
advance for your help.

Jakob.

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.