[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Extraction of data using key() and matches()
Hello, I have a large number of XML data files which contain a table with rows and data cells each (previously Excel files). I'm interested in finding out whether in the table's data cells there is or is not a given country name. If so I want to record in another file all country names that appear in the data file. The country name may be the only content of the data cell (<col>United Kingdom</col>), or it may be surrounded by other text (<col>Data has been provided for United Kingdom only.</col>). It can also be that more than one country name appears in a table cell. There won't be other elements in the cell, just character data. My current approach is to have an exhaustive lookup files with *all* country names that are potentially used. For each XML data file, I loop over all country names and query the contents of each data file whether it matches the current country name. The following works but is rather slow: countries.xml <countries> <country code="ABW"> <fr>Aruba</fr> <en>Aruba</en> </country> <country code="AFG"> <fr>Afghanistan</fr> <en>Afghanistan</en> </country> ... </countries> data.xml <workbook> <sheet> <name><![CDATA[Figure 1.1 (I)]]></name> <row number="0"> <col number="0"><![CDATA[United Kingdom]]></col> </row> <row number="1"> <col number="0"><![CDATA[Part I. ]]></col> <col number="1"><![CDATA[These data apply to France, Germany and a couple of other countries.]]></col> ... </row> ... </sheet> </workbook> extract.xsl <xsl:for-each select="document($country-file)/countries/country/en"> <xsl:variable name="current-node" select="."/> <xsl:if test="$data-doc//col[matches(., $current-node/text())]"> <country><xsl:value-of select="$current-node/../@code"/></country> </xsl:if> </xsl:for-each> In order to speed up the process I was thinking about indexing all data cells using xsl:key. However, I cannot see how the key() and the matches() function can be combined to use the former's speed with the latter's regex power. I was hoping of doing something along these lines, but would need some help as this doesn't currently work: <xsl:key name="cell" match="col" use="text()"/><!-- create an index of the cells' contents --> <xsl:for-each select="document($country-file)/countries/country/en"> <xsl:variable name="current-node" select="."/><!-- don't lose the current node --> <xsl:for-each select="document($data-file)"><!-- change context to data document --> <!-- key returns a nodeset, so count the number of nodes in the nodeset. this doesn't work if the country name is not the only content --> <xsl:if test="count(key("cell", $current-node)) > 0"> <country><xsl:value-of select="$current-node/../@code"/></country> </xsl:if> </xsl:for-each> </xsl:for-each> Maybe there's another solution that is more elegant and more efficient than what I've shown above. I'd love to know about it. Thank you in advance for your help. Jakob.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|