[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Using XSLT to build an index
I have now normalized and isolated every phrase I wish to index into a few
thousand structures similar to:
<Text lang="en" data="Zlutice Hymnal 1558" title="Czech Republic Stamp 664" ref="2010-664.htm"/> and want to break the @data attribute string into into individual words associated with its title and ref attributes. How do I use "distinct-values(tokenize(@data))" to construct a sequence of <Word> elements from the <Text> element similar to the following? That is, I don't see how to get at the words returned from distinct-values(tokenize(@data)) one at a time to do this. <Word title="Czech Republic Stamp 664" ref="2010-664.htm">Zlutice</Word> <Word title="Czech Republic Stamp 664" ref="2010-664.htm">Hymnal</Word> <Word title="Czech Republic Stamp 664" ref="2010-664.htm">1558</Word> Thanks, Mark -----Original Message----- From: G. Ken Holman Sent: Sunday, October 30, 2011 3:07 PM To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Subject: Re: Using XSLT to build an index At 2011-10-30 14:47 -0700, Mark wrote: The list archives did not seem to contain an XSLT stylesheet that could index an XML file, but I may have missed it. Is it practical to write my own XSLT 2 indexing stylesheet? If so, I have a bilingual XML file that I want to index.
My assumptions are that I must get rid of the punctuation properly, then isolate the words, sort them, remove stop words, and so on. To get started, I need a bit of help. All of the phrases are found in two attributes: @czech and @eng.
translate($inValue,'-,#.$%',' ') ... where the first argument is your input, the second starts with a "-" and then you put anything else in there as characters to remove, the third indicates the hyphen becomes a space and the rest are to be removed. (2) I assume that to get rid of extra spaces (if any), I can use a construct like: normalize-space(replace(@czech, C"b,Ksome regex expressionC"b,b")). That will reduce all sequences of white-space characters to a single space. (3) I assume that tokenize(normalize-space(replace(@czech, 'some regex expression'))) will permit me to write out a list of the words found in those attributes to an XML document. I am not completely clear as to what tokenize() returns, or how to access that return. tokenize() returns a sequence. But the input is only a single string. Actually, you want to turn the expression inside-out to get a list of words from the entire document then something along these lines should work: distinct-values( (//@czech)/tokenize(translate(normalize-space(.),'-,$%.#',' ')) ) That gives you a sequence of unique words. Can you work from that in order to do the hyperlinking, or do you need help there as well? Remember you will have to do the same translation when creating your links, so perhaps you should have a user function: mark:words(.) as tokenize(translate(normalize-space($arg),'-,$%.#',' ')) ... then use: (//@czech)/mark:words(.) ... then when creating your links you'll have the function available to ensure the same tokenizing is done at the point in time. I hope this helps. . . . . . . . . . . Ken
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|