[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Matching string values across element boundaries

Subject: Re: Matching string values across element boundaries
From: Michael Sokolov <msokolov@xxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 08 Apr 2013 19:00:00 -0400
Re:  Matching string values across element boundaries
David - I think the answer whether there is any improvement to be made to your system will depend in detail on just how the matching algorithm works. Clearly if it expects a string, you have to give it one, and you are left with something like your approach. If you're willing to revisit the matching algorithm (I expect you don't want to - it sounds hairy), you could probably also change the markup generation. One idea that springs to mind is the highlighters available in search platforms: these typically operate on text only, remembering the position of every word, and allow you to mark them with tags in a highlighting pass, which you can later coalesce using XSLT or some other markup-aware process. If you can cast the matching problem as a search problem, you could leverage MarkLogic, or Lucene or something like that. Maybe that would be better than what you have, I don't know.

-Mike

On 4/8/2013 2:15 PM, David Sewell wrote:
I expect this has been discussed here before, but I can't locate any relevant
discussion, so here goes.

We have input data with many unmarked short-title citations that look like this:

Sprague, <hi rend="italic">Braintree Families</hi>

We want to wrap them inside another element, in our case a <ref> to the
bibliographic expansion. We have a venerable chain of XSLT 2.0 transforms that
does this, and pretty well, by preprocessing the data to convert all those <hi>
tags into a pair of unique ASCII characters, so that we can do string-matching
operations within a single text node that now includes something like

Sprague, "Braintree Families%

which is easy to handle with xsl:analyze-string. then once we've wrapped all the
strings we need to, we post-process with xsl:analyze-string to put the <hi>
elements back in.

In practice, given the proper regexes, this works quite well and provides the
desired output, but I always feel a bit guilty about the hackishness of the
approach. Given that the citations are quite variable in structure (usually but
not always containing <hi> elements, with various combinations of text nodes at
start and end), I've never come up with a good general-purpose way to operate
purely on elements and text nodes without the convert-tags-to-characters step.
Is there one (or more)?

David S.

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.