[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Matching string values across element boundaries

Subject: Re: Matching string values across element boundaries
From: "steve.majewski@xxxxxxxxx" <steve.majewski@xxxxxxxxx>
Date: Mon, 8 Apr 2013 17:01:09 -0400
Re:  Matching string values across element boundaries
Anaylze-string can match the string, and wrap the <ref> around the matching
substring easily.
The problem is that the matching-substring/context-item has it's markup
stripped out, so what you
would get is:

<note>See for example <ref target="st002">Jay, Unpublished Papers</ref>,

instead of:

<note>See for example <ref target="st002">Jay, <i>Unpublished
Papers</i></ref>, 4:123.</note>

And I can't think of any way to preserve the markup except by escaping it in
some manner.

-- Steve Majewski / UVA Alderman Library

On Apr 8, 2013, at 4:26 PM, David Sewell wrote:

> A sample of the citation abbreviations that need to be matched (for
simplicity, <i> is used to indicate italics), from the lookup table used by
the transforms (omitting the expansions of the abbreviations that are in the
lookup table also):
> <abbr xml:id="st001"><i>Cal. Franklin Papers</i>, A.P.S.</abbr>
> <abbr xml:id="st002">Jay, <i>Unpublished Papers</i></abbr>
> <abbr xml:id="st003"><i>JCC</i></abbr>
> <abbr xml:id="st004"><i>Oxford Classical Dicy.</i></abbr>
> <abbr xml:id="st005">U.S. Census, 1790</abbr>
> In the incoming XML, abbreviations like those above appear in running text
without wrapper elements. The automated process to add wrappers needs to
operate on string values that often cross <i> boundaries., as in the first two
examples. So one might find in running text:
>    <note>See for example Jay, <i>Unpublished Papers</i>, 4:123.</note>
> which needs to be transformed into
>    <note>See for example <ref target="st002">Jay, <i>Unpublished
Papers</i></ref>, 4:123.</note>
> The XPath //note[matches(., 'Jay, Unpublished Papers')] will match the
input <note>, but the complexity is writing a template that wraps the
appropriate portions of the note in a <ref> element. That's why our
preprocessing converts <i> tags in both input and lookup table to single text
characters to make the string matching relatively simple.
> And we do in fact use unusual Unicode for markers in our current transform,
the example I gave substituted markers that would show up in everyone's
> David
> On Apr 8, 2013, at 2:58 PM, Michael M|ller-Hillebrand <mmh@xxxxxxxxx>
>> David,
>> Can you give a more complex example, how "variable in structure" those
citations may be. This may also shed some light on the kind of processing you
want to do. Changing tags to characters (why are you using ASCII instead of
some high Unicode character from the private use area?) and then back to tags
seems not a very interesting thing
>> - Michael
>> Am 08.04.2013 um 20:15 schrieb David Sewell <dsewell@xxxxxxxxxxxx>:
>>> I expect this has been discussed here before, but I can't locate any
>>> discussion, so here goes.
>>> We have input data with many unmarked short-title citations that look like
>>> Sprague, <hi rend="italic">Braintree Families</hi>
>>> We want to wrap them inside another element, in our case a <ref> to the
>>> bibliographic expansion. We have a venerable chain of XSLT 2.0 transforms
>>> does this, and pretty well, by preprocessing the data to convert all those
>>> tags into a pair of unique ASCII characters, so that we can do
>>> operations within a single text node that now includes something like
>>> Sprague, "Braintree Families%
>>> which is easy to handle with xsl:analyze-string. then once we've wrapped
all the
>>> strings we need to, we post-process with xsl:analyze-string to put the
>>> elements back in.
>>> In practice, given the proper regexes, this works quite well and provides
>>> desired output, but I always feel a bit guilty about the hackishness of
>>> approach. Given that the citations are quite variable in structure
(usually but
>>> not always containing <hi> elements, with various combinations of text
nodes at
>>> start and end), I've never come up with a good general-purpose way to
>>> purely on elements and text nodes without the convert-tags-to-characters
>>> Is there one (or more)?
>>> David S.

Current Thread


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.