[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: two <xsl:analyze-string> questions

Subject: Re: two <xsl:analyze-string> questions
From: Brandon Ibach <brandon.ibach@xxxxxxxxxxxxxxxxxxx>
Date: Sat, 22 Oct 2011 12:43:20 -0400
Re:  two <xsl:analyze-string> questions
The following might work for part 2.

  <xsl:variable name="regex" select="'(\p{L})6(\p{L}?)|(\p{L}?)6(\p{L})'"/>
  <xsl:analyze-string select="." regex="{$regex}">
    <xsl:matching-substring>
      <xsl:value-of select="concat(regex-group(1), regex-group(3),
'b', regex-group(2), regex-group(4))"/>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="."/>
    </xsl:non-matching-substring>
  </xsl:analyze-string>

-Brandon :)


On Sat, Oct 22, 2011 at 10:55 AM, Birnbaum, David J <djbpitt@xxxxxxxx> wrote:
> Dear XSLT-List,
>
> I'd be grateful for advice about a two-part <xsl:analyze-string> problem.
I'm post-processing messy OCR output, and the situation I'm trying to address
involves patterns and patterned errors that can be identified through regex
matching. Some of the patterns are traditional up-conversion (e.g., find a
certain pattern of digits and punctuation and wrap markup around it); some of
them are corrections (e.g., the digit "6" and the letter "b" are confused, but
a digit "6" adjacent to a letter is probably an error and should be corrected
automatically, while a digit "6" not adjacent to a letter probably isn't and
should be left alone).
>
> 1. The first part of my problem involves general program logic. I'm
currently using a strategy like the following:
>
>    <xsl:template match="text()">
>        <xsl:call-template name="editionLineNo">
>            <xsl:with-param name="current" select="."/>
>        </xsl:call-template>
>    </xsl:template>
>    <xsl:template name="editionLineNo">
>        <!-- 1. check for digits plus period, \d+\., edition line no -->
>        <xsl:param name="current"/>
>        <xsl:analyze-string select="$current" regex="(\d+)\.">
>            <xsl:matching-substring>
>                <editionLineNo>
>                    <xsl:value-of select="regex-group(1)"/>
>                </editionLineNo>
>            </xsl:matching-substring>
>            <xsl:non-matching-substring>
>                <xsl:call-template name="msFolioNo">
>                    <xsl:with-param name="current" select="$current"/>
>                </xsl:call-template>
>            </xsl:non-matching-substring>
>        </xsl:analyze-string>
>    </xsl:template>
>
> That is, at the beginning I grab a pristine text node and look for a
pattern. If it's there, I'm done; if not, I pass the non-matching substring to
the next template to look for a different pattern. One template calls another,
passing the unmatched substrings, until the end, when I just output the text.
>
> This works, but is it the best approach? Should I instead, for example, use
a single callable template and pass it both the haystack string and the needle
regex? My highest priorities are legibility and ease of development and
maintenance; efficiency of operation is less important. In case this is
important, the order in which the patterns are matched matters, at least in a
few instances. For example, digits followed by a period get one kind of markup
and digits not followed by a period get another, so I want to capture the
first type first and get them out of the way before looking for the second.
>
> 2. The second part of my problem involves a particular type of regex, one
that will, for example, identify a digit "6" that is adjacent to a letter and
replace it with a letter "b". The adjacent letter could precede or follow the
digit or both. If I make the preceding and following letter(s) optional in the
pattern, I've made both optional, and I'll erroneously catch an isolated digit
"6". If I use a disjunct pattern, it becomes harder to capture the pieces and
output the ones I want to retain with regex-group(). I suspect that this is a
common problem with a standard solution, but I haven't run into it before and
no single, elegant but legible regex leaps to mind. Is there one?
>
> Thanks for any advice,,
>
> David
> djbpitt@xxxxxxxxx

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.