Subject: RE: Using analyze-string to catch roman numerals?
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Thu, 9 Oct 2008 23:05:57 +0100
|
The two things wrong with your solution are:
(a) you're matching any sequence of letters that could be a roman numeral,
without looking at the context, hence matching the IX in APPENDIX.
(b) you're only matching the first thing in each element that looks like a
roman numeral
The second is easily fixed: don't use an anchored regex in analyze-string
like this
regex="^(.*?)([IVXL]+)(.*?)$"
Instead use an unanchored regex
regex="([IVXL]+)"
and add an xsl:non-matching-substring element that copies unmatched
substrings across unchanged (or case-converted if you want).
Problem (a) is much harder. You can get a fair way by requiring the sequence
of IVXL to have non-letters before and after it. But you'll still be
matching the word "ILL" as a roman numeral when it clearly isn't. Like all
up-conversion tasks, though, it's very much up to you how much time you want
to spend fine-tuning the patterns and rules that you define.
Michael Kay
http://www.saxonica.com/
> -----Original Message-----
> From: Tony Zanella [mailto:tony.zanella@xxxxxxxxx]
> Sent: 09 October 2008 20:18
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: Using analyze-string to catch roman numerals?
>
> Hello all,
>
> Given the following input:
>
> <root>
> <head>CHAPTER II. THE WRECKED FOUNDATIONS OF DOMESTICITY</head>
> <head>PROBLEMA. HELOISE XXIX.</head>
> <head>Selected Letters</head>
> <head>The Second Part of Henry IV.</head>
> <head>VIII</head>
> <head>APPENDIX VII</head>
> <head>Appendix VII</head>
> <head>APPENDIX</head>
> <head>CALVIN XVII</head>
> <head>ILLUSTRATION</head>
> </root>
>
> and the following template:
>
> <xsl:template match="head">
> <xsl:choose>
> <xsl:when test="not(matches(.,'^(.*?)([IVXL]+)(.*?)$'))">
> <xsl:value-of select="lower-case(.)"/>
> </xsl:when>
> <xsl:when test="matches(.,'^(.*?)([IVXL]+)(.*?)$')">
> <xsl:analyze-string select="."
> regex="^(.*?)([IVXL]+)(.*?)$">
> <xsl:matching-substring>
> <xsl:value-of
> select="lower-case(regex-group(1))"/>
> <xsl:value-of
> select="upper-case(regex-group(2))"/>
> <xsl:value-of
> select="lower-case(regex-group(3))"/>
> </xsl:matching-substring>
> </xsl:analyze-string>
> </xsl:when>
> <xsl:otherwise/>
> </xsl:choose>
> </xsl:template>
>
> I'm trying to use analyze-string to do the following:
> Test for a roman numeral. If there isn't one, lower-case(.).
> If there is one, break (.) into its roman numeral and
> non-roman numeral parts, lower-case()ing the latter.
>
> The output I get is:
>
> chapter II. the wrecked foundations of domesticity
> probLema. heloise xxix.
> selected Letters
> the second part of henry IV.
> VIII
> appendIX vii
> appendix VII
> appendIX
> caLVIn xvii
> ILLustration
>
> When what I want is this:
>
> chapter II. the wrecked foundations of domesticity
> problema. heloise XXIX.
> selected letters
> the second part of henry IV.
> VIII
> appendix VII
> appendix VII
> appendix
> calvin XVII
> illustration
>
> Between my relative inexperience with both regexes and XSLT,
> thanks for any help!
> Tony
|