[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: Regular expression functions (Was: Re: comments on

Subject: RE: Regular expression functions (Was: Re: comments on December F&O draft)
From: "Steven Noels" <stevenn@xxxxxxxxxxxxxxxx>
Date: Wed, 9 Jan 2002 22:04:15 +0100
lt regular expression
> -----Original Message-----
> From: owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> [mailto:owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx]On Behalf Of Michael Kay
> Sent: woensdag 9 januari 2002 12:40
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: RE: Regular expression functions (Was: Re:  comments on
> December F&O draft)

> I'm interested in your exploration of the use-cases for
> regexp matching and
> possible XSLT constructs to support those use cases, though
> so far I've had
> difficulty following the "make-it-up-as-you-go-along" style of
> specification!
>
> Mike Kay

We are currently working on a little tool (packaged as a Cocoon
generator, an Ant task and a CLI app) that is more or less
Omnimark-like, i.e. it enables you to 'uptranslate' a non-XML document
(HTML, delimited ASCII, ...) to an XML document.

We baptised it Regexslt since it borrows (a little bit) from the XSLT
language design.

It is based on the Jakarta ORO regex library.

Using the input document (can be a URL)
http://www.bloomberg.com/bbn/technology.html and this regexslt
specification:

<?xml version="1.0" encoding="UTF-8"?>
<regexslt xmlns="http://outerx.org/ns/regexslt/transform/1.0">
  <element name="feed">
    <element name="title">
      <text>Bloomberg &gt; Technology</text>
    </element>
    <element name="url">
      <text>http://www.bloomberg.com/bbn/technology.html</text>
    </element>
    <call-matcher name="feeddate"/>
    <call-matcher name="items"/>
  </element>
  <matcher
regex="CLASS=&quot;story3&quot;&gt;([^&lt;]+)&lt;BR&gt;&lt;/SPAN&gt;&lt;
/FONT&gt;&lt;/STRONG&gt;&lt;FONT\sCOLOR=&quot;#333333&quot;\sFACE=&quot;
sans-serif,\sarial&quot;&gt;&lt;SPAN\sCLASS=&quot;story&quot;&gt;([^&lt;
]+)&amp;nbsp;(.+)&lt;A\sHREF=&quot;([^&quot;]+)&quot;&gt;More"
name="items">
    <element name="item">
      <element name="blurb">
        <value-of select-group="1"/>
      </element>
      <element name="body">
        <value-of select-group="2"/>
      </element>
      <element name="url">
        <value-of select-group="4"/>
      </element>
    </element>
  </matcher>
  <matcher
regex="&lt;SPAN\sCLASS=&quot;date&quot;&gt;([^&lt;]+)&lt;/SPAN&gt;"
name="feeddate">
    <element name="date">
      <value-of select-group="1"/>
    </element>
  </matcher>
</regexslt>

it is transformed into

<?xml version="1.0" encoding="UTF-8"?>
<feed>
  <title>Bloomberg &gt; Technology</title>
  <url>http://www.bloomberg.com/bbn/technology.html</url>
  <date>Wed, 09 Jan 2002, 3:48pm EST</date>
  <item>
    <blurb>Oracle, BEA, Software Stocks Surge After SAP Says 2001 Sales
Beat Forecast</blurb>
    <body>The shares of	Oracle Corp., BEA Systems Inc. and other
software companies surged	after SAP AG, the largest maker of
business-management programs,	said it surpassed a lowered 2001 sales
forecast.</body>

<url>http://quote.bloomberg.com/fgcgi.cgi?ptitle=Technology%20News&amp;s
1=blk&amp;tp=ad_topright_tech&amp;T=markets_bfgcgi_content99.ht&amp;s2=a
d_right1_technology&amp;bt=ad_position1_technology&amp;middle=ad_frame2_
technology&amp;s=APDyfihUCT3JhY2xl</url>
  </item>
[...]
</feed>

One of the things which doesn't work well currently is the specification
of the regex as an attribute to the <matcher> element. We will avoid
this by putting the regex inside a CDATA section of a <regex> subelement
(will be optional, we are testing this right now). Not sure whether this
is good practice, advice welcome. It is only partially related to this
discussion of course.

We plan on releasing regexslt "when it's ready" (weeks, not months)
under a liberal license (ASF). People who are willing to play around
with it can contact me. There's an XML Schema for the language also (we
found validation of the transformationsheet very important).

But we would much more appreciate criticism and suggestions from the
people on this thread :-)

Pointers to other regex libraries which are more up to par with Perl
regexes would be welcome, too.

Regards,

Steven Noels
http://outerthought.org/
(+32)478 292900


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.