[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

How to parse text into words, phrases, clauses, sente

Subject: How to parse text into words, phrases, clauses, sentences, and paragraphs
From: mark bordelon <markcbordelon@xxxxxxxxx>
Date: Wed, 6 Jun 2007 14:52:23 -0700 (PDT)
 How to parse text into words
Hey XML gurus,

Still somewhat new to XML/XSL and need some help
getting started on how to use regular expressions and
tokens in English text to transform it into an XML
document marked up for:

1.words (delimited by WS, excluding any external
2.punctuation, but allowing internal punctuation)
3.phrases (delimited by the comma)
4.clauses (delimited by colon or semicolon)
5.sentences (delimited by the period, question-mark,
or  exclamation mark)
6.paragraphs (delimited by a line break)

Also ideal would be to assign sequenced id's to every
tag, either in a running consecutive style from
beginning to end, or repeating from 1 for every level
of nesting. 

In more concrete terms,

To transfrom this text:

THOU still unravish'd bride of quietness,
 Thou foster-child of Silence and slow Time,
Sylvan historian, who canst thus express
 A flowery tale more sweetly than our rhyme:
What leaf-fringed legend haunts about thy shap
 Of deities or mortals, or of both,
 In Tempe or the dales of Arcady?
 What men or gods are these? What maidens loth?
What mad pursuit? What struggle to escape?
 What pipes and timbrels? What wild ecstasy?

into this XML: (using indexing that renumbers for each
sub-group)

<para id=1>
 <sent id=1>
  <clause id=1>
   <phrase id=1>THOU still unravish'd bride of
quietness,</phrase>
   <phrase id=2>Thou foster-child of Silence and slow
Time,</phrase>
   <phrase id=3>Sylvan historian,</phrase>
   <phrase id=4> who canst thus express A flowery tale
more sweetly than our rhyme</phrase>:
  </clause>
  <clause id=2>
What leaf-fringed legend haunts about thy shape Of
deities or mortals,</phrase>
   <phrase id=1> or of both,</phrase>
   <phrase id=2> In Tempe or the dales of Arcady?
  </clause>
 </sent>
 <sent id=2>What men or gods are these?</sent>
 <sent id=3>What maidens loth?</sent>
 <sent id=4>What mad pursuit?</sent>
 <sent id=5>What struggle to escape?</sent>
 <sent id=6>What pipes and timbrels?</sent>
 <sent id=7>What wild ecstasy?</sent>
</para>


or into this XML: (using indexing that is continuous
per tag)

<para id=1>
 <sent id=1>
  <clause id=1>
   <phrase id=1>THOU still unravish'd bride of
quietness,</phrase>
   <phrase id=2>Thou foster-child of Silence and slow
Time,</phrase>
   <phrase id=3>Sylvan historian,</phrase>
   <phrase id=4> who canst thus express A flowery tale
more sweetly than our rhyme</phrase>:
  </clause>
  <clause id=2>
What leaf-fringed legend haunts about thy shape Of
deities or mortals,</phrase>
   <phrase id=5> or of both,</phrase>
   <phrase id=6> In Tempe or the dales of Arcady?
  </clause>
 </sent>
 <sent id=2>What men or gods are these?</sent>
 <sent id=3>What maidens loth?</sent>
 <sent id=4>What mad pursuit?</sent>
 <sent id=5>What struggle to escape?</sent>
 <sent id=6>What pipes and timbrels?</sent>
 <sent id=7>What wild ecstasy?</sent>
</para>

Surely this has been done before. I have searched
through archives and have not found anything, probably
since I am searching using the wrong terminology.

Would really appreciate the help as it would give me
insight into using regular expressions and sequencing
in XSL.

Thanks in advance

Mark Bordelon



 
____________________________________________________________________________________
Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.