[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

cleaning up ill-structured html

Subject: cleaning up ill-structured html
From: Ole Sandum <osandum@xxxxxxxxxxx>
Date: Thu, 23 Jan 2003 21:54:43 +0100
ole sandum
I have the task of migrating a number of legacy html
pages that were authored wihtout regard to proper
structuring. Body text paragraphs are delimited by any
combination of <p> (sometimes nested!) and runs of
<br>. I would like the result to consist of a flat list
of non-empty <p>'s.

I use JTidy to get into proper XML, but still face the
challenge of flattening the nested <p>'s and converting
runs of <br>'s to <p>'s.

Example:

   <p>Some <i>stuff</i>
   that should be cleaned.<br/>
   More <b>stuff.</b>
   <p>
   Yet more.<br>
   </p>
   Stuff.
   </p>

Should become:

   <p>Some <i>stuff</i> that should be cleaned.</p>
   <p>More <b>stuff.</b></p>
   <p>Yet more.</p>
   <p>Stuff.</p>

I assume it is easiest to do in two steps, first (step
1) convert into something like this:

   <break/>
   Some <i>stuff</i> that should be cleaned.
   <break/>
   More <b>stuff.</b>
   <break/>
   Yet more.
   <break/>
   <break/>
   Stuff.
   <break/>

and then (step 2) detecting continuous runs of
non-<break/> nodes, and wrapping these runs in <p></p>'s.

Do I make sense?

I can do step 1, but step 2 gives me trouble. To
formalise: how do I convert a structure structure like

<break/>+ { other+ <break/>+ }*

into

{ <p> other+ </p> }*

I fear the solution is really simple. Any ideas?

Thanks,
Ole Sandum, osandum@xxxxxxxxxxx




XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.