[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

cleaning up ill-structured html

Subject: cleaning up ill-structured html
From: Jim_Albright@xxxxxxxxxxxx
Date: Fri, 24 Jan 2003 13:41:10 -0500
structured html
with this input

<p>Some <i>stuff</i>
that should be cleaned.<br/>
More <b>stuff.</b>
<p>
Yet more.<br>
</p>
Stuff.
</p>

I have this XML output that you can clean up with XSLT

<sample>
<p>Some <emphasis>stuff</emphasis> that should be cleaned.</p>
<paragraph>More <strong>stuff.</strong></paragraph>
<p>Yet more.</p>
<paragraph>Stuff.</paragraph>
</sample>

Using this XML control file:

<?xml version="1.0" encoding="ISO-8859-1" ?>
<!DOCTYPE convert2xml SYSTEM "c:\d\xml\convert2xml.dtd" >

<!--

file:       HTML-cleanup.ctl
Purpose:    Control file for c2x program
Author:     jaa
Date:       20020124

            Clean up dirty HTML and make it into good XML
-->


<convert2xml>
<root-element name="sample">
      </root-element>
<recognize-element name="paragraph">
      <start-token>
            <pattern>\pp</pattern>
            <before>&#xa;</before>
      </start-token>
      <end-token>
            <pattern>&#xa;&lt;/p></pattern>
      </end-token>
      <allowed-child ref="emphasis"/>
      <allowed-child ref="strong"/>
</recognize-element>

<recognize-element name="p">
      <start-token>
            <pattern>&lt;p>&#xa;</pattern>
            <before>&#xa;</before>
      </start-token>
      <start-token>
            <pattern>&lt;p></pattern>
            <before>&#xa;</before>
      </start-token>
      <end-token>
            <pattern>&lt;/p></pattern>
      </end-token>
      <end-token>
            <pattern>&lt;b>&#xa;&lt;/p></pattern>
      </end-token>
      <end-token>
            <pattern>&lt;br/>&#xa;</pattern>
            <parsed-after>\pp</parsed-after>
      </end-token>
      <end-token>
            <pattern>&lt;br/>&#xa;&lt;/p></pattern>
            <parsed-after>\pp</parsed-after>
      </end-token>
      <end-token>
            <pattern>&lt;br>&#xa;&lt;/p>&#xa;</pattern>
            <parsed-after>\pp</parsed-after>
      </end-token>
      <end-token>
            <pattern>&lt;br/></pattern>
            <parsed-after>\pp</parsed-after>
      </end-token>
      <end-token>
            <pattern>&lt;br></pattern>
      </end-token>
      <end-token>
            <pattern>&#xa;&lt;/p></pattern>
      </end-token>

      <allowed-child ref="emphasis"/>
      <allowed-child ref="strong"/>
</recognize-element>

<recognize-element name="emphasis">
      <start-token>
            <pattern>&lt;i></pattern>
      </start-token>
      <end-token>
            <pattern>&lt;/i></pattern>
      </end-token>
      <end-token>
            <pattern>&lt;/i>&#xa;</pattern>
            <after> </after>
      </end-token>
</recognize-element>

<recognize-element name="strong">
      <start-token>
            <pattern>&lt;b></pattern>
      </start-token>
      <end-token>
            <pattern>&lt;/b></pattern>
      </end-token>
      <end-token>
            <pattern>&lt;/b>&#xa;</pattern>
      </end-token>
</recognize-element>

</convert2xml>

In a free program called C2X -- convert to XML.

Ask me off list if you want more info as C2X is off topic.

Date: Thu, 23 Jan 2003 21:54:43 +0100
From: Ole Sandum <osandum@xxxxxxxxxxx>
Subject:  cleaning up ill-structured html

Example:

    <p>Some <i>stuff</i>
    that should be cleaned.<br/>
    More <b>stuff.</b>
    <p>
    Yet more.<br>
    </p>
    Stuff.
    </p>

Should become:

    <p>Some <i>stuff</i> that should be cleaned.</p>
    <p>More <b>stuff.</b></p>
    <p>Yet more.</p>
    <p>Stuff.</p>





 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.