HTML Tidy

HTML Tidy is a program originally created by Dave Raggett for turning HTML into something that can be parsed as XML.

This capability is extremely useful as it allows XSLT and XQuery programs to fetch HTML pages and act upon them — expanding the realm of reachable documents to include the vast number of HTML pages available on the internet.

Within Stylus Studio, HTML Tidy is used in these places: the HTML-to-XSLT Wizard and the HTMLTidy Converter.

HTML to XSLT Wizard

HTML Tidy is used within the HTML-to-XSLT Wizard (File|Document Wizards|XSLT Editor|HTML to XSLT) to create a stub HTML file that would generate the given HTML file, suitable for splicing in your own code for generating the details.

This way, if you had either an existing HTML template or one custom-created, you could use that as the basis for a report or other output, and stick your transformation or reporting code right in the middle.

The HTML Tidy Converter

This is likely the far more interesting case. Using this converter, any piece of reachable HTML can be used as a source for XSLT, XQuery, or even more complex operations like the XML Publisher.

The trick is to prepend the converter scheme to the URL, like this:

converter:HTMLTidy?http://.....

Now, anywhere we reference the original URL, instead of HTML, our process will see XML — thanks to the on-the-fly conversion from HTML to XML. Let's put this into practice now with some demonstrations:

HTML Tidy and XSLT

Suppose that you wanted to wrap a web query, such as the http://www.weather.com/, so that it acted like a web service for your application. This particular query will tell you what the weather might be for the next few days near your location. It takes a URL like this and returns some HTML:

http://www.weather.com/weather/tenday/your zip code here

So, for the area where Stylus Studio's headquarters is, we'd issue

http://www.weather.com/weather/tenday/01730

(Note that since The Weather Channel website (as well as the weather itself!) changes, here is a cached copy of just the HTML without any images.)

But how would we automate it? To fetch the HTML and turn it into XML, we'll use the converter trick from above, and set that as our input source to plain ol' fashioned XSLT. So feeding

converter:HTMLTidy?http://www.weather.com/weather/tenday/01730

to

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:h="http://www.w3.org/1999/xhtml">
    <xsl:output method="html" encoding="UTF-8" indent="yes"/>

    <xsl:template match="/">
        <html><body><table border="1">
            <xsl:apply-templates select="/h:html/h:body/h:center[1] /h:div/h:table/h:tr/h:td/h:table/h:tr /h:td/h:div/h:table/h:tr/h:td/h:form/h:table/h:tr/h:td/h:table/h:tr"/>
        </table></body></html>
    </xsl:template>

    <xsl:template match="h:tr">
        <tr>
            <td><xsl:value-of select="h:td[1]/h:div/h:b/h:a/h:b/text()"/></td>
            <td><xsl:value-of select="h:td[1]/h:div/h:b/text()"/></td>
            <td><xsl:value-of select="h:td[2]/h:table/h:tr/h:td[2]/text()"/></td>
            <td><xsl:value-of select="concat(h:td[3]/h:b/text(), 'F')"/></td>
            <td><xsl:value-of select="concat(translate(h:td[3]/text(), '/', ''), 'F')"/></td>
            <td><xsl:value-of select="h:td[4]/text()"/></td>
        </tr>
    </xsl:template>
</xsl:stylesheet>

will yield

TodayOct 10Partly Cloudy62°F46°F20%
WedOct 11Cloudy60°F53°F20%
ThuOct 12Rain / Thunder66°F47°F70%
FriOct 13Showers54°F33°F60%
SatOct 14Sunny55°F36°F20%
SunOct 15Mostly Sunny57°F44°F20%
MonOct 16Showers58°F45°F40%
TueOct 17Showers63°F44°F60%
WedOct 18Partly Cloudy64°F42°F20%
ThuOct 19Partly Cloudy62°F38°F20%

Congratulations! You've just scraped existing Web content, using the HTML Tidy Converter to produce new HTML via XML! But you could do anything with the source data once it's in XML form. Download a copy of Stylus Studio® XML Enterprise Suite and try this with your own location, or investigate other web sites. Mine your own company's intranet for information that rightly belongs in other locations. The possibilities are only limited by the number of HTML pages on the internet!

HTML Tidy and XQuery

The equivalent program for your XQuery weather report using the HTML Tidy converter would be:

xquery version "1.0";
declare namespace h = "http://www.w3.org/1999/xhtml";
<html><body><table border="1">{
for $w in doc('converter:HTMLTidy:errors=no?http://www.weather.com/weather/tenday/01730') /h:html/h:body/h:center[1] /h:div/h:table/h:tr/h:td/h:table/h:tr /h:td/h:div/h:table/h:tr/h:td/h:form/h:table/h:tr/h:td/h:table/h:tr
return
    <tr>
        <td>{$w/h:td[1]/h:div/h:b/h:a/h:b/text()}</td>
        <td>{$w/h:td[1]/h:div/h:b/text()}</td>
        <td>{$w/h:td[2]/h:table/h:tr/h:td[2]/text()}</td>
        <td>{concat($w/h:td[3]/h:b/text(), 'F')}</td>
        <td>{concat(translate($w/h:td[3]/text(), '/', ''), 'F')}</td>
        <td>{$w/h:td[4]/text()}</td>
    </tr>
}</table></body></html>

What's notable here is you can see how you can directly embed the converter URL right into the source program. It can also be passed in as context or as a parameter, giving you maximum flexibility. And the output of this sample is identical to the XSLT and HTML Tidy above.

To learn more about the advanced XSLT and XQuery tools, as well as the other options for deploying XML applications, see the various pages on this site or download and run a free evaluation copy of Stylus Studio® right now.


XML and Weather

HTML Tidy The United States National Oceanic and Atmospheric Administration's National Weather Service (whew! that's a mouthful) has published schema to promote the interchange of weather-related information. This set of files can be shown in expanded form under Digital Weather Markup Language.
 
Free Stylus Studio XML Training:
W3C Member