[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Converting poorly formed HTML into well-formed XML
> > The HTML has been written by various web developers over a period of time, > so it is very inconsistent in formatting, use of quotation marks in > attributes, etc. > But, most of all, is the HTML correct, or conformant ? > Does XSLT have the facilities to directly read in the poorly formed HTML? > And if so, what needs to be done. > Nope, unless it is valid XML (that would be XHTML) > I've already begun developing the latter (custom) solution, but thought I'd > double check to see if there are any HTML -> XHTML converters available. > Check out HTML Tidy, from the W3C consortium (www.w3.org). It's a C application that cleans up messy (and incorrect HTML) and has an option to generate XHTML. The main problem of developing your own converter is that either you are sure your HTML is correct (and so you only need to fix cases, quotes in attributes, entitities and close the few HTML empty tags) or you will go crazy trying to cope with all the possible errors that the "official" web browsers accept but that would kill any simple parser. Anyway, I would be interested in knowing if there is any similar application/package in java. I would like to convert some pages (where I pretty much know the format) into XHTML and from there output the content in XML. The only other package I found is in Perl (HTML::TreeBuilder). It has a smart input parser and the author explains how he had to add a lot of hardcoded stuff to cover a lot of weird cases. I wrote a few lines of perl that reads in an HTML file and output XHTML, if anyone is interested. -- Raffaele ----------------------------------------------------- raff@xxxxxxxxxxxx (::) http://www.aromatic.org/~raff/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|