|
[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Converting HTML to plain text
Radha,
This is (much) harder, in the general case, than it looks. This is due to the famous looseness of what is considered "HTML". (This laxity was once touted by HTML developers as a desirable feature, and probably did promote HTML's adoption in some respects.) HTML being more or less tag soup, saving it as plain text more or less means implementing a parser, a major part of a browser (XML parsing is comparatively trivial). If you can constrain the "HTML" coming in to a controlled dialect of XML (using HTML tags if you like for browser friendliness), you can achieve this straightforwardly using stylesheets. Alternatively, if you truly have to accept arbitrary "HTML", you can look at parsing technologies such as HTML tag soup parsers (see e.g. http://mercury.ccil.org/~cowan/XML/tagsoup/) that will emit XML SAX parsing events from HTML, or HTML DOM implementations that can write out XML from HTML, or an analogous tool; such a processor can be hooked into an XML pipeline. When it comes to writing out nice plain text output with XSLT (which is a perfectly fine tool for the job), you may find multiple passes to be a good way to proceed in any case. Generally, XSLT can't be used on arbitrary HTML. A poor man's solution is to use a tool like HTML Tidy to make XML for XSLT from the HTML, but I don't know if that could be adapted to your requirement for "a platform independent way" (IIRC it is compiled for different platforms). But if in general HTML-to-formatted-plain-text were easy, I think we'd see lots more of it. Cheers, Wendell At 03:15 PM 6/21/2004, you wrote: I am looking around for any tools to convert html to plain text in a platform independent way. I also need support for UTF-8 encoding as well as a well formatted output of nested tables. What is the best way to do this ? Is XSL FO recommended for this ? I looked around for any XSL to convert HTML to FO, but I did not find any. ___&&__&_&___&_&__&&&__&_&__&__&&____&&_&___&__&_&&_____&__&__&&_____&_&&_ "Thus I make my own use of the telegraph, without consulting the directors, like the sparrows, which I perceive use it extensively for a perch." -- Thoreau
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|

Cart








