|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: generating DOM from ill-formed HTML docs
7/14/2002 10:00:06 PM, Robert Mena <rt_mena@y...> wrote: >Hi, I am developing an application that will have to >build a DOM tree of html pages. > >I'll use such DOM trees to perform some >analysis/comparisons. > >Since most of the time I'll find ill-formed documents >I'd like to know if there are any parsers out there >that "accept" this flaws and builds the tree anyway. > >I've tried domxml (php) with no luck. The standard answer is to use tidy to convert to XHTML. http://tidy.sourceforge.net/ and then parse it with an ordinary XML parser. The possibly wacky answer is to use Javascript in a browser if at all possible. For better or worse (mostly worse!) the browser vendors have worked hard to "accept the flaws and build a tree anyway", and then expose that tree with the DOM API. You can essentially pretend the ill-formed HTML is XML and use the XML Core DOM to work with it. You might need to use some server-side PHP or whatever to grab web pages, filter in the Javascript code, and feed it to the browser to work around the browser/Javascript "sandbox" limitations.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








