|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: generating DOM from ill-formed HTML docs
> The standard answer is to use tidy to convert to XHTML. > http://tidy.sourceforge.net/ and then parse it with an > ordinary XML parser. > I wake up some nights dreaming that I'm working in a sweatshop writing HTML parsing code and they won't let me go to the bathroom until it's 100% (:-) I've got two nightmare HTML parsing stories... The first was back in '96 when we were writing a web-browser from scratch. There was so much bad HTML out there already that the guy writing the parser basically had to completely violate all rules of *HTML* to make things come out the way browsers showed it. Both Netscape and IE allowed completely bad HTML to go through (but then again, most people already know that). A couple of years ago, we tried to write a single-pass combo XML/HTML parser for a product we were working on. Again, it was a total *nightmare* with daily 'exception' reports. The engineer working on it wasn't too thrilled about having to rewrite the YACC grammar on a weekly basis--the W3C HTML specs were practically useless in real life. There were things being done at popular web-sites (like AOL) that would set your teeth on edge. And visual editors like DreamWeaver weren't helping any. It became an exercise in futility. After about six months of this, we finally threw our hands up in the air and ripped it all out and went with tidy and Xerces. It still doesn't do a 100% job (tidy sometimes generates bad output, i.e. XHTML that doesn't look anything like the original). But it's better than anything else out there. Most other HTML parsing toolkits (including the ones in Java) just give up. If somebody hasn't done so already, they should extract the Mozilla HTML parser/DOM-builder and graft it onto a standard XML parser... I know it's against XML rules, but it would have a lot of practical uses (like that of the original poster). Ramin --- Ramin Firoozye - Wizen Software - multum in parvo - ---
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








