[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: generating DOM from ill-formed HTML docs


html parser dom
> The standard answer is to use tidy to convert to XHTML.
> http://tidy.sourceforge.net/ and then parse it with an
> ordinary XML parser.
>

I wake up some nights dreaming that I'm working in a sweatshop writing HTML
parsing code and they won't let me go to the bathroom until it's 100% (:-)
I've got two nightmare HTML parsing stories...

The first was back in '96 when we were writing a web-browser from scratch.
There was so much bad HTML out there already that the guy writing the parser
basically had to completely violate all rules of *HTML* to make things come
out the way browsers showed it. Both Netscape and IE allowed completely bad
HTML to go through (but then again, most people already know that).

A couple of years ago, we tried to write a single-pass combo XML/HTML parser
for a product we were working on. Again, it was a total *nightmare* with
daily 'exception' reports. The engineer working on it wasn't too thrilled
about having to rewrite the YACC grammar on a weekly basis--the W3C HTML
specs were practically useless in real life. There were things being done at
popular web-sites (like AOL) that would set your teeth on edge. And visual
editors like DreamWeaver weren't helping any. It became an exercise in
futility. After about six months of this, we finally threw our hands up in
the air and ripped it all out and went with tidy and Xerces.

It still doesn't do a 100% job (tidy sometimes generates bad output, i.e.
XHTML that doesn't look anything like the original). But it's better than
anything else out there. Most other HTML parsing toolkits (including the
ones in Java) just give up.

If somebody hasn't done so already, they should extract the Mozilla HTML
parser/DOM-builder and graft it onto a standard XML parser... I know it's
against XML rules, but it would have a lot of practical uses  (like that of
the original poster).

Ramin
---
Ramin Firoozye - Wizen Software
- multum in parvo -
---


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.