[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: HTML parser


html parser
In Java you have JTidy - http://lempinen.net/sami/jtidy/ or 
http://sourceforge.net/projects/jtidy/
It build it own w3c DOM tree. But you can traverse the tree to generate 
SAX events, or build a new Xerces, JDOM tree from the sax events. But 
tidy doesn't handle doublet attributes + more.

In C you have Tidy for all major platforms, and it is very fast. GUI's 
exists. I can be found here - http://tidy.sourceforge.net/

In Java, Andy Clark, IBM  a Xerces programmer, has made a "preview" of a 
HTML parser using the new Xerces xni. He posted the source code to the 
xerces mailing list. Andy Clark is a parser profs. so he know what he is 
doing.

Im also working on a HTML parser, but it to early to talk about. Parsing 
HTML documents is often for capturing information from a page, and I 
find myself using XSLT, XMLFilters etc. to extract data, and it is 
powerful but not very simple.

A html parsing is not always about wellformnes, but about extracting 
information, using RE, simple text patterns. Then you have XPath and 
XSL, witch requires wellformed (x)html document to work, and that 
requires building of dom trees, witch is a memory and speed problem. 
Much more could be said on this.....

Digital (now compaq) tried to make a "web language" that you can use to 
fetch pages from the web, and extract data. Take a look at it at -  
http://www.research.compaq.com/SRC/WebL/.  There is problems with java 
1.3, you need to make some small changes to the source code (Im running 
it on java 1.3 on Mac os X).

Niels Peter


On Monday, March 4, 2002, at 06:24 PM, Alexey N. Shananin wrote:

 >Hi!
 >I'm looking for a parser for HTML.
 >I know that XML parsers can't correctly handle HTML tags because of 
theese
 >tags might be unclosed( I mean <br> tag, but not <br/> or for 
example...).
ZI heared about XHTML standart. It's supported by XML parsers, as far 
as I


  • Follow-Ups:

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.