[Home] [By Thread] [By Date] [Recent Entries]
I forgot one(two) more html parser: You have Anders Kristensens HEX - http://www- uk.hpl.hp.com/people/sth/java/hex.html. It is quite old. It uses sax(1), to build a dom tree. I updated it to sax2 and xerces dom, but then started on my own project. The way that hex handles the wellformnes, is when building the dom tree. I moved that into a XMLFilter that allows you to do wellformnes on the sax stream. Anders is not a HP anymore, he works for another company, so the mail adress wont work!. IBM has a "system" ANDES, witch is(was - i don't know) used to parse html pages (a lot of other interesting things), from the papers I read it sounded just like the tool I wanted, but I could not find any information on IBM sites. Anyone has any info about ANDES? Niels Peter On Monday, March 4, 2002, at 08:12 PM, Niels Peter Strandberg wrote: > In Java you have JTidy - http://lempinen.net/sami/jtidy/ or > http://sourceforge.net/projects/jtidy/ > It build it own w3c DOM tree. But you can traverse the tree to generate > SAX events, or build a new Xerces, JDOM tree from the sax events. But > tidy doesn't handle doublet attributes + more. > > In C you have Tidy for all major platforms, and it is very fast. GUI's > exists. I can be found here - http://tidy.sourceforge.net/ > > In Java, Andy Clark, IBM a Xerces programmer, has made a "preview" of > a HTML parser using the new Xerces xni. He posted the source code to > the xerces mailing list. Andy Clark is a parser profs. so he know what > he is doing. > > Im also working on a HTML parser, but it to early to talk about. > Parsing HTML documents is often for capturing information from a page, > and I find myself using XSLT, XMLFilters etc. to extract data, and it > is powerful but not very simple. > > A html parsing is not always about wellformnes, but about extracting > information, using RE, simple text patterns. Then you have XPath and > XSL, witch requires wellformed (x)html document to work, and that > requires building of dom trees, witch is a memory and speed problem. > Much more could be said on this..... > > Digital (now compaq) tried to make a "web language" that you can use to > fetch pages from the web, and extract data. Take a look at it at - > http://www.research.compaq.com/SRC/WebL/. There is problems with java > 1.3, you need to make some small changes to the source code (Im running > it on java 1.3 on Mac os X). > > Niels Peter > > > On Monday, March 4, 2002, at 06:24 PM, Alexey N. Shananin wrote: > > >Hi! > >I'm looking for a parser for HTML. > >I know that XML parsers can't correctly handle HTML tags because of > theese > >tags might be unclosed( I mean <br> tag, but not <br/> or for > example...). > ZI heared about XHTML standart. It's supported by XML parsers, as far > as I > > > ----------------------------------------------------------------- > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > initiative of OASIS <http://www.oasis-open.org> > > The list archives are at http://lists.xml.org/archives/xml-dev/ > > To subscribe or unsubscribe from this list use the subscription > manager: <http://lists.xml.org/ob/adm.pl> >
|

Cart



