[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: sets of parsing rules
Hi, Nathan Young -X (natyoung - Artizen at Cisco) wrote: > Hi. > > I have seen parts of this question addressed but I think it's worth > asking the whole question anyway, since I'm sure others have run into > this problem but I haven't been able to dig up any best practices in my > searching so far. I may just need to search with the right terminology, > in which case this should be any easy one for someone who already > knows... > > I have an application that parses a large number of HTML pages. A few > of them are well formed XHTML but that's the exception rather than the > rule. By grabbing pages, manipulating them a bit (regexps have been > sufficient here so far), then tidying them I can get them to a state > where they are parsable XML. TagSoup and NekoHTML are tools that are doing the job NekoHTML is bundled in RefleX, so getting a DOM tree from ill-formed HTML sources is straightforward : <xcl:parse-html name="myHtml" source="file:///path/to/file.html"/> then you can use XPath on it : $myHtml//div (beware of the namespaces that CyberNeko might set on HTML, I don't remember what is the default, but you can of course change it) From there I can use XSL to get them the > rest of the way (although I have a process that allows me to run regexps > here too, supplementing XSLT 1.0). > > The wrinkle is that I have several kinds of pages, each one requiring a > distinct set of steps in order to parse it. I'm starting down the road > of modularizing the transforms so that I can handle more page types over > time in a way that's transparent to the rest of my application. > > I've been exposed XML only pipelines, are there pipeline tools that > allow for non-XML steps? > See the section "dealing with non-XML data source" : http://reflex.gforge.inria.fr/tips.html There are also tutorials that show you how to convert plain-text source to XML : http://reflex.gforge.inria.fr/tutorial.html#textToXML or how to parse a multipart SOAP message with a regular expression : http://reflex.gforge.inria.fr/tutorial.html#N801BD1 Another usefull example shows how to filter with XPath patterns a very big XML source that would cause an OutOfMemoryError if you were using XSLT or DOM-based processing : http://reflex.gforge.inria.fr/tutorial.html#N801C30 etc -- Cordialement, /// (. .) --------ooO--(_)--Ooo-------- | Philippe Poulard | ----------------------------- http://reflex.gforge.inria.fr/ Have the RefleX !
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|