Subject:Grab html to xml Author:Yaniv Gatigno Date:16 May 2008 06:49 PM
Suppose I have an html file such as a amazon book description.
I wish to get an XML file of values of selcted fields in that html.
In example: relating to the page refferd at the bottom of the post - I'l like to get and XML containing the author name, title and price.
1. How do I do that?
2. What if I have many files with the same template? Can I batch the operation?
Subject:Grab html to xml Author:Alberto Massari Date:20 May 2008 09:43 AM
you can convert the HTML page into XML by running the Document Wizard "HTML to XML", then you can write an XSLT or XQuery program that extracts XPath expressions like //*[@class='buying']//*[@id='btAsinTitle'] (to get the title), //*[@class='buying']//a[contains(@href,'field-author')] (to get the author) and //*[@class='buying']//*[@class='priceLarge'] (to get the price).
But as you can guess, this method can be easily broken by minor changes in the HTML generated by Amazon (and has been deprecated by the web community several years ago); the proper way to get such informations from Amazon is to use their Web Service interface as described at http://www.amazon.com/E-Commerce-Service-AWS-home-page/b/ref=sc_fe_l_6?node=12738641