[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Structured from/within unstructured documents
Stephen Green wrote: > What methods are there, these days, for extracting structured data from > unstructured documents (such as PDF)? > > [!!! SNIP !!!] > > Is this all there is? > Microsoft Word and Open Office both export to XML, and Antiword is a program that does a pretty good job of extracting Word files to DocBook. For PDF, though, I don't know of any really good tools. The following page, from someone who has played with the problem, gives a summary of what's out there: http://discerning.com/hacks/docutils/pdf2xml/readme.html I'd love it if someone would tell me there's something actively maintained that does this job in the open source world. I don't know it yet. Jonathan Red Hat Enterprise MRG: http://www.redhat.com/mrg/
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|