[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Re: Structured from/within unstructured documents
On Tue, 18 Dec 2007 10:17:31 +1100, Marcus Carr wrote > Stephen Green wrote: > > > What methods are there, these days, for extracting structured data from > > unstructured documents (such as PDF)? > > Maybe I'm missing something, but I didn't see anyone suggest saving > the PDF as XML straight from Acrobat. If you have a full licence, it To be honest I've not looked at it for years-- I don't have Acrobat, only the reader--- but, if I recall, the "save as XML" functionality was part of their XML-architecture (borrowed, I think, from Framemaker+SGML which I do have). This means that either the data was pre-tagged or one defined a appropriate mapping table. This can be good in a controlled environment where one is converting from existing documentation in a consistent corporate style but is ill-suited for conversion of the typical wild-west mix that most companies tend to have. With the effort, I think, one is better off using old school mission designed tools in the spirit of Omnimark or something like ClearForest or any of a number of auto-tagging and content categorization solutions in between. Its an industry with a host of companies specialized in the conversion of data to XML using these and their own proprietary tools. As I wrote earlier: [expletive deleted] in the metadata and marking up sentences, paragraphs and pages can be done with good quality in a relative generic manner (sufficiently adequate I found to be applied for all purpose PDF indexing). You really need to decide what you need. > does a pretty respectable job, getting you paragraph and character > tagging, tables and images. You can also batch process, converting > entire directories or what have you. The results are at least as > good as saving the PDF to something like Word first and you could be > forgiven for expecting that they might even be better. Using Word as in-between is like flying through Mogadishu to get to Los Angeles from Boston. It may get you there but chances are that you'll loose some luggage. -- Edward C. Zimmermann, Basis Systeme netzwerk, Munich Office Leo (R&D): Leopoldstrasse 53-55, D-80802 Munich, Federal Republic of Germany http://www.nonmonotonic.net
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|