[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Re: Structured from/within unstructured documents

  • From: "Edward C. Zimmermann" <edz@b...>
  • To: Marcus Carr <mcarr@a...>, xml-dev@l...
  • Date: Tue, 18 Dec 2007 12:46:55 +0100

Re:  Re: Structured from/within unstructured documents
On Tue, 18 Dec 2007 10:17:31 +1100, Marcus Carr wrote
> Stephen Green wrote:
> 
> > What methods are there, these days, for extracting structured data from
> > unstructured documents (such as PDF)?
> 
> Maybe I'm missing something, but I didn't see anyone suggest saving 
> the PDF as XML straight from Acrobat. If you have a full licence, it 

To be honest I've not looked at it for years-- I don't have Acrobat,
only the reader--- but, if I recall, the "save as XML" functionality was
part of their XML-architecture (borrowed, I think, from  Framemaker+SGML
which I do have). This means that either the data was pre-tagged or one
defined a appropriate mapping table. This can be good in a controlled
environment where one is converting from existing documentation in a
consistent corporate style but is ill-suited for conversion of the typical
wild-west mix that most companies tend to have. With the effort, I think,
one is better off using old school mission designed tools in the spirit
of Omnimark or something like ClearForest or any of a number of auto-tagging
and content categorization solutions in between. Its an industry with a host
of companies specialized in the conversion of data to XML using these and
their own proprietary tools. 

As I wrote earlier: [expletive deleted] in the metadata and marking up sentences,
paragraphs and pages can be done with good quality in a relative generic
manner (sufficiently adequate I found to be applied for all purpose PDF
indexing). You really need to decide what you need.

> does a pretty respectable job, getting you paragraph and character 
> tagging, tables and images. You can also batch process, converting 
> entire directories or what have you. The results are at least as 
> good as saving the PDF to something like Word first and you could be 
> forgiven for expecting that they might even be better.

Using Word as in-between is like flying through Mogadishu to get to
Los Angeles from Boston. It may get you there but chances are that you'll
loose some luggage.


--

Edward C. Zimmermann, Basis Systeme netzwerk, Munich
Office Leo (R&D):
   Leopoldstrasse 53-55, D-80802 Munich,
   Federal Republic of Germany
http://www.nonmonotonic.net



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.