[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Structured from/within unstructured documents

  • From: "Stephen Green" <stephengreenubl@g...>
  • To: "XML Developers List" <xml-dev@l...>
  • Date: Sun, 16 Dec 2007 13:03:25 +0000

Re:  Structured from/within unstructured documents
Many thanks for helpful answers

I guess what would be particularly helpful would be an API or equivalent.
Anyone heard of the like? Perhaps a tool with a scripting language even?
To allow lots of documents to be converted

On 16/12/2007, Greg Hunt <greg@f...> wrote:
> Stephen,
>  If the data is critical, then you should look at the specific source
> documents and their origins and confirm for yourself whether any particular
> tool has a low-enough error rate for the population of source documents that
> you have to deal with.  It is always possible to create documents that
> cannot be converted; the question is whether you have to deal with them.
>
>  Greg
>
> On 12/16/07, Stephen Green <stephengreenubl@g...> wrote:
> > I notice there are commercial tools advertised to convert PDF to .doc
> > or .odt, etc
> > or to extract data in one way or another. How reliable do people find
> > such tools?
> > Is it realistic yet to be extracting data and converting it to, say,
> > XML documents
> > in large volumes and with crucial data such a financial, technical or
> medical
> > records?
> >
> > On 16/12/2007, Edward C. Zimmermann <edz@b...> wrote:
> > > On Sun, 16 Dec 2007 18:15:05 +1100, Greg Hunt wrote
> > > > Stephen,
> > > > The problem with processing the physical PDF file is precisely its
> > > presentation orientation.
> > >
> > > You have to render PDF (at least internally into a buffer). Its a format
> with
> > > graphical "language" not totally unlike (and built-upon) PostScript
> where,
> > > among a host of features, each individual character can be positioned.
> > >
> > > Popular "freely" available PDF tools that can be used to "extract text"
> > > are, among others, Adobe's Acrobat Reader, Derek Noonburg's Xpdf,
> Poppler
> > > and Ghostscript. M$ Windows includes a "filter mechanism" called iFilter
> > > for their own search. It includes apparently, among others, a filter
> > > supplied by Adobe intended for the extraction of text from PDF.
> > >
> > > > A perverse document can mix image and text or even embed the text in
> the
> > > reverse order that it would be displayed in.
> > >
> > > Not really that wholly uncommon--- calculating glymph position from the
> right.
> > >
> > > >
> > >
> > > In rendering, however, you need or want to keep paragraph blocks
> together
> > > and **not** (as the case from a "screen scrape" of a display rendered
> page)
> > > preserve the columns and visual flow elements as these not only make it
> > > much more difficult to extract simple things like sentences but also
> > > don't deliver any contextual information. That a text was set in two
> column
> > > with a center picture is a result of its chosen style and not content
> > > structure--- recall that different output devices can look different.
> > > Linking structural semantics for many of these style elements is
> tenuous.
> > >
> > >
> > >
> > > --
> > >
> > >  Edward C. Zimmermann, Basis Systeme netzwerk, Munich
> > >  Office Leo (R&D):
> > >    Leopoldstrasse 53-55, D-80802 Munich,
> > >    Federal Republic of Germany
> > >  http://www.nonmonotonic.net
> > >
> > >
> >
> >
> > --
> > Stephen Green
> >
> > Partner
> > SystML, http://www.systml.co.uk
> > Tel: +44 (0) 117 9541606
> >
> > http://www.biblegateway.com/passage/?search=matthew+22:37
> .. and voice
> >
>
>


-- 
Stephen Green

Partner
SystML, http://www.systml.co.uk
Tel: +44 (0) 117 9541606

http://www.biblegateway.com/passage/?search=matthew+22:37 .. and voice


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.