[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Structured from/within unstructured documents
Many thanks for helpful answers I guess what would be particularly helpful would be an API or equivalent. Anyone heard of the like? Perhaps a tool with a scripting language even? To allow lots of documents to be converted On 16/12/2007, Greg Hunt <greg@f...> wrote: > Stephen, > If the data is critical, then you should look at the specific source > documents and their origins and confirm for yourself whether any particular > tool has a low-enough error rate for the population of source documents that > you have to deal with. It is always possible to create documents that > cannot be converted; the question is whether you have to deal with them. > > Greg > > On 12/16/07, Stephen Green <stephengreenubl@g...> wrote: > > I notice there are commercial tools advertised to convert PDF to .doc > > or .odt, etc > > or to extract data in one way or another. How reliable do people find > > such tools? > > Is it realistic yet to be extracting data and converting it to, say, > > XML documents > > in large volumes and with crucial data such a financial, technical or > medical > > records? > > > > On 16/12/2007, Edward C. Zimmermann <edz@b...> wrote: > > > On Sun, 16 Dec 2007 18:15:05 +1100, Greg Hunt wrote > > > > Stephen, > > > > The problem with processing the physical PDF file is precisely its > > > presentation orientation. > > > > > > You have to render PDF (at least internally into a buffer). Its a format > with > > > graphical "language" not totally unlike (and built-upon) PostScript > where, > > > among a host of features, each individual character can be positioned. > > > > > > Popular "freely" available PDF tools that can be used to "extract text" > > > are, among others, Adobe's Acrobat Reader, Derek Noonburg's Xpdf, > Poppler > > > and Ghostscript. M$ Windows includes a "filter mechanism" called iFilter > > > for their own search. It includes apparently, among others, a filter > > > supplied by Adobe intended for the extraction of text from PDF. > > > > > > > A perverse document can mix image and text or even embed the text in > the > > > reverse order that it would be displayed in. > > > > > > Not really that wholly uncommon--- calculating glymph position from the > right. > > > > > > > > > > > > > In rendering, however, you need or want to keep paragraph blocks > together > > > and **not** (as the case from a "screen scrape" of a display rendered > page) > > > preserve the columns and visual flow elements as these not only make it > > > much more difficult to extract simple things like sentences but also > > > don't deliver any contextual information. That a text was set in two > column > > > with a center picture is a result of its chosen style and not content > > > structure--- recall that different output devices can look different. > > > Linking structural semantics for many of these style elements is > tenuous. > > > > > > > > > > > > -- > > > > > > Edward C. Zimmermann, Basis Systeme netzwerk, Munich > > > Office Leo (R&D): > > > Leopoldstrasse 53-55, D-80802 Munich, > > > Federal Republic of Germany > > > http://www.nonmonotonic.net > > > > > > > > > > > > -- > > Stephen Green > > > > Partner > > SystML, http://www.systml.co.uk > > Tel: +44 (0) 117 9541606 > > > > http://www.biblegateway.com/passage/?search=matthew+22:37 > .. and voice > > > > -- Stephen Green Partner SystML, http://www.systml.co.uk Tel: +44 (0) 117 9541606 http://www.biblegateway.com/passage/?search=matthew+22:37 .. and voice
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|