[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Structured from/within unstructured documents

  • From: "Stephen Green" <stephengreenubl@g...>
  • To: "XML Developers List" <xml-dev@l...>
  • Date: Sat, 15 Dec 2007 18:04:41 +0000

Structured from/within unstructured documents
What methods are there, these days, for extracting structured data from
unstructured documents (such as PDF)?

I'm aware it is quite straightforward to extract data from semi-structured
documents such as spreadsheets (as previous XML-Dev discussions have
shown, such as via ODF with XSLT and macros/Ant/Ant Contrib, etc).

As yet, the only way I'm aware of for doing the same from PDF would be to
print out to paper and use OCR (sounds a little ridiculous) or maybe to
convert PDF, etc to some XML-based or other text-based print/archive
file somehow and go from there (perhaps with something akin to a screen-
scraper?).

Is this all there is?

Plus how does one then convert the data as, say XML into some XML
or equivalent document and embed that in, say, the PDF or equivalent
unstructured document file (for later extraction, say)?
I'd very much appreciate any light on this. Thank you. I'm interested not
so much in metadata but actual data or full structured equivalents of the
unstructured documents rather than just enough data to create an index.

E.g what about patient records held in PDF and in XML formats and how
to turn the first into the latter and/or embed the latter in the first.

Best regards

-- 
Stephen Green

Partner
SystML, http://www.systml.co.uk
Tel: +44 (0) 117 9541606

http://www.biblegateway.com/passage/?search=matthew+22:37 .. and voice


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.