[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Structured from/within unstructured documents
Any instance of an LR(1) language can be processed in pure XSLT and one possible result can be to produce an xml document. See for example the json-document() function of FXSL. This function uses the generic LR(1) parsing system of FXSL: the lr-parse() function. More information can be found here: http://dnovatchev.spaces.live.com/Blog/cns!44B0A32C2CCF7488!367.entry http://www.stylusstudio.com/xsllist/200711/post20640.html Cheers, Dimitre Novatchev "Stephen Green" <stephengreenubl@g...> wrote in message 92040e120712151004n13dec762x770cbe02afa1abb8@m...">news:92040e120712151004n13dec762x770cbe02afa1abb8@m...... > What methods are there, these days, for extracting structured data from > unstructured documents (such as PDF)? > > I'm aware it is quite straightforward to extract data from semi-structured > documents such as spreadsheets (as previous XML-Dev discussions have > shown, such as via ODF with XSLT and macros/Ant/Ant Contrib, etc). > > As yet, the only way I'm aware of for doing the same from PDF would be to > print out to paper and use OCR (sounds a little ridiculous) or maybe to > convert PDF, etc to some XML-based or other text-based print/archive > file somehow and go from there (perhaps with something akin to a screen- > scraper?). > > Is this all there is? > > Plus how does one then convert the data as, say XML into some XML > or equivalent document and embed that in, say, the PDF or equivalent > unstructured document file (for later extraction, say)? > I'd very much appreciate any light on this. Thank you. I'm interested not > so much in metadata but actual data or full structured equivalents of the > unstructured documents rather than just enough data to create an index. > > E.g what about patient records held in PDF and in XML formats and how > to turn the first into the latter and/or embed the latter in the first. > > Best regards > > -- > Stephen Green > > Partner > SystML, http://www.systml.co.uk > Tel: +44 (0) 117 9541606 > > http://www.biblegateway.com/passage/?search=matthew+22:37 .. and voice > > _______________________________________________________________________ > > XML-DEV is a publicly archived, unmoderated list hosted by OASIS > to support XML implementation and development. To minimize > spam in the archives, you must subscribe before posting. > > [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ > Or unsubscribe: xml-dev-unsubscribe@l... > subscribe: xml-dev-subscribe@l... > List archive: http://lists.xml.org/archives/xml-dev/ > List Guidelines: http://www.oasis-open.org/maillists/guidelines.php > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|