[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Re: Structured from/within unstructured documents

  • From: Marcus Carr <mcarr@a...>
  • To: "Edward C. Zimmermann" <edz@b...>
  • Date: Wed, 19 Dec 2007 10:05:44 +1100

Re:  Re: Structured from/within unstructured documents
Edward C. Zimmermann wrote:

> To be honest I've not looked at it for years-- I don't have Acrobat,
> only the reader--- but, if I recall, the "save as XML" functionality
> was part of their XML-architecture (borrowed, I think, from 
> Framemaker+SGML which I do have). This means that either the data was
> pre-tagged or one defined a appropriate mapping table.

Nope, it's a simple "save as". If the document was tagged, it will use 
whatever information as it has about styles, but it doesn't depend on 
tagging to produce valid XML.

> This can be good in a controlled environment where one is converting
> from existing documentation in a consistent corporate style but is
> ill-suited for conversion of the typical wild-west mix that most
> companies tend to have.

My objective is always to get out any proprietary file format and into 
some form of XML as quickly as possible, then assess what I've got and 
move forward from there. You have to tame the data somewhere - I prefer 
to do it in XML but before my target structure.

> With the effort, I think, one is better off using old school mission 
> designed tools in the spirit of Omnimark or something like 
> ClearForest or any of a number of auto-tagging and content 
> categorization solutions in between.

Yep, I started coding with OmniMark when it was still XTran from 
Exoterica and it's a great tool. If you're doing whatever the modern 
equivalent is to a cross- or up-translate though, the additional 
information in the form of the XML tagging is only going to assist, I 
would have thought. I don't see the two approaches as being incompatible 
at all.

> Its an industry with a host of companies specialized in the
> conversion of data to XML using these and their own proprietary
> tools.

Sure, if you're willing to send your data out you don't care what tools 
they're using, but that wasn't what the original poster was after.

> As I wrote earlier: [expletive deleted] in the metadata and marking up sentences,
> paragraphs and pages can be done with good quality in a relative
> generic manner (sufficiently adequate I found to be applied for all
> purpose PDF indexing). You really need to decide what you need.

Agreed - identifying sentences and pages is a very different task to 
anything more concentrated on the information.

> Using Word as in-between is like flying through Mogadishu to get to 
> Los Angeles from Boston. It may get you there but chances are that
> you'll loose some luggage.

I wasn't advocating it, but saving as RTF from PDF and then going to XML 
was mentioned. I was offering an alternative.


Marcus


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.