[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Re: Structured from/within unstructured documents
Edward C. Zimmermann wrote: > To be honest I've not looked at it for years-- I don't have Acrobat, > only the reader--- but, if I recall, the "save as XML" functionality > was part of their XML-architecture (borrowed, I think, from > Framemaker+SGML which I do have). This means that either the data was > pre-tagged or one defined a appropriate mapping table. Nope, it's a simple "save as". If the document was tagged, it will use whatever information as it has about styles, but it doesn't depend on tagging to produce valid XML. > This can be good in a controlled environment where one is converting > from existing documentation in a consistent corporate style but is > ill-suited for conversion of the typical wild-west mix that most > companies tend to have. My objective is always to get out any proprietary file format and into some form of XML as quickly as possible, then assess what I've got and move forward from there. You have to tame the data somewhere - I prefer to do it in XML but before my target structure. > With the effort, I think, one is better off using old school mission > designed tools in the spirit of Omnimark or something like > ClearForest or any of a number of auto-tagging and content > categorization solutions in between. Yep, I started coding with OmniMark when it was still XTran from Exoterica and it's a great tool. If you're doing whatever the modern equivalent is to a cross- or up-translate though, the additional information in the form of the XML tagging is only going to assist, I would have thought. I don't see the two approaches as being incompatible at all. > Its an industry with a host of companies specialized in the > conversion of data to XML using these and their own proprietary > tools. Sure, if you're willing to send your data out you don't care what tools they're using, but that wasn't what the original poster was after. > As I wrote earlier: [expletive deleted] in the metadata and marking up sentences, > paragraphs and pages can be done with good quality in a relative > generic manner (sufficiently adequate I found to be applied for all > purpose PDF indexing). You really need to decide what you need. Agreed - identifying sentences and pages is a very different task to anything more concentrated on the information. > Using Word as in-between is like flying through Mogadishu to get to > Los Angeles from Boston. It may get you there but chances are that > you'll loose some luggage. I wasn't advocating it, but saving as RTF from PDF and then going to XML was mentioned. I was offering an alternative. Marcus
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|