[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: PDF to XSL-FO

Subject: Re: PDF to XSL-FO
From: "W. Eliot Kimber" <eliot@xxxxxxxxxx>
Date: Fri, 22 Nov 2002 09:47:30 -0600
pdf xsl fo
Noel Golding wrote:
One business problem would be to transform already existing pdf document to
xml.  FO could be the first step to getting it into an xml schema for the
organization.  I would benefit from such a tool.

I don't think that approach would bear much fruit--FO wouldn't really add any value to what's already in the PDF simply because an FO instance is really a formatted document--the fact that it's in XML syntax doesn't really mean anything for a to-XML process.


It would almost certainly be more effective to use traditional data conversion approaches to getting the data into XML.

In any case, the content of a PDF document is quite accessible using available PDF libraries such as PJ and the Adobe PDF library. If you could convert the PDF to FO you could just as easily convert it to some specific DTD--the problem is essentially the same and has the same level of difficulty.

But it's also the case that recognizing semantic structures from the composed page as printed is usually easier than recognizing them from the raw PDF data stream--that's because something like a bold indented title only has one visual representation but could be defined in the PDF stream in any number of ways within the same PDF document, many of which would quite difficult to recognize hueristically. It's not uncommon, for example, to find a PDF page that's defined as a sequence of text commands, each containing one character that is positioned independently of all the other characters. That makes it very difficult to determine things like word boundaries, line boundaries, and so on, without actually doing the rendering those text commands define. At that point, you might as well scan the rendition. You could, I suppose, use the PDF text content as a post-scan quality check, but that's just a frill.

That is, it's much easier for an OCR system to recognize a structural title by its formatting than it is for a PDF interpreter to recognize a structural title by the sequence of PDF commands that happen to have been used to render it.

Of course, if you have tagged PDF (PDF with embedded markup), things may be a little easier, but the use of tagged PDF is, I think, pretty rare, and in any case, there are numerous limitations in what you can do with it in any case.

Cheers,

Eliot
--
W. Eliot Kimber, eliot@xxxxxxxxxx
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.