Re: PDF to XSL-FO

Play the video

Subject: Re: PDF to XSL-FO
From: "W. Eliot Kimber" <eliot@xxxxxxxxxx>
Date: Fri, 22 Nov 2002 09:47:30 -0600

Noel Golding wrote:

One business problem would be to transform already existing pdf document to
xml.  FO could be the first step to getting it into an xml schema for the
organization.  I would benefit from such a tool.

I don't think that approach would bear much fruit--FO wouldn't really add any value to what's already in the PDF simply because an FO instance is really a formatted document--the fact that it's in XML syntax doesn't really mean anything for a to-XML process.

It would almost certainly be more effective to use traditional data conversion approaches to getting the data into XML.

In any case, the content of a PDF document is quite accessible using available PDF libraries such as PJ and the Adobe PDF library. If you could convert the PDF to FO you could just as easily convert it to some specific DTD--the problem is essentially the same and has the same level of difficulty.

But it's also the case that recognizing semantic structures from the composed page as printed is usually easier than recognizing them from the raw PDF data stream--that's because something like a bold indented title only has one visual representation but could be defined in the PDF stream in any number of ways within the same PDF document, many of which would quite difficult to recognize hueristically. It's not uncommon, for example, to find a PDF page that's defined as a sequence of text commands, each containing one character that is positioned independently of all the other characters. That makes it very difficult to determine things like word boundaries, line boundaries, and so on, without actually doing the rendering those text commands define. At that point, you might as well scan the rendition. You could, I suppose, use the PDF text content as a post-scan quality check, but that's just a frill.

That is, it's much easier for an OCR system to recognize a structural title by its formatting than it is for a PDF interpreter to recognize a structural title by the sequence of PDF commands that happen to have been used to render it.

Of course, if you have tagged PDF (PDF with embedded markup), things may be a little easier, but the use of tagged PDF is, I think, pretty rare, and in any case, there are numerous limitations in what you can do with it in any case.

Cheers,

Eliot
--
W. Eliot Kimber, eliot@xxxxxxxxxx
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139

XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list

Current Thread

Re: Beginner-Problem, (continued)
- Oleg Tkachenko - Fri, 22 Nov 2002 09:46:09 -0500 (EST)
- W. Eliot Kimber - Fri, 22 Nov 2002 09:55:58 -0500 (EST)
  - Noel Golding - Fri, 22 Nov 2002 10:18:12 -0500 (EST)
    - W. Eliot Kimber - Fri, 22 Nov 2002 10:41:51 -0500 (EST) <=
    - Noel Golding - Fri, 22 Nov 2002 11:11:12 -0500 (EST)
    - bryan - Fri, 22 Nov 2002 12:53:36 -0500 (EST)
    - Geoff Hankerson - Fri, 22 Nov 2002 10:45:35 -0500 (EST)
    - Noel Golding - Fri, 22 Nov 2002 11:11:52 -0500 (EST)

<- Previous	Index	Next ->
Re: PDF to XSL-FO, Noel Golding	Thread	Re: PDF to XSL-FO, Noel Golding
Re: The beginning of xslt?, Ian Tindale	Date	RE: Can XSLT produce binary o, Michael Kay
	Month

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Subscribe in XML format

RSS 2.0
Atom 0.3

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >