[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

XML cleanup for Word 2K documents

  • From: "Simon St.Laurent" <simonstl@s...>
  • To: xml-dev@l...
  • Date: Mon, 31 Jul 2000 12:24:23 -0400

xml cleanup
At 08:12 AM 7/31/00 -0700, Chris Lovett wrote:
>General XML authoring was not a stated goal.  It does however embed some
>islands of well-formed XML inside the HTML pages.  This is intended for
>Office use only.  If you can figure out how to post-process the HTML to
>extract and manipulate this XML then more power to you.

It may be a bit early to announce, but since everyone's talking about it...

http://www.simonstl.com/projects/o2kxml/

I've been working on exactly such a post-processor, which filters Word 2K
(and I think Office 2K) files before they go through an XML parser.
Technically, it's a Java FilterReader.  By wrapping your parser input in
this filter, you allow the code to track the bytes as they come in, making
the necessary syntactical modifications to turn them into legitimate XML
files.  

Apart from a few empty HTML elements, Word does a pretty good job of
presenting clean structures in its HTML output, but not clean syntax.
Pretty good, of course, doesn't make it XML, but that's what this filter is
for.

This isn't a general XHTML clean-up program like Tidy - it only works for
O2K files, and may introduce problems in well-formed XHTML documents.  It
preserves all of the information stored in the O2K file, including the
strange conditionals Microsoft uses, though these are converted into an
element with an attribute.

The filter doesn't remove any of Microsoft's XML or HTML, leaving it all
there for later processing with XSLT, the DOM, or the XML tool of your choice.

You don't need to have Office 2000 to use the code - it only requires Java
1.1 or higher.  I haven't tested it extensively, just a few dozen Word
files, but so far it seems to do okay.  It's not pretty code, though I'm
cleaning and documenting as I find time.  I only got Word 2K last week, so
this is early, probably too early.

Test reports are welcome, as are code contributions.  

Simon St.Laurent
XML Elements of Style / XML: A Primer, 2nd Ed.
http://www.simonstl.com - XML essays and books

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.