|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] XML cleanup for Word 2K documents
At 08:12 AM 7/31/00 -0700, Chris Lovett wrote: >General XML authoring was not a stated goal. It does however embed some >islands of well-formed XML inside the HTML pages. This is intended for >Office use only. If you can figure out how to post-process the HTML to >extract and manipulate this XML then more power to you. It may be a bit early to announce, but since everyone's talking about it... http://www.simonstl.com/projects/o2kxml/ I've been working on exactly such a post-processor, which filters Word 2K (and I think Office 2K) files before they go through an XML parser. Technically, it's a Java FilterReader. By wrapping your parser input in this filter, you allow the code to track the bytes as they come in, making the necessary syntactical modifications to turn them into legitimate XML files. Apart from a few empty HTML elements, Word does a pretty good job of presenting clean structures in its HTML output, but not clean syntax. Pretty good, of course, doesn't make it XML, but that's what this filter is for. This isn't a general XHTML clean-up program like Tidy - it only works for O2K files, and may introduce problems in well-formed XHTML documents. It preserves all of the information stored in the O2K file, including the strange conditionals Microsoft uses, though these are converted into an element with an attribute. The filter doesn't remove any of Microsoft's XML or HTML, leaving it all there for later processing with XSLT, the DOM, or the XML tool of your choice. You don't need to have Office 2000 to use the code - it only requires Java 1.1 or higher. I haven't tested it extensively, just a few dozen Word files, but so far it seems to do okay. It's not pretty code, though I'm cleaning and documenting as I find time. I only got Word 2K last week, so this is early, probably too early. Test reports are welcome, as are code contributions. Simon St.Laurent XML Elements of Style / XML: A Primer, 2nd Ed. http://www.simonstl.com - XML essays and books
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








