|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Word and XML (was: XML standards coherency and so forth)
> From: "Rick Jelliffe" <ricko@a...> > Date: Sun, 24 Jan 1999 16:15:36 +1100 > Subject: Re: Word and XML (was: XML standards coherency and so forth) > > From: Biron,Paul V <Paul.V.Biron@k...> > > >Word 97 also produced several well-formedness violations when doing > anything > >more than simple nested lists. > > Dave Ragget's program "tidy" is excellent for fixing up badly formed > HTML and making it valid (it figures out which HTML DTD the document is > valid according to, and generates the appropriate DOCTYPE for it). It > also is great for converting to HTML-in-XML (e.g. our website > www.ascc.net/xml/ uses it). > > The program is available at > http://www.w3.org/People/Raggett/tidy/ > > I think website developers should consider making tidy a standard part > of website maintenance. Each HTML editing program can do strange things > to markup; using tidy on the maintenance fileset and then updating the > website fileset is a good way to keep a WF site. without forcing you to > give up non-WF tools. > > Rick > Wow! I've been so busy lately that I haven't been able to keep up with XML-DEV and had no idea my "innocent" post on Word and HTML/XML had been so long lived! On this matter, tidy was one of the first "fix-it" approaches we tried. Unfortunately, tidy doesn't happen to fix this particular problem. Tidy does many, many VERY important things! Fixing this problem is not one of them. The HTML produced by Word '97 from my example is: <P>This is <B>a test <I>of the</B> emergency</I> broadcast system</P> The output produced by tidy (22jan99 version) is: <P>This is <B>a test <I>of the</I> emergency</B> broadcast system</P> While this is "well-formed" HTML (it does not contain improper nesting), it is NOT the output that is wanted. The problem is that in the original, the BOLD stops after "the" (where it should stop); in the tidy version it continues until after "emergency". The output that Word should have originally output is: <P>This is <B>a test <I>of the</I></B> <I>emergency</I> broadcast system</P> That is, the fix is to insert a </I> when the </B> is seen and then to reopen <I> after the </B>. Tidy just replaces the </B> with </I> and then replaces the original </I> with </B>. The only tool I've found so far that fixes this problem correctly is FrontPage v1.1 (about 4 years old, funny they had it working back then:-). In truth, we've spent a great deal of time writting tools (a big daisy chain of FrontPage v1.1 -> hand-roled perl script 1 -> hand-roled perl script 2 -> etc.) just to HTML output from Word '97. What has made this all the more fustrating for us is that the HTML is not really what we want in the end. We just want a "clean" HTML version so that the transformation to the XML DTD that we're interested in is "easier". The BOLD and ITALIC that our authors see actually represent more "semantic" XML elements, e.g., <allergy> and <medication>. Such is life. Paul V. Biron SGML Business Analyst Kaiser Permanente, So Cal. xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i... Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1 To (un)subscribe, mailto:majordomo@i... the following message; (un)subscribe xml-dev To subscribe to the digests, mailto:majordomo@i... the following message; subscribe xml-dev-digest List coordinator, Henry Rzepa (mailto:rzepa@i...)
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








