[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

HTML section headings to XML document sections

Subject: HTML section headings to XML document sections
From: Michel Goossens <Michel.Goossens@xxxxxxx>
Date: Thu, 9 Aug 2001 09:01:22 +0200 (METDST)
section of the xml document
I have a lot of XHTML documents (mostly sanitized HTML with tidy and saved
with the -asxml option) that I would like to transform into XML (e.g.,
DocBook). The structure of HTML is however drastically different in
that standard HTML does not mark up the hierarchical subdivisions of a
document apart from indicating the start of each level by <h1>, <h2>,
<h3>, etc. Therefore my problem is to find a general algorithm, probably 
using recursion, to transform an HTML document into a valid XML equivalent, 
in particular indicating its hierarchical structure. For instance, suppose
I have an HTML source like this:

<html>
<h1>...</h1>....
<h2>...</h2>....
<h2>...</h2>....
<h3>...</h3>....
<h1>...</h1>....
<h2>...</h2>....
<h3>...</h3>....
<h3>...</h3>....
<h2>...</h2>....
</html>

this should become semething like

<html>
<sect1><title>...</title>
....
<sect2><title>...</title>        
....
</sect2>
<sect2><title>...</title>
....
<sect3><title>...</title>
....
</sect3>
</sect2>
</sect1>
<sect1><title>...</title>
....
<sect2><title>...</title>        
....
</sect2>
<sect3><title>...</title>
....
</sect3>
<sect3><title>...</title>
....
</sect3>
</sect2>
<sect2><title>...</title>
....
</sect2>
</sect1>
</html>

So the question is how to know each time a <hx> (h1, h2, h3, ...) element
is encountered what are the "open h" levels less than or equal to that
of the current element, so that we can "close" them. In particular, before
exiting the document we should also close the complete hierarchy correctly.

I have read with interest an article by Benoit Marchal mentioned here
recently: "recurse, not divide, to conquer", where he describes the use of
recursion for "hierarchising" a flat document, but I cannot really see how
to apply his approach in the present case without somehow also knowing the
"state" (hierarchical level) at the given point in the document. Reading
the discussion of recursion in MK's book or in "Professional XSL" did not
make me a lot wiser on how to solve this in an elegant way. Therefore, all
suggestions are very welcome. Thanks in advance. mg

Dr. Michel Goossens              Phone:(+41 22) 767-4902
CERN, IT Division                Fax:  (+41 22) 767-8630
CH-1211 Geneva 23, Switzerland   Email: michel.goossens@xxxxxxx


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.