|
[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: HTML section headings to XML document sections
Michel,
The best solutions to this currently (IMHO) are Jeni's (references already posted). She and I kind of leap-frogged development of a solution (I've called it "levitation" and you'll find my contributions in the list archives, I'll bet, if you search for that -- but that name for the problem doesn't seem to have stuck :-). But of course Jeni writes great code *and* documents it. The solution is to treat the problem as a special case of grouping, driving it all with keys that associate each node to the node that indicates its proper place in the hierarchy (generally the head of the invisible section it's in). But I think you'll find you'll have problems since your HTML coming in is not likely to be very regular. For example, if (when) you get something like... h1 p p h3 p p p you need to make a decision about whether to interpolate a missing level (that would be headed with an h2), that just happens to have no header (these things do happen in structured text), or whether to promote the h3 and its following p elements to the second level. Unfortunately, which of these ways is "correct" will depend on the documents: it may vary, and from the purist's point of view might require or demand an interpretation on a case-by-case basis. Not good. So it will come down to (a) how good (bad) your data actually is, and (b) how brutal you can afford to be. Enjoy, Wendell At 03:01 AM 8/9/01, you wrote: I have a lot of XHTML documents (mostly sanitized HTML with tidy and saved with the -asxml option) that I would like to transform into XML (e.g., DocBook). The structure of HTML is however drastically different in that standard HTML does not mark up the hierarchical subdivisions of a document apart from indicating the start of each level by <h1>, <h2>, <h3>, etc. Therefore my problem is to find a general algorithm, probably using recursion, to transform an HTML document into a valid XML equivalent, in particular indicating its hierarchical structure. For instance, suppose I have an HTML source like this: ====================================================================== Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9635 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ====================================================================== XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|

Cart








