[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

troff to unicode conversion

Subject: troff to unicode conversion
From: David J Birnbaum <djbpitt+xml@xxxxxxxx>
Date: Mon, 11 Sep 2006 07:42:32 -0400
xml unicode convert
Dear Dimitre (cc xsl-list),

> If you can send me the actual troff file and
> the definition of the mappings I will be
> interested to look for a better solution.

Thank you for your willingness to look at this. Because the troff file is quite large (6.7MB), instead of sending it by mail I have uploaded it to:

http://clover.slavic.pitt.edu/~djb/troff-to-unicode.zip

troff-to-unicode.zip contains:

temp3.xml: xml file with troff character coding. Note that by this stage I have already converted the troff structural and procedural markup to xml; the only part of the conversion still to be done involves the character coding of the textual data.

pvl_mappings.xml: xml file with troff/unicode mapping pairs

pvl_regex_fix.xsl: xsl stylesheet that inserts extra backslashes into the mapping file so that replace() will work in subsequent stylesheet. I built the mapping file in two stages this way because that makes it easier for me to read.

pvl_mappingGenerator.xsl: operates on the output of pvl_regex_fix.xsl to produce a new stylesheet (which I call pvl_unicode.xsl), which can be used to convert the character coding in temp3.xml from troff to unicode. I don't include pvl_unicode.xsl in the zip file because it can be generated from the included files (see below).

To process:

saxon8 -o pvl_mappings1.xml pvl_mappings.xml pvl_regex_fix.xsl
saxon8 -o pvl_unicode.xsl pvl_mappings1.xml pvl_mappingGenerator.xsl
saxon8 -o  temp4.xml temp3.xml pvl_unicode.xsl

Step 1 adds extra backslashes to the mapping file so that regex will work correctly. Step 2 reads the output of Step 1 and builds the stylesheet (which I call pvl_unicode.xsl) that will do the actual character conversion. Step 3 applies that stylesheet to temp3.xsml, which is the troff-encoded input. temp4.xml is final output. It has the same structure as temp3.xml, but the troff character coding in temp3.xml is replaced with unicode in temp4.xml

The problem is the inefficiency of the actual character conversion (the application of pvl_unicode.xsl to temp3.xml to produce temp4.xml).

Thank you for any advice or suggestions.

> It seems to me that the str-map template of FXSL 1.x
> should be more efficient, as it only performs a single
> pass on the string and will do all the replacements.

I haven't had occasion to use FXSL in any projects yet (although I was very interested in and impressed by the demonstration at Extreme), so if that proves to be an effective solution, I'll look forward to learning more about it.

Best,

David
djbpitt+xml@xxxxxxxx

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.