[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Efficiency and replace()

Subject: Re: Efficiency and replace()
From: "Dimitre Novatchev" <dnovatchev@xxxxxxxxx>
Date: Sun, 10 Sep 2006 12:08:20 -0700
unicode efficient
Cyrillic characters in the quoted message replaced by spaces as they
caused bin64 encoding to be used by gmail, which was rejected by the
xsl-list server.

Hi David,

If you can send me the actual troff file and the definition of the
mappings I will be interested to look for a better solution.

It seems to me that the str-map template of FXSL 1.x should be more
efficient, as it only performs a single pass on the string and will do
all the replacements.


-- Cheers, Dimitre Novatchev --------------------------------------- Truly great madness cannot be achieved without significant intelligence. --------------------------------------- To invent, you need a good imagination and a pile of junk



On 9/10/06, David J Birnbaum <djbpitt+xml@xxxxxxxx> wrote:
> Dear XSLTians,
>
> For a troff-to-XML/Unicode conversion I've implemented a strategy that
> produces the desired result, but that does the conversion to Unicode
> slowly, and I would be grateful for advice about improving the efficiency.
>
> I handle the conversion of the structural marked up XML first, and I
> wind up with all of my XML tagging in place, but the text strings use
> troff escape sequences, rather than Unicode. The text is almost all
> medieval Cyrillic, and most of the Cyrillic characters are represented
> in the troff with sequences of several ascii characters. The strategy I
> adopted to convert the troff character encoding to Unicode was to create
> a mapping file for the troff-to-Unicode character correspondences.
> Here's a snippet (a single mapping correspondence):
>
> <mapping>
> <troff>\(qb</troff>
> <unicode> </unicode>
> </mapping>
>
> I then wrote an XSLT script that reads the file of mappings and
> generates another XSLT script that will do the actual remapping. Here's
> a snippet of the generated XSLT script; this snippet is taken from
> within a template rule for text() nodes (the named template that gets
> called follows the snippet):
>
> <xsl:variable name="temp52">
> <xsl:call-template name="replacement">
> <xsl:with-param name="text">
> <xsl:value-of select="$temp51"/>
> </xsl:with-param>
> <xsl:with-param name="troff">\\\(\?s</xsl:with-param>
> <xsl:with-param name="unicode"> </xsl:with-param>
> </xsl:call-template>
> </xsl:variable>
> <xsl:variable name="temp53">
> <xsl:call-template name="replacement">
> <xsl:with-param name="text">
> <xsl:value-of select="$temp52"/>
> </xsl:with-param>
> <xsl:with-param name="troff">\\\(\?c</xsl:with-param>
> <xsl:with-param name="unicode"> </xsl:with-param>
> </xsl:call-template>
> </xsl:variable>
> . . .
> <xsl:template name="replacement">
> <xsl:param name="text"/>
> <xsl:param name="troff"/>
> <xsl:param name="unicode"/>
> <xsl:value-of select="replace($text, $troff, $unicode)"/>
> </xsl:template>
>
> The program logic is that for each text node, the template rule passes
> the textual contents to a replace() function that replaces a troff
> encoding with the corresponding Unicode value. The replace() function is
> then called again with the next mapping. The textual content is passed
> along through repeated remappings, and when it emerges on the other end,
> all multi-character troff sequences have been replaced with Unicode
> characters. There are 64 such mappings. I use replace() only for places
> where a multi-character troff string has to be replaced by a single
> Unicode character; at the end of the series of calls to replace() I use
> translate() to do the remaining one-to-one mappings (there are
> approximately 50 of them) in a single function call. The order of the
> mappings is (obviously) important; I need to remap longer strings before
> shorter ones, since the shorter ones may be subcomponents of the longer
> ones. In particular, I can remap individual characters (the one-to-one
> mappings) only after I've taken care of all of the many-to-one ones.
>
> The input file (XML with troff character coding instead of the desired
> Unicode) is 6.7MB and the Unicode output is 7.8MB. The transformation
> takes approximately five minutes to run, which feels like an eternity,
> but I'm not sure to what extent the execution time reflects the size of
> the input file and the number of replacements that needs to be
> performed, and to what extent it reflects inefficient program design.
> Can anyone suggest a revision that would provide a considerable
> improvement in efficiency (bearing in mind that the XSLT script that
> does the actual character remapping must be generated by XSLT from the
> mappings file)?
>
> Thanks,
>
> David
> djbpitt+xml@xxxxxxxx
>
>




--
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.