[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: gibberish-to-unicode conversation

Subject: Re: gibberish-to-unicode conversation
From: "Christopher R. Maden" <crism@xxxxxxxxx>
Date: Sat, 23 Apr 2011 22:34:53 -0400
Re:  gibberish-to-unicode conversation
Hash: SHA1

On 04/23/2011 10:27 PM, Birnbaum, David J wrote:
> My question, then, after this long-winded exposition, is: How should
> I have conceptualized this task? I broke it down into three types of
> replacements and adopted a different strategy for each, and I started
> with the easiest (the one-to-one replacements). I then realized that
> the problem was more general (there are other possible types of
> mappings), and also that there were multiple ways to deal with some
> of the types of mapping. Finally, the problem begins with a text()
> node, but once a replacement inserts some markup, it's no longer just
> a text() node, so a recursive strategy that requires with a pristine
> text() node as input may become inapplicable as the replacements
> accrue.
> On the one hand, this is a one-off transformation for a particular
> project, and once it's done I'll never have to run it again, so
> efficiency of execution isn't a high priority. On the other hand,
> these kinds of gibberish-to-unicode remappings are very common in my
> world (legacy documents in unusual writing systems), and I really
> should think about the general problem type, instead of cobbling
> together a new ad hoc solution every time a new project crosses my
> desk. I'd be grateful for any advice.

The main thing that comes to mind is: Did this need to be done in XSLT?
 While itbs certainly possible, this very much smells like a job for
Perl (or Python, if you prefer) to me.  That makes the many-to-many case
easier, as well.

If you were to run into a particular (ab)use of encoding repeatedly, you
could even implement it as an encoding module in Perl, and then just
read the input as being in that encoding and re-write it in UTF-8.

That all said, I think your approach was sound, insofar as XSLT was the
tool to use.

- --
Chris Maden, text nerd  <URL: http://crism.maden.org/ >
bThose in power write the history, while those who suffer
 write the songs.b b Frank Harte
GnuPG Fingerprint: C6E4 E2A9 C9F8 71AC 9724 CAA3 19F8 6677 0077 C319
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/


Current Thread


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.