[XSL-LIST Mailing List Archive Home]
[By Thread]
[By Date]
[Recent Entries]
[Reply To This Message]
Re: Advice on dictionary conversion
Subject: Re: Advice on dictionary conversion
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Mon, 17 Jan 2011 22:57:10 +0000
|
This is a tough problem for three reasons:
(a) upconversion is intrinsically hard, because it depends on
recognizing the patterns that occur in the source - it's very much a
heuristic rather than algorithmic process
(b) the XML that you get out of MS-Word is not the easiest thing to
start from, to put it mildly
(c) for the above two reasons, you'll need to use every trick in the
XSLT book, but you lack XSLT experience.
The best advice I can give for this kind of task is to build it as a
pipeline of transformations each of which gets you one step closer to
the target. Don't try to do too much in one transformation, it will get
too complicated and difficult to debug - keep each step as simple as
possible. In the first step the focus should be on getting rid of Word
noise that you aren't interested in; in the second step you might want
to concentrate on identifying the boundaries between dictionary entries,
in the third step combining multiple documents into one, and so on.
(You can construct the pipeline as a shell script, or an Ant task, or
whatever you are comfortable with - it doesn't really matter. You can
even run the steps one at a time by hand.)
Michael Kay
Saxonica
On 17/01/2011 20:14, Ciaran S Duibhmn wrote:
I wish to convert a bilingual dictionary from MS-Word format to
"properly"-tagged XML, and I hope I may ask for some comment on the
feasibility of this, using XSLT or otherwise.
First I found several programs which automatically convert the Word
files to FO:XSL, either from .doc or .rtf. My preferred one of those
I examined is the Novosoft converter (http://www.rtf-to-xml.com/). I
painlessly converted the entire letter D using their online interface.
Now I have to replace the presentational tags by tags like <HEADWORD>,
<EXPLANATION>, <EXAMPLE> etc. I tried doing this manually, but it is
not practical. Besides, I have to start from scratch again for each
new letter of the alphabet. I have zero experience of XSLT, but it
seemed that an XSLT program might be what was needed. I started with
XRay2 (really nice for a beginner in some ways) and have now moved on
to the Essential XML Editor with Saxon. But progress has been minimal.
The main problem is my ignorance of XSLT, although I am an experienced
general programmer. A particular difficulty is that "italics" (for
example) might be used for more than one part of the dictionary
entry. However the choice of which tag to replace it with might well
be decided by the target DTD (if I were to formulate it). Is this an
example of what people sometimes refer to on this list as
"schema-aware XSLT"? If so, I have no idea how to make my XSLT
schema-aware.
Another problem is that the dictionary contains quite a few "mistakes"
which are all but invisible in Word, eg. a single space might be
inadvertently bolded in an unbold field. This sort of thing is
faithfully copied by a converter and complicates the starting XML
unnecessarily, of course.
I would be grateful for advice as to how best to proceed. I took on
this job as a favour, hoping it would help me to learn something of
these technologies, but it seems now there is too much to learn on
one's own in any reasonable short space of time (XSLT is not for
amateurs :-(. Perhaps I should advise to have the job done
professionally. Unless there is something I am missing...
On a related matter, I have recently discovered LIFT as a particular
XML format for lexicographical work
(http://code.google.com/p/lift-standard/) Any experience of that as a
target format for XSLT would also be of interest.
Thanks,
Ciaran S Duibhmn.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format
RSS 2.0 |
|
Atom 0.3 |
|
|