[Home] [By Thread] [By Date] [Recent Entries]
I'm banging my head against a sequence alignment problem. I have a
feeling that this is straightforward, but I can't put my finger on
what's missing from my attempts.
Suppose I have two inputs like so, where input1//w is always a subset of input2//w: <input1> <w n="1">I</w> <w n="2">am</w> <w n="3">a</w> <w n="4">sequence</w> </input1> <input2> <w>I</w> <w>am</w> <w>a</w> <w>longer</w> <w>longer</w> <w>sequence</w> </input2> I'd like to get output like so: <output> <w n="1">I</w> <w n="2">am</w> <w n="3">a</w> <w n="skipped">longer</w> <w n="skipped">longer</w> <w n="4">sequence</w> </output> I.e., for each input1//w, @n should be copied to the nearest following sibling <w> in input2 that matches .; <w>s in input2 that aren't in input1 should be flagged as "skipped". P.S.: The use case is aligning an imperfect but timestamped transcription of an audio file (input1, machine-generated) with a perfect but not-timestamped one (input2, human-generated). Thanks much for any help, Markus -- Markus Flatscher, Project Editor ROTUNDA, The University of Virginia Press PO Box 400314, Charlottesville VA 22904, USA Courier: 211 Emmet Street South, Charlottesville VA 22903, USA Email: markus.flatscher@xxxxxxxxxxxx Web: http://rotunda.upress.virginia.edu/
|

Cart



