[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: equivalence of <i><b></b></i> and <b><i></i></b> et. al. ?


voice to speech
The tool builds Markov models based on the  tags seen in training 
documents and then uses Viterbi search to insert the same tags in other 
documents. These are the same algorithms used to transform voice to 
speech in voice recognition, but transforming between completely flat 
XML documents and marked up XML documents.

The kinds of tags that can be inserted range form part of speech tags 
and word segmentation tags (very useful for languages which don't 
explicitly write their inter-word spaces) to tags making bibliography 
information explicit. Hierarchical / nested tags can be marked up.

One requirement is a quality set of training documents, with tags used 
consistently throughout. Without such a corpus, the Markov models are 
unable to discriminate between tags. Sequences such as <i><b></b></i> 
need to be marked up consistently to get good models.

Because the tool is very general and the applicable transformations are 
in the application language (of which the tool knows nothing), it seems 
unlike that the tool can do much to make these consistent, unless 
someone else has already tackled the problem.

The tool will be released under the GPL when complete.

cheers
stuart

Bullard, Claude L (Len) wrote:

>So you are taking data already tagged in XML and inserting 
>more markup into it, as in adding HTML tags to text nodes?
>
>1)  You are right that markup systems are silent about 
>these semantics.  They are in the domain of the 
>application language.   However, in this case, a bold italic  
>item and an italic bold item are rendered identically, yes, 
>and rendering is the semantic yes, so why are these not 
>equivalent semantically if not syntactically?
>
>What do you mean by 'similar classes of constructs'?
>
>2)  An XSLT script could be used to transform this 
>example.  
>
>len
>
>
>From: Stuart A Yeates
>[mailto:stuart.yeates@c...]
>
>I have written a natural language modelling tool which marks up (inserts 
>XML tags into) natural language documents already in XML.
>
>I have come across an issue with this tool: some users and documents 
>have an expectation that <i><b></b></i> and <b><i></i></b> (and similar 
>classes of constructs) are equivalent, whereas my tool sees these are 
>completely distinct.
>
> From looking at at the standards, is appears that HTML, XHTML and XML 
>are all silent on the semantics of situations such as this.
>
>Are there any systems or toolkits which have already been written to 
>help systematise documents and corpora into a single, consistent 
>representation?
>
>cheers
>stuart
>
>  
>


-- 
Stuart Yeates            stuart.yeates@c...
OSS Watch                                  http://www.oss-watch.ac.uk/
Oxford Text Archive                             http://ota.ahds.ac.uk/
Humbul Humanities Hub                         http://www.humbul.ac.uk/


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.