[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

CLaRK System - an XML-based System for Corpora Development

  • To: xml-dev@l...
  • Subject: CLaRK System - an XML-based System for Corpora Development
  • From: "Milen Kouylekov" <mkouylekov@d...>
  • Date: Thu, 13 Jun 2002 21:21:06 +0300

clark system
Dear List members,

I would like to announce the CLaRK System - an XML-based System
for Corpora Development. It is available on the web page of 
the BulTreeBank Project:

http://www.bultreebank.org/

Please, follow the "CLaRK System" link and then Download.

The system is implemented in JAVA.

Short description:

CLaRK is an XML-based software system for corpora development.
The main aim behind the design of the system is the minimization
of human intervention during the creation of language resources.
It incorporates several technologies: (1) XML technology; 
(2) Unicode; (3) Regular Cascade Grammars; 
(4) Constraints over XML Documents.

For document management, storing and querying, we chose the 
XML technology because of its popularity and its ease of 
understanding. The core of CLaRK is an XML Editor, which is 
the main interface to the system. Besides the XML language itself, 
we implemented an XPath language for navigation in 
documents and an XSLT language for transformation of XML documents.

For multilingual processing tasks, CLaRK is based on an 
Unicode encoding of the information inside the system. 
There is a mechanism for the creation of a hierarchy of 
tokenisers. They can be attached to the elements in the DTDs 
and in this way there are different tokenisers for different 
parts of the documents.

The basic mechanism of CLaRK for linguistic processing of 
text corpora is the cascade regular grammar processor. 
The main challenge to the grammars in question is how to apply 
them on XML encoding of the linguistic information. The system 
offers a solution using an XPath language for constructing 
the input word to the grammar and an XML encoding of the 
categories of the recognised words.

Several mechanisms for imposing constraints over XML 
documents are available. The constraints cannot be stated by 
the standard XML technology. The following types of constraints 
are implemented in CLaRK: (1) Regular expression constraints - 
additional constraints over the content of given elements based 
on a context; (2) Number restriction constraints - cardinality 
constraints over the content of a document; (3) Value constraints - 
restriction of the possible content or parent of an element in 
a document based on a context. The constraints are used in 
two modes: checking the validity of a document regarding a set 
of constraints; supporting the linguist in his/her work during 
the building of a corpus. The first mode allows the creation of 
constraints for the validation of a corpus according to given 
requirements. The second mode helps the underlying strategy of 
minimisation of the human labour.

Best Wishes Milen Kouylekov
mkouylekov@d...
http://www.BulTreeBank.org

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.