[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: need for defining standard APIs for xml storage

  • From: Dongwook Shin <dwshin@n...>
  • To: xml-dev@x...
  • Date: Mon, 03 Apr 2000 15:00:03 -0400

Re: need for defining standard APIs for xml storage
Hi, Martin:

Martin said:

> It seems also that we need two kind of queries
> a) queries based on the elements.
> b) queries based on the data content.
>
> The last query is needed when not all information is tagged. In this
case we
> end up with a situation where the knowledge is stored in the data
content
> but not tagged and therefore need to be indexed to be easily
retrieved.

Very close, but not exactly. Most in the literature categorizes the
queries
into two or three
    a) Structural queries
    b) Content queries
    c) Attribute queries (This can be considered as a subset of a))

But, content queries are still in the context of elements. For instance,
a query

"find a SPEECH whose SPEAKER contains 'hamlet'" is regarded as a content

query even if it states a certain element relationship. On the other
hand,
structural queries are the one that address only the relationship among
elements,
like "Find SPEECH having at least THREE SPEAKER elements".

Whatever you define they are, I think it seems theoretically clearer to
assume
that
all the data contents are tagged, even though they are actually not. For

instance,
if you happen to get a plain text, you can assume that they are enclosed

by tags <DOC> </DOC>, or whatever you want. By doing so, you can take
all
the legacy plain text into XML framework with minimal overhead.

Martin said:

> So, if the DOM would include a function such as: node-set =
selectNodes
> (queryType, Expression) then we can have any kind of queries applied
on an
> information set without having to add a new function each time we add
a new
> query type.

> Now about your indexes, what kind of algorithm are you using  for the
> elements?

I mean whatever you extend to the DOM, you get into the same situation.
Basically, DOM is a representation of the whole XML document.
On the other hand, the index is a small set of pointers to actual data.
If you have a query like "find a SPEECH whose SPEAKER contains
'hamlet'",
you have to search the whole DOM, which is not scalable to large
document.
On the other hand, if you have the inverted index for the document,
you can get the elements having "hamlet" immediately.

Martin said:

> now, you can also imagine that one of the nodes represent a topic and
that
> this topic is a keyword located in different documents.
>
> You can as well choose to have an xlink element to point to an XML
fragment
> instead of an xinclude:include element.
>
> The point here to note is that a permanent information set can have
its
> content in diverse forms, but still show an XML face. So, structured
and
> unstructured content can be freely intermixed. B+* trees can be used
to
> retrieve structured content and text indexing indexes used to retrieve

> unstructured content. So, as soon as we talk about XML storage it is
better
> to have an engine that can wraps this back end diversity. To hide the
fact
> that the XML hierarchy does not necessarily comes from a text
document.
> Finally that this diversity is resolved by a small set of data types
like
> nodes, node-sets, etc...
>
> In fact, all this stuff is what the GROVE is about. Now the DOM should

> evolve to be not only an interface to parsed text documents but also
an
> interface to information sets. An information set is not necessarily
coming
> from a text document. In fact, information sets could be the latest
> incarnation of hierarchical databases. Or the latest incarnation of an

> aggregation tool. We are slowly evolving toward that goal.
>

It seems to me that the notion of "permenant information set" looks like

data repository. The first issue here is that how you store the data
and refer to it elsewhere. And another is how a query space (the
document
space a query should look at) should be: should it be limited to the
current
XML fragment, or extended to following links? Your GROVE seems to
be one solution for that.

Thanks
Dongwook

--
Dongwook Shin
Visiting Scholar
Lister Hill National Center for Biomedical Communications
National Library of Medicine,
8600 Rockville Pike Bethesda 20894, MD
E-mail: dwshin@n...
Tel: (301) 435-3257
FAX: (301) 480-3035
URL: http://dlb2.nlm.nih.gov/~dwshin



***************************************************************************
This is xml-dev, the mailing list for XML developers.
To unsubscribe, mailto:majordomo@x...&BODY=unsubscribe%20xml-dev
List archives are available at http://xml.org/archives/xml-dev/
***************************************************************************

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.