[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: need for defining standard APIs for xml storage
Hi Dongwook Dongwook said: I have been developing XML indexing and retrieval engines, which can scale up to large XML collections. And I also see that the similar ones that only creates DOM and searches in DOM fail to scale in large collections. Every time I develop XML IR systems, I need API for XML storage. The only time I want to invoke DOM is in indexing, which is usually performed off-line. In retrieval, I do not want to rely on DOM, since it may spend huge amount of memory, which seems crucial in degrading retrieval performance. Instead, I want to use light-weight index that maps elements to real data. To create such kind of index without depending on specific repositories, it seems important to have a well-defined API for XML storages. Didier replies: The recent posting about XML queries made me think a bit more on the subject. I think that concretely speaking if the DOM would be augmented with a function like node-set = selectNodes (queryType, Expression) where the query type could be for instance "XPath" or "XQL" or whatever and that the expression is a string representing the expression we have here a useful construct. It seems also that we need two kind of queries a) queries based on the elements. b) queries based on the data content. The last query is needed when not all information is tagged. In this case we end up with a situation where the knowledge is stored in the data content but not tagged and therefore need to be indexed to be easily retrieved. So, if the DOM would include a function such as: node-set = selectNodes (queryType, Expression) then we can have any kind of queries applied on an information set without having to add a new function each time we add a new query type. Now about your indexes, what kind of algorithm are you using for the elements? On our side, as we get more experimental data, we are moving toward a world where the permanent information set uses some of the grove concepts. We have now the right element for this: the xinclude:include element. If a data source somewhere can take a URL as request, and if this data source can return an XML document fragment, then even a big collection can be managed by all kinds of tools. I'll explain it more, be patient.... a) imagine now that you have an information set where a big chunk of it is stored somewhere else. Moreover, that this chunk of information is dynamically created. to do so, we have a document as: <mydocument> <element1> .... </element1> <element 2> <xinclude:include href="http://myfavoritesqlserver.com/sql=select name, address, profile from customerDB where profile=good-customer"/> </element2> etc.... </mydocument> b) imagine now that this document is stored in a permanent information set (or GROVE if you whish). we only store the xinclude element in the permanent information set. This element is used as a kind of external link. c) a user request a XPath like "/mydocument/element2[name='albert Einstein" then, the information set engine would talk to the sql server with the SQL request. The SQL engine uses it set of B+* trees to retrieve the information and return an XML document. From this document we continue to resolve the XPath expression until we get the right "albert Einstein". d) you can imagine also the same scenario with a different query language like XQL that should allows you to select a range. now, you can also imagine that one of the nodes represent a topic and that this topic is a keyword located in different documents. You can as well choose to have an xlink element to point to an XML fragment instead of an xinclude:include element. The point here to note is that a permanent information set can have its content in diverse forms, but still show an XML face. So, structured and unstructured content can be freely intermixed. B+* trees can be used to retrieve structured content and text indexing indexes used to retrieve unstructured content. So, as soon as we talk about XML storage it is better to have an engine that can wraps this back end diversity. To hide the fact that the XML hierarchy does not necessarily comes from a text document. Finally that this diversity is resolved by a small set of data types like nodes, node-sets, etc... In fact, all this stuff is what the GROVE is about. Now the DOM should evolve to be not only an interface to parsed text documents but also an interface to information sets. An information set is not necessarily coming from a text document. In fact, information sets could be the latest incarnation of hierarchical databases. Or the latest incarnation of an aggregation tool. We are slowly evolving toward that goal. Cheers *************************************************************************** This is xml-dev, the mailing list for XML developers. To unsubscribe, mailto:majordomo@x...&BODY=unsubscribe%20xml-dev List archives are available at http://xml.org/archives/xml-dev/ ***************************************************************************
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|