[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: searching for search

  • From: Marcelo Cantos <marcelo@m...>
  • To: xml-dev@i...
  • Date: Mon, 24 May 1999 12:07:09 +1000

Re: searching for search
On Sun, May 23, 1999 at 11:31:43PM +0200, Edward C. Zimmermann wrote:
> > - in general, what are the features that push a technology
> > into another level of complexity and why (i.e. what is so
> > hard here?)
> Designing fulltext engines is not difficult :-)

Making them updateable, small and fast is.  Full structure indexing,
in particular, is easy in principle but quite awkward to make
efficient and small in a dynamic update environment.  phrase querying
is another that has to be done carefully to simultaneously offer speed
while avoiding space blowouts.

It is quite common for engines to generate indexes that are two to ten
times the size of the data.  Furthermore, this is generally considered
acceptable (SIM rarely goes over half the size of the data -- we
haven't implemented full structure or phrase yet, but have done some
research to establish the cost; structure indexes should be minimal
and efficient phrase indexes should roughly double our current index
size).

Here is a list of features that have the potential to push complexity (the list
is neither comprehensive, nor in any particular order):

  * Size minimisation
  * Large collections (e.g. exceeding 2 or 4 GB can pose
    unique problems that are non-trivial to solve) 
  * Performance
  * Interactive updates
  * Full-structure querying
  * Phrase querying
  * Transactions
  * Incremental backups (important for large collections)
  * Multi-database queries
  * Multi-database ranked/sorted queries
  * Multi-server queries
  * Multi-server ranked/sorted queries
  * Multi-server multi-vendor queries
  * Multi-server multi-vendor ranked queries

(I list the various multi-database options separately because each
of them introduces new and quite different issues, though some of the
issues may only arise in the context of Z39.50, with which we deal.)

> > - specifically, what are the characteristics of each of
> > these in performance/reliability/features (personal experience
> > from non-vendors and public benchmarks are of course preferred,
> > but vendor claims might be of interest too)
> > 
> > - can i safely ignore the non open source ones without giving
> > up capabilities
> What do you mean (I seem to be on a roll at not understanding
> questions these days)?

Possibly the triple negative (_ignore_, _non_, _without_) contributed
in this case. :-)

> > - if all i wanted to do was boolean search on field values with
> > no stemming/concept support, then regardless of how i did the
> > indexing, what is wrong with using standard b-trees and/or just
> > putting the index data in a sql db?
> To make the answer short: depends upon what you want to do.

A slightly longer answer is, if you have 100GB of data that you want
to index in an SQL database then you'd better grab a terabyte of hard
disk and be prepared to wait a LONG time for your queries to come back
to you.


Cheers,
Marcelo

-- 
http://www.simdb.com/~marcelo/

xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.